๐Ÿ—„๏ธ Blob-Based Storage Optimization

Transforming Nexio with Content-Addressable Storage

๐Ÿ“… 2025-12-21

๐Ÿงญ Introduction

As Nexio evolved, I started noticing inefficiencies in how it stored file snapshots. Every commit was duplicating files, even when they hadn't changed. This approach, while simple, doesn't scale well. Imagine committing a 1MB file 10 times without modifications โ€” you'd end up with 10MB of redundant data. It was time for an optimization.

In this post, I'll walk you through how I transformed Nexio from raw file storage to a content-addressable blob store, achieving up to 97% storage savings while significantly improving performance.

๐ŸŽฏ The Problem

The original Nexio storage had some fundamental issues:

IssueImpact
Full file copies per commitDisk usage grows linearly with commits
Byte-by-byte comparisonSlow change detection for large files
Flat directory structurePerformance degrades with many files
No deduplicationIdentical files stored multiple times

For a version control system to be practical, especially at scale, these problems needed to be addressed.

๐Ÿ› ๏ธ The Solution: Content-Addressable Storage

The key insight is simple: store content by its hash. If two files have the same content, they produce the same hash and only need to be stored once. This is the same principle Git uses with its object database.

The Optimization Stack

ComponentTechnologyPurpose
HashingBLAKE3Fastest cryptographic hash, enables deduplication
CompressionZlib (level 6)50-90% size reduction for text files
Sharding2-character prefixDistributes blobs across ~256 subdirectories

โ“ Why BLAKE3?

When choosing a hash algorithm, I had several options: MD5, SHA-1, SHA-256, or BLAKE3. I chose BLAKE3 because:

  1. Speed: BLAKE3 is 3-4x faster than SHA-256
  2. Security: Cryptographically secure (unlike MD5 or SHA-1)
  3. Simplicity: Single algorithm for all file sizes

The Go implementation I used is lukechampine.com/blake3, which provides excellent performance with a simple API.

๐Ÿ”„ How It Works

Here's the flow when adding a file to Nexio:

1. File: src/main.go (10KB) | v 2. BLAKE3 hash -> "ab3f7c9e2d1a8b4f6e..." | v 3. Zlib compress -> ~3KB | v 4. Shard path -> .nexio/objects/ab/3f7c9e2d1a8b4f6e... | v 5. Dedup check -> Skip write if blob exists

The magic happens at step 5: if the blob already exists (same hash = same content), we skip writing entirely. This is where the massive storage savings come from.

๐Ÿ“ New Directory Structure

The updated .nexio directory now includes an objects folder:

.nexio/ โ”œโ”€โ”€ objects/ # Content-addressable blob store โ”‚ โ”œโ”€โ”€ 00/ โ”‚ โ”œโ”€โ”€ 01/ โ”‚ โ”œโ”€โ”€ ... โ”‚ โ”œโ”€โ”€ ab/ โ”‚ โ”‚ โ”œโ”€โ”€ 3f7c9e2d1a8b4f6e... # Compressed blob โ”‚ โ”‚ โ””โ”€โ”€ cdef123456789012... # Compressed blob โ”‚ โ”œโ”€โ”€ ... โ”‚ โ””โ”€โ”€ ff/ โ”œโ”€โ”€ staging/ โ”‚ โ””โ”€โ”€ logs.json # Enhanced with blobHash field โ”œโ”€โ”€ commits/ โ”‚ โ””โ”€โ”€ <commit-hash>/ โ”‚ โ”œโ”€โ”€ fileList.json # Enhanced with blobHash + mode โ”‚ โ”œโ”€โ”€ metadata.json โ”‚ โ””โ”€โ”€ logs.json โ”œโ”€โ”€ branches/ โ””โ”€โ”€ config.json

The raw file copies that previously lived in staging/added/, staging/modified/, and commits/<hash>/<file-id>/ are now gone. All file content lives in objects/ with automatic deduplication.

๐Ÿงฉ The Blob Module

I created a new blob.go file with the following core functions:

FunctionDescription
HashFile(path)Compute BLAKE3 hash of file (streaming, memory efficient)
HashBytes(data)Compute BLAKE3 hash of byte slice
BlobPath(hash)Return sharded path: ab3f... -> .nexio/objects/ab/3f...
BlobExists(hash)Check if blob exists (for deduplication)
WriteBlob(path)Hash, compress, store blob. Skip if exists. Return hash.
ReadBlob(hash)Read and decompress blob content
RestoreBlob(hash, destPath, mode)Decompress blob to destination with permissions

The WriteBlob function is the workhorse โ€” it handles the entire pipeline from reading the source file to storing the compressed blob.

๐Ÿงน Garbage Collection

With content-addressable storage, we need a way to clean up orphaned blobs. I added a new nexio clean command:

nexio clean

The algorithm is straightforward:

  1. Collect all blob hashes referenced in commits' fileList.json and staging logs.json
  2. Walk .nexio/objects/**/*
  3. Clean up empty shard directories
  4. Delete any blob not in the referenced set
  5. Delete any shard directory with no remaining blobs
  6. Report: "Cleaned X blobs, freed Y MB"

This will run automatically before nexio push and after nexio pull (to be implemented) or can be executed manually to keep storage tidy.

๐Ÿ“Š Results

The storage improvements are dramatic:

ScenarioBefore (Raw)After (Blobs)Savings
10 commits, same 1MB file10MB~300KB97%
100KB source file100KB~30KB70%
10 identical files1MB100KB90%
10,000 objects1 directory~39 files/shardO(1) lookup

Performance also improved significantly:

OperationBeforeAfter
File comparisonByte-by-byte (slow)Hash comparison (instant)
Duplicate detectionNoneAutomatic via content hash
Storage per commitFull file copiesOnly new/changed blobs
Directory listingDegrades with scaleConstant via sharding

๐ŸŽจ Design Decisions

Several key decisions shaped this implementation:

DecisionChoiceRationale
Hash algorithmBLAKE33-4x faster than SHA-256, cryptographically secure
CompressionZlib level 6Good balance of speed and compression ratio
Shard prefix2 characters~256 directories, handles millions of objects
Staging storageHash reference onlyMost efficient, no duplicate storage
Orphan cleanupManual + auto on push/pullClean before upload, after download
File permissionsFull uint32Preserves exact Unix permissions

What I Didn't Implement

FeatureReason to Defer
ChunkingOverhead exceeds benefit for source code files
Delta compressionSignificant complexity; whole-file dedup is sufficient
PackfilesOnly needed for very large repos (100k+ objects)
MigrationFresh implementation; no legacy repos to support

These features would add complexity without proportional benefit for typical source code repositories. If Nexio grows to handle very large repos, they can be added later.

๐Ÿ’ก Lessons Learned

  1. Hash-based deduplication is powerful: The simplicity of "same content = same hash = store once" provides enormous benefits with minimal complexity.

  2. Sharding prevents filesystem bottlenecks: A single directory with thousands of files performs poorly on most filesystems. The 2-character prefix sharding keeps directories small.

  3. Compression compounds savings: Combining deduplication with compression means you're both eliminating duplicates AND shrinking what remains.

  4. Keep it simple: I deliberately avoided features like chunking and delta compression. For source code, whole-file deduplication is usually sufficient.

๐Ÿ”ฎ Future

This blob storage system sets the foundation for several future features:

  • Remote sync: Efficiently transfer only missing blobs between remotes
  • Shallow clones: Fetch only the blobs needed for a specific commit
  • Integrity verification: Use hashes to detect storage corruption

The content-addressable architecture makes all of these features much easier to implement.

๐Ÿ”— Resources

If you're interested in learning more about content-addressable storage:

๐Ÿ’ป Check out Nexio at GitHub.

Share this post on: