Key Concepts
Domain concepts and API design patterns
This page explains the core domain concepts and API design patterns that underpin Nabu Store. Understanding these building blocks — blobs, identities, replication policies, the consistent hash ring, erasure coding, and the metadata index — will help you make informed decisions when designing integrations, choosing durability strategies, and diagnosing cluster behaviour. All interaction with Nabu Store happens over HTTP/gRPC APIs; the concepts here map directly to the fields, parameters, and error codes you will encounter in those requests.
Blobs: the fundamental unit of storage
Everything Nabu Store manages is a blob: an immutable, binary sequence of bytes addressed by a 128-bit identifier. Blobs are write-once — once a Put succeeds the content can never be overwritten. This constraint is not a limitation; it is the property that makes distributed consistency tractable. Because no blob ever changes, any replica of it is authoritative, and the cluster never needs to reconcile conflicting versions.
Blobs carry a metadata record (BlobMeta) that travels with the data. The fields you set at write time — policy, labels, and (optionally) content_hash — are stored verbatim. The backend overwrites size and created_at on every successful Put, so you should treat those as read-only from the API consumer's perspective.
Multi-stripe blobs
When a blob exceeds the per-stripe capacity (64 MiB for EC 4+2 with 16 MiB shards), Nabu Store transparently splits it into a stripe set: a sequence of independent blobs, each carrying stripe_size, stripe_index, and stripe_total in its metadata. The coordination layer above the storage backend handles splitting and reassembly; each stripe is a fully independent blob as far as the BlobBackend interface is concerned. You will see these fields populated on Stat and List responses for large objects.
Blob identity and the BlobID
Every blob has a BlobID: a 128-bit opaque value encoded as a 32-character lowercase hex string (for example, a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4). You will pass BlobIDs as path parameters or request body fields in API calls.
The system derives BlobIDs in one of two ways, and the choice matters:
| Derivation | How it works | Best for |
|---|---|---|
| Content-addressable | SHA-256(data)[:16] — the same bytes always produce the same ID | Model weights, KV cache entries — enables deduplication and cluster-wide cache hits on repeated prompt prefixes |
| Random | UUIDv4 bit layout — unique per write | Training checkpoints, any object where uniqueness per write is required |
When you send a Put request without an explicit ID, the cluster computes a content-addressable ID for you. If you supply a content_hash (full SHA-256) in the metadata, the backend verifies the hash and rejects the write with ErrBlobCorrupt if there is a mismatch — giving you end-to-end integrity verification at write time.
If you Put a BlobID that already exists the API returns ErrBlobExists. This is intentional: write-once semantics mean idempotent re-upload of identical content is safe (content-addressable IDs deduplicate automatically), but silent overwrites are not permitted.
Replication policies
Every blob is associated with exactly one replication policy, chosen at write time and stored immutably in its metadata. The policy controls both how much durability overhead you pay and how the cluster distributes data across nodes.
| Policy | Description | Storage overhead | Fault tolerance | Recommended for |
|---|---|---|---|---|
replica-3 | Three full copies on distinct nodes | 3× | Any 2 simultaneous node failures | Model weights (write-once / read-many; overhead acceptable) |
ec-4+2 | 4 data shards + 2 parity shards (Reed-Solomon) | 1.5× | Any 2 simultaneous node failures | KV cache blobs, training checkpoints |
ec-8+2 | 8 data shards + 2 parity shards (Reed-Solomon) | 1.25× | Any 2 simultaneous node failures | Large training corpora where storage efficiency matters most |
Choose replica-3 when read latency and simplicity outweigh storage cost. Choose ec-4+2 or ec-8+2 when you need the same fault tolerance at a fraction of the space. Note that EC reconstruction requires reading from multiple nodes, so very small, latency-sensitive blobs may be better served by replica-3.
Tip: You cannot change a blob's replication policy after it has been written. If you need a different policy, delete and re-upload the blob.
Erasure coding
When you choose ec-4+2 or ec-8+2, Nabu Store applies Reed-Solomon erasure coding before placing data. The blob is split into equal-sized data shards; parity shards are computed using arithmetic over GF(2⁸) (finite field, where addition is XOR). The shards are then distributed across distinct storage nodes via the consistent hash ring.
Original 64 KB blob → split into 4 data shards (D0–D3, 16 KB each)
→ compute 2 parity shards (P0, P1, 16 KB each)
→ place each shard on a distinct node
If up to two nodes fail, the cluster can reconstruct any missing shards from the surviving ones. You do not need to take any action for transparent reconstruction to occur during a Get — the API response returns the reassembled blob as if nothing happened. Reconstruction is visible only in cluster health metrics.
For blobs larger than the per-stripe capacity, the data is broken into multiple stripes, each encoded independently. This allows parallel placement and parallel reconstruction across the ring.
The consistent hash ring
Nabu Store uses a consistent hash ring to decide which storage node owns any given blob or shard, without requiring a central directory. The 128-bit BlobID keyspace is arranged in a circle; each physical node contributes 150 virtual nodes (vnodes) spread around that circle. To locate the node responsible for a given ID, the system binary-searches for the first vnode whose position is ≥ the ID and walks clockwise until it has collected the required number of distinct physical nodes.
The 150-vnode default keeps load imbalance to roughly ±8% across nodes. Fewer vnodes reduce memory usage but increase imbalance; more vnodes improve balance at the cost of higher ring metadata overhead.
Why this matters for your API usage
You never call the ring directly — the cluster routes requests internally. However, the ring has two observable effects:
- Adding a node triggers a rebalancing migration: blobs whose IDs now fall under the new node's vnode range are copied to it. During migration, reads continue to succeed because the old node still holds the data until migration completes.
- Removing a node triggers re-replication: the cluster copies the departing node's blobs to their next responsible node. Graceful removal (
graceful: truein theLeaveRPC) waits for re-replication to complete before the node stops serving requests.
The CXL metadata index
Every blob lookup involves two logical steps: find which node owns the blob, then read the blob. Without caching, the first step requires hashing the BlobID against the ring on every request (~1 µs). Nabu Store accelerates this with a metadata index: a Robin Hood hash map memory-mapped to a file (or to a CXL DIMM in production) that caches the node location, NVMe offset, size, and replication policy for each blob at sub-microsecond latency (~300 ns on CXL).
The index is an internal implementation detail — you do not interact with it through the API. Its existence affects you in two ways:
- Read-path performance: The first access after a write goes through the ring; subsequent accesses are served from the index cache.
- CXL tiering: When the cluster is configured with CXL memory, the index is backed by a CXL DIMM (
/dev/dax), giving it byte-addressable persistence with near-DRAM latency. When running without CXL (development or test), the index falls back to a regular memory-mapped file with ~1 µs latency.
API design patterns
Write-once and idempotency
Because blobs are immutable, Put requests are naturally idempotent when you use content-addressable IDs: uploading the same bytes twice results in the same BlobID and the second call returns ErrBlobExists without modifying any state. For random IDs, each Put creates a new, unique object. Design your application retry logic accordingly — retrying a failed Put with a content-addressable ID is always safe; retrying with a random ID will create a duplicate if the first attempt partially succeeded.
Range reads
GetRange lets you retrieve a byte range [offset, offset+length) without fetching the whole blob. This is the primary access pattern for tensor-slice reads and for EC stripe reconstruction. Passing length = 0 reads from offset to EOF. Passing offset + length > blob size also reads to EOF without error. Use range reads to minimise network transfer when your application only needs a portion of a large model weight blob.
Cursor-based listing
List uses cursor-based pagination rather than page numbers. Each response includes a next_cursor and a has_more flag. Pass the cursor from one response as the cursor of the next request to advance through the keyspace. Cursors are BlobIDs, so the listing order is ascending by BlobID (lexicographic byte order). A zero cursor starts from the beginning of the keyspace.
Streaming for large objects
For blobs too large to fit comfortably in a single request body, the gRPC interface exposes PutStream (client-side streaming) and GetStream (server-side streaming). Each chunk carries a final boolean; set final: true on the last chunk of a PutStream to commit the write atomically. Partial streams that are never finalised are never visible to readers.
Error semantics
| Error | Meaning | Suggested action |
|---|---|---|
ErrBlobNotFound | No blob with the given ID exists | Verify the ID; check if the blob was ever written |
ErrBlobExists | A blob with this ID already exists | For content-addressable IDs, this is normal — the blob is already present |
ErrBlobCorrupt | SHA-256 mismatch on Put, or data file missing | Retry the upload; if persistent, the node may need recovery |
ErrInvalidRange | offset < 0 or offset > blob size | Correct the offset before retrying |
ErrBackendClosed | The backend was shut down | The node is restarting or leaving the cluster; retry against another node |
Example 1: Uploading a blob with content integrity verification
Upload a small blob, supplying the full SHA-256 hash so the cluster verifies integrity at write time. On success the response contains the assigned BlobID and the stored size.
POST /v1/blobs HTTP/1.1
Content-Type: application/octet-stream
X-Nabu-Policy: replica-3
X-Nabu-Content-Hash: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
X-Nabu-Label-purpose: model-weights
X-Nabu-Label-version: v2.1
<binary model weight bytes>
Expected response:
{
"id": "a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4",
"size": 134217728
}
If the supplied hash does not match the uploaded bytes, the server rejects the write:
{
"error": "blob: corrupt",
"message": "SHA-256 mismatch: content does not match provided content_hash"
}
Example 2: Fetching a byte range from a large blob
Read bytes 16,777,216 through 33,554,431 (the second 16 MiB slice) of a stored blob without downloading the entire object. This is the typical access pattern for tensor-slice inference.
GET /v1/blobs/a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4 HTTP/1.1
Range: bytes=16777216-33554431
Expected response (HTTP 206 Partial Content):
HTTP/1.1 206 Partial Content
Content-Range: bytes 16777216-33554431/134217728
Content-Length: 16777216
Content-Type: application/octet-stream
<16 MiB binary slice>
Passing an offset beyond the blob size returns an error:
{
"error": "blob: invalid range",
"message": "offset 200000000 exceeds blob size 134217728"
}
Example 3: Uploading with an EC policy for a KV cache blob
Choose ec-4+2 to store a KV cache checkpoint at 1.5× overhead with tolerance for any two simultaneous node failures.
PUT /v1/blobs/content HTTP/1.1
Content-Type: application/octet-stream
X-Nabu-Policy: ec-4+2
X-Nabu-Label-type: kv-cache
X-Nabu-Label-session: sess_abc123
<binary KV cache bytes>
Expected response:
{
"id": "f7e6d5c4b3a2f7e6d5c4b3a2f7e6d5c4",
"size": 67108864
}
Example 4: Paginating through stored blobs
List the first 100 blobs, then advance to the next page using the returned cursor.
First request — start from the beginning of the keyspace:
GET /v1/blobs?limit=100 HTTP/1.1
Expected response:
{
"ids": [
"0000000000000000000000000000000a",
"0000000000000000000000000000001f",
"..."
],
"next_cursor": "0000000000000000000000000000001f",
"has_more": true
}
Second request — continue from the cursor:
GET /v1/blobs?cursor=0000000000000000000000000000001f&limit=100 HTTP/1.1
When has_more is false, you have reached the end of the keyspace.
Example 5: Checking blob metadata without downloading data
Verify that a blob exists and inspect its replication policy and labels before deciding whether to re-upload.
HEAD /v1/blobs/a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4 HTTP/1.1
Expected response (blob exists):
{
"exists": true,
"meta": {
"size": 134217728,
"created_at": 1740000000000000000,
"checksum": "e3b0c44298fc1c149afbf4c8996fb924...",
"labels": {
"purpose": "model-weights",
"version": "v2.1"
}
}
}
Expected response (blob does not exist):
{
"exists": false,
"meta": null
}
- Install and start a single-node cluster — Before you can store or retrieve blobs, you need a running cluster. Start here if you have not yet deployed Nabu Store.
- Store a blob — A step-by-step walkthrough of your first
Putrequest, including how to set labels and choose a replication policy. - Retrieve a blob — How to perform full and range
Getrequests, including streaming for large objects. - Choose and apply a replication policy — A deeper decision guide for selecting between
replica-3,ec-4+2, andec-8+2based on your workload's durability, latency, and cost requirements. - Deploy a multi-node cluster on Kubernetes — How the consistent hash ring is initialised when you add nodes, and what ring version synchronisation means for your deployment.
- Expand the cluster (add a node) — How the ring rebalances when a node joins, what migration looks like in practice, and how to monitor progress via the API.
- Diagnose and recover from a node failure — How erasure coding enables transparent reconstruction, when the cluster falls back to ring lookup versus the metadata index, and which API error codes signal partial degradation.
- Enable CXL memory tiering — How to configure the metadata index to use a CXL DIMM for sub-microsecond lookup latency instead of the default memory-mapped file.
- Authenticate and issue API requests programmatically — How to obtain and present credentials for all the API calls described on this page.
- Register a custom backend or EC plugin — How the
BlobBackendinterface contract works in practice if you are implementing your own storage backend.
