Nabu Store | NabuStore Docs

Architecture

Service architecture, components, and data flow

Overview

This page describes the internal architecture of Nabu Store — the four core subsystems that make up the hyper-converged AI inference storage system, how they are composed into a running server, and how data moves through the system from the moment you issue an HTTP/gRPC request to the moment bytes land on NVMe. Understanding this architecture helps you make informed decisions about replication policies, cluster topology, hardware selection, and plugin configuration before you deploy at scale.

Architecture diagram

Loading diagram...

Components

Nabu Store is built from four tightly coupled subsystems plus the gRPC server that binds them together. Each component has a single, well-scoped responsibility.

Blob Store (`blob/`)

The Blob store is the public storage interface — everything you read or write is a blob. Every blob is identified by a 128-bit BlobID that can be derived three ways: content-addressable (SHA-256 of the data, enabling automatic deduplication), random (UUID v4 layout, guaranteeing uniqueness per write), or parsed from a hex string you already have. The BlobBackend interface — Put, Get, GetRange, Delete, Stat, List, Capacity — is what all higher-level code calls, and it is the extension point for custom storage hardware. Blobs are write-once and immutable after a successful Put; partial writes are never visible because the default LocalFSBackend stages data in a tmp/ directory and only renames to the live path once the SHA-256 checksum is verified. For blobs larger than 64 MiB the store automatically applies multi-stripe layout, splitting the data across independently placed stripe groups tracked by StripeIndex and StripeTotal fields in BlobMeta.

Three replication policies are available as first-class options on every Put:

Policy	Mechanism	Storage overhead	Primary use case
`Replica3`	3 full copies	3×	Model weights — maximum read parallelism, tolerates 2 full node losses
`EC42`	4 data + 2 parity shards	1.5×	KV cache, training checkpoints — good balance of durability and space
`EC82`	8 data + 2 parity shards	1.25×	Large training corpora — lowest overhead, still tolerates any 2 failures

Consistent Hash Ring (`ring/`)

The Ring decides where each blob or shard lives in the cluster. It implements a distributed hash table using consistent hashing over a 128-bit keyspace. Each physical node is mapped to 150 virtual nodes (vnodes), whose positions are derived from SHA256(nodeID + ":" + i)[:16]. This level of over-provisioning keeps load standard deviation around 8 % across the cluster — halving vnodes to 75 raises that to roughly 14 %, while doubling to 300 drops it to about 5 % at significant memory cost. When you call ring.Lookup(blobID, n), it performs a binary search for the first vnode at or clockwise from blobID, then walks the ring collecting distinct physical nodes until it has n candidates — O(log V + n) complexity. Node join and removal are fully online: before a node joins, KeysToMigrate identifies which blob IDs will shift ownership so you can copy them first; before a node leaves, KeysToEvict identifies blobs that need re-replication to maintain policy guarantees.

Erasure Coding Engine (`ec/`)

The EC engine encodes blobs into shards using Reed-Solomon codes over GF(2⁸) — a finite field where addition is XOR and multiplication uses pre-computed log/exp tables, enabling SIMD acceleration (AVX-512 via Intel ISA-L in production). For EC42 a 64 KiB blob is split into four 16 KiB data shards and two 16 KiB parity shards; any two of the six can be lost and the original data is fully reconstructed. For large blobs the engine applies multi-stripe encoding: ec.EncodeBlob(data, codec, shardSize) returns a slice of StripeGroup values, each with its own data and parity shards, and each stripe is placed independently through the Ring. Reconstruction calls codec.Reconstruct(shards) with nil entries for missing shards; ReconstructData is a faster variant that skips rebuilding parity when you only need the original bytes.

CXL Metadata Index (`index/`)

The Index is a Robin Hood open-addressing hash map backed by a memory-mapped file, designed to be swapped for a real CXL DIMM (/dev/dax) in production. Its purpose is to eliminate a full ring traversal on every read: a direct index.Get(blobID) returns the primary node ID, NVMe byte offset, size, and policy in roughly 300 ns from CXL (versus ~1 µs from the ring lookup), keeping the read fast path entirely in userspace. Each 128-byte slot holds the 16-byte BlobID key, a probe-sequence length counter used by Robin Hood eviction, an occupied/empty flag, and a BlobIndexEntry value. When the load factor exceeds 0.72 the map automatically snapshots all entries, doubles capacity, and rehashes — you never need to pre-size it manually. A miss falls back gracefully to a ring lookup, so the index is an optimistic cache, not a single point of failure.

gRPC Server (`server/`)

The server is the runtime that composes all four subsystems into a network-accessible service. It exposes three gRPC service groups: BlobService (client-facing Put and Get), ClusterService (node join/leave, heartbeat), and InternalService (shard replication and inter-node transfers). On the HTTP side it serves a Prometheus /metrics endpoint and a health check at /health. The server maintains a gossip-style cluster view: every HeartbeatInterval (default 5 s) each node exchanges capacity, blob count, and ring version with all peers it knows about, converging on a consistent cluster state without a central coordinator. Configuration is handled through the Config struct, which controls the storage backend (localfs or spdk), CXL tiering, vnode count, seed nodes, and graceful shutdown timeout.

Data flow

The following traces walk a blob through the complete write path and then the read path, including the EC variant, so you can see exactly which components are involved at each step.

Write path — EC42 policy

Step 1 — Client issues PUT You send a PutRequest over gRPC (or HTTP) with the raw blob bytes and policy = EC42. The blobServiceImpl.Put handler in server/grpc_services.go receives the request.

Step 2 — Content-addressed BlobID generation The handler computes SHA-256(data) and takes the first 16 bytes as the BlobID. This means two identical payloads always produce the same ID, and a second Put of the same content is a no-op — deduplication comes for free.

Step 3 — Erasure encode ec.EncodeBlob(data, ec42Codec, shardSize) splits the payload into stripes. For data under 64 MiB the shard size is 1 MiB; for larger blobs it rises to 16 MiB. Each stripe produces four data shards and two parity shards. The encoding is entirely in userspace using GF(2⁸) arithmetic.

Step 4 — Shard placement via the Ring For each stripe, a deterministic shard key is derived from hash(blobID + stripeIndex + shardIndex). ring.Lookup(stripeKey, 6) returns six distinct physical nodes — one per shard — by walking the 128-bit vnode ring clockwise. This ensures shards from the same stripe are spread across different machines.

Step 5 — Concurrent shard storage The server iterates over all stripes and shards. Shards assigned to the local node are written directly to BlobBackend.Put — staged atomically through a temp file with SHA-256 verification, then renamed into data/{shard}/{id}.blob. Shards on remote nodes are dispatched in parallel goroutines via InternalService.StoreShard gRPC calls. The server waits for all goroutines with a sync.WaitGroup. Up to parityShards failures are tolerated before the write is declared failed.

Step 6 — Index update Once shards are durable, the coordinator node calls index.Put(blobID, BlobIndexEntry{NodeID, Size, Policy}) to record the blob's primary location, size, and policy. Subsequent reads can use this entry to skip the ring lookup.

Step 7 — Response A PutResponse containing the BlobID and final size is returned to the caller.

Read path — EC42 policy (with reconstruction)

Step 1 — Client issues GET You send a GetRequest with the BlobID bytes you received from the earlier Put.

Step 2 — Index fast-path lookup index.Get(blobID) is attempted first (~300 ns on CXL). If the entry exists and shows an EC policy, the getEC path is taken. If the index misses, the handler falls back to getReplicated, which tries the local backend then queries up to three ring candidates.

Step 3 — Stripe and shard topology reconstruction The handler re-derives the number of stripes from entry.Size and shardSize, then for each stripe calls ring.Lookup(stripeKey, 6) to learn which node holds each shard — the ring is deterministic, so this always returns the same placement without any extra coordination.

Step 4 — Parallel shard fetch For each stripe, the handler attempts to fetch all six shards concurrently. Local shards are read from BlobBackend.Get; remote shards are fetched via InternalService gRPC calls. Shards that are unavailable (node down, network error) are left as nil in the shard slice.

Step 5 — Reconstruction (if needed) If any shards are nil, codec.ReconstructData(shards) is called. As long as at least four of the six shards are available — any four — the original data shards are fully recovered using the GF(2⁸) parity equations. This is transparent to your application.

Step 6 — Stripe reassembly and response Data shards from each stripe are concatenated in order. The final assembled bytes are returned in a GetResponse.

Cluster membership and heartbeat flow

In parallel with all data operations, each node runs a background runHeartbeat goroutine that fires every HeartbeatInterval (default 5 s). Each heartbeat exchange carries usedBytes, blobCount, and ringVersion. If a peer's LastHeartbeat goes stale beyond the configured threshold, the node is marked NodeStatusOffline in the local cluster view; when it responds again it transitions back to NodeStatusActive. This gossip loop keeps the consistent hash ring and node pool converged without a central metadata service.

Design decisions

1. Content-addressable BlobIDs by default

BlobIDs are derived from SHA-256(data)[:16] rather than assigned by a central allocator. This choice eliminates a round-trip to a metadata server on every write, makes deduplication automatic, and means any node can verify data integrity without out-of-band coordination. The trade-off is that two logically distinct blobs with identical bytes will silently alias — callers that need guaranteed uniqueness should use BlobIDRandom() explicitly.

2. 150 virtual nodes per physical node

The vnode count of 150 was chosen as the pragmatic midpoint in the load-distribution/memory curve. At 100 vnodes load standard deviation is ~10 %; at 150 it is ~8 %; at 300 it is ~5 % but memory consumption doubles. For a storage system serving AI inference workloads where per-node capacity is large (tens of TiB of NVMe), the 150-vnode regime provides acceptable balance while keeping the ring table small enough to fit in L3 cache on the control path.

3. Robin Hood hashing for the index

Standard open addressing degrades under high load because long probe chains cluster around hot slots. Robin Hood hashing moves entries with shorter-than-average probe sequences to make room for entries with longer ones, keeping the maximum probe length bounded and enabling an early-exit rule: if a lookup's current PSL exceeds the slot's PSL, the key provably does not exist. This is critical for the read fast path, where a cache miss must be detected quickly so the system can fall back to the ring without incurring a long linear scan.

4. Auto-resize at 0.72 load factor

The 0.72 threshold was chosen because Robin Hood hashing degrades noticeably above 75 % occupancy even with PSL balancing. Resizing at 72 % provides a safety margin without triggering excessive resize cycles during typical write-heavy ingestion bursts. Resize doubles capacity and rehashes atomically from a snapshot, so reads and writes can continue on the old map while the new one is being built.

5. Write-once blob semantics

Blobs are immutable after a successful Put. This decision eliminates the need for distributed locking on updates, makes content verification trivially checkable at any time, and aligns naturally with AI workload access patterns — model weights and KV cache entries are written once and read many times. Applications that need versioned mutable objects must manage versioning above the blob layer (for example, by storing a new blob and updating a pointer).

6. EC over full replication for large data

EC42 (1.5× overhead, tolerates 2 failures) and EC82 (1.25× overhead, tolerates 2 failures) were chosen alongside Replica3 (3× overhead) rather than replacing replication entirely. Full replication provides lower read latency and higher read parallelism at the cost of space — which is acceptable for model weights that must be served with minimal latency across many concurrent inference pods. EC is reserved for workloads where storage efficiency matters more than absolute read throughput, such as KV cache and training corpora.

7. Decentralised gossip over a central metadata store

Cluster membership and node health are maintained through periodic peer-to-peer heartbeats rather than a consensus store such as etcd. This eliminates etcd as a data-path dependency: a storage node can continue serving reads and writes even if the Kubernetes control plane is temporarily unavailable, because ring topology and node liveness are known locally. The trade-off is eventual consistency — a freshly joined node may not immediately appear in every peer's view — but for a storage system where blob placement is deterministic, a few seconds of lag is acceptable.

8. Plugin architecture for backends and EC engines

The BlobBackend, ec.Codec, and index.Index interfaces are explicitly designed as extension points. server.New uses built-in constructors for development; server.NewWithPlugins accepts pre-opened plugin implementations. This allows system integrators to substitute the SPDK NVMe backend, a custom ISA-L-accelerated EC engine, or a hardware-backed CXL index without forking the core server code.

Trade-offs

Known limitations

Single-stripe EC for small blobs. Blobs smaller than shardSize × dataShards (1 MiB × 4 = 4 MiB for EC42 at default settings) still generate full-size parity shards. A 100 KiB blob encoded with EC42 produces six 1 MiB shards, consuming 6 MiB of NVMe instead of 150 KiB. If your workload involves many small blobs, Replica3 or EC42 with a reduced shardSize is more space-efficient.

Index does not survive a hard crash without msync / clflush. In the default mmap (development) mode, index state is not guaranteed to be flushed to disk after every write. A hard power loss can leave the index inconsistent. In production you must either enable the CXL path (EnableCXL: true) with clflush/clwb semantics or tolerate a cold rebuild of the index from the blob store on restart.

Ring convergence is eventual, not immediate. Because membership state is propagated by gossip heartbeats every 5 s, a newly added or removed node may not be reflected in every peer's ring for up to one heartbeat interval. During this window, a small fraction of requests may be routed to a node that no longer owns the target range. The application-level retry behaviour on codes.NotFound is the primary mitigation.

No POSIX compatibility on the hot path. The BlobBackend interface is intentionally non-POSIX. If you need to mount the store as a filesystem for tools that require directory semantics, you must use the POSIX compatibility shim, which is explicitly not on the fast path and carries the overhead of inode and directory emulation.

EC reconstruction requires at least dataShards available shards. EC42 tolerates any two simultaneous failures. A third simultaneous failure makes the blob unrecoverable until a failed shard is repaired. If your failure domain analysis requires tolerating three or more simultaneous node losses, you must use Replica3 or a wider EC configuration (which is not currently a built-in policy and would require a custom EC plugin).

Alternatives not chosen

Centralised metadata server (etcd/ZooKeeper). Using a strongly consistent metadata store would give linearisable ring state and eliminate the gossip convergence window, but it introduces a network round-trip on every blob placement decision and makes the data path dependent on control-plane availability. The gossip model was preferred for latency and availability reasons.

POSIX filesystem backend in production. A VFS-based backend (ext4, XFS) was considered for simplicity but rejected because kernel page-cache overhead and VFS lock contention degrade throughput at the queue depths required by AI inference workloads. SPDK with poll-mode NVMe drivers is the production path.

Fixed shard-to-node mapping (static partitioning). A static partition table (like Cassandra's token ranges) would avoid the vnode overhead entirely, but it makes adding a single node require manual re-partitioning of the entire keyspace. Consistent hashing with vnodes rebalances incrementally with minimal data movement.

When to choose a different approach

If you need sub-100 µs object access latency end-to-end, consider whether your workload fits entirely in GPU HBM or CXL DRAM — Nabu Store is optimised for NVMe-backed workloads in the microsecond-to-millisecond range.
If your blobs are primarily small (< 64 KiB) and access patterns are random, the per-blob index entry overhead and EC padding may make an object store with inline data (e.g., a key-value store) more efficient.
If you require strict linearisability for concurrent writers to the same logical object, you will need to implement optimistic concurrency control above the blob layer, since the blob interface is write-once by design.