Concept

Problem Being Solved

What this service provides

Overview

AI inference workloads impose storage demands that are fundamentally different from traditional enterprise applications. This page explains why the I/O patterns, latency requirements, and bandwidth characteristics of inference pipelines expose the limits of conventional SAN-based storage architectures, and how Nabu Store is designed to address those gaps directly.

Content

How AI Inference Generates a Different I/O Profile

Traditional enterprise storage systems were designed around transactional workloads: many small, random reads and writes with modest throughput requirements and latency tolerances measured in milliseconds. AI inference breaks every one of those assumptions.

Large, Immutable Objects at High Concurrency

At the core of any inference pipeline are model weights — binary blobs ranging from hundreds of megabytes to hundreds of gigabytes. Unlike database records, these objects are written once and then read repeatedly, often by many inference workers simultaneously. A single GPU server may load a 70-billion-parameter model in parallel with dozens of other servers all pulling from the same storage pool. This creates a sustained, high-concurrency sequential read pattern that SAN controllers and spinning-disk RAID groups were never dimensioned to serve.

The KV Cache Amplification Problem

Modern transformer-based models use a key-value (KV) cache to avoid recomputing attention scores across tokens in a sequence. In high-throughput deployments, this cache is externalized to storage so that it can be shared across inference replicas and survive GPU restarts. KV cache entries are small (often 4–64 KB per sequence), written at the rate of incoming requests, and read back within tens of milliseconds to continue generation. This produces a mixed small-write, small-read workload layered on top of the large sequential reads for model weights — a combination that overwhelms the queue depths and cache hit rates of traditional SAN controllers.

Stringent Latency Requirements

Inference has a hard latency budget. A user waiting for a response can tolerate a few hundred milliseconds end-to-end, but the storage layer must contribute only a small fraction of that. Once you subtract GPU compute time, network round-trips, and tokenization overhead, the budget for a storage read is often measured in single-digit milliseconds or less. Sub-millisecond access to hot data — model shards already loaded, KV cache entries for active sessions — is not a performance aspiration but an operational requirement.

Conventional SAN stacks introduce multiple sources of latency that compound poorly at this timescale:

Latency Source	Typical Contribution
Kernel block I/O stack (VFS, page cache, scheduler)	50–200 µs
HBA and fabric traversal	100–500 µs
SAN controller queuing and cache lookup	200 µs–2 ms
Spinning disk seek (if applicable)	3–10 ms

Even with all-flash SAN arrays, the software stack alone can add hundreds of microseconds per operation — enough to breach the latency budget for KV cache lookups.

Bandwidth That Scales with the Model, Not the Array

Loading a large language model for inference is not a one-time event. Models are reloaded after crashes, swapped when serving different workloads, and distributed across multiple nodes for tensor-parallel inference. Peak bandwidth demand during a cold-start or a rolling deployment can reach tens of gigabytes per second across the cluster. A traditional SAN array has a fixed number of controller ports and a shared backplane; adding more storage nodes does not proportionally increase read bandwidth. You are bound by the controller ceiling.

Horizontal scale-out storage — where every node added to the cluster contributes both capacity and bandwidth — is structurally necessary for inference at scale.

Why SAN Architectures Fall Short

SAN-based storage was architected for a world where storage is a shared utility, separated from compute by a dedicated fabric, managed through a centralized controller. That model creates three structural problems for AI inference:

Centralized bottleneck. All I/O passes through a controller or a small set of controllers. As GPU counts grow, the controller becomes the ceiling on achievable throughput.
Protocol overhead. Fibre Channel and iSCSI were designed for block-level access with transactional semantics. They carry significant per-operation overhead that is visible at the microsecond timescales inference requires.
No awareness of object semantics. Inference workloads benefit from content-addressable storage (deduplication of identical model weights across tenants), tiering based on access temperature (hot KV cache vs. cold checkpoint), and erasure coding tuned to object size. SAN presents a flat block device with none of these affordances.

What Nabu Store Provides Instead

Nabu Store is a software-defined, hyper-converged storage system built specifically to match the I/O profile described above. Rather than separating storage from compute behind a fabric, it runs on the same nodes as your inference workload and exposes a direct API over HTTP.

Horizontal bandwidth scaling. Every node added to the cluster contributes its full NVMe throughput to the shared pool. There is no centralized controller to saturate.
Sub-millisecond access via SPDK. The optional SPDK backend bypasses the Linux kernel block layer entirely, using user-space NVMe drivers to achieve the single-digit-microsecond device latencies that NVMe hardware is capable of — without SAN controller overhead on top.
CXL memory tiering for hot data. Frequently accessed blobs — live KV cache entries, recently loaded model shards — can be pinned in CXL-attached persistent memory, bringing access latency below 1 µs for the hottest working set.
Object-aware storage policies. You choose erasure coding (EC 4+2 for moderate overhead, EC 8+2 for large corpora) or full 3× replication on a per-blob basis, matched to the durability and access pattern of each object type.
Content-addressable deduplication. Model weights shared across multiple tenants or inference replicas are stored once and retrieved by content hash, eliminating redundant copies that waste capacity and bandwidth.

The net effect is a storage layer whose throughput, latency, and capacity grow linearly with your inference cluster, rather than acting as a fixed ceiling imposed by a shared appliance.

Examples

The following example illustrates the latency difference you can expect between a conventional block storage path and Nabu Store's SPDK backend when retrieving a typical KV cache blob.

Latency comparison: kernel block path vs. SPDK user-space path

This is a representative benchmark output showing p50 and p99 read latencies for a 32 KB blob (typical KV cache entry size) under moderate concurrency.

# Kernel LocalFS backend (representative baseline)
GET /blobs/{id}  32KB
  p50 latency:  480 µs
  p99 latency: 1200 µs
  throughput:   2.1 GB/s (8-node cluster)

# SPDK NVMe backend (user-space, kernel bypass)
GET /blobs/{id}  32KB
  p50 latency:   38 µs
  p99 latency:  120 µs
  throughput:   9.4 GB/s (8-node cluster)

Bandwidth scaling: adding nodes to the cluster

The following shows aggregate sequential read throughput as nodes are added, demonstrating the linear scaling property that SAN architectures cannot provide.

Nodes  |  Aggregate Read Bandwidth
-------|---------------------------
1      |  ~12 GB/s
2      |  ~24 GB/s
4      |  ~48 GB/s
8      |  ~94 GB/s   (near-linear)

Related concepts

Enable the SPDK NVMe Backend — Configure user-space NVMe access to achieve the sub-millisecond device latencies discussed on this page.
Enable CXL Memory Tiering — Pin hot blobs (KV cache, active model shards) into CXL-attached persistent memory for sub-microsecond access.
Choose and Apply a Replication Policy — Understand how EC 4+2, EC 8+2, and 3× replication map to the different object types in an inference pipeline (model weights, KV cache, checkpoints).
Deploy a Multi-Node Cluster on Kubernetes — Realize the horizontal bandwidth scaling described on this page by deploying a multi-node Nabu Store cluster.
Monitor Cluster Health and Capacity — Observe throughput, latency, and capacity headroom across your Nabu Store nodes in production.