Deployment Guide
Deploying to production (Docker, Kubernetes, serverless)
This guide walks you through deploying Nabu Store (AIStore) to production environments using Docker, Kubernetes with raw manifests, or the official Helm chart. You will learn how to run a single-node instance for validation, scale to a multi-node cluster for production resilience, and enable the SPDK NVMe backend for workloads that demand the lowest possible latency. Follow the sections in order for a first deployment, or jump to the topology that matches your infrastructure.
Before you begin, ensure you have the following:
Container runtime and orchestration
- Docker 24.0+ (for single-node or local testing)
- Kubernetes 1.27+ with
kubectlconfigured against your target cluster - Helm 3.12+ (for Helm-based deployments)
Hardware
- At least 3 worker nodes with persistent block storage for a production cluster
- NVMe drives bound to your OS and accessible via PCI address if you plan to use the SPDK backend
- Hugepage support enabled in the Linux kernel if using SPDK (
/proc/sys/vm/nr_hugepageswritable)
Image access
- Network access to
ghcr.io/trilio/aistore(Helm deployments) orxyzmd/aistore:latest(raw manifest deployments) - A valid
imagePullSecretif your cluster requires authenticated registry access
Permissions
cluster-adminor equivalent RBAC role to create namespaces, StatefulSets, DaemonSets, and PersistentVolumeClaims- Privileged pod security policy or a permissive admission controller if deploying the SPDK DaemonSet
Networking
- Port
50051(gRPC) reachable between nodes and from API clients - Port
9090reachable from your monitoring stack if metrics are enabled
Choose the deployment path that matches your environment. All paths expose AIStore's gRPC API on port 50051.
Option A — Docker (single node, validation)
Use this path to verify connectivity and API behaviour before committing to a full cluster.
-
Pull the image
docker pull ghcr.io/trilio/aistore:latest -
Create a data directory on the host
mkdir -p /data/aistore -
Start the container
docker run -d \ --name aistore-node1 \ -p 50051:50051 \ -v /data/aistore:/data/aistore \ ghcr.io/trilio/aistore:latest \ -node-id node1 \ -listen :50051 \ -data-dir /data/aistore -
Verify the node is healthy
wget --spider --quiet http://localhost:50051/healthz && echo "healthy"You should see
healthyprinted within a few seconds.
Option B — Kubernetes raw manifests (multi-node, lab or staging)
The deploy/k8s/aistore-cluster.yaml manifest deploys three AIStore pods on a single Kubernetes host using host networking and separate ports. It is suited for functional validation on a single machine before moving to a StatefulSet topology.
-
Create the namespace
kubectl create namespace aistore -
Apply the cluster manifest
kubectl apply -f deploy/k8s/aistore-cluster.yaml -
Wait for all pods to become Ready
kubectl -n aistore wait pod \ -l app=aistore \ --for=condition=Ready \ --timeout=120s -
Verify each node joined the cluster
kubectl -n aistore exec aistore-node1 -- \ testclient --addr=localhost:50051 kubectl -n aistore exec aistore-node2 -- \ testclient --addr=localhost:50052 kubectl -n aistore exec aistore-node3 -- \ testclient --addr=localhost:50053
Option C — Helm (multi-node, production)
The Helm chart deploys AIStore as a Kubernetes StatefulSet with persistent volumes, liveness/readiness probes, a Pod Disruption Budget, and optional Prometheus metrics. This is the recommended path for production.
-
Add the chart repository (replace with the actual Helm repo URL when available)
helm repo add trilio https://charts.trilio.io helm repo updateNote for reviewer: The source material does not include a published Helm repository URL. Update this step when the chart is hosted.
-
Create the namespace
kubectl create namespace aistore -
Install with default values (3 replicas, 100 Gi per node)
helm install aistore trilio/aistore \ --namespace aistore \ --version 0.1.0 -
Install from a local chart checkout
helm install aistore ./deploy/helm/aistore \ --namespace aistore -
Watch the StatefulSet roll out
kubectl -n aistore rollout status statefulset/aistore -
Confirm all pods are Running and Ready
kubectl -n aistore get pods -l app.kubernetes.io/name=aistoreExpected output:
NAME READY STATUS RESTARTS AGE aistore-0 1/1 Running 0 2m aistore-1 1/1 Running 0 2m aistore-2 1/1 Running 0 2m
Option D — SPDK NVMe backend (production, low-latency)
The SPDK DaemonSet runs AIStore in user-space NVMe mode, bypassing the kernel I/O stack. It requires privileged pods and pre-configured hugepages.
-
Label each NVMe-capable worker node
kubectl label node <node-name> aistore.trilio.io/storage=nvme -
Apply the SPDK DaemonSet manifest
kubectl apply -f deploy/spdk/daemonset.yaml -
Monitor the init container (it binds NVMe devices and allocates hugepages)
kubectl -n aistore-spdk logs -l app=aistore,backend=spdk \ -c spdk-setup --follow -
Confirm the DaemonSet is fully ready
kubectl -n aistore-spdk rollout status daemonset/aistore-spdkEach pod exposes gRPC on port
50051and metrics on port9090via host networking.
All runtime behaviour is controlled through command-line flags passed to the aistore binary. When deploying via Helm, these flags are surfaced as values.yaml keys; when using raw manifests, set them directly in the args array of your pod spec.
Core node flags
| Flag | Default | Description |
|---|---|---|
-node-id | host name | Unique identifier for this node in the cluster ring. Must be stable across restarts—use the pod name or a DNS-stable hostname. |
-listen | :50051 | gRPC listen address. Change only if you need to run multiple nodes on the same host. |
-data-dir | /data/aistore | Directory where blobs and the BoltDB index are stored. Back this by a persistent volume in production. |
-index-path | /data/aistore/index.db | Path to the BoltDB index file. Defaults to inside -data-dir. |
-backend | localfs | Storage backend. Valid values: localfs (default, kernel-managed I/O) or spdk (user-space NVMe). |
-seed-nodes | (none) | Comma-separated host:port addresses of existing cluster members. The first node in a new cluster leaves this empty; every subsequent node must point to at least one running peer. |
-vnodes | 150 | Virtual nodes per physical node in the consistent hash ring. Higher values improve key distribution but increase ring state size. Do not change after the cluster has stored data. |
-heartbeat | 5s | Interval at which this node announces itself to its peers. Lower values detect failures faster at the cost of more network traffic. |
-shutdown-timeout | 30s | How long the node waits for in-flight gRPC calls to complete before it exits. The Helm chart automatically adds 10 s to this value for terminationGracePeriodSeconds. |
-enable-cxl | false | Enable CXL memory tiering for hot-data caching. Requires compatible CXL hardware. |
SPDK-specific flags
Only applicable when -backend spdk is set.
| Flag | Default | Description |
|---|---|---|
-spdk-core-mask | 0x3 | Hexadecimal bitmask of CPU cores dedicated to the SPDK reactor. Cores assigned here are fully consumed by SPDK polling loops—do not overlap with OS-critical cores. |
-spdk-mem | 8192 | Memory in MB reserved for SPDK. Must be backed by pre-allocated hugepages. |
Helm values.yaml knobs
When using the Helm chart, the most important values to override for production are:
| Key | Default | Guidance |
|---|---|---|
replicaCount | 3 | Set to at least 3 for quorum-safe operation. The Pod Disruption Budget enforces minAvailable: 2 by default. |
persistence.storageClass | "" (cluster default) | Set to nvme-ssd or your site-specific SSD storage class. Using local-path is acceptable only for development. |
persistence.size | 100Gi | Provision enough capacity for your expected blob volume plus a 30 % overhead buffer. |
resources.limits.cpu | 4 | Tune based on your workload profile. Inference serving is typically I/O bound, so CPU limits can often be reduced. |
resources.limits.memory | 8Gi | Must comfortably exceed your anticipated working set plus BoltDB index size. |
node.vnodes | 150 | Keep consistent across all nodes and do not change after initial cluster formation. |
metrics.enabled | true | Leave enabled in production; metrics are exposed on port 9090 for Prometheus scraping. |
podDisruptionBudget.minAvailable | 2 | Adjust to match your replication factor minus one to allow rolling maintenance. |
Once your cluster is running, you interact with it exclusively through its gRPC API on port 50051. The sections below show the most common operational patterns.
Sending API requests
You can use the bundled testclient binary for ad-hoc testing, or any gRPC client generated from the proto/ definitions for programmatic access.
Test connectivity with testclient
# Against a Docker single-node deployment
testclient --addr=localhost:50051
# Against a specific pod in a Kubernetes cluster
kubectl -n aistore exec aistore-0 -- \
testclient --addr=localhost:50051
Resolve the gRPC endpoint for a Helm-deployed cluster
The Helm chart creates two Services:
- A
ClusterIPservice (aistore) on port50051for internal client traffic. - A headless service (
aistore-headless) for StatefulSet pod DNS, used for gossip and seed-node resolution.
For API clients running inside the same namespace:
aistore.aistore.svc.cluster.local:50051
For clients outside the cluster, expose the service via a LoadBalancer or NodePort, or use kubectl port-forward:
kubectl -n aistore port-forward svc/aistore 50051:50051
Scaling the cluster (adding nodes)
With the Helm chart, add nodes by increasing replicaCount. New pods automatically discover the cluster through the headless service seed address (aistore-0.aistore-headless:50051).
helm upgrade aistore ./deploy/helm/aistore \
--namespace aistore \
--set replicaCount=5
Monitor the ring rebalance by watching pod logs:
kubectl -n aistore logs -f aistore-3
Performing a rolling update
Update the image tag and let the StatefulSet controller handle the rollout one pod at a time:
helm upgrade aistore ./deploy/helm/aistore \
--namespace aistore \
--set image.tag=0.2.0
kubectl -n aistore rollout status statefulset/aistore
The -shutdown-timeout 30s flag ensures in-flight requests complete before each pod terminates.
Monitoring cluster health
If metrics.enabled: true (the default), each pod exposes Prometheus metrics on port 9090. Scrape them directly or configure a ServiceMonitor if you use the Prometheus Operator:
# Quick spot-check from inside the cluster
kubectl -n aistore exec aistore-0 -- \
wget -qO- http://localhost:9090/metrics | head -40
The examples below are self-contained and runnable. They assume a Helm-deployed cluster in the aistore namespace with kubectl port-forward active on localhost:50051.
Example 1 — Deploy a 3-node cluster with a custom storage class
Override the storage class and disk size for a production NVMe environment:
helm install aistore ./deploy/helm/aistore \
--namespace aistore \
--create-namespace \
--set replicaCount=3 \
--set persistence.storageClass=nvme-ssd \
--set persistence.size=500Gi \
--set resources.limits.memory=16Gi
Expected output:
NAME: aistore
LAST DEPLOYED: Mon Jan 06 2026 10:00:00
NAMESPACE: aistore
STATUS: deployed
REVISION: 1
Example 2 — Run a single Docker node and verify the gRPC endpoint
docker run -d \
--name aistore-node1 \
-p 50051:50051 \
-v /data/aistore:/data/aistore \
ghcr.io/trilio/aistore:latest \
-node-id node1 \
-listen :50051 \
-data-dir /data/aistore
sleep 5
docker exec aistore-node1 \
testclient --addr=localhost:50051
Expected output (abbreviated):
Connected to localhost:50051
Put blob OK key=test-key size=128
Get blob OK key=test-key size=128
All operations succeeded
Example 3 — Deploy the SPDK DaemonSet on NVMe nodes
# Label the nodes that have NVMe hardware
kubectl label node worker-01 aistore.trilio.io/storage=nvme
kubectl label node worker-02 aistore.trilio.io/storage=nvme
# Apply the DaemonSet
kubectl apply -f deploy/spdk/daemonset.yaml
# Watch init container bind NVMe devices
kubectl -n aistore-spdk logs \
-l app=aistore,backend=spdk \
-c spdk-setup
Expected init container output:
Setting up SPDK...
[auto-detect] Binding 0000:01:00.0 to vfio-pci
SPDK setup complete
Example 4 — Scale from 3 to 5 nodes using Helm
helm upgrade aistore ./deploy/helm/aistore \
--namespace aistore \
--reuse-values \
--set replicaCount=5
kubectl -n aistore rollout status statefulset/aistore
Expected output:
Waiting for 2 pods to be ready...
statefulset rolling update complete 5 pods at revision aistore-6d8f7c9b4...
Example 5 — Apply a custom node selector and toleration via Helm
Pin AIStore pods to dedicated storage nodes that carry a NoSchedule taint:
helm upgrade aistore ./deploy/helm/aistore \
--namespace aistore \
--reuse-values \
--set nodeSelector."node-role\.kubernetes\.io/storage"=true \
--set-json 'tolerations=[{"key":"node-role.kubernetes.io/storage","operator":"Exists","effect":"NoSchedule"}]'
Use the following reference to diagnose and resolve the most common deployment problems.
Problem: Pods stuck in Pending state after Helm install
Symptom: kubectl -n aistore get pods shows Pending for one or more pods indefinitely.
Likely cause: No PersistentVolume can be dynamically provisioned because the storage class does not exist or no nodes satisfy the nodeSelector.
Fix:
# Check PVC status
kubectl -n aistore get pvc
# Describe the pending pod for scheduling events
kubectl -n aistore describe pod aistore-0
Ensure persistence.storageClass in values.yaml matches an existing StorageClass (kubectl get storageclass). If no storage class is set, your cluster must have a default one.
Problem: Pods crash with exec format error or exit immediately
Symptom: Pods enter CrashLoopBackOff; logs show exec format error.
Likely cause: The image was built for linux/amd64 but your nodes run a different architecture (for example, ARM).
Fix: The Dockerfile builds with GOARCH=amd64 explicitly. Rebuild the binary and image for your target architecture or verify your node hardware.
kubectl -n aistore logs aistore-0 --previous
Problem: New node cannot join the cluster (seed-node connection refused)
Symptom: Logs on a new pod show repeated connection errors to the seed address.
Likely cause: The first pod (aistore-0) is not yet Ready when subsequent pods start, or the headless service DNS has not propagated.
Fix: The Helm chart uses podManagementPolicy: Parallel, which starts all pods simultaneously. If your cluster has slow DNS, switch to OrderedReady or add an init container that waits for the seed node:
# Check headless service DNS from inside a pod
kubectl -n aistore exec aistore-1 -- \
nslookup aistore-0.aistore-headless.aistore.svc.cluster.local
If the lookup fails, verify the headless service selector matches pod labels.
Problem: SPDK init container fails with hugepage allocation error
Symptom: Init container exits non-zero; logs show cannot allocate hugepages.
Likely cause: The kernel hugepage pool on that node is not large enough, or the node's LimitRange denies the requested hugepages-1Gi resource.
Fix:
# Check current hugepage availability on the node
cat /proc/meminfo | grep Huge
# Increase the pool (run on the node, or via a privileged DaemonSet)
echo 8 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
The SPDK manifest requests and limits hugepages-1Gi: 8Gi. Ensure the node has at least 8 free 1 GiB hugepages and that the namespace LimitRange allows up to 16Gi.
Problem: Health check fails (/healthz returns connection refused)
Symptom: Docker HEALTHCHECK or Kubernetes liveness probe marks the container unhealthy shortly after startup.
Likely cause: The gRPC server did not finish initialising before the probe fired, or the -listen address does not match the probe target.
Fix: Increase livenessProbe.initialDelaySeconds to give the node more startup time:
helm upgrade aistore ./deploy/helm/aistore \
--namespace aistore \
--reuse-values \
--set livenessProbe.initialDelaySeconds=30
For Docker, adjust the --start-period flag in your docker run command or your docker-compose.yml.
Problem: Rolling update causes quorum loss
Symptom: During helm upgrade, more than one pod goes offline simultaneously and the cluster stops serving requests.
Likely cause: The Pod Disruption Budget minimum (minAvailable: 2) was not enforced, or the StatefulSet update strategy replaced too many pods at once.
Fix: Verify the PDB is in place and check its status:
kubectl -n aistore get pdb
kubectl -n aistore describe pdb aistore
If minAvailable was changed to a value lower than 2 for a 3-node cluster, reset it:
helm upgrade aistore ./deploy/helm/aistore \
--namespace aistore \
--reuse-values \
--set podDisruptionBudget.minAvailable=2
