NabuStore
Guide

Deployment Guide

Deploying to production (Docker, Kubernetes, serverless)


Overview

This guide walks you through deploying Nabu Store (AIStore) to production environments using Docker, Kubernetes with raw manifests, or the official Helm chart. You will learn how to run a single-node instance for validation, scale to a multi-node cluster for production resilience, and enable the SPDK NVMe backend for workloads that demand the lowest possible latency. Follow the sections in order for a first deployment, or jump to the topology that matches your infrastructure.


Prerequisites

Before you begin, ensure you have the following:

Container runtime and orchestration

  • Docker 24.0+ (for single-node or local testing)
  • Kubernetes 1.27+ with kubectl configured against your target cluster
  • Helm 3.12+ (for Helm-based deployments)

Hardware

  • At least 3 worker nodes with persistent block storage for a production cluster
  • NVMe drives bound to your OS and accessible via PCI address if you plan to use the SPDK backend
  • Hugepage support enabled in the Linux kernel if using SPDK (/proc/sys/vm/nr_hugepages writable)

Image access

  • Network access to ghcr.io/trilio/aistore (Helm deployments) or xyzmd/aistore:latest (raw manifest deployments)
  • A valid imagePullSecret if your cluster requires authenticated registry access

Permissions

  • cluster-admin or equivalent RBAC role to create namespaces, StatefulSets, DaemonSets, and PersistentVolumeClaims
  • Privileged pod security policy or a permissive admission controller if deploying the SPDK DaemonSet

Networking

  • Port 50051 (gRPC) reachable between nodes and from API clients
  • Port 9090 reachable from your monitoring stack if metrics are enabled

Installation

Choose the deployment path that matches your environment. All paths expose AIStore's gRPC API on port 50051.


Option A — Docker (single node, validation)

Use this path to verify connectivity and API behaviour before committing to a full cluster.

  1. Pull the image

    docker pull ghcr.io/trilio/aistore:latest
    
  2. Create a data directory on the host

    mkdir -p /data/aistore
    
  3. Start the container

    docker run -d \
      --name aistore-node1 \
      -p 50051:50051 \
      -v /data/aistore:/data/aistore \
      ghcr.io/trilio/aistore:latest \
      -node-id node1 \
      -listen :50051 \
      -data-dir /data/aistore
    
  4. Verify the node is healthy

    wget --spider --quiet http://localhost:50051/healthz && echo "healthy"
    

    You should see healthy printed within a few seconds.


Option B — Kubernetes raw manifests (multi-node, lab or staging)

The deploy/k8s/aistore-cluster.yaml manifest deploys three AIStore pods on a single Kubernetes host using host networking and separate ports. It is suited for functional validation on a single machine before moving to a StatefulSet topology.

  1. Create the namespace

    kubectl create namespace aistore
    
  2. Apply the cluster manifest

    kubectl apply -f deploy/k8s/aistore-cluster.yaml
    
  3. Wait for all pods to become Ready

    kubectl -n aistore wait pod \
      -l app=aistore \
      --for=condition=Ready \
      --timeout=120s
    
  4. Verify each node joined the cluster

    kubectl -n aistore exec aistore-node1 -- \
      testclient --addr=localhost:50051
    
    kubectl -n aistore exec aistore-node2 -- \
      testclient --addr=localhost:50052
    
    kubectl -n aistore exec aistore-node3 -- \
      testclient --addr=localhost:50053
    

Option C — Helm (multi-node, production)

The Helm chart deploys AIStore as a Kubernetes StatefulSet with persistent volumes, liveness/readiness probes, a Pod Disruption Budget, and optional Prometheus metrics. This is the recommended path for production.

  1. Add the chart repository (replace with the actual Helm repo URL when available)

    helm repo add trilio https://charts.trilio.io
    helm repo update
    

    Note for reviewer: The source material does not include a published Helm repository URL. Update this step when the chart is hosted.

  2. Create the namespace

    kubectl create namespace aistore
    
  3. Install with default values (3 replicas, 100 Gi per node)

    helm install aistore trilio/aistore \
      --namespace aistore \
      --version 0.1.0
    
  4. Install from a local chart checkout

    helm install aistore ./deploy/helm/aistore \
      --namespace aistore
    
  5. Watch the StatefulSet roll out

    kubectl -n aistore rollout status statefulset/aistore
    
  6. Confirm all pods are Running and Ready

    kubectl -n aistore get pods -l app.kubernetes.io/name=aistore
    

    Expected output:

    NAME        READY   STATUS    RESTARTS   AGE
    aistore-0   1/1     Running   0          2m
    aistore-1   1/1     Running   0          2m
    aistore-2   1/1     Running   0          2m
    

Option D — SPDK NVMe backend (production, low-latency)

The SPDK DaemonSet runs AIStore in user-space NVMe mode, bypassing the kernel I/O stack. It requires privileged pods and pre-configured hugepages.

  1. Label each NVMe-capable worker node

    kubectl label node <node-name> aistore.trilio.io/storage=nvme
    
  2. Apply the SPDK DaemonSet manifest

    kubectl apply -f deploy/spdk/daemonset.yaml
    
  3. Monitor the init container (it binds NVMe devices and allocates hugepages)

    kubectl -n aistore-spdk logs -l app=aistore,backend=spdk \
      -c spdk-setup --follow
    
  4. Confirm the DaemonSet is fully ready

    kubectl -n aistore-spdk rollout status daemonset/aistore-spdk
    

    Each pod exposes gRPC on port 50051 and metrics on port 9090 via host networking.


Configuration

All runtime behaviour is controlled through command-line flags passed to the aistore binary. When deploying via Helm, these flags are surfaced as values.yaml keys; when using raw manifests, set them directly in the args array of your pod spec.


Core node flags

FlagDefaultDescription
-node-idhost nameUnique identifier for this node in the cluster ring. Must be stable across restarts—use the pod name or a DNS-stable hostname.
-listen:50051gRPC listen address. Change only if you need to run multiple nodes on the same host.
-data-dir/data/aistoreDirectory where blobs and the BoltDB index are stored. Back this by a persistent volume in production.
-index-path/data/aistore/index.dbPath to the BoltDB index file. Defaults to inside -data-dir.
-backendlocalfsStorage backend. Valid values: localfs (default, kernel-managed I/O) or spdk (user-space NVMe).
-seed-nodes(none)Comma-separated host:port addresses of existing cluster members. The first node in a new cluster leaves this empty; every subsequent node must point to at least one running peer.
-vnodes150Virtual nodes per physical node in the consistent hash ring. Higher values improve key distribution but increase ring state size. Do not change after the cluster has stored data.
-heartbeat5sInterval at which this node announces itself to its peers. Lower values detect failures faster at the cost of more network traffic.
-shutdown-timeout30sHow long the node waits for in-flight gRPC calls to complete before it exits. The Helm chart automatically adds 10 s to this value for terminationGracePeriodSeconds.
-enable-cxlfalseEnable CXL memory tiering for hot-data caching. Requires compatible CXL hardware.

SPDK-specific flags

Only applicable when -backend spdk is set.

FlagDefaultDescription
-spdk-core-mask0x3Hexadecimal bitmask of CPU cores dedicated to the SPDK reactor. Cores assigned here are fully consumed by SPDK polling loops—do not overlap with OS-critical cores.
-spdk-mem8192Memory in MB reserved for SPDK. Must be backed by pre-allocated hugepages.

Helm values.yaml knobs

When using the Helm chart, the most important values to override for production are:

KeyDefaultGuidance
replicaCount3Set to at least 3 for quorum-safe operation. The Pod Disruption Budget enforces minAvailable: 2 by default.
persistence.storageClass"" (cluster default)Set to nvme-ssd or your site-specific SSD storage class. Using local-path is acceptable only for development.
persistence.size100GiProvision enough capacity for your expected blob volume plus a 30 % overhead buffer.
resources.limits.cpu4Tune based on your workload profile. Inference serving is typically I/O bound, so CPU limits can often be reduced.
resources.limits.memory8GiMust comfortably exceed your anticipated working set plus BoltDB index size.
node.vnodes150Keep consistent across all nodes and do not change after initial cluster formation.
metrics.enabledtrueLeave enabled in production; metrics are exposed on port 9090 for Prometheus scraping.
podDisruptionBudget.minAvailable2Adjust to match your replication factor minus one to allow rolling maintenance.

Usage

Once your cluster is running, you interact with it exclusively through its gRPC API on port 50051. The sections below show the most common operational patterns.


Sending API requests

You can use the bundled testclient binary for ad-hoc testing, or any gRPC client generated from the proto/ definitions for programmatic access.

Test connectivity with testclient

# Against a Docker single-node deployment
testclient --addr=localhost:50051

# Against a specific pod in a Kubernetes cluster
kubectl -n aistore exec aistore-0 -- \
  testclient --addr=localhost:50051

Resolve the gRPC endpoint for a Helm-deployed cluster

The Helm chart creates two Services:

  • A ClusterIP service (aistore) on port 50051 for internal client traffic.
  • A headless service (aistore-headless) for StatefulSet pod DNS, used for gossip and seed-node resolution.

For API clients running inside the same namespace:

aistore.aistore.svc.cluster.local:50051

For clients outside the cluster, expose the service via a LoadBalancer or NodePort, or use kubectl port-forward:

kubectl -n aistore port-forward svc/aistore 50051:50051

Scaling the cluster (adding nodes)

With the Helm chart, add nodes by increasing replicaCount. New pods automatically discover the cluster through the headless service seed address (aistore-0.aistore-headless:50051).

helm upgrade aistore ./deploy/helm/aistore \
  --namespace aistore \
  --set replicaCount=5

Monitor the ring rebalance by watching pod logs:

kubectl -n aistore logs -f aistore-3

Performing a rolling update

Update the image tag and let the StatefulSet controller handle the rollout one pod at a time:

helm upgrade aistore ./deploy/helm/aistore \
  --namespace aistore \
  --set image.tag=0.2.0

kubectl -n aistore rollout status statefulset/aistore

The -shutdown-timeout 30s flag ensures in-flight requests complete before each pod terminates.


Monitoring cluster health

If metrics.enabled: true (the default), each pod exposes Prometheus metrics on port 9090. Scrape them directly or configure a ServiceMonitor if you use the Prometheus Operator:

# Quick spot-check from inside the cluster
kubectl -n aistore exec aistore-0 -- \
  wget -qO- http://localhost:9090/metrics | head -40

Examples

The examples below are self-contained and runnable. They assume a Helm-deployed cluster in the aistore namespace with kubectl port-forward active on localhost:50051.


Example 1 — Deploy a 3-node cluster with a custom storage class

Override the storage class and disk size for a production NVMe environment:

helm install aistore ./deploy/helm/aistore \
  --namespace aistore \
  --create-namespace \
  --set replicaCount=3 \
  --set persistence.storageClass=nvme-ssd \
  --set persistence.size=500Gi \
  --set resources.limits.memory=16Gi

Expected output:

NAME: aistore
LAST DEPLOYED: Mon Jan 06 2026 10:00:00
NAMESPACE: aistore
STATUS: deployed
REVISION: 1

Example 2 — Run a single Docker node and verify the gRPC endpoint

docker run -d \
  --name aistore-node1 \
  -p 50051:50051 \
  -v /data/aistore:/data/aistore \
  ghcr.io/trilio/aistore:latest \
  -node-id node1 \
  -listen :50051 \
  -data-dir /data/aistore

sleep 5

docker exec aistore-node1 \
  testclient --addr=localhost:50051

Expected output (abbreviated):

Connected to localhost:50051
Put blob OK  key=test-key size=128
Get blob OK  key=test-key size=128
All operations succeeded

Example 3 — Deploy the SPDK DaemonSet on NVMe nodes

# Label the nodes that have NVMe hardware
kubectl label node worker-01 aistore.trilio.io/storage=nvme
kubectl label node worker-02 aistore.trilio.io/storage=nvme

# Apply the DaemonSet
kubectl apply -f deploy/spdk/daemonset.yaml

# Watch init container bind NVMe devices
kubectl -n aistore-spdk logs \
  -l app=aistore,backend=spdk \
  -c spdk-setup

Expected init container output:

Setting up SPDK...
[auto-detect] Binding 0000:01:00.0 to vfio-pci
SPDK setup complete

Example 4 — Scale from 3 to 5 nodes using Helm

helm upgrade aistore ./deploy/helm/aistore \
  --namespace aistore \
  --reuse-values \
  --set replicaCount=5

kubectl -n aistore rollout status statefulset/aistore

Expected output:

Waiting for 2 pods to be ready...
statefulset rolling update complete 5 pods at revision aistore-6d8f7c9b4...

Example 5 — Apply a custom node selector and toleration via Helm

Pin AIStore pods to dedicated storage nodes that carry a NoSchedule taint:

helm upgrade aistore ./deploy/helm/aistore \
  --namespace aistore \
  --reuse-values \
  --set nodeSelector."node-role\.kubernetes\.io/storage"=true \
  --set-json 'tolerations=[{"key":"node-role.kubernetes.io/storage","operator":"Exists","effect":"NoSchedule"}]'

Troubleshooting

Use the following reference to diagnose and resolve the most common deployment problems.


Problem: Pods stuck in Pending state after Helm install

Symptom: kubectl -n aistore get pods shows Pending for one or more pods indefinitely.

Likely cause: No PersistentVolume can be dynamically provisioned because the storage class does not exist or no nodes satisfy the nodeSelector.

Fix:

# Check PVC status
kubectl -n aistore get pvc

# Describe the pending pod for scheduling events
kubectl -n aistore describe pod aistore-0

Ensure persistence.storageClass in values.yaml matches an existing StorageClass (kubectl get storageclass). If no storage class is set, your cluster must have a default one.


Problem: Pods crash with exec format error or exit immediately

Symptom: Pods enter CrashLoopBackOff; logs show exec format error.

Likely cause: The image was built for linux/amd64 but your nodes run a different architecture (for example, ARM).

Fix: The Dockerfile builds with GOARCH=amd64 explicitly. Rebuild the binary and image for your target architecture or verify your node hardware.

kubectl -n aistore logs aistore-0 --previous

Problem: New node cannot join the cluster (seed-node connection refused)

Symptom: Logs on a new pod show repeated connection errors to the seed address.

Likely cause: The first pod (aistore-0) is not yet Ready when subsequent pods start, or the headless service DNS has not propagated.

Fix: The Helm chart uses podManagementPolicy: Parallel, which starts all pods simultaneously. If your cluster has slow DNS, switch to OrderedReady or add an init container that waits for the seed node:

# Check headless service DNS from inside a pod
kubectl -n aistore exec aistore-1 -- \
  nslookup aistore-0.aistore-headless.aistore.svc.cluster.local

If the lookup fails, verify the headless service selector matches pod labels.


Problem: SPDK init container fails with hugepage allocation error

Symptom: Init container exits non-zero; logs show cannot allocate hugepages.

Likely cause: The kernel hugepage pool on that node is not large enough, or the node's LimitRange denies the requested hugepages-1Gi resource.

Fix:

# Check current hugepage availability on the node
cat /proc/meminfo | grep Huge

# Increase the pool (run on the node, or via a privileged DaemonSet)
echo 8 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages

The SPDK manifest requests and limits hugepages-1Gi: 8Gi. Ensure the node has at least 8 free 1 GiB hugepages and that the namespace LimitRange allows up to 16Gi.


Problem: Health check fails (/healthz returns connection refused)

Symptom: Docker HEALTHCHECK or Kubernetes liveness probe marks the container unhealthy shortly after startup.

Likely cause: The gRPC server did not finish initialising before the probe fired, or the -listen address does not match the probe target.

Fix: Increase livenessProbe.initialDelaySeconds to give the node more startup time:

helm upgrade aistore ./deploy/helm/aistore \
  --namespace aistore \
  --reuse-values \
  --set livenessProbe.initialDelaySeconds=30

For Docker, adjust the --start-period flag in your docker run command or your docker-compose.yml.


Problem: Rolling update causes quorum loss

Symptom: During helm upgrade, more than one pod goes offline simultaneously and the cluster stops serving requests.

Likely cause: The Pod Disruption Budget minimum (minAvailable: 2) was not enforced, or the StatefulSet update strategy replaced too many pods at once.

Fix: Verify the PDB is in place and check its status:

kubectl -n aistore get pdb
kubectl -n aistore describe pdb aistore

If minAvailable was changed to a value lower than 2 for a 3-node cluster, reset it:

helm upgrade aistore ./deploy/helm/aistore \
  --namespace aistore \
  --reuse-values \
  --set podDisruptionBudget.minAvailable=2