NabuStore
Runbook

Troubleshooting

Common API errors and debugging


Objective

Use this runbook to diagnose and resolve common API errors, connectivity failures, and operational issues in a Nabu Store (AIStore) cluster so that blob operations return to a healthy state.


Scope

This runbook covers gRPC API error codes and their remediation, connection failures between clients and nodes, blob operation failures (Put, Get, Delete, Stat, List), replication and erasure-coding errors surfaced through the API, and basic cluster-level diagnostic steps. It does not cover hardware-level SPDK or CXL driver debugging, Kubernetes infrastructure failures outside the AIStore pods, or performance tuning.


Prerequisites

Before you begin, confirm the following:

  • Nabu Store (AIStore) binary is built and deployed (aistore server, testclient diagnostic binary).
  • gRPC client tooling is available: the testclient binary or a compatible gRPC client (grpcurl, language SDK).
  • Cluster endpoint is known (default: localhost:50051; each node listens on its own port).
  • kubectl access (Kubernetes deployments) with permissions to exec into pods and read logs.
  • You have read access to the node's data directory (default: /data/aistore) and index database (/data/aistore/index.db).
  • The Go toolchain is available if you need to rebuild diagnostic binaries (go build).
  • Network access from your workstation to all node gRPC ports (50051–50053 in the default three-node cluster).

Steps

Work through these steps in order. Stop as soon as the issue is resolved and proceed to Verification.


Step 1 — Confirm the node is reachable

Run the testclient binary against the target node. A successful connection prints Connected! before any operation output.

./testclient --addr=<node-host>:50051

Success: Connected! followed by test output.

Failure — Failed to connect: The gRPC dial timed out or was refused. Check:

  • The node process is running: ps aux | grep aistore or kubectl get pods.
  • The --listen flag matches the port you are targeting (default :50051).
  • No firewall or Kubernetes NetworkPolicy is blocking the port.
  • In Kubernetes, exec into the pod and test loopback: kubectl exec -it <pod> -- testclient --addr=localhost:50051.

Step 2 — Read the node logs

Node logs are the primary source of error detail. Retrieve them before attempting any remediation.

# Bare-metal / Docker
journalctl -u aistore --since "10 minutes ago"

# Kubernetes
kubectl logs <pod-name> --tail=200
kubectl logs <pod-name> --previous   # if the pod restarted

Look for lines containing ERROR, WARN, or gRPC status strings such as codes.NotFound, codes.Unavailable, or codes.Internal.


Step 3 — Diagnose by gRPC status code

Match the status code in your client error to the relevant sub-procedure below.

codes.Unavailable — Node or backend not ready

  1. Confirm the aistore process is running (Step 1).
  2. If using the SPDK backend (--backend=spdk), verify the SPDK environment is initialised and the NVMe device is visible to the process. A missing device prevents the storage layer from starting.
  3. If using CXL tiering (--enable-cxl=true), check that the CXL device is mounted and the kernel module is loaded. The node will log a startup error and continue without CXL, but misconfigured paths can cause backend initialisation to block.
  4. Restart the node after fixing the underlying cause:
    # Kubernetes
    kubectl rollout restart deployment/<aistore-deployment>
    

codes.NotFound — Blob does not exist

  1. Verify the blob ID used in the request is correct. IDs are 16-byte values returned by a Put response (putResp.Id).
  2. Run Stat with the same ID to confirm absence:
    ./testclient --addr=<node>:50051
    # The testclient runs Stat as Test 2 and Test 6 automatically.
    
  3. If the blob was stored under a replication policy (replica2, replica3) or erasure coding (ec42, ec82), query all nodes — the consistent-hash ring routes requests, and a misconfigured or recently-joined node may not have the index entry. Check seed node connectivity:
    # Confirm the ring is healthy by running testclient against each node
    kubectl exec -it aistore-node1 -- testclient --addr=localhost:50051
    kubectl exec -it aistore-node2 -- testclient --addr=localhost:50052
    kubectl exec -it aistore-node3 -- testclient --addr=localhost:50053
    
  4. If the index (BoltDB at /data/aistore/index.db) was lost or corrupted, the blob metadata is gone even if the raw data exists on disk. This requires data recovery — see Escalation.

codes.Internal — Server-side failure

  1. Collect full log output (Step 2). Internal errors always produce a server-side log entry with a stack trace or cause string.
  2. Common causes:
    • Disk full: Check available space on the data directory. df -h /data/aistore.
    • BoltDB lock conflict: Only one aistore process may own the index database. Ensure no duplicate processes are running against the same --index-path.
    • Erasure coding parity failure: With ec42 or ec82 policies, the cluster needs at least 4 or 8 live data-shard nodes respectively. A node failure below that threshold causes reconstruction to fail with an internal error. Restore the missing node or reduce the policy.
  3. After resolving the root cause, retry the failed operation.

codes.InvalidArgument — Malformed request

  1. Confirm the ReplicationPolicy enum value is one of the accepted strings: none, replica2, replica3, ec42, ec82.
    ./testclient --addr=<node>:50051 --policy=replica3
    
  2. Confirm the blob ID passed to Get, Stat, or Delete is a valid 16-byte value returned by a prior Put.
  3. For List requests, confirm Limit is a positive integer.

codes.DeadlineExceeded — Request timed out

  1. The default client timeout in testclient is 120 seconds. For large blobs (--size flag) or slow backends, this may be insufficient.
  2. For production clients, increase the context deadline:
    ctx, cancel := context.WithTimeout(context.Background(), 300*time.Second)
    
  3. Check node CPU and I/O utilisation. SPDK and CXL backends are designed for sub-millisecond latency; unexpected timeouts may indicate a misconfigured backend falling back to synchronous I/O.

Step 4 — Verify data-directory permissions

The aistore process must have read/write access to both --data-dir and --index-path.

ls -la /data/aistore
# Expected: owned by the user running the aistore process

If permissions are wrong:

chown -R <aistore-user>:<aistore-user> /data/aistore

Then restart the node.


Step 5 — Verify cluster ring membership

After a node failure or restart, confirm that the node has re-joined the consistent-hash ring. A node that is running but not in the ring will accept connections but return codes.NotFound for blobs it should serve.

  1. Check that --seed-nodes is correctly configured with at least one live peer address.
  2. Inspect logs for join or ClusterService messages confirming ring membership.
  3. If the node is not rejoining, restart it with the correct --seed-nodes value:
    ./aistore --node-id=node2 --listen=:50052 --data-dir=/data/aistore \
      --seed-nodes=node1:50051,node3:50053
    

Step 6 — Run the full diagnostic test suite

Once you believe the issue is resolved, run the full testclient test sequence against the affected node. All six tests (Put, Stat, Get, List, Delete, Verify deletion) must pass.

./testclient --addr=<node-host>:<port> --policy=replica3 --size=1048576

Success output ends with:

✓ All tests passed!

Verification

After completing the remediation steps, confirm the following observable outcomes:

  1. Connectivity: testclient prints Connected! within the dial timeout.
  2. Put succeeds: A blob ID is returned and printed as a hex string (e.g., Blob ID: a1b2c3...).
  3. Stat confirms existence: exists=true with a non-zero Size and CreatedAt timestamp.
  4. Get returns correct data: Retrieved byte count matches the uploaded size, and byte-level verification passes (Data verified!).
  5. List includes the blob: The blob ID appears in the List response.
  6. Delete and re-Stat confirm removal: After deletion, Stat returns exists=false or codes.NotFound.
  7. No ERROR lines in logs during the test window:
    kubectl logs <pod-name> --since=2m | grep -i error
    
  8. All cluster nodes healthy: Running testclient against each node in a multi-node deployment all return ✓ All tests passed!.

Rollback

The troubleshooting steps in this runbook are primarily diagnostic and non-destructive. However, if any of the following actions were taken, use the corresponding rollback:

Action takenRollback
Restarted a nodeRestore the previous --seed-nodes, --backend, and --enable-cxl flags before restarting again.
Changed --data-dir or --index-pathRevert to the original paths and restart the node. The BoltDB index must match the data directory it was built against.
Changed chown on data directoryRestore the original ownership: chown -R <original-user> /data/aistore.
Applied a kubectl rollout restartRoll back to the previous pod revision: kubectl rollout undo deployment/<aistore-deployment>.
Deleted a blob during testingBlobs deleted via the Delete RPC cannot be automatically restored. If the blob had a replication policy (replica2, replica3) and the deletion was issued before all replicas were confirmed gone, check peer nodes for surviving copies using Stat with the same blob ID. EC-encoded blobs (ec42, ec82) cannot be reconstructed once the Delete RPC completes successfully across shards.

Escalation

If this runbook does not resolve the issue, escalate with the following information:

Who to contact:

  • Your internal platform or infrastructure team responsible for Nabu Store deployments.
  • For bugs or unexpected gRPC status codes not covered here, open an issue in the project repository referencing the Apache-2.0 licensed AIStore codebase.

Information to collect before escalating:

  1. Full node logs from the time of the first error to now:
    kubectl logs <pod-name> --since=1h > aistore-node-logs.txt
    
  2. Node configuration flags — capture the exact startup command, including --node-id, --listen, --data-dir, --index-path, --backend, --seed-nodes, and --enable-cxl.
  3. gRPC error string — the full status code, message, and any trailing metadata returned by the client.
  4. Replication policy in use at the time of failure (none, replica2, replica3, ec42, ec82).
  5. Cluster topology — number of nodes, their addresses, and which nodes (if any) are currently unavailable.
  6. Disk and memory state on the affected node:
    df -h /data/aistore
    free -h
    
  7. Kubernetes events if running on Kubernetes:
    kubectl describe pod <pod-name>
    kubectl get events --sort-by=.lastTimestamp
    
  8. Steps already attempted from this runbook and their outcomes.