Troubleshooting
Common API errors and debugging
Use this runbook to diagnose and resolve common API errors, connectivity failures, and operational issues in a Nabu Store (AIStore) cluster so that blob operations return to a healthy state.
This runbook covers gRPC API error codes and their remediation, connection failures between clients and nodes, blob operation failures (Put, Get, Delete, Stat, List), replication and erasure-coding errors surfaced through the API, and basic cluster-level diagnostic steps. It does not cover hardware-level SPDK or CXL driver debugging, Kubernetes infrastructure failures outside the AIStore pods, or performance tuning.
Before you begin, confirm the following:
- Nabu Store (AIStore) binary is built and deployed (
aistoreserver,testclientdiagnostic binary). - gRPC client tooling is available: the
testclientbinary or a compatible gRPC client (grpcurl, language SDK). - Cluster endpoint is known (default:
localhost:50051; each node listens on its own port). kubectlaccess (Kubernetes deployments) with permissions to exec into pods and read logs.- You have read access to the node's data directory (default:
/data/aistore) and index database (/data/aistore/index.db). - The Go toolchain is available if you need to rebuild diagnostic binaries (
go build). - Network access from your workstation to all node gRPC ports (50051–50053 in the default three-node cluster).
Work through these steps in order. Stop as soon as the issue is resolved and proceed to Verification.
Step 1 — Confirm the node is reachable
Run the testclient binary against the target node. A successful connection prints Connected! before any operation output.
./testclient --addr=<node-host>:50051
Success: Connected! followed by test output.
Failure — Failed to connect: The gRPC dial timed out or was refused. Check:
- The node process is running:
ps aux | grep aistoreorkubectl get pods. - The
--listenflag matches the port you are targeting (default:50051). - No firewall or Kubernetes NetworkPolicy is blocking the port.
- In Kubernetes, exec into the pod and test loopback:
kubectl exec -it <pod> -- testclient --addr=localhost:50051.
Step 2 — Read the node logs
Node logs are the primary source of error detail. Retrieve them before attempting any remediation.
# Bare-metal / Docker
journalctl -u aistore --since "10 minutes ago"
# Kubernetes
kubectl logs <pod-name> --tail=200
kubectl logs <pod-name> --previous # if the pod restarted
Look for lines containing ERROR, WARN, or gRPC status strings such as codes.NotFound, codes.Unavailable, or codes.Internal.
Step 3 — Diagnose by gRPC status code
Match the status code in your client error to the relevant sub-procedure below.
codes.Unavailable — Node or backend not ready
- Confirm the
aistoreprocess is running (Step 1). - If using the SPDK backend (
--backend=spdk), verify the SPDK environment is initialised and the NVMe device is visible to the process. A missing device prevents the storage layer from starting. - If using CXL tiering (
--enable-cxl=true), check that the CXL device is mounted and the kernel module is loaded. The node will log a startup error and continue without CXL, but misconfigured paths can cause backend initialisation to block. - Restart the node after fixing the underlying cause:
# Kubernetes kubectl rollout restart deployment/<aistore-deployment>
codes.NotFound — Blob does not exist
- Verify the blob ID used in the request is correct. IDs are 16-byte values returned by a
Putresponse (putResp.Id). - Run
Statwith the same ID to confirm absence:./testclient --addr=<node>:50051 # The testclient runs Stat as Test 2 and Test 6 automatically. - If the blob was stored under a replication policy (
replica2,replica3) or erasure coding (ec42,ec82), query all nodes — the consistent-hash ring routes requests, and a misconfigured or recently-joined node may not have the index entry. Check seed node connectivity:# Confirm the ring is healthy by running testclient against each node kubectl exec -it aistore-node1 -- testclient --addr=localhost:50051 kubectl exec -it aistore-node2 -- testclient --addr=localhost:50052 kubectl exec -it aistore-node3 -- testclient --addr=localhost:50053 - If the index (BoltDB at
/data/aistore/index.db) was lost or corrupted, the blob metadata is gone even if the raw data exists on disk. This requires data recovery — see Escalation.
codes.Internal — Server-side failure
- Collect full log output (Step 2). Internal errors always produce a server-side log entry with a stack trace or cause string.
- Common causes:
- Disk full: Check available space on the data directory.
df -h /data/aistore. - BoltDB lock conflict: Only one
aistoreprocess may own the index database. Ensure no duplicate processes are running against the same--index-path. - Erasure coding parity failure: With
ec42orec82policies, the cluster needs at least 4 or 8 live data-shard nodes respectively. A node failure below that threshold causes reconstruction to fail with an internal error. Restore the missing node or reduce the policy.
- Disk full: Check available space on the data directory.
- After resolving the root cause, retry the failed operation.
codes.InvalidArgument — Malformed request
- Confirm the
ReplicationPolicyenum value is one of the accepted strings:none,replica2,replica3,ec42,ec82../testclient --addr=<node>:50051 --policy=replica3 - Confirm the blob ID passed to
Get,Stat, orDeleteis a valid 16-byte value returned by a priorPut. - For
Listrequests, confirmLimitis a positive integer.
codes.DeadlineExceeded — Request timed out
- The default client timeout in
testclientis 120 seconds. For large blobs (--sizeflag) or slow backends, this may be insufficient. - For production clients, increase the context deadline:
ctx, cancel := context.WithTimeout(context.Background(), 300*time.Second) - Check node CPU and I/O utilisation. SPDK and CXL backends are designed for sub-millisecond latency; unexpected timeouts may indicate a misconfigured backend falling back to synchronous I/O.
Step 4 — Verify data-directory permissions
The aistore process must have read/write access to both --data-dir and --index-path.
ls -la /data/aistore
# Expected: owned by the user running the aistore process
If permissions are wrong:
chown -R <aistore-user>:<aistore-user> /data/aistore
Then restart the node.
Step 5 — Verify cluster ring membership
After a node failure or restart, confirm that the node has re-joined the consistent-hash ring. A node that is running but not in the ring will accept connections but return codes.NotFound for blobs it should serve.
- Check that
--seed-nodesis correctly configured with at least one live peer address. - Inspect logs for
joinorClusterServicemessages confirming ring membership. - If the node is not rejoining, restart it with the correct
--seed-nodesvalue:./aistore --node-id=node2 --listen=:50052 --data-dir=/data/aistore \ --seed-nodes=node1:50051,node3:50053
Step 6 — Run the full diagnostic test suite
Once you believe the issue is resolved, run the full testclient test sequence against the affected node. All six tests (Put, Stat, Get, List, Delete, Verify deletion) must pass.
./testclient --addr=<node-host>:<port> --policy=replica3 --size=1048576
Success output ends with:
✓ All tests passed!
After completing the remediation steps, confirm the following observable outcomes:
- Connectivity:
testclientprintsConnected!within the dial timeout. - Put succeeds: A blob ID is returned and printed as a hex string (e.g.,
Blob ID: a1b2c3...). - Stat confirms existence:
exists=truewith a non-zeroSizeandCreatedAttimestamp. - Get returns correct data: Retrieved byte count matches the uploaded size, and byte-level verification passes (
Data verified!). - List includes the blob: The blob ID appears in the
Listresponse. - Delete and re-Stat confirm removal: After deletion,
Statreturnsexists=falseorcodes.NotFound. - No ERROR lines in logs during the test window:
kubectl logs <pod-name> --since=2m | grep -i error - All cluster nodes healthy: Running
testclientagainst each node in a multi-node deployment all return✓ All tests passed!.
The troubleshooting steps in this runbook are primarily diagnostic and non-destructive. However, if any of the following actions were taken, use the corresponding rollback:
| Action taken | Rollback |
|---|---|
| Restarted a node | Restore the previous --seed-nodes, --backend, and --enable-cxl flags before restarting again. |
Changed --data-dir or --index-path | Revert to the original paths and restart the node. The BoltDB index must match the data directory it was built against. |
Changed chown on data directory | Restore the original ownership: chown -R <original-user> /data/aistore. |
Applied a kubectl rollout restart | Roll back to the previous pod revision: kubectl rollout undo deployment/<aistore-deployment>. |
| Deleted a blob during testing | Blobs deleted via the Delete RPC cannot be automatically restored. If the blob had a replication policy (replica2, replica3) and the deletion was issued before all replicas were confirmed gone, check peer nodes for surviving copies using Stat with the same blob ID. EC-encoded blobs (ec42, ec82) cannot be reconstructed once the Delete RPC completes successfully across shards. |
If this runbook does not resolve the issue, escalate with the following information:
Who to contact:
- Your internal platform or infrastructure team responsible for Nabu Store deployments.
- For bugs or unexpected gRPC status codes not covered here, open an issue in the project repository referencing the Apache-2.0 licensed AIStore codebase.
Information to collect before escalating:
- Full node logs from the time of the first error to now:
kubectl logs <pod-name> --since=1h > aistore-node-logs.txt - Node configuration flags — capture the exact startup command, including
--node-id,--listen,--data-dir,--index-path,--backend,--seed-nodes, and--enable-cxl. - gRPC error string — the full status code, message, and any trailing metadata returned by the client.
- Replication policy in use at the time of failure (
none,replica2,replica3,ec42,ec82). - Cluster topology — number of nodes, their addresses, and which nodes (if any) are currently unavailable.
- Disk and memory state on the affected node:
df -h /data/aistore free -h - Kubernetes events if running on Kubernetes:
kubectl describe pod <pod-name> kubectl get events --sort-by=.lastTimestamp - Steps already attempted from this runbook and their outcomes.
