Health Checks
Liveness and readiness endpoints
Nabu Store exposes two health-check surfaces: an HTTP endpoint (/health) served on the metrics port, and the standard gRPC Health Checking Protocol served on the main gRPC port. Together these give you a lightweight way to verify that a node is alive and ready to serve traffic — whether you are running a single-node cluster on bare metal, integrating with a Kubernetes liveness/readiness probe, or building an external monitoring pipeline. This page explains how each endpoint works, what its response codes mean, and how to configure probes correctly in your deployment.
Before using the health endpoints, make sure you have:
- A running Nabu Store node (see the single-node cluster quick-start guide).
- Network access to the node's metrics/HTTP port (default
9090) and its gRPC port (default50051). - A tool capable of making HTTP requests (
curl,wget) and, optionally, a gRPC client orgrpc-health-probefor the gRPC surface. - If deploying on Kubernetes: Kubernetes 1.24+ (native
grpcprobe type) or an older release that supportsexec-based probes. - The metrics HTTP endpoint must be enabled — it is on by default (
MetricsAddr: ":9090"). If your deployment setsMetricsAddrto an empty string, the HTTP/healthroute is not registered and only the gRPC health service is available.
No additional installation is required. Both health surfaces start automatically when the Nabu Store server starts.
-
Start the server. When you run the Nabu Store binary or pod, it immediately begins listening on both ports:
AIStore node node-1 listening on :50051 Web UI and metrics server listening on :9090 -
Confirm the HTTP health endpoint is reachable:
curl -i http://<node-address>:9090/healthA healthy node returns:
HTTP/1.1 200 OK Content-Type: text/plain; charset=utf-8 OK -
Confirm the gRPC health service is reachable (requires
grpc-health-probe):grpc-health-probe -addr=<node-address>:50051A healthy node prints:
status: SERVING -
Kubernetes — enable probes via Helm. The Helm chart configures both probes automatically when the corresponding values are set to
enabled: true(see Configuration). Render and inspect the generated manifest to verify:helm template my-aistore ./deploy/helm/aistore -f my-values.yaml | grep -A 10 livenessProbe
HTTP health endpoint
| Config field | Default | Effect |
|---|---|---|
MetricsAddr | ":9090" | Address and port for the HTTP server that serves /health and /metrics. Set to "" to disable the entire HTTP server, which also disables the HTTP health endpoint. |
The /health handler is registered on the same ServeMux as the Prometheus /metrics route, so both share the same port.
gRPC health service
The server registers the standard grpc.health.v1.Health service on the main gRPC port (ListenAddr, default ":50051"). The overall (empty-string service name) status is set to SERVING on startup and to NOT_SERVING when graceful shutdown begins. There is no separate configuration knob; the service is always enabled.
Kubernetes probe configuration (Helm)
The Helm chart exposes these values in values.yaml:
| Value | Default | Description |
|---|---|---|
livenessProbe.enabled | true | Enable or disable the liveness probe. |
livenessProbe.initialDelaySeconds | (set in values) | Seconds after container start before the first probe. Increase this on slow-start hardware. |
livenessProbe.periodSeconds | (set in values) | How frequently Kubernetes re-probes. |
livenessProbe.timeoutSeconds | (set in values) | Probe timeout. |
livenessProbe.failureThreshold | (set in values) | Consecutive failures before the container is restarted. |
readinessProbe.enabled | true | Enable or disable the readiness probe. |
readinessProbe.initialDelaySeconds | (set in values) | Delay before the first readiness probe. |
readinessProbe.periodSeconds | (set in values) | Probe frequency. |
readinessProbe.timeoutSeconds | (set in values) | Probe timeout. |
readinessProbe.failureThreshold | (set in values) | Failures before the pod is removed from service endpoints. |
node.port | (set in values) | The gRPC port probed by both liveness and readiness. |
Both Helm probes use the Kubernetes-native grpc probe type, which targets the gRPC health service directly — no sidecar or exec wrapper is needed on Kubernetes 1.24+.
Checking node health from the command line
Poll the HTTP endpoint to confirm a node is healthy before sending API traffic:
curl -sf http://node-1:9090/health && echo "Node is healthy"
The -sf flags suppress progress output and treat HTTP error codes as failures, making this safe to use in scripts and CI pipelines.
Distinguishing liveness from readiness
Nabu Store does not expose separate /live and /ready HTTP paths — the single /health route performs a backend storage check and is suitable for both purposes in HTTP-based monitoring:
- Liveness — use
/healthto detect a hard fault (storage backend unreachable). A503response indicates the backend has failed and the process should be restarted. - Readiness — use
/healthto gate traffic. If the check fails, stop routing requests to this node until it recovers.
On Kubernetes, the Helm chart maps both liveness and readiness probes to the gRPC health service (not the HTTP endpoint), which is the recommended approach for gRPC-native workloads. The gRPC status transitions to NOT_SERVING during graceful shutdown so that Kubernetes stops routing traffic before the process exits.
Watching status changes during shutdown
When you call Stop() or send SIGTERM to a pod, the gRPC health status changes to NOT_SERVING and any in-flight /health HTTP request will return 503. You can observe this with a polling loop:
while true; do
STATUS=$(curl -so /dev/null -w "%{http_code}" http://node-1:9090/health)
echo "$(date +%T) HTTP status: $STATUS"
sleep 2
done
Integrating with load balancers and service meshes
Point your load balancer's health check at GET /health on port 9090. Configure the load balancer to:
- Expect HTTP
200for a healthy node. - Remove a node from the pool on
503or a connection timeout. - Re-add the node when it returns
200for a configurable number of consecutive checks.
Example 1 — Simple HTTP health check on a single-node cluster
Assumes a node started with default settings.
curl -i http://localhost:9090/health
Expected output (healthy):
HTTP/1.1 200 OK
Content-Type: text/plain; charset=utf-8
Date: Mon, 01 Jan 2026 12:00:00 GMT
Content-Length: 2
OK
Example 2 — HTTP health check when the storage backend is unavailable
curl -i http://localhost:9090/health
Expected output (unhealthy):
HTTP/1.1 503 Service Unavailable
Content-Type: text/plain; charset=utf-8
Date: Mon, 01 Jan 2026 12:00:00 GMT
Content-Length: 30
UNHEALTHY: backend error
The body includes the prefix UNHEALTHY: followed by the underlying error description, which you can capture for alerting.
Example 3 — gRPC health probe
grpc-health-probe -addr=localhost:50051
Expected output (healthy):
status: SERVING
Expected output (during graceful shutdown):
healthy: false
error: service unhealthy (responded with "NOT_SERVING")
Example 4 — Script that waits for a node to become healthy
Useful in automation scripts after starting a pod or adding a node to the cluster.
#!/usr/bin/env bash
NODE_URL="http://node-1:9090/health"
MAX_RETRIES=30
SLEEP_SEC=5
for i in $(seq 1 $MAX_RETRIES); do
HTTP_CODE=$(curl -so /dev/null -w "%{http_code}" "$NODE_URL")
if [ "$HTTP_CODE" = "200" ]; then
echo "Node is healthy after $((i * SLEEP_SEC))s"
exit 0
fi
echo "Attempt $i/$MAX_RETRIES — got HTTP $HTTP_CODE, waiting ${SLEEP_SEC}s..."
sleep $SLEEP_SEC
done
echo "Node did not become healthy in time" >&2
exit 1
Expected output (node starts within 20 seconds):
Attempt 1/30 — got HTTP 000, waiting 5s...
Attempt 2/30 — got HTTP 503, waiting 5s...
Attempt 3/30 — got HTTP 503, waiting 5s...
Attempt 4/30 — got HTTP 200, waiting 5s...
Node is healthy after 20s
Example 5 — Kubernetes Helm values enabling both probes
# values-production.yaml
livenessProbe:
enabled: true
initialDelaySeconds: 15
periodSeconds: 20
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
enabled: true
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
Apply with:
helm upgrade --install my-aistore ./deploy/helm/aistore -f values-production.yaml
Kubernetes will probe the gRPC health service on node.port using the native grpc probe type.
Issue: curl /health returns connection refused
Symptom: curl: (7) Failed to connect to localhost port 9090: Connection refused
Likely cause: The HTTP metrics server is disabled. Either MetricsAddr is set to an empty string in the server configuration, or the server has not finished starting.
Fix:
- Verify
MetricsAddris set to a non-empty value (e.g.,":9090") in your server configuration or Helm values. - Check server logs for the line
Web UI and metrics server listening on :9090. If it is absent, the metrics server did not start. - If the server is still starting up, wait and retry — the listener is registered during
Start().
Issue: curl /health returns 503 Service Unavailable with body UNHEALTHY: backend error
Symptom: The HTTP health endpoint returns a 503 status and the body begins with UNHEALTHY:.
Likely cause: The storage backend failed its capacity check. This can happen if the data directory is unmounted, the SPDK socket is unavailable, or the underlying device has an I/O error.
Fix:
- Inspect server logs for storage-related errors.
- Confirm the data directory (
/data/aistoreby default) is mounted and writable. - If using the SPDK backend, verify the SPDK RPC socket (
SPDKSocket) exists and the SPDK process is running. - Check
nabu_backend_healthyandnabu_node_upPrometheus metrics on:9090/metricsfor historical trend data.
Issue: grpc-health-probe returns NOT_SERVING
Symptom: healthy: false / service unhealthy (responded with "NOT_SERVING")
Likely cause: The node is in graceful shutdown (it received SIGTERM or Stop() was called), or the gRPC server has not yet registered the health service.
Fix:
- Check whether the process is shutting down: look for
Shutting down AIStore server...in the logs. - If the node is shutting down unexpectedly, investigate the cause (OOM kill, crashed background goroutine, pod eviction).
- If this occurs on startup, increase
livenessProbe.initialDelaySecondsto give the server time to register the health service before probes begin.
Issue: Kubernetes liveness probe fails and restarts the pod in a loop
Symptom: Pod status cycles between Running and CrashLoopBackOff; kubectl describe pod shows liveness probe failures.
Likely cause: initialDelaySeconds is too short for your hardware (e.g., slow NVMe initialization or large data directory on restart), or failureThreshold is set too low.
Fix:
- Increase
livenessProbe.initialDelaySecondsin your Helm values. - Increase
livenessProbe.failureThresholdto tolerate brief storage hiccups without restarting. - Check pod logs for backend errors that indicate a genuine fault rather than a slow start.
Issue: HTTP health endpoint is healthy but gRPC calls fail
Symptom: GET /health returns 200 OK, but blob store or cluster API calls return errors.
Likely cause: The HTTP health check only verifies backend storage capacity. A partial failure — such as a broken gRPC listener, a crashed internal service, or a ring-state inconsistency — will not be reflected in the HTTP check.
Fix:
- Use
grpc-health-probeto check the gRPC surface independently. - Review
/metricsfor elevated error counters (heartbeat failures, gRPC error rates). - For production Kubernetes deployments, rely on the gRPC probe (configured by the Helm chart) as your primary liveness/readiness signal, and treat the HTTP endpoint as a supplementary check.
