NabuStore
Guide

Health Checks

Liveness and readiness endpoints


Overview

Nabu Store exposes two health-check surfaces: an HTTP endpoint (/health) served on the metrics port, and the standard gRPC Health Checking Protocol served on the main gRPC port. Together these give you a lightweight way to verify that a node is alive and ready to serve traffic — whether you are running a single-node cluster on bare metal, integrating with a Kubernetes liveness/readiness probe, or building an external monitoring pipeline. This page explains how each endpoint works, what its response codes mean, and how to configure probes correctly in your deployment.


Prerequisites

Before using the health endpoints, make sure you have:

  • A running Nabu Store node (see the single-node cluster quick-start guide).
  • Network access to the node's metrics/HTTP port (default 9090) and its gRPC port (default 50051).
  • A tool capable of making HTTP requests (curl, wget) and, optionally, a gRPC client or grpc-health-probe for the gRPC surface.
  • If deploying on Kubernetes: Kubernetes 1.24+ (native grpc probe type) or an older release that supports exec-based probes.
  • The metrics HTTP endpoint must be enabled — it is on by default (MetricsAddr: ":9090"). If your deployment sets MetricsAddr to an empty string, the HTTP /health route is not registered and only the gRPC health service is available.

Installation

No additional installation is required. Both health surfaces start automatically when the Nabu Store server starts.

  1. Start the server. When you run the Nabu Store binary or pod, it immediately begins listening on both ports:

    AIStore node node-1 listening on :50051
    Web UI and metrics server listening on :9090
    
  2. Confirm the HTTP health endpoint is reachable:

    curl -i http://<node-address>:9090/health
    

    A healthy node returns:

    HTTP/1.1 200 OK
    Content-Type: text/plain; charset=utf-8
    
    OK
    
  3. Confirm the gRPC health service is reachable (requires grpc-health-probe):

    grpc-health-probe -addr=<node-address>:50051
    

    A healthy node prints:

    status: SERVING
    
  4. Kubernetes — enable probes via Helm. The Helm chart configures both probes automatically when the corresponding values are set to enabled: true (see Configuration). Render and inspect the generated manifest to verify:

    helm template my-aistore ./deploy/helm/aistore -f my-values.yaml | grep -A 10 livenessProbe
    

Configuration

HTTP health endpoint

Config fieldDefaultEffect
MetricsAddr":9090"Address and port for the HTTP server that serves /health and /metrics. Set to "" to disable the entire HTTP server, which also disables the HTTP health endpoint.

The /health handler is registered on the same ServeMux as the Prometheus /metrics route, so both share the same port.


gRPC health service

The server registers the standard grpc.health.v1.Health service on the main gRPC port (ListenAddr, default ":50051"). The overall (empty-string service name) status is set to SERVING on startup and to NOT_SERVING when graceful shutdown begins. There is no separate configuration knob; the service is always enabled.


Kubernetes probe configuration (Helm)

The Helm chart exposes these values in values.yaml:

ValueDefaultDescription
livenessProbe.enabledtrueEnable or disable the liveness probe.
livenessProbe.initialDelaySeconds(set in values)Seconds after container start before the first probe. Increase this on slow-start hardware.
livenessProbe.periodSeconds(set in values)How frequently Kubernetes re-probes.
livenessProbe.timeoutSeconds(set in values)Probe timeout.
livenessProbe.failureThreshold(set in values)Consecutive failures before the container is restarted.
readinessProbe.enabledtrueEnable or disable the readiness probe.
readinessProbe.initialDelaySeconds(set in values)Delay before the first readiness probe.
readinessProbe.periodSeconds(set in values)Probe frequency.
readinessProbe.timeoutSeconds(set in values)Probe timeout.
readinessProbe.failureThreshold(set in values)Failures before the pod is removed from service endpoints.
node.port(set in values)The gRPC port probed by both liveness and readiness.

Both Helm probes use the Kubernetes-native grpc probe type, which targets the gRPC health service directly — no sidecar or exec wrapper is needed on Kubernetes 1.24+.


Usage

Checking node health from the command line

Poll the HTTP endpoint to confirm a node is healthy before sending API traffic:

curl -sf http://node-1:9090/health && echo "Node is healthy"

The -sf flags suppress progress output and treat HTTP error codes as failures, making this safe to use in scripts and CI pipelines.


Distinguishing liveness from readiness

Nabu Store does not expose separate /live and /ready HTTP paths — the single /health route performs a backend storage check and is suitable for both purposes in HTTP-based monitoring:

  • Liveness — use /health to detect a hard fault (storage backend unreachable). A 503 response indicates the backend has failed and the process should be restarted.
  • Readiness — use /health to gate traffic. If the check fails, stop routing requests to this node until it recovers.

On Kubernetes, the Helm chart maps both liveness and readiness probes to the gRPC health service (not the HTTP endpoint), which is the recommended approach for gRPC-native workloads. The gRPC status transitions to NOT_SERVING during graceful shutdown so that Kubernetes stops routing traffic before the process exits.


Watching status changes during shutdown

When you call Stop() or send SIGTERM to a pod, the gRPC health status changes to NOT_SERVING and any in-flight /health HTTP request will return 503. You can observe this with a polling loop:

while true; do
  STATUS=$(curl -so /dev/null -w "%{http_code}" http://node-1:9090/health)
  echo "$(date +%T) HTTP status: $STATUS"
  sleep 2
done

Integrating with load balancers and service meshes

Point your load balancer's health check at GET /health on port 9090. Configure the load balancer to:

  • Expect HTTP 200 for a healthy node.
  • Remove a node from the pool on 503 or a connection timeout.
  • Re-add the node when it returns 200 for a configurable number of consecutive checks.

Examples

Example 1 — Simple HTTP health check on a single-node cluster

Assumes a node started with default settings.

curl -i http://localhost:9090/health

Expected output (healthy):

HTTP/1.1 200 OK
Content-Type: text/plain; charset=utf-8
Date: Mon, 01 Jan 2026 12:00:00 GMT
Content-Length: 2

OK

Example 2 — HTTP health check when the storage backend is unavailable

curl -i http://localhost:9090/health

Expected output (unhealthy):

HTTP/1.1 503 Service Unavailable
Content-Type: text/plain; charset=utf-8
Date: Mon, 01 Jan 2026 12:00:00 GMT
Content-Length: 30

UNHEALTHY: backend error

The body includes the prefix UNHEALTHY: followed by the underlying error description, which you can capture for alerting.


Example 3 — gRPC health probe

grpc-health-probe -addr=localhost:50051

Expected output (healthy):

status: SERVING

Expected output (during graceful shutdown):

healthy: false
error: service unhealthy (responded with "NOT_SERVING")

Example 4 — Script that waits for a node to become healthy

Useful in automation scripts after starting a pod or adding a node to the cluster.

#!/usr/bin/env bash
NODE_URL="http://node-1:9090/health"
MAX_RETRIES=30
SLEEP_SEC=5

for i in $(seq 1 $MAX_RETRIES); do
  HTTP_CODE=$(curl -so /dev/null -w "%{http_code}" "$NODE_URL")
  if [ "$HTTP_CODE" = "200" ]; then
    echo "Node is healthy after $((i * SLEEP_SEC))s"
    exit 0
  fi
  echo "Attempt $i/$MAX_RETRIES — got HTTP $HTTP_CODE, waiting ${SLEEP_SEC}s..."
  sleep $SLEEP_SEC
done

echo "Node did not become healthy in time" >&2
exit 1

Expected output (node starts within 20 seconds):

Attempt 1/30 — got HTTP 000, waiting 5s...
Attempt 2/30 — got HTTP 503, waiting 5s...
Attempt 3/30 — got HTTP 503, waiting 5s...
Attempt 4/30 — got HTTP 200, waiting 5s...
Node is healthy after 20s

Example 5 — Kubernetes Helm values enabling both probes

# values-production.yaml
livenessProbe:
  enabled: true
  initialDelaySeconds: 15
  periodSeconds: 20
  timeoutSeconds: 5
  failureThreshold: 3

readinessProbe:
  enabled: true
  initialDelaySeconds: 10
  periodSeconds: 10
  timeoutSeconds: 3
  failureThreshold: 3

Apply with:

helm upgrade --install my-aistore ./deploy/helm/aistore -f values-production.yaml

Kubernetes will probe the gRPC health service on node.port using the native grpc probe type.


Troubleshooting

Issue: curl /health returns connection refused

Symptom: curl: (7) Failed to connect to localhost port 9090: Connection refused

Likely cause: The HTTP metrics server is disabled. Either MetricsAddr is set to an empty string in the server configuration, or the server has not finished starting.

Fix:

  1. Verify MetricsAddr is set to a non-empty value (e.g., ":9090") in your server configuration or Helm values.
  2. Check server logs for the line Web UI and metrics server listening on :9090. If it is absent, the metrics server did not start.
  3. If the server is still starting up, wait and retry — the listener is registered during Start().

Issue: curl /health returns 503 Service Unavailable with body UNHEALTHY: backend error

Symptom: The HTTP health endpoint returns a 503 status and the body begins with UNHEALTHY:.

Likely cause: The storage backend failed its capacity check. This can happen if the data directory is unmounted, the SPDK socket is unavailable, or the underlying device has an I/O error.

Fix:

  1. Inspect server logs for storage-related errors.
  2. Confirm the data directory (/data/aistore by default) is mounted and writable.
  3. If using the SPDK backend, verify the SPDK RPC socket (SPDKSocket) exists and the SPDK process is running.
  4. Check nabu_backend_healthy and nabu_node_up Prometheus metrics on :9090/metrics for historical trend data.

Issue: grpc-health-probe returns NOT_SERVING

Symptom: healthy: false / service unhealthy (responded with "NOT_SERVING")

Likely cause: The node is in graceful shutdown (it received SIGTERM or Stop() was called), or the gRPC server has not yet registered the health service.

Fix:

  1. Check whether the process is shutting down: look for Shutting down AIStore server... in the logs.
  2. If the node is shutting down unexpectedly, investigate the cause (OOM kill, crashed background goroutine, pod eviction).
  3. If this occurs on startup, increase livenessProbe.initialDelaySeconds to give the server time to register the health service before probes begin.

Issue: Kubernetes liveness probe fails and restarts the pod in a loop

Symptom: Pod status cycles between Running and CrashLoopBackOff; kubectl describe pod shows liveness probe failures.

Likely cause: initialDelaySeconds is too short for your hardware (e.g., slow NVMe initialization or large data directory on restart), or failureThreshold is set too low.

Fix:

  1. Increase livenessProbe.initialDelaySeconds in your Helm values.
  2. Increase livenessProbe.failureThreshold to tolerate brief storage hiccups without restarting.
  3. Check pod logs for backend errors that indicate a genuine fault rather than a slow start.

Issue: HTTP health endpoint is healthy but gRPC calls fail

Symptom: GET /health returns 200 OK, but blob store or cluster API calls return errors.

Likely cause: The HTTP health check only verifies backend storage capacity. A partial failure — such as a broken gRPC listener, a crashed internal service, or a ring-state inconsistency — will not be reflected in the HTTP check.

Fix:

  1. Use grpc-health-probe to check the gRPC surface independently.
  2. Review /metrics for elevated error counters (heartbeat failures, gRPC error rates).
  3. For production Kubernetes deployments, rely on the gRPC probe (configured by the Helm chart) as your primary liveness/readiness signal, and treat the HTTP endpoint as a supplementary check.