Observability
Logging, metrics, tracing, health checks
Nabu Store exposes a full observability stack—metrics, health checks, and structured logging—following Kubernetes best practices so you can integrate with the tools and workflows already present in your cluster. This page explains how to configure and consume each signal: Prometheus-compatible metrics served from every node, HTTP health endpoints suitable for Kubernetes liveness and readiness probes, and per-node structured logs that feed standard aggregation pipelines. Understanding these signals lets you monitor cluster capacity and throughput in real time, set up alerting on failure conditions, and diagnose problems before they affect inference workloads.
Before proceeding, make sure you have the following in place:
- Kubernetes 1.25 or later — Nabu Store's observability integrations rely on standard Kubernetes probe APIs and annotation-based service discovery.
- Helm 3.0 or later — required to deploy or upgrade Nabu Store using the provided chart.
- Prometheus Operator or a compatible scrape stack (e.g., kube-prometheus-stack) — the metrics endpoint emits Prometheus exposition format; a Prometheus instance must be configured to scrape it.
- Nabu Store cluster already deployed — see the single-node or multi-node installation guides before continuing.
kubectlconfigured to reach your cluster with at leastget,list, andport-forwardpermissions on Nabu Store namespaces.curlor any HTTP client available in your shell for manual endpoint testing.
Nabu Store's observability components are built into every node binary—no separate sidecar or agent is required. The steps below enable and verify each signal.
Step 1 — Confirm the metrics address is set
Each Nabu Store node starts an HTTP server on the address specified by MetricsAddr (default :9090). Verify this is set in your deployment's configuration, for example in your Helm values.yaml:
nabustore:
metricsAddr: ":9090"
If MetricsAddr is left empty, the metrics and health endpoints are disabled for that node.
Step 2 — Annotate the Pod for Prometheus scraping
Add the standard Prometheus scrape annotations to your Pod template so that Prometheus auto-discovers each node:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: nabustore
spec:
template:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
prometheus.io/path: "/metrics"
Step 3 — Configure Kubernetes liveness and readiness probes
Add HTTP probes to your container spec pointing at the /health endpoint:
containers:
- name: nabustore
ports:
- containerPort: 50051 # gRPC
name: grpc
- containerPort: 9090 # metrics + health
name: http-metrics
livenessProbe:
httpGet:
path: /health
port: http-metrics
initialDelaySeconds: 10
periodSeconds: 15
failureThreshold: 3
readinessProbe:
httpGet:
path: /health
port: http-metrics
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 2
Step 4 — Create a Kubernetes Service exposing the metrics port
apiVersion: v1
kind: Service
metadata:
name: nabustore-metrics
labels:
app: nabustore
spec:
selector:
app: nabustore
ports:
- name: http-metrics
port: 9090
targetPort: http-metrics
clusterIP: None # headless — Prometheus scrapes each Pod individually
Step 5 — (Optional) Create a ServiceMonitor for Prometheus Operator
If you are using the Prometheus Operator, create a ServiceMonitor instead of relying on annotations:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: nabustore
labels:
release: prometheus # match your Prometheus Operator selector
spec:
selector:
matchLabels:
app: nabustore
endpoints:
- port: http-metrics
path: /metrics
interval: 15s
Step 6 — Verify the endpoints are reachable
# Port-forward to the first node
kubectl port-forward pod/nabustore-0 9090:9090
# In a second terminal — health check
curl -s http://localhost:9090/health
# Expected: OK
# Metrics scrape
curl -s http://localhost:9090/metrics | head -40
The table below describes every configuration field that affects observability. All fields belong to the server Config struct and can be set via Helm values or environment-variable injection.
| Field | Default | Valid values | Effect |
|---|---|---|---|
MetricsAddr | ":9090" | Any host:port string, or "" to disable | Binds the HTTP server that serves /metrics, /health, and the web UI. Set to "" to disable all three. |
HeartbeatInterval | 5s | Any time.Duration string | Controls how often each node sends heartbeats to peers and calls updateMetrics(). Lower values give fresher metric data but increase intra-cluster traffic. |
NodeID | Hostname | Non-empty string | Appears as the node label on every emitted metric. Ensure uniqueness across your cluster. |
Metric series reference
All series are prefixed aistore_ and carry a node="<NodeID>" label.
Node health
| Metric | Type | Description |
|---|---|---|
aistore_node_up | Gauge | 1 if the node is healthy, 0 otherwise. Driven by backend reachability. |
aistore_uptime_seconds | Gauge | Seconds since the node process started. |
aistore_last_heartbeat_timestamp | Gauge | Unix timestamp of the last successful heartbeat sent to a peer. |
aistore_backend_healthy | Gauge | 1 if the storage backend is reachable, 0 otherwise. |
aistore_grpc_connections_active | Gauge | Number of live outbound gRPC connections to peers. |
aistore_peer_health | Gauge | Per-peer label peer="<nodeID>": 1 = online, 0 = offline. |
Storage capacity
| Metric | Type | Description |
|---|---|---|
aistore_capacity_bytes_total | Gauge | Raw storage capacity of the node's backend in bytes. |
aistore_capacity_bytes_used | Gauge | Bytes currently consumed by stored blobs. |
aistore_blob_count | Gauge | Number of blobs indexed on this node. |
aistore_shard_count | Gauge | Number of erasure-coded shards stored on this node. |
Throughput and operations
| Metric | Type | Description |
|---|---|---|
aistore_bytes_read_total | Counter | Cumulative bytes served by GET operations. |
aistore_bytes_written_total | Counter | Cumulative bytes received by PUT operations. |
aistore_operations_total | Counter (labeled) | Per-operation label `operation="put |
Latency histograms
All latency metrics are histograms with the following bucket upper bounds (in seconds): 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30, 60, +Inf.
| Metric | Operation measured |
|---|---|
aistore_put_latency_seconds | PUT blob |
aistore_get_latency_seconds | GET blob |
aistore_delete_latency_seconds | DELETE blob |
aistore_stat_latency_seconds | Stat/head blob |
aistore_list_latency_seconds | List blobs |
aistore_ec_encode_latency_seconds | Erasure-code encode |
aistore_ec_decode_latency_seconds | Erasure-code decode / reconstruct |
aistore_shard_fetch_latency_seconds | Remote shard fetch |
aistore_shard_store_latency_seconds | Remote shard store |
aistore_replicate_latency_seconds | Full replication round-trip |
Cluster state
| Metric | Type | Description |
|---|---|---|
aistore_nodes_total | Gauge | Total nodes known to this node. |
aistore_nodes_active | Gauge | Nodes currently in Active state (includes self). |
aistore_nodes_offline | Gauge | Nodes currently in Offline state. |
aistore_heartbeat_failures_total | Counter | Total failed outbound heartbeat attempts. |
aistore_ring_version | Gauge | Current consistent-hash ring version. Monotonically increases on topology changes. |
Replication and erasure coding
| Metric | Type | Description |
|---|---|---|
aistore_replication_queue_depth | Gauge | Pending replication operations. |
aistore_replication_lag_seconds | Gauge | Estimated replication lag. |
aistore_ec_reconstruction_total | Counter | Total EC reconstruction events triggered. |
aistore_shards_missing_total | Counter | Total missing shards detected. |
Errors
| Metric | Type | Labels | Description |
|---|---|---|---|
aistore_errors_total | Counter | `type="storage | network |
aistore_request_failures_total | Counter | `operation="put | get |
Polling the metrics endpoint directly
Each node exposes its metrics at GET /metrics on the configured MetricsAddr. The response is plain text in Prometheus exposition format (content type text/plain; version=0.0.4).
curl -s http://<node-host>:9090/metrics
To watch a specific metric while troubleshooting:
curl -s http://<node-host>:9090/metrics | grep aistore_nodes_offline
Health checks from the command line
The /health endpoint performs a live backend check and returns HTTP 200 OK with body OK when healthy, or HTTP 503 Service Unavailable with body UNHEALTHY: backend error when the storage backend cannot be reached:
curl -o /dev/null -w "%{http_code}" http://<node-host>:9090/health
Kubernetes liveness and readiness probes target this same endpoint (see the Installation section), so pod scheduling decisions are automatically driven by real backend health.
Computing utilization from Prometheus
Once Prometheus is scraping your cluster, use the following PromQL expressions to monitor day-to-day health:
Cluster-wide storage utilisation (%)
sum(aistore_capacity_bytes_used) / sum(aistore_capacity_bytes_total) * 100
Any offline nodes
sum(aistore_nodes_offline) > 0
99th-percentile PUT latency (ms) per node
histogram_quantile(0.99, rate(aistore_put_latency_seconds_bucket[5m])) * 1000
Error rate per operation type
rate(aistore_errors_total[5m])
Alerting recommendations
Create Prometheus alerting rules for these conditions at a minimum:
| Condition | PromQL | Severity |
|---|---|---|
| Node down | aistore_node_up == 0 | Critical |
| Backend unhealthy | aistore_backend_healthy == 0 | Critical |
| Storage > 85% full | aistore_capacity_bytes_used / aistore_capacity_bytes_total > 0.85 | Warning |
| Storage > 95% full | aistore_capacity_bytes_used / aistore_capacity_bytes_total > 0.95 | Critical |
| Offline nodes | aistore_nodes_offline > 0 | Warning |
| High heartbeat failure rate | rate(aistore_heartbeat_failures_total[5m]) > 1 | Warning |
| Missing EC shards rising | increase(aistore_shards_missing_total[10m]) > 0 | Warning |
Web UI
The same HTTP server that serves /metrics and /health also hosts a built-in web UI at the root path (/). The UI displays a live overview of node states, capacity bar charts, operation counters, and per-peer health—useful for ad-hoc inspection without requiring a Grafana dashboard.
Example 1 — Scrape metrics from a running node
kubectl port-forward pod/nabustore-0 9090:9090 &
curl -s http://localhost:9090/metrics
Expected output (truncated)
# HELP aistore_node_up Node health status (1=healthy, 0=unhealthy)
# TYPE aistore_node_up gauge
aistore_node_up{node="nabustore-0"} 1
# HELP aistore_uptime_seconds Seconds since node started
# TYPE aistore_uptime_seconds gauge
aistore_uptime_seconds{node="nabustore-0"} 3842
# HELP aistore_capacity_bytes_total Total storage capacity in bytes
# TYPE aistore_capacity_bytes_total gauge
aistore_capacity_bytes_total{node="nabustore-0"} 107374182400
# HELP aistore_capacity_bytes_used Used storage in bytes
# TYPE aistore_capacity_bytes_used gauge
aistore_capacity_bytes_used{node="nabustore-0"} 21474836480
# HELP aistore_blob_count Number of blobs stored
# TYPE aistore_blob_count gauge
aistore_blob_count{node="nabustore-0"} 4096
# HELP aistore_operations_total Total operations by type
# TYPE aistore_operations_total counter
aistore_operations_total{node="nabustore-0",operation="get"} 18234
aistore_operations_total{node="nabustore-0",operation="put"} 4096
# HELP aistore_put_latency_seconds Put operation latency
# TYPE aistore_put_latency_seconds histogram
aistore_put_latency_seconds_bucket{node="nabustore-0",le="0.001"} 312
aistore_put_latency_seconds_bucket{node="nabustore-0",le="0.005"} 3901
aistore_put_latency_seconds_bucket{node="nabustore-0",le="0.01"} 4080
aistore_put_latency_seconds_bucket{node="nabustore-0",le="+Inf"} 4096
aistore_put_latency_seconds_sum{node="nabustore-0"} 14.832
aistore_put_latency_seconds_count{node="nabustore-0"} 4096
Example 2 — Health probe returning healthy
curl -v http://localhost:9090/health
Expected output
< HTTP/1.1 200 OK
< Content-Type: text/plain; charset=utf-8
<
OK
Example 3 — Health probe returning unhealthy (backend failure)
When the storage backend is unreachable (e.g., underlying NVMe device failure), the node returns:
curl -v http://localhost:9090/health
Expected output
< HTTP/1.1 503 Service Unavailable
< Content-Type: text/plain; charset=utf-8
<
UNHEALTHY: backend error
Kubernetes will mark the pod NotReady and remove it from service endpoints after the configured failureThreshold is reached.
Example 4 — PromQL: 99th-percentile GET latency over a 5-minute window
histogram_quantile(
0.99,
sum by (node, le) (
rate(aistore_get_latency_seconds_bucket[5m])
)
)
Expected output (in Prometheus/Grafana)
{node="nabustore-0"} 0.00421 # ≈ 4.2 ms
{node="nabustore-1"} 0.00389 # ≈ 3.9 ms
{node="nabustore-2"} 0.00412 # ≈ 4.1 ms
Example 5 — Prometheus alert rule for an offline node
Save the following to a PrometheusRule resource in the same namespace as your Prometheus Operator deployment:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: nabustore-alerts
labels:
release: prometheus
spec:
groups:
- name: nabustore.node
interval: 30s
rules:
- alert: NabuStoreNodeDown
expr: aistore_node_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Nabu Store node {{ $labels.node }} is unhealthy"
description: "Node {{ $labels.node }} has reported node_up=0 for more than 1 minute."
- alert: NabuStoreStorageNearFull
expr: >
aistore_capacity_bytes_used / aistore_capacity_bytes_total > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "Nabu Store node {{ $labels.node }} storage above 85%"
Apply it
kubectl apply -f nabustore-alerts.yaml
Issue: /metrics endpoint returns connection refused
Symptom: curl http://<node>:9090/metrics fails with Connection refused or times out.
Likely cause: MetricsAddr is set to "" (empty string) in the server configuration, which disables the HTTP server entirely.
Fix: Set MetricsAddr to a non-empty bind address (e.g., ":9090") in your Helm values and roll the StatefulSet:
helm upgrade nabustore ./chart --set nabustore.metricsAddr=":9090"
kubectl rollout restart statefulset/nabustore
Issue: Kubernetes liveness probe marks pod as CrashLoopBackOff even though the node process is running
Symptom: The pod restarts repeatedly; kubectl describe pod shows the liveness probe hitting /health and receiving 503.
Likely cause: The storage backend is returning an error (disk full, NVMe device failure, or filesystem permissions). The /health handler calls backend.Capacity() and returns 503 if it fails, which causes Kubernetes to kill and restart the pod.
Fix:
- Check pod logs for backend errors:
kubectl logs nabustore-0 - If the disk is full, free capacity or expand storage before the pod can recover.
- If it is a permissions issue:
kubectl exec -it nabustore-0 -- ls -la /data/aistore - If the problem is transient and you need to prevent repeated restarts, temporarily increase
failureThresholdorinitialDelaySecondsin the probe spec while you remediate.
Issue: Prometheus shows no metrics for Nabu Store pods
Symptom: No aistore_* series appear in Prometheus, even though the /metrics endpoint responds correctly to curl.
Likely cause 1: Scrape annotations are missing or have the wrong port.
Fix: Verify the annotations on the pod: kubectl get pod nabustore-0 -o jsonpath='{.metadata.annotations}'. Ensure prometheus.io/scrape: "true", prometheus.io/port: "9090", and prometheus.io/path: "/metrics" are all present.
Likely cause 2: You are using Prometheus Operator, which ignores pod annotations and requires a ServiceMonitor resource.
Fix: Create a ServiceMonitor as shown in the Installation section and confirm its release label matches the label selector of your Prometheus custom resource.
Issue: aistore_nodes_offline is non-zero but all pods show Running
Symptom: The metric reports one or more offline peers, but kubectl get pods shows all nodes in Running state.
Likely cause: Intra-cluster heartbeat traffic is blocked by a NetworkPolicy, or a node's gRPC port (:50051) is not reachable from peers. Node health in the metrics layer is determined by heartbeat success, not pod state.
Fix:
- Check
aistore_heartbeat_failures_totalon the reporting node—a high and rising count confirms heartbeat loss. - Verify NetworkPolicy allows pod-to-pod traffic on the gRPC port:
kubectl get networkpolicy -n <namespace> - Test direct connectivity:
kubectl exec nabustore-0 -- curl -v http://nabustore-1:50051(or usegrpc_health_probe). - Check that the
SeedNodesconfiguration matches the actual DNS names or IPs of your peer pods.
Issue: Latency histograms show all values in the +Inf bucket
Symptom: aistore_put_latency_seconds_bucket has zero counts in all finite buckets; everything accumulates in +Inf.
Likely cause: Operations are taking longer than the maximum bucket bound of 60 seconds, indicating severe backend degradation (e.g., storage I/O saturation or a hung syscall).
Fix:
- Check
aistore_backend_healthy— if0, the backend is already reporting an error. - Inspect node-level disk I/O metrics via your Kubernetes node exporter (
node_disk_io_time_seconds_total). - If using the SPDK backend, verify the SPDK process is responsive on its RPC socket.
- Consider restarting the affected node pod after confirming peer nodes hold sufficient replicas or EC shards to cover the data.
