NabuStore
Guide

Observability

Logging, metrics, tracing, health checks


Overview

Nabu Store exposes a full observability stack—metrics, health checks, and structured logging—following Kubernetes best practices so you can integrate with the tools and workflows already present in your cluster. This page explains how to configure and consume each signal: Prometheus-compatible metrics served from every node, HTTP health endpoints suitable for Kubernetes liveness and readiness probes, and per-node structured logs that feed standard aggregation pipelines. Understanding these signals lets you monitor cluster capacity and throughput in real time, set up alerting on failure conditions, and diagnose problems before they affect inference workloads.


Prerequisites

Before proceeding, make sure you have the following in place:

  • Kubernetes 1.25 or later — Nabu Store's observability integrations rely on standard Kubernetes probe APIs and annotation-based service discovery.
  • Helm 3.0 or later — required to deploy or upgrade Nabu Store using the provided chart.
  • Prometheus Operator or a compatible scrape stack (e.g., kube-prometheus-stack) — the metrics endpoint emits Prometheus exposition format; a Prometheus instance must be configured to scrape it.
  • Nabu Store cluster already deployed — see the single-node or multi-node installation guides before continuing.
  • kubectl configured to reach your cluster with at least get, list, and port-forward permissions on Nabu Store namespaces.
  • curl or any HTTP client available in your shell for manual endpoint testing.

Installation

Nabu Store's observability components are built into every node binary—no separate sidecar or agent is required. The steps below enable and verify each signal.

Step 1 — Confirm the metrics address is set

Each Nabu Store node starts an HTTP server on the address specified by MetricsAddr (default :9090). Verify this is set in your deployment's configuration, for example in your Helm values.yaml:

nabustore:
  metricsAddr: ":9090"

If MetricsAddr is left empty, the metrics and health endpoints are disabled for that node.

Step 2 — Annotate the Pod for Prometheus scraping

Add the standard Prometheus scrape annotations to your Pod template so that Prometheus auto-discovers each node:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: nabustore
spec:
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
        prometheus.io/path: "/metrics"

Step 3 — Configure Kubernetes liveness and readiness probes

Add HTTP probes to your container spec pointing at the /health endpoint:

containers:
  - name: nabustore
    ports:
      - containerPort: 50051   # gRPC
        name: grpc
      - containerPort: 9090    # metrics + health
        name: http-metrics
    livenessProbe:
      httpGet:
        path: /health
        port: http-metrics
      initialDelaySeconds: 10
      periodSeconds: 15
      failureThreshold: 3
    readinessProbe:
      httpGet:
        path: /health
        port: http-metrics
      initialDelaySeconds: 5
      periodSeconds: 10
      failureThreshold: 2

Step 4 — Create a Kubernetes Service exposing the metrics port

apiVersion: v1
kind: Service
metadata:
  name: nabustore-metrics
  labels:
    app: nabustore
spec:
  selector:
    app: nabustore
  ports:
    - name: http-metrics
      port: 9090
      targetPort: http-metrics
  clusterIP: None   # headless — Prometheus scrapes each Pod individually

Step 5 — (Optional) Create a ServiceMonitor for Prometheus Operator

If you are using the Prometheus Operator, create a ServiceMonitor instead of relying on annotations:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: nabustore
  labels:
    release: prometheus   # match your Prometheus Operator selector
spec:
  selector:
    matchLabels:
      app: nabustore
  endpoints:
    - port: http-metrics
      path: /metrics
      interval: 15s

Step 6 — Verify the endpoints are reachable

# Port-forward to the first node
kubectl port-forward pod/nabustore-0 9090:9090

# In a second terminal — health check
curl -s http://localhost:9090/health
# Expected: OK

# Metrics scrape
curl -s http://localhost:9090/metrics | head -40

Configuration

The table below describes every configuration field that affects observability. All fields belong to the server Config struct and can be set via Helm values or environment-variable injection.

FieldDefaultValid valuesEffect
MetricsAddr":9090"Any host:port string, or "" to disableBinds the HTTP server that serves /metrics, /health, and the web UI. Set to "" to disable all three.
HeartbeatInterval5sAny time.Duration stringControls how often each node sends heartbeats to peers and calls updateMetrics(). Lower values give fresher metric data but increase intra-cluster traffic.
NodeIDHostnameNon-empty stringAppears as the node label on every emitted metric. Ensure uniqueness across your cluster.

Metric series reference

All series are prefixed aistore_ and carry a node="<NodeID>" label.

Node health

MetricTypeDescription
aistore_node_upGauge1 if the node is healthy, 0 otherwise. Driven by backend reachability.
aistore_uptime_secondsGaugeSeconds since the node process started.
aistore_last_heartbeat_timestampGaugeUnix timestamp of the last successful heartbeat sent to a peer.
aistore_backend_healthyGauge1 if the storage backend is reachable, 0 otherwise.
aistore_grpc_connections_activeGaugeNumber of live outbound gRPC connections to peers.
aistore_peer_healthGaugePer-peer label peer="<nodeID>": 1 = online, 0 = offline.

Storage capacity

MetricTypeDescription
aistore_capacity_bytes_totalGaugeRaw storage capacity of the node's backend in bytes.
aistore_capacity_bytes_usedGaugeBytes currently consumed by stored blobs.
aistore_blob_countGaugeNumber of blobs indexed on this node.
aistore_shard_countGaugeNumber of erasure-coded shards stored on this node.

Throughput and operations

MetricTypeDescription
aistore_bytes_read_totalCounterCumulative bytes served by GET operations.
aistore_bytes_written_totalCounterCumulative bytes received by PUT operations.
aistore_operations_totalCounter (labeled)Per-operation label `operation="put

Latency histograms

All latency metrics are histograms with the following bucket upper bounds (in seconds): 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30, 60, +Inf.

MetricOperation measured
aistore_put_latency_secondsPUT blob
aistore_get_latency_secondsGET blob
aistore_delete_latency_secondsDELETE blob
aistore_stat_latency_secondsStat/head blob
aistore_list_latency_secondsList blobs
aistore_ec_encode_latency_secondsErasure-code encode
aistore_ec_decode_latency_secondsErasure-code decode / reconstruct
aistore_shard_fetch_latency_secondsRemote shard fetch
aistore_shard_store_latency_secondsRemote shard store
aistore_replicate_latency_secondsFull replication round-trip

Cluster state

MetricTypeDescription
aistore_nodes_totalGaugeTotal nodes known to this node.
aistore_nodes_activeGaugeNodes currently in Active state (includes self).
aistore_nodes_offlineGaugeNodes currently in Offline state.
aistore_heartbeat_failures_totalCounterTotal failed outbound heartbeat attempts.
aistore_ring_versionGaugeCurrent consistent-hash ring version. Monotonically increases on topology changes.

Replication and erasure coding

MetricTypeDescription
aistore_replication_queue_depthGaugePending replication operations.
aistore_replication_lag_secondsGaugeEstimated replication lag.
aistore_ec_reconstruction_totalCounterTotal EC reconstruction events triggered.
aistore_shards_missing_totalCounterTotal missing shards detected.

Errors

MetricTypeLabelsDescription
aistore_errors_totalCounter`type="storagenetwork
aistore_request_failures_totalCounter`operation="putget

Usage

Polling the metrics endpoint directly

Each node exposes its metrics at GET /metrics on the configured MetricsAddr. The response is plain text in Prometheus exposition format (content type text/plain; version=0.0.4).

curl -s http://<node-host>:9090/metrics

To watch a specific metric while troubleshooting:

curl -s http://<node-host>:9090/metrics | grep aistore_nodes_offline

Health checks from the command line

The /health endpoint performs a live backend check and returns HTTP 200 OK with body OK when healthy, or HTTP 503 Service Unavailable with body UNHEALTHY: backend error when the storage backend cannot be reached:

curl -o /dev/null -w "%{http_code}" http://<node-host>:9090/health

Kubernetes liveness and readiness probes target this same endpoint (see the Installation section), so pod scheduling decisions are automatically driven by real backend health.

Computing utilization from Prometheus

Once Prometheus is scraping your cluster, use the following PromQL expressions to monitor day-to-day health:

Cluster-wide storage utilisation (%)

sum(aistore_capacity_bytes_used) / sum(aistore_capacity_bytes_total) * 100

Any offline nodes

sum(aistore_nodes_offline) > 0

99th-percentile PUT latency (ms) per node

histogram_quantile(0.99, rate(aistore_put_latency_seconds_bucket[5m])) * 1000

Error rate per operation type

rate(aistore_errors_total[5m])

Alerting recommendations

Create Prometheus alerting rules for these conditions at a minimum:

ConditionPromQLSeverity
Node downaistore_node_up == 0Critical
Backend unhealthyaistore_backend_healthy == 0Critical
Storage > 85% fullaistore_capacity_bytes_used / aistore_capacity_bytes_total > 0.85Warning
Storage > 95% fullaistore_capacity_bytes_used / aistore_capacity_bytes_total > 0.95Critical
Offline nodesaistore_nodes_offline > 0Warning
High heartbeat failure raterate(aistore_heartbeat_failures_total[5m]) > 1Warning
Missing EC shards risingincrease(aistore_shards_missing_total[10m]) > 0Warning

Web UI

The same HTTP server that serves /metrics and /health also hosts a built-in web UI at the root path (/). The UI displays a live overview of node states, capacity bar charts, operation counters, and per-peer health—useful for ad-hoc inspection without requiring a Grafana dashboard.


Examples

Example 1 — Scrape metrics from a running node

kubectl port-forward pod/nabustore-0 9090:9090 &
curl -s http://localhost:9090/metrics

Expected output (truncated)

# HELP aistore_node_up Node health status (1=healthy, 0=unhealthy)
# TYPE aistore_node_up gauge
aistore_node_up{node="nabustore-0"} 1
# HELP aistore_uptime_seconds Seconds since node started
# TYPE aistore_uptime_seconds gauge
aistore_uptime_seconds{node="nabustore-0"} 3842
# HELP aistore_capacity_bytes_total Total storage capacity in bytes
# TYPE aistore_capacity_bytes_total gauge
aistore_capacity_bytes_total{node="nabustore-0"} 107374182400
# HELP aistore_capacity_bytes_used Used storage in bytes
# TYPE aistore_capacity_bytes_used gauge
aistore_capacity_bytes_used{node="nabustore-0"} 21474836480
# HELP aistore_blob_count Number of blobs stored
# TYPE aistore_blob_count gauge
aistore_blob_count{node="nabustore-0"} 4096
# HELP aistore_operations_total Total operations by type
# TYPE aistore_operations_total counter
aistore_operations_total{node="nabustore-0",operation="get"} 18234
aistore_operations_total{node="nabustore-0",operation="put"} 4096
# HELP aistore_put_latency_seconds Put operation latency
# TYPE aistore_put_latency_seconds histogram
aistore_put_latency_seconds_bucket{node="nabustore-0",le="0.001"} 312
aistore_put_latency_seconds_bucket{node="nabustore-0",le="0.005"} 3901
aistore_put_latency_seconds_bucket{node="nabustore-0",le="0.01"} 4080
aistore_put_latency_seconds_bucket{node="nabustore-0",le="+Inf"} 4096
aistore_put_latency_seconds_sum{node="nabustore-0"} 14.832
aistore_put_latency_seconds_count{node="nabustore-0"} 4096

Example 2 — Health probe returning healthy

curl -v http://localhost:9090/health

Expected output

< HTTP/1.1 200 OK
< Content-Type: text/plain; charset=utf-8
<
OK

Example 3 — Health probe returning unhealthy (backend failure)

When the storage backend is unreachable (e.g., underlying NVMe device failure), the node returns:

curl -v http://localhost:9090/health

Expected output

< HTTP/1.1 503 Service Unavailable
< Content-Type: text/plain; charset=utf-8
<
UNHEALTHY: backend error

Kubernetes will mark the pod NotReady and remove it from service endpoints after the configured failureThreshold is reached.


Example 4 — PromQL: 99th-percentile GET latency over a 5-minute window

histogram_quantile(
  0.99,
  sum by (node, le) (
    rate(aistore_get_latency_seconds_bucket[5m])
  )
)

Expected output (in Prometheus/Grafana)

{node="nabustore-0"}  0.00421   # ≈ 4.2 ms
{node="nabustore-1"}  0.00389   # ≈ 3.9 ms
{node="nabustore-2"}  0.00412   # ≈ 4.1 ms

Example 5 — Prometheus alert rule for an offline node

Save the following to a PrometheusRule resource in the same namespace as your Prometheus Operator deployment:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: nabustore-alerts
  labels:
    release: prometheus
spec:
  groups:
    - name: nabustore.node
      interval: 30s
      rules:
        - alert: NabuStoreNodeDown
          expr: aistore_node_up == 0
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: "Nabu Store node {{ $labels.node }} is unhealthy"
            description: "Node {{ $labels.node }} has reported node_up=0 for more than 1 minute."

        - alert: NabuStoreStorageNearFull
          expr: >
            aistore_capacity_bytes_used / aistore_capacity_bytes_total > 0.85
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Nabu Store node {{ $labels.node }} storage above 85%"

Apply it

kubectl apply -f nabustore-alerts.yaml

Troubleshooting

Issue: /metrics endpoint returns connection refused

Symptom: curl http://<node>:9090/metrics fails with Connection refused or times out.

Likely cause: MetricsAddr is set to "" (empty string) in the server configuration, which disables the HTTP server entirely.

Fix: Set MetricsAddr to a non-empty bind address (e.g., ":9090") in your Helm values and roll the StatefulSet:

helm upgrade nabustore ./chart --set nabustore.metricsAddr=":9090"
kubectl rollout restart statefulset/nabustore

Issue: Kubernetes liveness probe marks pod as CrashLoopBackOff even though the node process is running

Symptom: The pod restarts repeatedly; kubectl describe pod shows the liveness probe hitting /health and receiving 503.

Likely cause: The storage backend is returning an error (disk full, NVMe device failure, or filesystem permissions). The /health handler calls backend.Capacity() and returns 503 if it fails, which causes Kubernetes to kill and restart the pod.

Fix:

  1. Check pod logs for backend errors: kubectl logs nabustore-0
  2. If the disk is full, free capacity or expand storage before the pod can recover.
  3. If it is a permissions issue: kubectl exec -it nabustore-0 -- ls -la /data/aistore
  4. If the problem is transient and you need to prevent repeated restarts, temporarily increase failureThreshold or initialDelaySeconds in the probe spec while you remediate.

Issue: Prometheus shows no metrics for Nabu Store pods

Symptom: No aistore_* series appear in Prometheus, even though the /metrics endpoint responds correctly to curl.

Likely cause 1: Scrape annotations are missing or have the wrong port.

Fix: Verify the annotations on the pod: kubectl get pod nabustore-0 -o jsonpath='{.metadata.annotations}'. Ensure prometheus.io/scrape: "true", prometheus.io/port: "9090", and prometheus.io/path: "/metrics" are all present.

Likely cause 2: You are using Prometheus Operator, which ignores pod annotations and requires a ServiceMonitor resource.

Fix: Create a ServiceMonitor as shown in the Installation section and confirm its release label matches the label selector of your Prometheus custom resource.


Issue: aistore_nodes_offline is non-zero but all pods show Running

Symptom: The metric reports one or more offline peers, but kubectl get pods shows all nodes in Running state.

Likely cause: Intra-cluster heartbeat traffic is blocked by a NetworkPolicy, or a node's gRPC port (:50051) is not reachable from peers. Node health in the metrics layer is determined by heartbeat success, not pod state.

Fix:

  1. Check aistore_heartbeat_failures_total on the reporting node—a high and rising count confirms heartbeat loss.
  2. Verify NetworkPolicy allows pod-to-pod traffic on the gRPC port: kubectl get networkpolicy -n <namespace>
  3. Test direct connectivity: kubectl exec nabustore-0 -- curl -v http://nabustore-1:50051 (or use grpc_health_probe).
  4. Check that the SeedNodes configuration matches the actual DNS names or IPs of your peer pods.

Issue: Latency histograms show all values in the +Inf bucket

Symptom: aistore_put_latency_seconds_bucket has zero counts in all finite buckets; everything accumulates in +Inf.

Likely cause: Operations are taking longer than the maximum bucket bound of 60 seconds, indicating severe backend degradation (e.g., storage I/O saturation or a hung syscall).

Fix:

  1. Check aistore_backend_healthy — if 0, the backend is already reporting an error.
  2. Inspect node-level disk I/O metrics via your Kubernetes node exporter (node_disk_io_time_seconds_total).
  3. If using the SPDK backend, verify the SPDK process is responsive on its RPC socket.
  4. Consider restarting the affected node pod after confirming peer nodes hold sufficient replicas or EC shards to cover the data.