Guide

Observability

Logging, metrics, tracing, health checks

Overview

Nabu Store exposes a full observability stack—metrics, health checks, and structured logging—following Kubernetes best practices so you can integrate with the tools and workflows already present in your cluster. This page explains how to configure and consume each signal: Prometheus-compatible metrics served from every node, HTTP health endpoints suitable for Kubernetes liveness and readiness probes, and per-node structured logs that feed standard aggregation pipelines. Understanding these signals lets you monitor cluster capacity and throughput in real time, set up alerting on failure conditions, and diagnose problems before they affect inference workloads.

Prerequisites

Before proceeding, make sure you have the following in place:

Kubernetes 1.25 or later — Nabu Store's observability integrations rely on standard Kubernetes probe APIs and annotation-based service discovery.
Helm 3.0 or later — required to deploy or upgrade Nabu Store using the provided chart.
Prometheus Operator or a compatible scrape stack (e.g., kube-prometheus-stack) — the metrics endpoint emits Prometheus exposition format; a Prometheus instance must be configured to scrape it.
Nabu Store cluster already deployed — see the single-node or multi-node installation guides before continuing.
kubectl configured to reach your cluster with at least get, list, and port-forward permissions on Nabu Store namespaces.
curl or any HTTP client available in your shell for manual endpoint testing.

Installation

Nabu Store's observability components are built into every node binary—no separate sidecar or agent is required. The steps below enable and verify each signal.

Step 1 — Confirm the metrics address is set

Each Nabu Store node starts an HTTP server on the address specified by MetricsAddr (default :9090). Verify this is set in your deployment's configuration, for example in your Helm values.yaml:

nabustore:
  metricsAddr: ":9090"

If MetricsAddr is left empty, the metrics and health endpoints are disabled for that node.

Step 2 — Annotate the Pod for Prometheus scraping

Add the standard Prometheus scrape annotations to your Pod template so that Prometheus auto-discovers each node:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: nabustore
spec:
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
        prometheus.io/path: "/metrics"

Step 3 — Configure Kubernetes liveness and readiness probes

Add HTTP probes to your container spec pointing at the /health endpoint:

containers:
  - name: nabustore
    ports:
      - containerPort: 50051   # gRPC
        name: grpc
      - containerPort: 9090    # metrics + health
        name: http-metrics
    livenessProbe:
      httpGet:
        path: /health
        port: http-metrics
      initialDelaySeconds: 10
      periodSeconds: 15
      failureThreshold: 3
    readinessProbe:
      httpGet:
        path: /health
        port: http-metrics
      initialDelaySeconds: 5
      periodSeconds: 10
      failureThreshold: 2

Step 4 — Create a Kubernetes Service exposing the metrics port

apiVersion: v1
kind: Service
metadata:
  name: nabustore-metrics
  labels:
    app: nabustore
spec:
  selector:
    app: nabustore
  ports:
    - name: http-metrics
      port: 9090
      targetPort: http-metrics
  clusterIP: None   # headless — Prometheus scrapes each Pod individually

Step 5 — (Optional) Create a ServiceMonitor for Prometheus Operator

If you are using the Prometheus Operator, create a ServiceMonitor instead of relying on annotations:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: nabustore
  labels:
    release: prometheus   # match your Prometheus Operator selector
spec:
  selector:
    matchLabels:
      app: nabustore
  endpoints:
    - port: http-metrics
      path: /metrics
      interval: 15s

Step 6 — Verify the endpoints are reachable

# Port-forward to the first node
kubectl port-forward pod/nabustore-0 9090:9090

# In a second terminal — health check
curl -s http://localhost:9090/health
# Expected: OK

# Metrics scrape
curl -s http://localhost:9090/metrics | head -40

Configuration

The table below describes every configuration field that affects observability. All fields belong to the server Config struct and can be set via Helm values or environment-variable injection.

Field	Default	Valid values	Effect
`MetricsAddr`	`":9090"`	Any `host:port` string, or `""` to disable	Binds the HTTP server that serves `/metrics`, `/health`, and the web UI. Set to `""` to disable all three.
`HeartbeatInterval`	`5s`	Any `time.Duration` string	Controls how often each node sends heartbeats to peers and calls `updateMetrics()`. Lower values give fresher metric data but increase intra-cluster traffic.
`NodeID`	Hostname	Non-empty string	Appears as the `node` label on every emitted metric. Ensure uniqueness across your cluster.

Metric series reference

All series are prefixed aistore_ and carry a node="<NodeID>" label.

Node health

Metric	Type	Description
`aistore_node_up`	Gauge	`1` if the node is healthy, `0` otherwise. Driven by backend reachability.
`aistore_uptime_seconds`	Gauge	Seconds since the node process started.
`aistore_last_heartbeat_timestamp`	Gauge	Unix timestamp of the last successful heartbeat sent to a peer.
`aistore_backend_healthy`	Gauge	`1` if the storage backend is reachable, `0` otherwise.
`aistore_grpc_connections_active`	Gauge	Number of live outbound gRPC connections to peers.
`aistore_peer_health`	Gauge	Per-peer label `peer="<nodeID>"`: `1` = online, `0` = offline.

Storage capacity

Metric	Type	Description
`aistore_capacity_bytes_total`	Gauge	Raw storage capacity of the node's backend in bytes.
`aistore_capacity_bytes_used`	Gauge	Bytes currently consumed by stored blobs.
`aistore_blob_count`	Gauge	Number of blobs indexed on this node.
`aistore_shard_count`	Gauge	Number of erasure-coded shards stored on this node.

Throughput and operations

Metric	Type	Description
`aistore_bytes_read_total`	Counter	Cumulative bytes served by GET operations.
`aistore_bytes_written_total`	Counter	Cumulative bytes received by PUT operations.
`aistore_operations_total`	Counter (labeled)	Per-operation label `operation="put

Latency histograms

All latency metrics are histograms with the following bucket upper bounds (in seconds): 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30, 60, +Inf.

Metric	Operation measured
`aistore_put_latency_seconds`	PUT blob
`aistore_get_latency_seconds`	GET blob
`aistore_delete_latency_seconds`	DELETE blob
`aistore_stat_latency_seconds`	Stat/head blob
`aistore_list_latency_seconds`	List blobs
`aistore_ec_encode_latency_seconds`	Erasure-code encode
`aistore_ec_decode_latency_seconds`	Erasure-code decode / reconstruct
`aistore_shard_fetch_latency_seconds`	Remote shard fetch
`aistore_shard_store_latency_seconds`	Remote shard store
`aistore_replicate_latency_seconds`	Full replication round-trip

Cluster state

Metric	Type	Description
`aistore_nodes_total`	Gauge	Total nodes known to this node.
`aistore_nodes_active`	Gauge	Nodes currently in `Active` state (includes self).
`aistore_nodes_offline`	Gauge	Nodes currently in `Offline` state.
`aistore_heartbeat_failures_total`	Counter	Total failed outbound heartbeat attempts.
`aistore_ring_version`	Gauge	Current consistent-hash ring version. Monotonically increases on topology changes.

Replication and erasure coding

Metric	Type	Description
`aistore_replication_queue_depth`	Gauge	Pending replication operations.
`aistore_replication_lag_seconds`	Gauge	Estimated replication lag.
`aistore_ec_reconstruction_total`	Counter	Total EC reconstruction events triggered.
`aistore_shards_missing_total`	Counter	Total missing shards detected.

Errors

Metric	Type	Labels	Description
`aistore_errors_total`	Counter	`type="storage	network
`aistore_request_failures_total`	Counter	`operation="put	get

Usage

Polling the metrics endpoint directly

Each node exposes its metrics at GET /metrics on the configured MetricsAddr. The response is plain text in Prometheus exposition format (content type text/plain; version=0.0.4).

curl -s http://<node-host>:9090/metrics

To watch a specific metric while troubleshooting:

curl -s http://<node-host>:9090/metrics | grep aistore_nodes_offline

Health checks from the command line

The /health endpoint performs a live backend check and returns HTTP 200 OK with body OK when healthy, or HTTP 503 Service Unavailable with body UNHEALTHY: backend error when the storage backend cannot be reached:

curl -o /dev/null -w "%{http_code}" http://<node-host>:9090/health

Kubernetes liveness and readiness probes target this same endpoint (see the Installation section), so pod scheduling decisions are automatically driven by real backend health.

Computing utilization from Prometheus

Once Prometheus is scraping your cluster, use the following PromQL expressions to monitor day-to-day health:

Cluster-wide storage utilisation (%)

sum(aistore_capacity_bytes_used) / sum(aistore_capacity_bytes_total) * 100

Any offline nodes

sum(aistore_nodes_offline) > 0

99th-percentile PUT latency (ms) per node

histogram_quantile(0.99, rate(aistore_put_latency_seconds_bucket[5m])) * 1000

Error rate per operation type

rate(aistore_errors_total[5m])

Alerting recommendations

Create Prometheus alerting rules for these conditions at a minimum:

Condition	PromQL	Severity
Node down	`aistore_node_up == 0`	Critical
Backend unhealthy	`aistore_backend_healthy == 0`	Critical
Storage > 85% full	`aistore_capacity_bytes_used / aistore_capacity_bytes_total > 0.85`	Warning
Storage > 95% full	`aistore_capacity_bytes_used / aistore_capacity_bytes_total > 0.95`	Critical
Offline nodes	`aistore_nodes_offline > 0`	Warning
High heartbeat failure rate	`rate(aistore_heartbeat_failures_total[5m]) > 1`	Warning
Missing EC shards rising	`increase(aistore_shards_missing_total[10m]) > 0`	Warning

Web UI

The same HTTP server that serves /metrics and /health also hosts a built-in web UI at the root path (/). The UI displays a live overview of node states, capacity bar charts, operation counters, and per-peer health—useful for ad-hoc inspection without requiring a Grafana dashboard.

Examples

Example 1 — Scrape metrics from a running node

kubectl port-forward pod/nabustore-0 9090:9090 &
curl -s http://localhost:9090/metrics

Expected output (truncated)

# HELP aistore_node_up Node health status (1=healthy, 0=unhealthy)
# TYPE aistore_node_up gauge
aistore_node_up{node="nabustore-0"} 1
# HELP aistore_uptime_seconds Seconds since node started
# TYPE aistore_uptime_seconds gauge
aistore_uptime_seconds{node="nabustore-0"} 3842
# HELP aistore_capacity_bytes_total Total storage capacity in bytes
# TYPE aistore_capacity_bytes_total gauge
aistore_capacity_bytes_total{node="nabustore-0"} 107374182400
# HELP aistore_capacity_bytes_used Used storage in bytes
# TYPE aistore_capacity_bytes_used gauge
aistore_capacity_bytes_used{node="nabustore-0"} 21474836480
# HELP aistore_blob_count Number of blobs stored
# TYPE aistore_blob_count gauge
aistore_blob_count{node="nabustore-0"} 4096
# HELP aistore_operations_total Total operations by type
# TYPE aistore_operations_total counter
aistore_operations_total{node="nabustore-0",operation="get"} 18234
aistore_operations_total{node="nabustore-0",operation="put"} 4096
# HELP aistore_put_latency_seconds Put operation latency
# TYPE aistore_put_latency_seconds histogram
aistore_put_latency_seconds_bucket{node="nabustore-0",le="0.001"} 312
aistore_put_latency_seconds_bucket{node="nabustore-0",le="0.005"} 3901
aistore_put_latency_seconds_bucket{node="nabustore-0",le="0.01"} 4080
aistore_put_latency_seconds_bucket{node="nabustore-0",le="+Inf"} 4096
aistore_put_latency_seconds_sum{node="nabustore-0"} 14.832
aistore_put_latency_seconds_count{node="nabustore-0"} 4096

Example 2 — Health probe returning healthy

curl -v http://localhost:9090/health

Expected output

< HTTP/1.1 200 OK
< Content-Type: text/plain; charset=utf-8
<
OK

Example 3 — Health probe returning unhealthy (backend failure)

When the storage backend is unreachable (e.g., underlying NVMe device failure), the node returns:

curl -v http://localhost:9090/health

Expected output

< HTTP/1.1 503 Service Unavailable
< Content-Type: text/plain; charset=utf-8
<
UNHEALTHY: backend error

Kubernetes will mark the pod NotReady and remove it from service endpoints after the configured failureThreshold is reached.

Example 4 — PromQL: 99th-percentile GET latency over a 5-minute window

histogram_quantile(
  0.99,
  sum by (node, le) (
    rate(aistore_get_latency_seconds_bucket[5m])
  )
)

Expected output (in Prometheus/Grafana)

{node="nabustore-0"}  0.00421   # ≈ 4.2 ms
{node="nabustore-1"}  0.00389   # ≈ 3.9 ms
{node="nabustore-2"}  0.00412   # ≈ 4.1 ms

Example 5 — Prometheus alert rule for an offline node

Save the following to a PrometheusRule resource in the same namespace as your Prometheus Operator deployment:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: nabustore-alerts
  labels:
    release: prometheus
spec:
  groups:
    - name: nabustore.node
      interval: 30s
      rules:
        - alert: NabuStoreNodeDown
          expr: aistore_node_up == 0
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: "Nabu Store node {{ $labels.node }} is unhealthy"
            description: "Node {{ $labels.node }} has reported node_up=0 for more than 1 minute."

        - alert: NabuStoreStorageNearFull
          expr: >
            aistore_capacity_bytes_used / aistore_capacity_bytes_total > 0.85
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Nabu Store node {{ $labels.node }} storage above 85%"

Apply it

kubectl apply -f nabustore-alerts.yaml

Troubleshooting

Issue: `/metrics` endpoint returns connection refused

Symptom: curl http://<node>:9090/metrics fails with Connection refused or times out.

Likely cause: MetricsAddr is set to "" (empty string) in the server configuration, which disables the HTTP server entirely.

Fix: Set MetricsAddr to a non-empty bind address (e.g., ":9090") in your Helm values and roll the StatefulSet:

helm upgrade nabustore ./chart --set nabustore.metricsAddr=":9090"
kubectl rollout restart statefulset/nabustore

Issue: Kubernetes liveness probe marks pod as `CrashLoopBackOff` even though the node process is running

Symptom: The pod restarts repeatedly; kubectl describe pod shows the liveness probe hitting /health and receiving 503.

Likely cause: The storage backend is returning an error (disk full, NVMe device failure, or filesystem permissions). The /health handler calls backend.Capacity() and returns 503 if it fails, which causes Kubernetes to kill and restart the pod.

Fix:

Check pod logs for backend errors: kubectl logs nabustore-0
If the disk is full, free capacity or expand storage before the pod can recover.
If it is a permissions issue: kubectl exec -it nabustore-0 -- ls -la /data/aistore
If the problem is transient and you need to prevent repeated restarts, temporarily increase failureThreshold or initialDelaySeconds in the probe spec while you remediate.

Issue: Prometheus shows no metrics for Nabu Store pods

Symptom: No aistore_* series appear in Prometheus, even though the /metrics endpoint responds correctly to curl.

Likely cause 1: Scrape annotations are missing or have the wrong port.

Fix: Verify the annotations on the pod: kubectl get pod nabustore-0 -o jsonpath='{.metadata.annotations}'. Ensure prometheus.io/scrape: "true", prometheus.io/port: "9090", and prometheus.io/path: "/metrics" are all present.

Likely cause 2: You are using Prometheus Operator, which ignores pod annotations and requires a ServiceMonitor resource.

Fix: Create a ServiceMonitor as shown in the Installation section and confirm its release label matches the label selector of your Prometheus custom resource.

Issue: `aistore_nodes_offline` is non-zero but all pods show `Running`

Symptom: The metric reports one or more offline peers, but kubectl get pods shows all nodes in Running state.

Likely cause: Intra-cluster heartbeat traffic is blocked by a NetworkPolicy, or a node's gRPC port (:50051) is not reachable from peers. Node health in the metrics layer is determined by heartbeat success, not pod state.

Fix:

Check aistore_heartbeat_failures_total on the reporting node—a high and rising count confirms heartbeat loss.
Verify NetworkPolicy allows pod-to-pod traffic on the gRPC port: kubectl get networkpolicy -n <namespace>
Test direct connectivity: kubectl exec nabustore-0 -- curl -v http://nabustore-1:50051 (or use grpc_health_probe).
Check that the SeedNodes configuration matches the actual DNS names or IPs of your peer pods.

Issue: Latency histograms show all values in the `+Inf` bucket

Symptom: aistore_put_latency_seconds_bucket has zero counts in all finite buckets; everything accumulates in +Inf.

Likely cause: Operations are taking longer than the maximum bucket bound of 60 seconds, indicating severe backend degradation (e.g., storage I/O saturation or a hung syscall).

Fix:

Check aistore_backend_healthy — if 0, the backend is already reporting an error.
Inspect node-level disk I/O metrics via your Kubernetes node exporter (node_disk_io_time_seconds_total).
If using the SPDK backend, verify the SPDK process is responsive on its RPC socket.
Consider restarting the affected node pod after confirming peer nodes hold sufficient replicas or EC shards to cover the data.

Observability

Step 1 — Confirm the metrics address is set

Step 2 — Annotate the Pod for Prometheus scraping

Step 3 — Configure Kubernetes liveness and readiness probes

Step 4 — Create a Kubernetes Service exposing the metrics port

Step 5 — (Optional) Create a ServiceMonitor for Prometheus Operator

Step 6 — Verify the endpoints are reachable

Metric series reference

Polling the metrics endpoint directly

Health checks from the command line

Computing utilization from Prometheus

Alerting recommendations

Web UI

Example 1 — Scrape metrics from a running node

Example 2 — Health probe returning healthy

Example 3 — Health probe returning unhealthy (backend failure)

Example 4 — PromQL: 99th-percentile GET latency over a 5-minute window

Example 5 — Prometheus alert rule for an offline node

Issue: /metrics endpoint returns connection refused

Issue: Kubernetes liveness probe marks pod as CrashLoopBackOff even though the node process is running

Issue: Prometheus shows no metrics for Nabu Store pods

Issue: aistore_nodes_offline is non-zero but all pods show Running

Issue: Latency histograms show all values in the +Inf bucket

Issue: `/metrics` endpoint returns connection refused

Issue: Kubernetes liveness probe marks pod as `CrashLoopBackOff` even though the node process is running

Issue: `aistore_nodes_offline` is non-zero but all pods show `Running`

Issue: Latency histograms show all values in the `+Inf` bucket