Guide

Metrics

Prometheus metrics or monitoring endpoints

Overview

Nabu Store exposes operational and performance metrics through two complementary endpoints: a Prometheus-compatible scrape endpoint (/metrics) that emits text-format time-series data, and a JSON API endpoint (/api/v1/metrics) that returns structured metrics for dashboards and programmatic consumers. Together, these endpoints give you real-time visibility into node health, storage capacity, throughput, latency distributions, cluster topology, erasure-coding activity, and error rates — everything you need to detect problems early and validate cluster behavior after configuration changes.

Prerequisites

Before you begin:

A running Nabu Store node or cluster (single-node or multi-node)
Network access to the node's HTTP port (default :8080 unless configured otherwise)
A valid API token if authentication is enabled (see Authentication and API Requests)
For Prometheus scraping: Prometheus 2.x or any OpenMetrics-compatible scraper
For dashboard integration: Grafana 9.x or later (recommended), or any tool that can query the JSON API
curl or an HTTP client for manual verification

Installation

Nabu Store's metrics endpoints are built into every node and require no separate installation. Follow these steps to verify they are reachable and, optionally, wire them into Prometheus.

Step 1 — Verify the Prometheus endpoint is reachable

curl -s http://<node-address>:8080/metrics | head -40

Expected output begins with comment lines followed by metric samples:

# HELP aistore_node_up Node health status (1=healthy, 0=unhealthy)
# TYPE aistore_node_up gauge
aistore_node_up{node="node-1"} 1
# HELP aistore_uptime_seconds Seconds since node started
# TYPE aistore_uptime_seconds gauge
aistore_uptime_seconds{node="node-1"} 3742
...

Step 2 — Verify the JSON metrics endpoint is reachable

curl -s http://<node-address>:8080/api/v1/metrics | python3 -m json.tool

Step 3 — Add a Prometheus scrape job

In your prometheus.yml, add a scrape configuration for each Nabu Store node:

scrape_configs:
  - job_name: 'nabu-store'
    scrape_interval: 15s
    static_configs:
      - targets:
          - '<node-1-address>:8080'
          - '<node-2-address>:8080'
          - '<node-3-address>:8080'
    metrics_path: /metrics

For a Kubernetes deployment, use a ServiceMonitor (Prometheus Operator) instead of static targets:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: nabu-store
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: nabu-store
  endpoints:
    - port: http
      path: /metrics
      interval: 15s

Step 4 — Reload Prometheus

curl -X POST http://<prometheus-address>:9090/-/reload

Confirm targets appear in the Prometheus UI under Status → Targets.

Configuration

The metrics endpoints are always enabled; there are no feature flags to turn on. The behavior you can influence falls into two areas:

Scrape interval

Because aistore_uptime_seconds is recomputed on every request to /metrics, you should set your scrape interval no lower than 15s to avoid unnecessary CPU overhead from concurrent scrapes. Prometheus defaults work well.

Latency histogram buckets

All latency histograms use the built-in DefaultLatencyBuckets (in seconds):

Bucket upper bound	Human-readable
0.0001	100 µs
0.0005	500 µs
0.001	1 ms
0.005	5 ms
0.01	10 ms
0.025	25 ms
0.05	50 ms
0.1	100 ms
0.25	250 ms
0.5	500 ms
1	1 s
2.5	2.5 s
5	5 s
10	10 s
30	30 s
60	60 s
+Inf	catch-all

These buckets cover the full latency range from fast NVMe-backed operations (sub-millisecond) to slow degraded-mode reconstructions (tens of seconds). Custom bucket configuration is not exposed through the API.

Node identity label

Every metric carries a node label whose value is the Node ID assigned when the node joined the cluster. In a multi-node cluster, all nodes expose the same metric names but with different node label values, so you can aggregate or filter in Prometheus without naming collisions.

Authentication

If cluster-wide authentication is enabled, include your API token on every request to /metrics and /api/v1/metrics. See Authenticate and Issue API Requests for token acquisition.

Usage

Fetching raw Prometheus metrics

Send a GET request to /metrics on any node. The response body is plain text in Prometheus exposition format (content type text/plain; version=0.0.4; charset=utf-8).

curl -s http://<node-address>:8080/metrics

If authentication is required:

curl -s -H "Authorization: Bearer <token>" \
  http://<node-address>:8080/metrics

Fetching structured JSON metrics

Send a GET request to /api/v1/metrics. The response is JSON and is always served with Cache-Control: no-cache, so every request returns a fresh snapshot.

curl -s -H "Authorization: Bearer <token>" \
  http://<node-address>:8080/api/v1/metrics

Only GET is accepted. Sending any other HTTP method returns 405 Method Not Allowed.

Reading per-node health from the JSON endpoint

The health object in the JSON response is the fastest way to check whether a node considers itself operational:

curl -s http://<node-address>:8080/api/v1/metrics | \
  python3 -c "import sys,json; d=json.load(sys.stdin); print(d['health'])"

Checking storage utilisation

curl -s http://<node-address>:8080/api/v1/metrics | \
  python3 -c "
import sys, json
d = json.load(sys.stdin)
st = d['storage']
used_pct = st['usedBytes'] / st['totalBytes'] * 100 if st['totalBytes'] else 0
print(f"Used: {st['usedBytes']:,} / {st['totalBytes']:,} bytes ({used_pct:.1f}%)")
"

Writing a Prometheus alerting rule

Use the Prometheus metric names directly in alert expressions:

groups:
  - name: nabu-store
    rules:
      - alert: NabuStoreNodeDown
        expr: aistore_node_up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Nabu Store node {{ $labels.node }} is unhealthy"

      - alert: NabuStoreCapacityHigh
        expr: >
          aistore_capacity_bytes_used / aistore_capacity_bytes_total > 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Node {{ $labels.node }} storage above 85%"

      - alert: NabuStoreNodesOffline
        expr: aistore_nodes_offline > 0
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "{{ $value }} node(s) offline in cluster"

Examples

Example 1 — Full Prometheus scrape response (excerpt)

Request:

curl -s http://node-1.nabu.local:8080/metrics

Response (partial — all metric families follow the same pattern):

# HELP aistore_node_up Node health status (1=healthy, 0=unhealthy)
# TYPE aistore_node_up gauge
aistore_node_up{node="node-1"} 1

# HELP aistore_uptime_seconds Seconds since node started
# TYPE aistore_uptime_seconds gauge
aistore_uptime_seconds{node="node-1"} 86432

# HELP aistore_last_heartbeat_timestamp Unix timestamp of last successful heartbeat
# TYPE aistore_last_heartbeat_timestamp gauge
aistore_last_heartbeat_timestamp{node="node-1"} 1718000123

# HELP aistore_backend_healthy Storage backend health (1=healthy, 0=unhealthy)
# TYPE aistore_backend_healthy gauge
aistore_backend_healthy{node="node-1"} 1

# HELP aistore_grpc_connections_active Active gRPC connections
# TYPE aistore_grpc_connections_active gauge
aistore_grpc_connections_active{node="node-1"} 4

# HELP aistore_peer_health Per-peer health status (1=online, 0=offline)
# TYPE aistore_peer_health gauge
aistore_peer_health{node="node-1",peer="node-2"} 1
aistore_peer_health{node="node-1",peer="node-3"} 1

# HELP aistore_capacity_bytes_total Total storage capacity in bytes
# TYPE aistore_capacity_bytes_total gauge
aistore_capacity_bytes_total{node="node-1"} 10995116277760

# HELP aistore_capacity_bytes_used Used storage in bytes
# TYPE aistore_capacity_bytes_used gauge
aistore_capacity_bytes_used{node="node-1"} 4123456789000

# HELP aistore_blob_count Number of blobs stored
# TYPE aistore_blob_count gauge
aistore_blob_count{node="node-1"} 124892

# HELP aistore_bytes_read_total Total bytes read
# TYPE aistore_bytes_read_total counter
aistore_bytes_read_total{node="node-1"} 8834267136

# HELP aistore_bytes_written_total Total bytes written
# TYPE aistore_bytes_written_total counter
aistore_bytes_written_total{node="node-1"} 4123456789000

# HELP aistore_operations_total Total operations by type
# TYPE aistore_operations_total counter
aistore_operations_total{node="node-1",operation="put"} 124892
aistore_operations_total{node="node-1",operation="get"} 389201
aistore_operations_total{node="node-1",operation="delete"} 1024
aistore_operations_total{node="node-1",operation="stat"} 45000
aistore_operations_total{node="node-1",operation="list"} 3210

# HELP aistore_put_latency_seconds Put operation latency
# TYPE aistore_put_latency_seconds histogram
aistore_put_latency_seconds_bucket{node="node-1",le="0.0001"} 0
aistore_put_latency_seconds_bucket{node="node-1",le="0.001"} 4201
aistore_put_latency_seconds_bucket{node="node-1",le="0.01"} 118400
aistore_put_latency_seconds_bucket{node="node-1",le="0.1"} 124500
aistore_put_latency_seconds_bucket{node="node-1",le="+Inf"} 124892
aistore_put_latency_seconds_sum{node="node-1"} 312.45
aistore_put_latency_seconds_count{node="node-1"} 124892

# HELP aistore_nodes_total Total nodes in cluster
# TYPE aistore_nodes_total gauge
aistore_nodes_total{node="node-1"} 3

# HELP aistore_nodes_active Active nodes in cluster
# TYPE aistore_nodes_active gauge
aistore_nodes_active{node="node-1"} 3

# HELP aistore_nodes_offline Offline nodes in cluster
# TYPE aistore_nodes_offline gauge
aistore_nodes_offline{node="node-1"} 0

# HELP aistore_replication_queue_depth Current replication queue depth
# TYPE aistore_replication_queue_depth gauge
aistore_replication_queue_depth{node="node-1"} 0

Example 2 — JSON metrics response

Request:

curl -s http://node-1.nabu.local:8080/api/v1/metrics

Response:

{
  "nodeId": "node-1",
  "uptime": 86432,
  "storage": {
    "totalBytes": 10995116277760,
    "usedBytes": 4123456789000,
    "blobCount": 124892,
    "shardCount": 374676
  },
  "throughput": {
    "bytesReadTotal": 8834267136,
    "bytesWrittenTotal": 4123456789000,
    "operations": {
      "put": 124892,
      "get": 389201,
      "delete": 1024,
      "stat": 45000,
      "list": 3210
    }
  },
  "cluster": {
    "nodesTotal": 3,
    "nodesActive": 3,
    "nodesOffline": 0,
    "ringVersion": 7
  },
  "health": {
    "nodeUp": true,
    "backendHealthy": true,
    "peerHealth": {
      "node-2": true,
      "node-3": true
    }
  }
}

Example 3 — Detecting a degraded node with a one-liner

This shell snippet polls every 5 seconds and prints an alert if nodeUp is false on any node in a list:

for NODE in node-1.nabu.local node-2.nabu.local node-3.nabu.local; do
  STATUS=$(curl -s http://${NODE}:8080/api/v1/metrics \
    | python3 -c "import sys,json; d=json.load(sys.stdin); print('OK' if d['health']['nodeUp'] else 'DEGRADED')")
  echo "${NODE}: ${STATUS}"
done

Expected output (healthy cluster):

node-1.nabu.local: OK
node-2.nabu.local: OK
node-3.nabu.local: OK

Expected output (one node down):

node-1.nabu.local: OK
node-2.nabu.local: DEGRADED
node-3.nabu.local: OK

Example 4 — Querying latency percentiles in Prometheus

After Prometheus has scraped several intervals, compute the approximate 99th-percentile GET latency across all nodes:

histogram_quantile(
  0.99,
  sum by (le) (rate(aistore_get_latency_seconds_bucket[5m]))
)

For per-node breakdown:

histogram_quantile(
  0.99,
  sum by (node, le) (rate(aistore_get_latency_seconds_bucket[5m]))
)

Troubleshooting

Issue: `/metrics` returns `404 Not Found`

Symptom: curl http://<node>:8080/metrics returns an HTTP 404 response.

Likely cause: You are connecting to the wrong port, or the path is misspelled.

Fix: Confirm the node's HTTP port in your deployment configuration. The metrics path is exactly /metrics (no trailing slash). Try:

curl -v http://<node>:8080/metrics

and inspect the Location header if a redirect occurs.

Issue: `/metrics` returns `405 Method Not Allowed`

Symptom: A non-GET request (e.g., POST) to /api/v1/metrics returns 405.

Likely cause: Both metrics endpoints accept only GET. Prometheus scrapers and curl without -X use GET by default, so this typically indicates a misconfigured client or middleware.

Fix: Ensure your scraper or client issues GET requests only.

Issue: Prometheus shows `aistore_node_up{node="node-2"} 0`

Symptom: One or more nodes report 0 for aistore_node_up or aistore_backend_healthy.

Likely cause: The node's storage backend has become unreachable (disk failure, SPDK process crash, CXL device error) or the node process itself is in a fault state.

Fix:

Check aistore_backend_healthy — if it is also 0, the storage layer is the issue. Consult Diagnose and Recover from a Node Failure.
Check aistore_peer_health{peer="node-2"} on a healthy peer to see whether the affected node appears offline from the cluster's perspective.
Review the node's application logs for error messages from the backend subsystem.

Issue: `aistore_nodes_offline` is greater than zero but the node appears running

Symptom: aistore_nodes_offline reports one or more offline nodes, but the node process responds to HTTP requests.

Likely cause: The node has missed heartbeat deadlines. A node is marked healthy only when its state is active and its lastHeartbeat is within 15 seconds. Network latency, GC pauses, or clock skew can cause transient failures.

Fix:

Check aistore_last_heartbeat_timestamp on the affected node and compare with the current Unix time. A gap larger than 15 seconds indicates a missed heartbeat.
Check aistore_heartbeat_failures_total for a rising count.
Investigate network connectivity between nodes and verify system clocks are synchronized (NTP/PTP).

Issue: All latency histograms show counts only in the `+Inf` bucket

Symptom: Prometheus shows most or all observations in the le="+Inf" bucket, suggesting operations are taking longer than 60 seconds.

Likely cause: Severely degraded storage backend, erasure-coding reconstruction under heavy shard loss, or a hanging peer gRPC connection.

Fix:

Check aistore_replication_lag_seconds and aistore_replication_queue_depth for a backlog.
Check aistore_ec_reconstruction_total for a rapidly increasing count, which indicates frequent reconstruction from missing shards.
Check aistore_shards_missing_total to quantify data availability.
If the issue is isolated to PUT operations, check aistore_backend_healthy on each peer.

Issue: JSON endpoint returns `500 Internal Server Error`

Symptom: GET /api/v1/metrics returns HTTP 500.

Likely cause: The node's internal metrics collection encountered an error serializing the response (e.g., a nil pointer in cluster state during a topology change).

Fix: Retry the request — topology-change windows are transient. If the 500 persists, check node application logs for JSON encoding errors and verify the cluster has completed any in-progress ring rebalancing (aistore_ring_version should stabilize).

Issue: `aistore_peer_health` labels are missing for some peers

Symptom: The aistore_peer_health metric does not include a label for every expected peer node.

Likely cause: The PeerHealth gauge vector only emits a label after the node has observed at least one heartbeat from that peer. Newly joined nodes or nodes that have never been seen will not appear.

Fix: After adding a node to the cluster, wait for at least one full heartbeat cycle (typically ≤ 10 seconds) before expecting its peer label to appear. If the label never appears, verify the new node has successfully joined via GET /api/v1/cluster.

Metrics

Fetching raw Prometheus metrics

Fetching structured JSON metrics

Reading per-node health from the JSON endpoint

Checking storage utilisation

Writing a Prometheus alerting rule

Example 1 — Full Prometheus scrape response (excerpt)

Example 2 — JSON metrics response

Example 3 — Detecting a degraded node with a one-liner

Example 4 — Querying latency percentiles in Prometheus

Issue: /metrics returns 404 Not Found

Issue: /metrics returns 405 Method Not Allowed

Issue: Prometheus shows aistore_node_up{node="node-2"} 0

Issue: aistore_nodes_offline is greater than zero but the node appears running

Issue: All latency histograms show counts only in the +Inf bucket

Issue: JSON endpoint returns 500 Internal Server Error

Issue: aistore_peer_health labels are missing for some peers

Issue: `/metrics` returns `404 Not Found`

Issue: `/metrics` returns `405 Method Not Allowed`

Issue: Prometheus shows `aistore_node_up{node="node-2"} 0`

Issue: `aistore_nodes_offline` is greater than zero but the node appears running

Issue: All latency histograms show counts only in the `+Inf` bucket

Issue: JSON endpoint returns `500 Internal Server Error`

Issue: `aistore_peer_health` labels are missing for some peers