Metrics
Prometheus metrics or monitoring endpoints
Nabu Store exposes operational and performance metrics through two complementary endpoints: a Prometheus-compatible scrape endpoint (/metrics) that emits text-format time-series data, and a JSON API endpoint (/api/v1/metrics) that returns structured metrics for dashboards and programmatic consumers. Together, these endpoints give you real-time visibility into node health, storage capacity, throughput, latency distributions, cluster topology, erasure-coding activity, and error rates — everything you need to detect problems early and validate cluster behavior after configuration changes.
Before you begin:
- A running Nabu Store node or cluster (single-node or multi-node)
- Network access to the node's HTTP port (default
:8080unless configured otherwise) - A valid API token if authentication is enabled (see Authentication and API Requests)
- For Prometheus scraping: Prometheus 2.x or any OpenMetrics-compatible scraper
- For dashboard integration: Grafana 9.x or later (recommended), or any tool that can query the JSON API
curlor an HTTP client for manual verification
Nabu Store's metrics endpoints are built into every node and require no separate installation. Follow these steps to verify they are reachable and, optionally, wire them into Prometheus.
Step 1 — Verify the Prometheus endpoint is reachable
curl -s http://<node-address>:8080/metrics | head -40
Expected output begins with comment lines followed by metric samples:
# HELP aistore_node_up Node health status (1=healthy, 0=unhealthy)
# TYPE aistore_node_up gauge
aistore_node_up{node="node-1"} 1
# HELP aistore_uptime_seconds Seconds since node started
# TYPE aistore_uptime_seconds gauge
aistore_uptime_seconds{node="node-1"} 3742
...
Step 2 — Verify the JSON metrics endpoint is reachable
curl -s http://<node-address>:8080/api/v1/metrics | python3 -m json.tool
Step 3 — Add a Prometheus scrape job
In your prometheus.yml, add a scrape configuration for each Nabu Store node:
scrape_configs:
- job_name: 'nabu-store'
scrape_interval: 15s
static_configs:
- targets:
- '<node-1-address>:8080'
- '<node-2-address>:8080'
- '<node-3-address>:8080'
metrics_path: /metrics
For a Kubernetes deployment, use a ServiceMonitor (Prometheus Operator) instead of static targets:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: nabu-store
namespace: monitoring
spec:
selector:
matchLabels:
app: nabu-store
endpoints:
- port: http
path: /metrics
interval: 15s
Step 4 — Reload Prometheus
curl -X POST http://<prometheus-address>:9090/-/reload
Confirm targets appear in the Prometheus UI under Status → Targets.
The metrics endpoints are always enabled; there are no feature flags to turn on. The behavior you can influence falls into two areas:
Scrape interval
Because aistore_uptime_seconds is recomputed on every request to /metrics, you should set your scrape interval no lower than 15s to avoid unnecessary CPU overhead from concurrent scrapes. Prometheus defaults work well.
Latency histogram buckets
All latency histograms use the built-in DefaultLatencyBuckets (in seconds):
| Bucket upper bound | Human-readable |
|---|---|
| 0.0001 | 100 µs |
| 0.0005 | 500 µs |
| 0.001 | 1 ms |
| 0.005 | 5 ms |
| 0.01 | 10 ms |
| 0.025 | 25 ms |
| 0.05 | 50 ms |
| 0.1 | 100 ms |
| 0.25 | 250 ms |
| 0.5 | 500 ms |
| 1 | 1 s |
| 2.5 | 2.5 s |
| 5 | 5 s |
| 10 | 10 s |
| 30 | 30 s |
| 60 | 60 s |
| +Inf | catch-all |
These buckets cover the full latency range from fast NVMe-backed operations (sub-millisecond) to slow degraded-mode reconstructions (tens of seconds). Custom bucket configuration is not exposed through the API.
Node identity label
Every metric carries a node label whose value is the Node ID assigned when the node joined the cluster. In a multi-node cluster, all nodes expose the same metric names but with different node label values, so you can aggregate or filter in Prometheus without naming collisions.
Authentication
If cluster-wide authentication is enabled, include your API token on every request to /metrics and /api/v1/metrics. See Authenticate and Issue API Requests for token acquisition.
Fetching raw Prometheus metrics
Send a GET request to /metrics on any node. The response body is plain text in Prometheus exposition format (content type text/plain; version=0.0.4; charset=utf-8).
curl -s http://<node-address>:8080/metrics
If authentication is required:
curl -s -H "Authorization: Bearer <token>" \
http://<node-address>:8080/metrics
Fetching structured JSON metrics
Send a GET request to /api/v1/metrics. The response is JSON and is always served with Cache-Control: no-cache, so every request returns a fresh snapshot.
curl -s -H "Authorization: Bearer <token>" \
http://<node-address>:8080/api/v1/metrics
Only GET is accepted. Sending any other HTTP method returns 405 Method Not Allowed.
Reading per-node health from the JSON endpoint
The health object in the JSON response is the fastest way to check whether a node considers itself operational:
curl -s http://<node-address>:8080/api/v1/metrics | \
python3 -c "import sys,json; d=json.load(sys.stdin); print(d['health'])"
Checking storage utilisation
curl -s http://<node-address>:8080/api/v1/metrics | \
python3 -c "
import sys, json
d = json.load(sys.stdin)
st = d['storage']
used_pct = st['usedBytes'] / st['totalBytes'] * 100 if st['totalBytes'] else 0
print(f"Used: {st['usedBytes']:,} / {st['totalBytes']:,} bytes ({used_pct:.1f}%)")
"
Writing a Prometheus alerting rule
Use the Prometheus metric names directly in alert expressions:
groups:
- name: nabu-store
rules:
- alert: NabuStoreNodeDown
expr: aistore_node_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Nabu Store node {{ $labels.node }} is unhealthy"
- alert: NabuStoreCapacityHigh
expr: >
aistore_capacity_bytes_used / aistore_capacity_bytes_total > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "Node {{ $labels.node }} storage above 85%"
- alert: NabuStoreNodesOffline
expr: aistore_nodes_offline > 0
for: 2m
labels:
severity: warning
annotations:
summary: "{{ $value }} node(s) offline in cluster"
Example 1 — Full Prometheus scrape response (excerpt)
Request:
curl -s http://node-1.nabu.local:8080/metrics
Response (partial — all metric families follow the same pattern):
# HELP aistore_node_up Node health status (1=healthy, 0=unhealthy)
# TYPE aistore_node_up gauge
aistore_node_up{node="node-1"} 1
# HELP aistore_uptime_seconds Seconds since node started
# TYPE aistore_uptime_seconds gauge
aistore_uptime_seconds{node="node-1"} 86432
# HELP aistore_last_heartbeat_timestamp Unix timestamp of last successful heartbeat
# TYPE aistore_last_heartbeat_timestamp gauge
aistore_last_heartbeat_timestamp{node="node-1"} 1718000123
# HELP aistore_backend_healthy Storage backend health (1=healthy, 0=unhealthy)
# TYPE aistore_backend_healthy gauge
aistore_backend_healthy{node="node-1"} 1
# HELP aistore_grpc_connections_active Active gRPC connections
# TYPE aistore_grpc_connections_active gauge
aistore_grpc_connections_active{node="node-1"} 4
# HELP aistore_peer_health Per-peer health status (1=online, 0=offline)
# TYPE aistore_peer_health gauge
aistore_peer_health{node="node-1",peer="node-2"} 1
aistore_peer_health{node="node-1",peer="node-3"} 1
# HELP aistore_capacity_bytes_total Total storage capacity in bytes
# TYPE aistore_capacity_bytes_total gauge
aistore_capacity_bytes_total{node="node-1"} 10995116277760
# HELP aistore_capacity_bytes_used Used storage in bytes
# TYPE aistore_capacity_bytes_used gauge
aistore_capacity_bytes_used{node="node-1"} 4123456789000
# HELP aistore_blob_count Number of blobs stored
# TYPE aistore_blob_count gauge
aistore_blob_count{node="node-1"} 124892
# HELP aistore_bytes_read_total Total bytes read
# TYPE aistore_bytes_read_total counter
aistore_bytes_read_total{node="node-1"} 8834267136
# HELP aistore_bytes_written_total Total bytes written
# TYPE aistore_bytes_written_total counter
aistore_bytes_written_total{node="node-1"} 4123456789000
# HELP aistore_operations_total Total operations by type
# TYPE aistore_operations_total counter
aistore_operations_total{node="node-1",operation="put"} 124892
aistore_operations_total{node="node-1",operation="get"} 389201
aistore_operations_total{node="node-1",operation="delete"} 1024
aistore_operations_total{node="node-1",operation="stat"} 45000
aistore_operations_total{node="node-1",operation="list"} 3210
# HELP aistore_put_latency_seconds Put operation latency
# TYPE aistore_put_latency_seconds histogram
aistore_put_latency_seconds_bucket{node="node-1",le="0.0001"} 0
aistore_put_latency_seconds_bucket{node="node-1",le="0.001"} 4201
aistore_put_latency_seconds_bucket{node="node-1",le="0.01"} 118400
aistore_put_latency_seconds_bucket{node="node-1",le="0.1"} 124500
aistore_put_latency_seconds_bucket{node="node-1",le="+Inf"} 124892
aistore_put_latency_seconds_sum{node="node-1"} 312.45
aistore_put_latency_seconds_count{node="node-1"} 124892
# HELP aistore_nodes_total Total nodes in cluster
# TYPE aistore_nodes_total gauge
aistore_nodes_total{node="node-1"} 3
# HELP aistore_nodes_active Active nodes in cluster
# TYPE aistore_nodes_active gauge
aistore_nodes_active{node="node-1"} 3
# HELP aistore_nodes_offline Offline nodes in cluster
# TYPE aistore_nodes_offline gauge
aistore_nodes_offline{node="node-1"} 0
# HELP aistore_replication_queue_depth Current replication queue depth
# TYPE aistore_replication_queue_depth gauge
aistore_replication_queue_depth{node="node-1"} 0
Example 2 — JSON metrics response
Request:
curl -s http://node-1.nabu.local:8080/api/v1/metrics
Response:
{
"nodeId": "node-1",
"uptime": 86432,
"storage": {
"totalBytes": 10995116277760,
"usedBytes": 4123456789000,
"blobCount": 124892,
"shardCount": 374676
},
"throughput": {
"bytesReadTotal": 8834267136,
"bytesWrittenTotal": 4123456789000,
"operations": {
"put": 124892,
"get": 389201,
"delete": 1024,
"stat": 45000,
"list": 3210
}
},
"cluster": {
"nodesTotal": 3,
"nodesActive": 3,
"nodesOffline": 0,
"ringVersion": 7
},
"health": {
"nodeUp": true,
"backendHealthy": true,
"peerHealth": {
"node-2": true,
"node-3": true
}
}
}
Example 3 — Detecting a degraded node with a one-liner
This shell snippet polls every 5 seconds and prints an alert if nodeUp is false on any node in a list:
for NODE in node-1.nabu.local node-2.nabu.local node-3.nabu.local; do
STATUS=$(curl -s http://${NODE}:8080/api/v1/metrics \
| python3 -c "import sys,json; d=json.load(sys.stdin); print('OK' if d['health']['nodeUp'] else 'DEGRADED')")
echo "${NODE}: ${STATUS}"
done
Expected output (healthy cluster):
node-1.nabu.local: OK
node-2.nabu.local: OK
node-3.nabu.local: OK
Expected output (one node down):
node-1.nabu.local: OK
node-2.nabu.local: DEGRADED
node-3.nabu.local: OK
Example 4 — Querying latency percentiles in Prometheus
After Prometheus has scraped several intervals, compute the approximate 99th-percentile GET latency across all nodes:
histogram_quantile(
0.99,
sum by (le) (rate(aistore_get_latency_seconds_bucket[5m]))
)
For per-node breakdown:
histogram_quantile(
0.99,
sum by (node, le) (rate(aistore_get_latency_seconds_bucket[5m]))
)
Issue: /metrics returns 404 Not Found
Symptom: curl http://<node>:8080/metrics returns an HTTP 404 response.
Likely cause: You are connecting to the wrong port, or the path is misspelled.
Fix: Confirm the node's HTTP port in your deployment configuration. The metrics path is exactly /metrics (no trailing slash). Try:
curl -v http://<node>:8080/metrics
and inspect the Location header if a redirect occurs.
Issue: /metrics returns 405 Method Not Allowed
Symptom: A non-GET request (e.g., POST) to /api/v1/metrics returns 405.
Likely cause: Both metrics endpoints accept only GET. Prometheus scrapers and curl without -X use GET by default, so this typically indicates a misconfigured client or middleware.
Fix: Ensure your scraper or client issues GET requests only.
Issue: Prometheus shows aistore_node_up{node="node-2"} 0
Symptom: One or more nodes report 0 for aistore_node_up or aistore_backend_healthy.
Likely cause: The node's storage backend has become unreachable (disk failure, SPDK process crash, CXL device error) or the node process itself is in a fault state.
Fix:
- Check
aistore_backend_healthy— if it is also0, the storage layer is the issue. Consult Diagnose and Recover from a Node Failure. - Check
aistore_peer_health{peer="node-2"}on a healthy peer to see whether the affected node appears offline from the cluster's perspective. - Review the node's application logs for error messages from the backend subsystem.
Issue: aistore_nodes_offline is greater than zero but the node appears running
Symptom: aistore_nodes_offline reports one or more offline nodes, but the node process responds to HTTP requests.
Likely cause: The node has missed heartbeat deadlines. A node is marked healthy only when its state is active and its lastHeartbeat is within 15 seconds. Network latency, GC pauses, or clock skew can cause transient failures.
Fix:
- Check
aistore_last_heartbeat_timestampon the affected node and compare with the current Unix time. A gap larger than 15 seconds indicates a missed heartbeat. - Check
aistore_heartbeat_failures_totalfor a rising count. - Investigate network connectivity between nodes and verify system clocks are synchronized (NTP/PTP).
Issue: All latency histograms show counts only in the +Inf bucket
Symptom: Prometheus shows most or all observations in the le="+Inf" bucket, suggesting operations are taking longer than 60 seconds.
Likely cause: Severely degraded storage backend, erasure-coding reconstruction under heavy shard loss, or a hanging peer gRPC connection.
Fix:
- Check
aistore_replication_lag_secondsandaistore_replication_queue_depthfor a backlog. - Check
aistore_ec_reconstruction_totalfor a rapidly increasing count, which indicates frequent reconstruction from missing shards. - Check
aistore_shards_missing_totalto quantify data availability. - If the issue is isolated to PUT operations, check
aistore_backend_healthyon each peer.
Issue: JSON endpoint returns 500 Internal Server Error
Symptom: GET /api/v1/metrics returns HTTP 500.
Likely cause: The node's internal metrics collection encountered an error serializing the response (e.g., a nil pointer in cluster state during a topology change).
Fix: Retry the request — topology-change windows are transient. If the 500 persists, check node application logs for JSON encoding errors and verify the cluster has completed any in-progress ring rebalancing (aistore_ring_version should stabilize).
Issue: aistore_peer_health labels are missing for some peers
Symptom: The aistore_peer_health metric does not include a label for every expected peer node.
Likely cause: The PeerHealth gauge vector only emits a label after the node has observed at least one heartbeat from that peer. Newly joined nodes or nodes that have never been seen will not appear.
Fix: After adding a node to the cluster, wait for at least one full heartbeat cycle (typically ≤ 10 seconds) before expecting its peer label to appear. If the label never appears, verify the new node has successfully joined via GET /api/v1/cluster.
