Guide

Error Codes

HTTP status codes and error response formats

Overview

This page describes how Nabu Store expresses errors over HTTP: the standard status codes you will receive, the JSON envelope every error response uses, and the application-level error codes embedded inside that envelope. Understanding these codes lets you write resilient clients that distinguish retriable network faults from permanent data errors, surface meaningful diagnostics, and avoid silent data loss during blob operations.

Prerequisites

Before using this reference:

You have a running Nabu Store cluster (single-node or multi-node).
You are issuing HTTP requests to the Nabu Store API gateway (the service that translates HTTP to the internal gRPC transport).
You are familiar with standard HTTP semantics (status codes, headers, request bodies).
Your client can parse JSON response bodies.
Authentication credentials (token or mTLS certificate) are already obtained. See Authenticate and issue API requests programmatically for setup steps.

Installation

Configuration

Error response behaviour is controlled by the following server-side settings. These are set at cluster startup and affect every API endpoint.

Setting	Default	Valid values	Effect
`error.include_detail`	`true`	`true` / `false`	When `true`, the `detail` field in error responses contains a human-readable explanation of the failure. Set to `false` in high-security environments to avoid leaking internal state to callers.
`error.include_grpc_code`	`false`	`true` / `false`	When `true`, the raw gRPC status code is included as `grpc_code` in the error envelope. Useful when clients want to map failures back to the internal RPC layer without consulting server logs.
`error.max_detail_bytes`	`512`	Any positive integer	Truncates the `detail` string to this many bytes. Prevents very long internal error messages from being forwarded to clients.

Usage

Every failed API request returns a JSON body regardless of the HTTP status code. Parse this body first before inspecting the status code, because the code field provides finer-grained information than the HTTP status alone.

Error response envelope

{
  "code": "NOT_FOUND",
  "message": "blob not found on any node",
  "detail": "blob 3f2a...c1 not found on any node",
  "request_id": "01HZ7K3P9QVTW4XBJD8MG6NR2F"
}

Field	Type	Description
`code`	string	Machine-readable error code. Maps 1-to-1 with gRPC status codes used internally.
`message`	string	Short, stable description of the error category. Safe to display in UIs.
`detail`	string	Longer explanation, potentially including internal context. May be empty if `error.include_detail` is `false`.
`request_id`	string	Unique identifier for this request. Include this in support tickets and log correlation.

HTTP status code to error code mapping

HTTP status	`code` value	Typical cause
`400 Bad Request`	`INVALID_ARGUMENT`	Malformed blob ID, missing required field, or invalid policy value.
`401 Unauthorized`	`UNAUTHENTICATED`	Missing or expired bearer token / mTLS certificate.
`403 Forbidden`	`PERMISSION_DENIED`	Authenticated principal lacks permission for the requested operation.
`404 Not Found`	`NOT_FOUND`	The requested blob does not exist on any node in the cluster.
`409 Conflict`	`ALREADY_EXISTS`	A `PUT` was issued for a blob whose ID is already present (content-addressed deduplication).
`412 Precondition Failed`	`FAILED_PRECONDITION`	Cluster does not have enough nodes to satisfy the requested replication or erasure-coding policy.
`429 Too Many Requests`	`RESOURCE_EXHAUSTED`	Rate limit exceeded or cluster is out of storage capacity.
`500 Internal Server Error`	`INTERNAL`	Unhandled server fault. Check cluster logs and the `request_id`.
`503 Service Unavailable`	`UNAVAILABLE`	The cluster is degraded or a required node is offline. Retrying with backoff is appropriate.
`504 Gateway Timeout`	`DEADLINE_EXCEEDED`	The request took longer than the server-side deadline.

Retrying safely

Only retry on 503 UNAVAILABLE and 504 DEADLINE_EXCEEDED. Retrying 400, 404, 409, or 412 without changing the request will always produce the same result. For 500 INTERNAL, check cluster health before retrying.

When retrying, use exponential backoff with jitter. A starting interval of 100 ms, multiplying by 2 each attempt, capped at 30 seconds is a reasonable default for cluster recovery scenarios (for example, after a node failure during blob replication).

Blob ID validation errors

Blob IDs are 16-byte values transmitted as binary in the internal protocol and as hex-encoded strings in HTTP responses. If you supply an ID that is not exactly 16 bytes after decoding, you receive:

HTTP 400
{
  "code": "INVALID_ARGUMENT",
  "message": "invalid blob ID",
  "detail": "blob ID must be exactly 16 bytes"
}

Replication policy errors

When you request EC42 or EC82 but the cluster has fewer active nodes than the total shard count required (6 for EC 4+2, 10 for EC 8+2), the server rejects the write immediately:

HTTP 412
{
  "code": "FAILED_PRECONDITION",
  "message": "not enough nodes for EC",
  "detail": "not enough nodes for EC (have 4, need 6)"
}

Add more nodes or choose a less demanding policy such as REPLICA3 before retrying.

Examples

Example 1 — Successful PUT (for contrast)

POST /v1/blobs
Authorization: Bearer <token>
Content-Type: application/json

{
  "data": "<base64-encoded payload>",
  "policy": "REPLICATION_POLICY_REPLICA3"
}

Expected response (200 OK):

{
  "id": { "id": "3f2ac1..." },
  "meta": {
    "size": 4096,
    "created_at": 1751234567000000000,
    "policy": "REPLICATION_POLICY_REPLICA3"
  }
}

Example 2 — GET a blob that does not exist

GET /v1/blobs/000000000000000000000000deadbeef
Authorization: Bearer <token>

Expected response (404 Not Found):

{
  "code": "NOT_FOUND",
  "message": "blob not found on any node",
  "detail": "blob 000000000000000000000000deadbeef not found on any node",
  "request_id": "01HZ7K3P9QVTW4XBJD8MG6NR2F"
}

Example 3 — PUT with a malformed blob ID in a downstream context

If your application constructs a StatRequest with a blob ID shorter than 16 bytes:

GET /v1/blobs/short
Authorization: Bearer <token>

Expected response (400 Bad Request):

{
  "code": "INVALID_ARGUMENT",
  "message": "invalid blob ID",
  "detail": "blob ID must be exactly 16 bytes",
  "request_id": "01HZ7K4ABCDE1234XYZ"
}

Example 4 — EC policy rejected because the cluster is too small

POST /v1/blobs
Authorization: Bearer <token>
Content-Type: application/json

{
  "data": "<base64-encoded payload>",
  "policy": "REPLICATION_POLICY_EC82"
}

Expected response (412 Precondition Failed) when only 7 nodes are active:

{
  "code": "FAILED_PRECONDITION",
  "message": "not enough nodes for EC",
  "detail": "not enough nodes for EC (have 7, need 10)",
  "request_id": "01HZ7M9QRSTUVWXY0000"
}

Resolve by adding the required nodes or switching the policy to REPLICATION_POLICY_EC42 (requires 6 nodes) or REPLICATION_POLICY_REPLICA3 (requires 3 nodes).

Example 5 — Cluster unavailable during node recovery

GET /v1/blobs/3f2ac1bde44f8a0011223344aabbccdd
Authorization: Bearer <token>

Expected response (503 Service Unavailable) while a node is being recovered:

{
  "code": "UNAVAILABLE",
  "message": "cluster degraded",
  "detail": "insufficient replicas available for blob 3f2ac1...; retry after recovery",
  "request_id": "01HZ7P2MNOPQRSTU1234"
}

This response is safe to retry. Implement exponential backoff and monitor /v1/cluster/state for node health before attempting again.

Example 6 — Too many shard failures during EC write

If more than parity_shards remote StoreShard calls fail during a single PUT:

{
  "code": "INTERNAL",
  "message": "too many shard store failures",
  "detail": "too many shard store failures: 3 (max allowed: 2)",
  "request_id": "01HZ7Q5ZZZZ9999AAAA"
}

Check node health with GET /v1/cluster/state, confirm network connectivity between nodes, and retry after the cluster stabilises.

Troubleshooting

Use the request_id from every error response to correlate with server-side logs (grep <request_id> /var/log/nabu-store/server.log).

Issue: Receiving 404 NOT_FOUND for a blob you just wrote

Symptom: A PUT returned 200 OK with a blob ID, but an immediate GET using that ID returns 404.

Likely cause: Replication is asynchronous. The primary node accepted the write and returned success, but the replica nodes have not yet received the blob. The GET was routed to a replica before replication completed.

Fix: Implement a short retry loop (3–5 attempts, 200 ms apart) on 404 responses that occur within a few seconds of a successful PUT. If the 404 persists beyond 10 seconds, check replication lag with GET /v1/cluster/state and inspect node connectivity.

Issue: Repeated 412 FAILED_PRECONDITION on EC writes after adding nodes

Symptom: You added nodes to reach the required count, but EC writes still return 412.

Likely cause: Newly joined nodes are in NODE_STATE_JOINING state and are not yet eligible ring members. The consistent-hash ring has not been updated on all nodes.

Fix: Poll GET /v1/cluster/state and wait until all target nodes report NODE_STATE_ACTIVE. Ring synchronisation (SyncRing) happens automatically but may take 30–60 seconds after a node join.

Issue: 500 INTERNAL with detail containing "EC encode failed"

Symptom: PUT requests using EC42 or EC82 fail with 500 and the detail mentions EC encoding.

Likely cause: The blob data could not be divided into the required shard structure. This can occur if the data length is zero or if the EC codec plugin is misconfigured.

Fix: Verify that your request body is non-empty and correctly base64-encoded. If the problem persists, check that the EC plugin is registered correctly (see Register a custom backend or EC plugin) and review server startup logs for codec initialisation errors.

Issue: 400 INVALID_ARGUMENT with "invalid blob ID" on IDs copied from a previous response

Symptom: You copy a blob ID from a PutResponse and use it in a GetRequest, but receive 400.

Likely cause: The blob ID field is a 16-byte binary value. If your client serialises it as a UTF-8 string or truncates hex digits, the resulting byte slice will not be exactly 16 bytes.

Fix: Treat blob IDs as opaque byte arrays. When working with the HTTP API, hex-encode the full 16 bytes (32 hex characters). Do not truncate or re-encode the value between requests.

Issue: 401 UNAUTHENTICATED after a token that was working previously

Symptom: Requests that succeeded earlier now return 401.

Likely cause: Your bearer token has expired, or the cluster's signing key was rotated.

Fix: Re-authenticate to obtain a fresh token (see Authenticate and issue API requests programmatically). If you are using short-lived tokens, implement proactive refresh before expiry rather than reacting to 401 responses.

Issue: 503 UNAVAILABLE persists for more than 5 minutes after a node failure

Symptom: After a node goes offline, reads and writes to affected blobs return 503 and do not recover automatically.

Likely cause: The cluster may not have enough surviving replicas or EC shards to reconstruct the data, or the recovery process has stalled.

Fix: Check GET /v1/cluster/state to identify which nodes are NODE_STATE_OFFLINE. Follow the Diagnose and recover from a node failure guide to either restore the failed node or trigger shard reconstruction from surviving replicas. Monitor GET /v1/cluster/metrics for rebalancing progress.