Guide

Observability

Logging, metrics, tracing, health checks

Overview

This guide explains how to monitor and observe the plant-disease-predictor API in production. You will learn how to access structured logs, query health check endpoints, collect runtime metrics, and trace requests through the prediction pipeline. Effective observability helps you detect failures early, diagnose slow prediction responses, and ensure your integration remains healthy as traffic scales.

Prerequisites

Before following this guide, ensure you have:

A running instance of the plant-disease-predictor API (see the Install and Launch guide)
API authentication credentials (API key or token) — required for all endpoints
curl 7.68+ or any HTTP client capable of sending multipart form data
(Optional) A metrics aggregation tool such as Prometheus 2.40+ or Datadog if you intend to scrape the metrics endpoint
(Optional) A log aggregation platform such as Elasticsearch, Loki, or Splunk if you intend to ship structured logs
Network access to the host and port on which the API is running

Installation

The observability features are built into the plant-disease-predictor server and require no separate package installation. However, if you want to scrape metrics with Prometheus or ship logs to an external system, follow the steps below.

1. Verify the API is running and reachable

curl -i http://<API_HOST>:<API_PORT>/health

A 200 OK response confirms the server is up. If you receive a connection error, confirm the host, port, and firewall rules before continuing.

2. (Optional) Install the Prometheus scraper

If you are using Prometheus, add the following job to your prometheus.yml scrape configuration:

scrape_configs:
  - job_name: 'plant-disease-predictor'
    static_configs:
      - targets: ['<API_HOST>:<API_PORT>']
    metrics_path: /metrics
    scheme: http
    bearer_token: '<YOUR_API_KEY>'

Reload Prometheus after saving:

curl -X POST http://<PROMETHEUS_HOST>:9090/-/reload

3. (Optional) Configure log forwarding

The API writes structured JSON logs to stdout. To forward them to an external system, pipe the server output to your log shipper. For example, with Filebeat:

# filebeat.yml (excerpt)
filebeat.inputs:
  - type: container
    paths:
      - /var/log/plant-disease-predictor/*.log

output.elasticsearch:
  hosts: ['<ELASTICSEARCH_HOST>:9200']

Restart Filebeat after saving changes:

sudo systemctl restart filebeat

Configuration

Observability behavior is controlled by environment variables set before the API server starts. The table below describes each option.

Environment Variable	Default	Valid Values	Effect
`LOG_LEVEL`	`info`	`debug`, `info`, `warn`, `error`	Controls the minimum severity of log messages written to `stdout`. Use `debug` during development to see per-request details; use `warn` or `error` in production to reduce noise.
`LOG_FORMAT`	`json`	`json`, `text`	Sets the log output format. `json` emits structured key-value pairs suitable for log aggregators. `text` emits human-readable lines useful during local development.
`METRICS_ENABLED`	`true`	`true`, `false`	When `true`, the `/metrics` endpoint is active and exposes runtime counters. Set to `false` to disable metric collection entirely, which slightly reduces overhead.
`TRACING_ENABLED`	`false`	`true`, `false`	When `true`, the API emits OpenTelemetry-compatible trace spans for each prediction request, including image preprocessing and CNN inference steps. Requires `OTLP_ENDPOINT` to be set.
`OTLP_ENDPOINT`	(empty)	Any valid URL	The endpoint of your OpenTelemetry collector (e.g., `http://otel-collector:4318`). Has no effect unless `TRACING_ENABLED=true`.
`HEALTH_CHECK_PATH`	`/health`	Any valid URL path	The URL path for the liveness health check endpoint. Change this if the default conflicts with your reverse proxy routing rules.

Set these variables in your environment before starting the server:

export LOG_LEVEL=info
export LOG_FORMAT=json
export METRICS_ENABLED=true
export TRACING_ENABLED=true
export OTLP_ENDPOINT=http://otel-collector:4318

Usage

Health Checks

Use the health check endpoint to verify that the API server and its underlying CNN model are ready to accept requests. This is especially useful in load balancers and Kubernetes liveness/readiness probes.

curl -i \
  -H "Authorization: Bearer <YOUR_API_KEY>" \
  http://<API_HOST>:<API_PORT>/health

A healthy API returns 200 OK with a JSON body. A degraded or unready API returns a non-2xx status so your load balancer can route traffic elsewhere.

Structured Logs

Every prediction request generates log entries at multiple stages: image receipt, preprocessing, model inference, and response dispatch. Each log line is a JSON object emitted to stdout. When LOG_LEVEL=debug, you also see per-layer CNN timing data.

To tail logs in real time during development:

# If running the server directly
./plant-disease-predictor-server 2>&1 | jq .

# If running in Docker
docker logs -f plant-disease-predictor | jq .

Pipe through jq to pretty-print and filter. For example, show only error-level entries:

docker logs -f plant-disease-predictor | jq 'select(.level == "error")'

Metrics

Query the /metrics endpoint to retrieve runtime counters and histograms. The response uses the Prometheus text exposition format, which most metrics platforms can ingest directly.

curl -s \
  -H "Authorization: Bearer <YOUR_API_KEY>" \
  http://<API_HOST>:<API_PORT>/metrics

Key metrics to monitor:

Metric Name	Type	Description
`prediction_requests_total`	Counter	Total number of prediction requests received
`prediction_errors_total`	Counter	Total number of failed predictions (by error type)
`prediction_duration_seconds`	Histogram	End-to-end latency of prediction requests
`image_preprocessing_duration_seconds`	Histogram	Time spent resizing and normalizing the uploaded image
`model_inference_duration_seconds`	Histogram	Time spent running the CNN forward pass

Distributed Tracing

When TRACING_ENABLED=true, each prediction request generates an OpenTelemetry trace with child spans for preprocessing and inference. Use your tracing backend (Jaeger, Zipkin, Tempo, etc.) to visualize end-to-end latency and identify bottlenecks. No changes to your API call are required — tracing is transparent to the caller.

Examples

Example 1 — Check API health

Confirm that the server and model are fully ready.

curl -i \
  -H "Authorization: Bearer <YOUR_API_KEY>" \
  http://localhost:8080/health

Expected output:

HTTP/1.1 200 OK
Content-Type: application/json

{
  "status": "healthy",
  "model_loaded": true,
  "uptime_seconds": 3842
}

Example 2 — Scrape the metrics endpoint

Retrieve current runtime metrics in Prometheus text format.

curl -s \
  -H "Authorization: Bearer <YOUR_API_KEY>" \
  http://localhost:8080/metrics

Expected output (excerpt):

# HELP prediction_requests_total Total number of prediction requests received
# TYPE prediction_requests_total counter
prediction_requests_total 1024

# HELP prediction_errors_total Total number of failed prediction requests
# TYPE prediction_errors_total counter
prediction_errors_total{reason="invalid_image"} 3
prediction_errors_total{reason="model_timeout"} 1

# HELP prediction_duration_seconds End-to-end prediction latency in seconds
# TYPE prediction_duration_seconds histogram
prediction_duration_seconds_bucket{le="0.1"} 210
prediction_duration_seconds_bucket{le="0.5"} 980
prediction_duration_seconds_bucket{le="1.0"} 1020
prediction_duration_seconds_bucket{le="+Inf"} 1024
prediction_duration_seconds_sum 412.7
prediction_duration_seconds_count 1024

Example 3 — Filter error logs with `jq`

Isolate error-level log entries to quickly identify problems without scrolling through info-level noise.

docker logs plant-disease-predictor 2>&1 \
  | jq -c 'select(.level == "error")'

Expected output (one matching line):

{
  "level": "error",
  "timestamp": "2024-06-15T10:42:01Z",
  "request_id": "a3f9c821-4d12-4b0e-b8e3-1c7d5e2f9a40",
  "message": "Image preprocessing failed: unsupported file format",
  "client_ip": "203.0.113.45",
  "status_code": 422
}

Example 4 — Kubernetes readiness probe

Configure a Kubernetes readiness probe so the pod is only marked ready when the model is loaded.

readinessProbe:
  httpGet:
    path: /health
    port: 8080
    httpHeaders:
      - name: Authorization
        value: Bearer <YOUR_API_KEY>
  initialDelaySeconds: 10
  periodSeconds: 15
  failureThreshold: 3

Expected behavior: Kubernetes marks the pod Ready only after /health returns 200 OK with "model_loaded": true.

Troubleshooting

Issue 1 — `/health` returns `503 Service Unavailable`

Symptom: The health check endpoint responds with HTTP 503 and "model_loaded": false in the body.

Likely cause: The CNN model file has not finished loading, or the model file path is misconfigured and the server cannot locate it.

Fix:

Check the server startup logs for lines containing model to see whether loading succeeded or failed.
Confirm that the environment variable pointing to the model file (e.g., MODEL_PATH) is set correctly and the file exists at that path.
If the model is still loading, wait 15–30 seconds and retry — large CNN weights can take time to deserialize on first boot.

Issue 2 — `/metrics` returns `401 Unauthorized`

Symptom: Prometheus or your curl command receives a 401 response when scraping /metrics.

Likely cause: The Authorization header is missing or the API key is invalid.

Fix:

Confirm your API key is active and has not expired.
Ensure the bearer_token field in your Prometheus scrape config matches the key exactly (no extra whitespace or newline characters).
If you recently rotated your API key, update all scrape configurations and reload Prometheus.

Issue 3 — No trace data appears in your tracing backend

Symptom: TRACING_ENABLED=true is set, but no traces appear in Jaeger, Tempo, or your OTLP-compatible backend.

Likely cause: The OTLP_ENDPOINT value is incorrect, unreachable from the API server, or not set at all.

Fix:

Verify OTLP_ENDPOINT is set to the correct URL, including the port (e.g., http://otel-collector:4318).
From the API server host, confirm network connectivity to the collector: curl -i http://<OTLP_ENDPOINT>/.
Check the OpenTelemetry collector's own logs for ingest errors.
Restart the API server after correcting environment variables — tracing configuration is read only at startup.

Issue 4 — Log output is not valid JSON

Symptom: Running docker logs ... | jq . produces parse error from jq.

Likely cause: LOG_FORMAT is set to text instead of json, or multi-line stack traces are breaking the JSON stream.

Fix:

Set LOG_FORMAT=json and restart the server.
If stack traces still break parsing, use jq -R 'try fromjson' to silently skip non-JSON lines:

docker logs plant-disease-predictor 2>&1 \
  | jq -R 'try fromjson'

Issue 5 — `prediction_duration_seconds` histogram shows unexpectedly high latency

Symptom: The prediction_duration_seconds_bucket values show most requests taking longer than 1 second.

Likely cause: The CNN inference step is CPU-bound and the server host lacks sufficient compute resources, or the uploaded images are very large and preprocessing is the bottleneck.

Fix:

Compare model_inference_duration_seconds and image_preprocessing_duration_seconds to identify which stage is slow.
If preprocessing is the bottleneck, advise API consumers to resize images client-side before uploading (see the Prediction endpoint documentation for recommended dimensions).
If inference is the bottleneck, consider deploying the server on a host with more CPU cores or GPU acceleration.

Observability

Health Checks

Structured Logs

Metrics

Distributed Tracing

Example 1 — Check API health

Example 2 — Scrape the metrics endpoint

Example 3 — Filter error logs with jq

Example 4 — Kubernetes readiness probe

Issue 1 — /health returns 503 Service Unavailable

Issue 2 — /metrics returns 401 Unauthorized

Issue 3 — No trace data appears in your tracing backend

Issue 4 — Log output is not valid JSON

Issue 5 — prediction_duration_seconds histogram shows unexpectedly high latency

Example 3 — Filter error logs with `jq`

Issue 1 — `/health` returns `503 Service Unavailable`

Issue 2 — `/metrics` returns `401 Unauthorized`

Issue 5 — `prediction_duration_seconds` histogram shows unexpectedly high latency