Health API
Cluster health, replication health, component status
The Health API provides real-time visibility into the operational state of your Site Recovery deployment. It exposes endpoints and Kubernetes-native status fields that report cluster connectivity, DRBD replication health, and the status of individual Site Recovery components such as the Protection Group controller and Failover controller. Platform engineers and SREs use the Health API to validate DR readiness before planned failovers, diagnose replication degradation, and confirm that all components are reconciling correctly after infrastructure changes.
Before querying the Health API, ensure the following are in place:
- A functioning Site Recovery deployment with at least the primary and DR clusters configured (DRBD Operator model: two clusters minimum; LINSTOR model: three clusters required — primary, DR, and quorum)
kubectlaccess to all relevant clusters with valid kubeconfigs (~/.kube/config-cluster1,~/.kube/config-cluster2, and~/.kube/config-quorumif applicable)- The Site Recovery CRDs installed and the Protection Group controller running on the primary cluster
- The Failover controller running on the quorum cluster (LINSTOR model) or the designated management cluster (DRBD Operator model)
jqinstalled locally for parsing JSON output fromkubectlcommands- Sufficient RBAC permissions to read
protectiongroups,failovers, and related custom resources in the namespaces where your workloads are deployed
Step 1: Confirm the Protection Group controller is running on the primary cluster
kubectl --kubeconfig ~/.kube/config-cluster1 \
get deployment protection-group-controller -n <namespace>
Expected output shows READY replicas equal to DESIRED:
NAME READY UP-TO-DATE AVAILABLE AGE
protection-group-controller 1/1 1 1 3d
Step 2: Confirm the Failover controller is running
For LINSTOR deployments, the Failover controller runs on the quorum cluster:
kubectl --kubeconfig ~/.kube/config-quorum \
get deployment failover-controller -n <namespace>
For DRBD Operator deployments without a quorum cluster, query the designated management cluster:
kubectl --kubeconfig ~/.kube/config-cluster1 \
get deployment failover-controller -n <namespace>
Step 3: Confirm Protection Group CRDs are registered
kubectl --kubeconfig ~/.kube/config-cluster1 \
get crd protectiongroups.siterecovery.trilio.io
NAME CREATED AT
protectiongroups.siterecovery.trilio.io 2025-10-01T00:00:00Z
Step 4: Restart controllers if you have recently applied CRD updates
kubectl --kubeconfig ~/.kube/config-cluster1 \
rollout restart deployment protection-group-controller -n <namespace>
kubectl --kubeconfig ~/.kube/config-quorum \
rollout restart deployment failover-controller -n <namespace>
Health status is reported through fields on the ProtectionGroup and Failover custom resources. Understanding these fields lets you interpret health output correctly and configure alerting or automation against them.
ProtectionGroup status fields
| Field | Type | Possible values | Meaning |
|---|---|---|---|
status.state | string | Active, Failed | Whether the Protection Group passed validation and is operating normally |
status.replicationHealth | string | Healthy, Degraded, Unknown | Aggregate DRBD replication health across all protected VMs |
status.currentState | string | running, stopped, mixed, unknown | Actual VM lifecycle state across all VMs in the group |
status.protectedVMs[*].replicationStatus | string | Protected, Degraded, Unknown | Per-VM replication status |
status.protectedVMs[*].volumes[*].replicationState | string | Protected, Degraded | Per-volume (PVC) replication state |
status.warnings | list of strings | Free text | Non-blocking advisory messages (e.g., missing replicasOnDifferent on a storage class) |
status.currentState is set by the Protection Group controller when it reconciles spec.desiredState. The controller updates this field after it has successfully started or stopped all VMs in the group. A value of mixed means reconciliation is still in progress.
spec.desiredState accepts running or stopped. Changing this field causes the Protection Group controller to start or stop all VMs in the group atomically. The Failover controller drives this field during failover operations — you should not need to set it manually during normal operations.
Failover CR status fields
| Field | Type | Possible values | Meaning |
|---|---|---|---|
status.state | string | Pending, StoppingOnSource, WaitingForDRBD, StartingOnTarget, Completed, Failed | Current phase of a failover operation |
status.phase | string | Same as above | Mirrors state for compatibility |
The Failover controller reconciles these fields every 10 seconds. A Completed state confirms that all VMs are running on the target cluster and DRBD replication has been handed over successfully.
You will primarily interact with the Health API by querying Protection Group status with kubectl. The Protection Group controller reconciles health every 60 seconds under normal conditions; allow up to one full reconciliation cycle after making infrastructure changes before treating a stale status as a problem.
Check overall Protection Group health
This is your primary health check before executing a planned failover:
kubectl --kubeconfig ~/.kube/config-cluster1 \
get protectiongroup <pg-name> -n <namespace>
A healthy Protection Group shows State: Active and Replication: Healthy:
NAME STATE VMS REPLICATION HEALTH AGE
production-protection-group Active 2 synchronous Healthy 22h
If STATE is Failed or HEALTH is anything other than Healthy, inspect the full status before proceeding with any failover.
Inspect per-VM and per-volume replication health
kubectl --kubeconfig ~/.kube/config-cluster1 \
get protectiongroup <pg-name> -n <namespace> -o yaml \
| grep -A 50 "protectedVMs:"
This reveals the replicationStatus and replicationState for each VM and each volume individually, letting you pinpoint which VM or disk is degraded without examining every resource separately.
Monitor a failover in progress
When a failover is running, track its phase through the Failover CR:
kubectl --kubeconfig ~/.kube/config-quorum \
get failover <failover-name> -n <namespace> -o jsonpath='{.status.state}'
Poll this until it returns Completed. You can also watch it:
kubectl --kubeconfig ~/.kube/config-quorum \
wait --for=jsonpath='{.status.state}'=Completed \
failover/<failover-name> -n <namespace> --timeout=600s
Check for advisory warnings
Warnings do not block operations but indicate suboptimal configuration (for example, storage classes that lack explicit cross-site placement rules):
kubectl --kubeconfig ~/.kube/config-cluster1 \
get protectiongroup <pg-name> -n <namespace> \
-o jsonpath='{.status.warnings}' | jq .
Example 1: Healthy Protection Group with two protected VMs
kubectl --kubeconfig ~/.kube/config-cluster1 \
get protectiongroup production-protection-group -n default
Expected output:
NAME STATE VMS REPLICATION HEALTH AGE
production-protection-group Active 2 synchronous Healthy 22h
Example 2: Detailed per-VM replication status
kubectl --kubeconfig ~/.kube/config-cluster1 \
get protectiongroup production-protection-group -n default \
-o yaml | grep -A 50 "protectedVMs:"
Expected output:
status:
protectedVMs:
- name: prod-vm-1
namespace: default
replicationStatus: Protected
volumes:
- pvcName: prod-vm-1-disk
replicationState: Protected
resourceName: pvc-8f04b7f3-ab58-46a9-9721-508337d30d61
usingLinstorCSI: true
- name: prod-vm-2
namespace: default
replicationStatus: Protected
volumes:
- pvcName: prod-vm-2-disk
replicationState: Protected
resourceName: pvc-0720b27f-d1cc-4283-97fc-d1c82177ed5e
usingLinstorCSI: true
replicationHealth: Healthy
state: Active
Both VMs show replicationStatus: Protected and both volumes show replicationState: Protected. This is the expected state before executing a planned failover.
Example 3: Failed Protection Group — geo-replication validation error
kubectl --kubeconfig ~/.kube/config-cluster1 \
describe protectiongroup invalid-pg -n default
Expected output (relevant section):
Status:
State: Failed
Conditions:
Type: ValidationFailed
Status: True
Reason: InvalidConfiguration
Message: PVC invalid-vm-disk for VM invalid-vm: Storage class linstor-local
has placementCount=1, minimum 2 required for geo-replication
This tells you exactly which PVC failed validation and why. Resolve the issue by updating the VM to use a storage class with placementCount >= 2 and a LINSTOR CSI provisioner, then re-apply or patch the Protection Group.
Example 4: Querying Protection Group currentState during failover
kubectl --kubeconfig ~/.kube/config-cluster1 \
get protectiongroup production-protection-group -n default \
-o jsonpath='{.status.currentState}'
During failover (source cluster — VMs stopping):
mixed
After VMs have stopped on source:
stopped
After VMs have started on target cluster:
kubectl --kubeconfig ~/.kube/config-cluster2 \
get protectiongroup production-protection-group -n default \
-o jsonpath='{.status.currentState}'
running
Example 5: Check for storage class warnings
kubectl --kubeconfig ~/.kube/config-cluster1 \
get protectiongroup production-protection-group -n default \
-o jsonpath='{.status.warnings}' | jq .
Output when a storage class lacks explicit cross-site placement rules:
[
"PVC prod-vm-1-disk: Storage class linstor-basic has no replicasOnDifferent setting - replicas may be on same site"
]
This warning does not block failover but indicates that DRBD replicas may not be distributed across failure domains. For production DR, update the storage class to include replicasOnDifferent: topology.kubernetes.io/zone.
Use the following patterns to diagnose common Health API problems. Each entry lists the symptom you will observe, its most likely cause, and the steps to resolve it.
Symptom: status.state shows Failed immediately after creating or patching a Protection Group
Likely cause: One or more VMs in the group use PVCs that fail geo-replication validation — either the storage class does not use the LINSTOR CSI provisioner (linstor.csi.linbit.com), or placementCount is less than 2.
Fix:
- Run
kubectl describe protectiongroup <name> -n <namespace>and read theMessagefield underConditionsto identify which PVC failed and why. - If the provisioner is wrong, the VM must be re-created with a PVC backed by a LINSTOR CSI storage class.
- If
placementCountis too low, update or replace the storage class to setplacementCount >= 2and re-provision the PVC. - Once the PVC is corrected, patch the Protection Group to trigger re-validation.
Symptom: status.replicationHealth shows Degraded for a Protection Group that was previously Healthy
Likely cause: DRBD replication between the primary and DR cluster nodes has encountered an issue — a node may be offline, a network path between cluster nodes may be interrupted, or a DRBD resource may have lost quorum.
Fix:
- Identify which VM is degraded:
kubectl get protectiongroup <name> -n <namespace> -o yaml | grep -A 20 "protectedVMs:" - Check the specific volume's
replicationStateto confirm which PVC is affected. - Verify node connectivity between primary and DR cluster worker nodes (DRBD replication runs directly between these nodes; the quorum cluster does not relay replication traffic).
- Inspect DRBD resource status on the affected nodes using your deployment's LINSTOR or DRBD Operator tooling.
- After the underlying issue is resolved, wait up to 60 seconds for the Protection Group controller to reconcile and update
replicationHealth.
Symptom: status.currentState is stuck at mixed for more than 60 seconds
Likely cause: The Protection Group controller is still reconciling VM state — one or more VMs may be slow to start or stop, or the controller itself may have encountered an error during reconcile_vm_state().
Fix:
- Check controller logs:
kubectl --kubeconfig ~/.kube/config-cluster1 logs deployment/protection-group-controller -n <namespace> - Look for errors related to the specific VM names in the Protection Group.
- Check VM status individually:
kubectl --kubeconfig ~/.kube/config-cluster1 get vm -n <namespace> - If a VM is stuck in a transitional state (e.g.,
TerminatingorPending), investigate and resolve that VM's issue first; the controller will re-reconcile on the next cycle. - If the controller pod itself is unhealthy, restart it:
kubectl --kubeconfig ~/.kube/config-cluster1 rollout restart deployment/protection-group-controller -n <namespace>
Symptom: A Failover CR status.state is stuck at StoppingOnSource or StartingOnTarget
Likely cause: The Failover controller is waiting for the Protection Group currentState on the source or target cluster to transition, but the Protection Group controller has not completed reconciliation. This can also occur if the Failover controller cannot reach the remote cluster's API server.
Fix:
- Check the source cluster's Protection Group
currentState:kubectl --kubeconfig ~/.kube/config-cluster1 get protectiongroup <name> -n <namespace> -o jsonpath='{.status.currentState}' - If
currentStateismixed, VMs are still transitioning — wait and re-check. - If
currentStatehas not changed after two minutes, check Protection Group controller logs on the source cluster for reconciliation errors. - Check Failover controller logs on the quorum cluster for connectivity errors to the source or target cluster API server.
- Verify that the kubeconfig secrets or credentials used by the Failover controller to reach both clusters are current and valid.
Symptom: Health status fields are stale — the Protection Group shows Healthy but DRBD is known to be degraded
Likely cause: The Protection Group controller reconciles on a 60-second cycle. If the degradation occurred within the last reconciliation window, the status has not yet been updated.
Fix:
- Wait up to 60 seconds and re-query.
- If the status does not update after two cycles, check whether the Protection Group controller pod is running and not in a crash loop:
kubectl --kubeconfig ~/.kube/config-cluster1 get pod -l app=protection-group-controller -n <namespace> - If the controller is running but not reconciling, inspect its logs for errors that might cause it to skip reconciliation for a specific Protection Group.
Symptom: kubectl get protectiongroup returns No resources found
Likely cause: Either no Protection Groups have been created yet, or you are querying the wrong cluster or namespace, or the CRDs are not installed.
Fix:
- Confirm you are querying the primary cluster and the correct namespace:
kubectl --kubeconfig ~/.kube/config-cluster1 get protectiongroup -A - Confirm the CRD is registered:
kubectl --kubeconfig ~/.kube/config-cluster1 get crd protectiongroups.siterecovery.trilio.io - If the CRD is missing, re-apply the Site Recovery CRD manifests via Ansible (Ansible is responsible for deploying all infrastructure components including CRDs to the primary cluster).