Site Recoveryfor Kubenetes Virtual Machines
Guide

Health API

Cluster health, replication health, component status


Overview

The Health API provides real-time visibility into the operational state of your Site Recovery deployment. It exposes endpoints and Kubernetes-native status fields that report cluster connectivity, DRBD replication health, and the status of individual Site Recovery components such as the Protection Group controller and Failover controller. Platform engineers and SREs use the Health API to validate DR readiness before planned failovers, diagnose replication degradation, and confirm that all components are reconciling correctly after infrastructure changes.


Prerequisites

Before querying the Health API, ensure the following are in place:

  • A functioning Site Recovery deployment with at least the primary and DR clusters configured (DRBD Operator model: two clusters minimum; LINSTOR model: three clusters required — primary, DR, and quorum)
  • kubectl access to all relevant clusters with valid kubeconfigs (~/.kube/config-cluster1, ~/.kube/config-cluster2, and ~/.kube/config-quorum if applicable)
  • The Site Recovery CRDs installed and the Protection Group controller running on the primary cluster
  • The Failover controller running on the quorum cluster (LINSTOR model) or the designated management cluster (DRBD Operator model)
  • jq installed locally for parsing JSON output from kubectl commands
  • Sufficient RBAC permissions to read protectiongroups, failovers, and related custom resources in the namespaces where your workloads are deployed

Installation

Step 1: Confirm the Protection Group controller is running on the primary cluster

kubectl --kubeconfig ~/.kube/config-cluster1 \
  get deployment protection-group-controller -n <namespace>

Expected output shows READY replicas equal to DESIRED:

NAME                          READY   UP-TO-DATE   AVAILABLE   AGE
protection-group-controller   1/1     1            1           3d

Step 2: Confirm the Failover controller is running

For LINSTOR deployments, the Failover controller runs on the quorum cluster:

kubectl --kubeconfig ~/.kube/config-quorum \
  get deployment failover-controller -n <namespace>

For DRBD Operator deployments without a quorum cluster, query the designated management cluster:

kubectl --kubeconfig ~/.kube/config-cluster1 \
  get deployment failover-controller -n <namespace>

Step 3: Confirm Protection Group CRDs are registered

kubectl --kubeconfig ~/.kube/config-cluster1 \
  get crd protectiongroups.siterecovery.trilio.io
NAME                                       CREATED AT
protectiongroups.siterecovery.trilio.io   2025-10-01T00:00:00Z

Step 4: Restart controllers if you have recently applied CRD updates

kubectl --kubeconfig ~/.kube/config-cluster1 \
  rollout restart deployment protection-group-controller -n <namespace>

kubectl --kubeconfig ~/.kube/config-quorum \
  rollout restart deployment failover-controller -n <namespace>

Configuration

Health status is reported through fields on the ProtectionGroup and Failover custom resources. Understanding these fields lets you interpret health output correctly and configure alerting or automation against them.

ProtectionGroup status fields

FieldTypePossible valuesMeaning
status.statestringActive, FailedWhether the Protection Group passed validation and is operating normally
status.replicationHealthstringHealthy, Degraded, UnknownAggregate DRBD replication health across all protected VMs
status.currentStatestringrunning, stopped, mixed, unknownActual VM lifecycle state across all VMs in the group
status.protectedVMs[*].replicationStatusstringProtected, Degraded, UnknownPer-VM replication status
status.protectedVMs[*].volumes[*].replicationStatestringProtected, DegradedPer-volume (PVC) replication state
status.warningslist of stringsFree textNon-blocking advisory messages (e.g., missing replicasOnDifferent on a storage class)

status.currentState is set by the Protection Group controller when it reconciles spec.desiredState. The controller updates this field after it has successfully started or stopped all VMs in the group. A value of mixed means reconciliation is still in progress.

spec.desiredState accepts running or stopped. Changing this field causes the Protection Group controller to start or stop all VMs in the group atomically. The Failover controller drives this field during failover operations — you should not need to set it manually during normal operations.

Failover CR status fields

FieldTypePossible valuesMeaning
status.statestringPending, StoppingOnSource, WaitingForDRBD, StartingOnTarget, Completed, FailedCurrent phase of a failover operation
status.phasestringSame as aboveMirrors state for compatibility

The Failover controller reconciles these fields every 10 seconds. A Completed state confirms that all VMs are running on the target cluster and DRBD replication has been handed over successfully.


Usage

You will primarily interact with the Health API by querying Protection Group status with kubectl. The Protection Group controller reconciles health every 60 seconds under normal conditions; allow up to one full reconciliation cycle after making infrastructure changes before treating a stale status as a problem.

Check overall Protection Group health

This is your primary health check before executing a planned failover:

kubectl --kubeconfig ~/.kube/config-cluster1 \
  get protectiongroup <pg-name> -n <namespace>

A healthy Protection Group shows State: Active and Replication: Healthy:

NAME                          STATE    VMS   REPLICATION   HEALTH    AGE
production-protection-group   Active   2     synchronous   Healthy   22h

If STATE is Failed or HEALTH is anything other than Healthy, inspect the full status before proceeding with any failover.

Inspect per-VM and per-volume replication health

kubectl --kubeconfig ~/.kube/config-cluster1 \
  get protectiongroup <pg-name> -n <namespace> -o yaml \
  | grep -A 50 "protectedVMs:"

This reveals the replicationStatus and replicationState for each VM and each volume individually, letting you pinpoint which VM or disk is degraded without examining every resource separately.

Monitor a failover in progress

When a failover is running, track its phase through the Failover CR:

kubectl --kubeconfig ~/.kube/config-quorum \
  get failover <failover-name> -n <namespace> -o jsonpath='{.status.state}'

Poll this until it returns Completed. You can also watch it:

kubectl --kubeconfig ~/.kube/config-quorum \
  wait --for=jsonpath='{.status.state}'=Completed \
  failover/<failover-name> -n <namespace> --timeout=600s

Check for advisory warnings

Warnings do not block operations but indicate suboptimal configuration (for example, storage classes that lack explicit cross-site placement rules):

kubectl --kubeconfig ~/.kube/config-cluster1 \
  get protectiongroup <pg-name> -n <namespace> \
  -o jsonpath='{.status.warnings}' | jq .

Examples

Example 1: Healthy Protection Group with two protected VMs

kubectl --kubeconfig ~/.kube/config-cluster1 \
  get protectiongroup production-protection-group -n default

Expected output:

NAME                          STATE    VMS   REPLICATION   HEALTH    AGE
production-protection-group   Active   2     synchronous   Healthy   22h

Example 2: Detailed per-VM replication status

kubectl --kubeconfig ~/.kube/config-cluster1 \
  get protectiongroup production-protection-group -n default \
  -o yaml | grep -A 50 "protectedVMs:"

Expected output:

status:
  protectedVMs:
  - name: prod-vm-1
    namespace: default
    replicationStatus: Protected
    volumes:
    - pvcName: prod-vm-1-disk
      replicationState: Protected
      resourceName: pvc-8f04b7f3-ab58-46a9-9721-508337d30d61
      usingLinstorCSI: true
  - name: prod-vm-2
    namespace: default
    replicationStatus: Protected
    volumes:
    - pvcName: prod-vm-2-disk
      replicationState: Protected
      resourceName: pvc-0720b27f-d1cc-4283-97fc-d1c82177ed5e
      usingLinstorCSI: true
  replicationHealth: Healthy
  state: Active

Both VMs show replicationStatus: Protected and both volumes show replicationState: Protected. This is the expected state before executing a planned failover.


Example 3: Failed Protection Group — geo-replication validation error

kubectl --kubeconfig ~/.kube/config-cluster1 \
  describe protectiongroup invalid-pg -n default

Expected output (relevant section):

Status:
  State: Failed
  Conditions:
    Type:    ValidationFailed
    Status:  True
    Reason:  InvalidConfiguration
    Message: PVC invalid-vm-disk for VM invalid-vm: Storage class linstor-local
             has placementCount=1, minimum 2 required for geo-replication

This tells you exactly which PVC failed validation and why. Resolve the issue by updating the VM to use a storage class with placementCount >= 2 and a LINSTOR CSI provisioner, then re-apply or patch the Protection Group.


Example 4: Querying Protection Group currentState during failover

kubectl --kubeconfig ~/.kube/config-cluster1 \
  get protectiongroup production-protection-group -n default \
  -o jsonpath='{.status.currentState}'

During failover (source cluster — VMs stopping):

mixed

After VMs have stopped on source:

stopped

After VMs have started on target cluster:

kubectl --kubeconfig ~/.kube/config-cluster2 \
  get protectiongroup production-protection-group -n default \
  -o jsonpath='{.status.currentState}'
running

Example 5: Check for storage class warnings

kubectl --kubeconfig ~/.kube/config-cluster1 \
  get protectiongroup production-protection-group -n default \
  -o jsonpath='{.status.warnings}' | jq .

Output when a storage class lacks explicit cross-site placement rules:

[
  "PVC prod-vm-1-disk: Storage class linstor-basic has no replicasOnDifferent setting - replicas may be on same site"
]

This warning does not block failover but indicates that DRBD replicas may not be distributed across failure domains. For production DR, update the storage class to include replicasOnDifferent: topology.kubernetes.io/zone.


Troubleshooting

Use the following patterns to diagnose common Health API problems. Each entry lists the symptom you will observe, its most likely cause, and the steps to resolve it.


Symptom: status.state shows Failed immediately after creating or patching a Protection Group

Likely cause: One or more VMs in the group use PVCs that fail geo-replication validation — either the storage class does not use the LINSTOR CSI provisioner (linstor.csi.linbit.com), or placementCount is less than 2.

Fix:

  1. Run kubectl describe protectiongroup <name> -n <namespace> and read the Message field under Conditions to identify which PVC failed and why.
  2. If the provisioner is wrong, the VM must be re-created with a PVC backed by a LINSTOR CSI storage class.
  3. If placementCount is too low, update or replace the storage class to set placementCount >= 2 and re-provision the PVC.
  4. Once the PVC is corrected, patch the Protection Group to trigger re-validation.

Symptom: status.replicationHealth shows Degraded for a Protection Group that was previously Healthy

Likely cause: DRBD replication between the primary and DR cluster nodes has encountered an issue — a node may be offline, a network path between cluster nodes may be interrupted, or a DRBD resource may have lost quorum.

Fix:

  1. Identify which VM is degraded: kubectl get protectiongroup <name> -n <namespace> -o yaml | grep -A 20 "protectedVMs:"
  2. Check the specific volume's replicationState to confirm which PVC is affected.
  3. Verify node connectivity between primary and DR cluster worker nodes (DRBD replication runs directly between these nodes; the quorum cluster does not relay replication traffic).
  4. Inspect DRBD resource status on the affected nodes using your deployment's LINSTOR or DRBD Operator tooling.
  5. After the underlying issue is resolved, wait up to 60 seconds for the Protection Group controller to reconcile and update replicationHealth.

Symptom: status.currentState is stuck at mixed for more than 60 seconds

Likely cause: The Protection Group controller is still reconciling VM state — one or more VMs may be slow to start or stop, or the controller itself may have encountered an error during reconcile_vm_state().

Fix:

  1. Check controller logs: kubectl --kubeconfig ~/.kube/config-cluster1 logs deployment/protection-group-controller -n <namespace>
  2. Look for errors related to the specific VM names in the Protection Group.
  3. Check VM status individually: kubectl --kubeconfig ~/.kube/config-cluster1 get vm -n <namespace>
  4. If a VM is stuck in a transitional state (e.g., Terminating or Pending), investigate and resolve that VM's issue first; the controller will re-reconcile on the next cycle.
  5. If the controller pod itself is unhealthy, restart it: kubectl --kubeconfig ~/.kube/config-cluster1 rollout restart deployment/protection-group-controller -n <namespace>

Symptom: A Failover CR status.state is stuck at StoppingOnSource or StartingOnTarget

Likely cause: The Failover controller is waiting for the Protection Group currentState on the source or target cluster to transition, but the Protection Group controller has not completed reconciliation. This can also occur if the Failover controller cannot reach the remote cluster's API server.

Fix:

  1. Check the source cluster's Protection Group currentState: kubectl --kubeconfig ~/.kube/config-cluster1 get protectiongroup <name> -n <namespace> -o jsonpath='{.status.currentState}'
  2. If currentState is mixed, VMs are still transitioning — wait and re-check.
  3. If currentState has not changed after two minutes, check Protection Group controller logs on the source cluster for reconciliation errors.
  4. Check Failover controller logs on the quorum cluster for connectivity errors to the source or target cluster API server.
  5. Verify that the kubeconfig secrets or credentials used by the Failover controller to reach both clusters are current and valid.

Symptom: Health status fields are stale — the Protection Group shows Healthy but DRBD is known to be degraded

Likely cause: The Protection Group controller reconciles on a 60-second cycle. If the degradation occurred within the last reconciliation window, the status has not yet been updated.

Fix:

  1. Wait up to 60 seconds and re-query.
  2. If the status does not update after two cycles, check whether the Protection Group controller pod is running and not in a crash loop: kubectl --kubeconfig ~/.kube/config-cluster1 get pod -l app=protection-group-controller -n <namespace>
  3. If the controller is running but not reconciling, inspect its logs for errors that might cause it to skip reconciliation for a specific Protection Group.

Symptom: kubectl get protectiongroup returns No resources found

Likely cause: Either no Protection Groups have been created yet, or you are querying the wrong cluster or namespace, or the CRDs are not installed.

Fix:

  1. Confirm you are querying the primary cluster and the correct namespace: kubectl --kubeconfig ~/.kube/config-cluster1 get protectiongroup -A
  2. Confirm the CRD is registered: kubectl --kubeconfig ~/.kube/config-cluster1 get crd protectiongroups.siterecovery.trilio.io
  3. If the CRD is missing, re-apply the Site Recovery CRD manifests via Ansible (Ansible is responsible for deploying all infrastructure components including CRDs to the primary cluster).