Guide

Deployments Page

Creating and managing DR deployments

Overview

This page covers how to create and manage disaster recovery deployments using Site Recovery's multi-tenant architecture. A deployment is an isolated DR setup that pairs a primary cluster with a DR cluster and runs a dedicated failover controller on the quorum cluster. Understanding deployments is essential because all Protection Group operations, failover workflows, and multi-tenant isolation are scoped to a deployment. You will learn how to create single or multiple deployments, verify they are healthy, manage their lifecycle, and operate them safely when running concurrent failover operations.

Prerequisites

Before creating a deployment, ensure you have the following in place:

Site Recovery infrastructure already deployed via Ansible to your primary, DR, and quorum clusters (LINSTOR model: three clusters required; DRBD Operator model: two clusters minimum, quorum optional but recommended)
Site Manager UI deployed via Helm to the quorum cluster
kubectl configured and able to reach all clusters
pgctl CLI available and on your $PATH — this is the primary tool for deployment management and Protection Group operations
kubeconfig files for each cluster, accessible from the machine where you run deployment scripts:
- ~/.kube/config-quorum — quorum cluster
- ~/.kube/config-cluster1 — primary cluster
- ~/.kube/config-cluster2 — DR cluster
Quorum cluster access — all deployment control-plane resources (failover controllers, secrets, namespaces) are created on the quorum cluster; you must have sufficient RBAC permissions to create namespaces, ClusterRoles, ClusterRoleBindings, Secrets, ConfigMaps, and Deployments on it
Source code repository (k8s-protector) checked out locally, including scripts/, cmds/pgctl, and src/
src/requirements.txt present in the repository (contains Python dependencies: kopf, kubernetes, pyyaml, flask, flask-cors, requests)

Installation

Each deployment creates an isolated namespace (dr-<name>) on the quorum cluster, populates it with a failover controller, kubeconfig secrets, and RBAC resources, and links it to a specific primary/DR cluster pair. The quorum cluster hosts only management-plane components — it does not run application workloads and does not participate in storage replication, which runs directly between primary and DR cluster nodes.

Option A: Deploy a single deployment

Step 1. Export your kubeconfig paths as environment variables for convenience:

export KUBECONFIG_QUORUM=~/.kube/config-quorum
export KUBECONFIG_CLUSTER1=~/.kube/config-cluster1
export KUBECONFIG_CLUSTER2=~/.kube/config-cluster2

Step 2. Run the deployment script with a meaningful name (for example, prod):

./scripts/deploy-multi-tenant-quorum.sh \
  --deployment prod \
  --primary ~/.kube/config-cluster1 \
  --dr ~/.kube/config-cluster2

The script validates the kubeconfig files, creates the dr-prod namespace on the quorum cluster, stores the cluster kubeconfigs as Secrets, provisions RBAC (ClusterRole and ClusterRoleBinding scoped per deployment to avoid conflicts), deploys the failover controller code via ConfigMaps, and waits for the controller pod to become ready.

Step 3. Verify the deployment is running:

kubectl --kubeconfig ~/.kube/config-quorum get all -n dr-prod

Option B: Deploy multiple deployments from a configuration file

This is the recommended approach when managing several environments (production, staging, development) from the same quorum cluster.

Step 1. Create a deployments.yaml file:

deployments:
  - name: prod
    primary: ~/.kube/config-cluster1
    dr: ~/.kube/config-cluster2

  - name: staging
    primary: ~/.kube/config-staging-cluster1
    dr: ~/.kube/config-staging-cluster2

Step 2. Deploy all environments at once:

./scripts/deploy-multi-tenant-quorum.sh --config deployments.yaml

Each deployment is created sequentially. If one fails, the script reports the error and continues with the remaining deployments.

Option C: Interactive management menu

For day-to-day operations you can use the interactive menu, which wraps all deployment lifecycle actions:

./scripts/quorum-deployments.sh

Select option 2 (Deploy Multi-Tenant Quorum) or option 3 (Add New DR Deployment) from the menu.

Resources created per deployment

After a successful deployment, the following resources exist on the quorum cluster:

Namespace:            dr-<deployment-name>
ServiceAccount:       failover-controller
ClusterRole:          failover-controller-<deployment-name>
ClusterRoleBinding:   failover-controller-<deployment-name>
Secret:               cluster1-kubeconfig
Secret:               cluster2-kubeconfig
ConfigMap:            failover-controller-code
ConfigMap:            orchestration-library
Deployment:           failover-controller

Shared cluster-level resources (CRDs for FailoverRequest and ProtectionGroup, and LINSTOR access bindings) are installed once and used by all deployments.

Configuration

The following options control how a deployment is created and how the failover controller behaves at runtime.

Deployment script flags

Flag	Required	Description
`--deployment <name>`	Yes (single mode)	The deployment name. Used as the namespace suffix (`dr-<name>`) and as part of all RBAC resource names. Use short, lowercase identifiers: `prod`, `staging`, `dev`.
`--primary <path>`	Yes (single mode)	Path to the kubeconfig for the primary cluster.
`--dr <path>`	Yes (single mode)	Path to the kubeconfig for the DR cluster.
`--config <path>`	Yes (batch mode)	Path to a `deployments.yaml` file listing multiple deployments. Mutually exclusive with `--deployment`.

deployments.yaml structure

deployments:
  - name: <string>          # Deployment name; becomes namespace dr-<name>
    primary: <path>         # Kubeconfig path for primary cluster
    dr: <path>              # Kubeconfig path for DR cluster

Failover controller resource limits

The controller deployment does not enforce resource limits by default. For production deployments, set limits appropriate to the number of VMs and the degree of parallelism you expect:

resources:
  requests:
    memory: "128Mi"
    cpu: "100m"
  limits:
    memory: "512Mi"   # Increase for high VM counts
    cpu: "500m"       # Increase for high parallelism

Apply overrides with kubectl edit deployment/failover-controller -n dr-<name> after initial deployment.

Controller replicas

By default the controller runs as a single replica. For production environments where quorum cluster availability is a concern, scale to two replicas:

kubectl --kubeconfig ~/.kube/config-quorum scale deployment/failover-controller \
  --replicas=2 -n dr-prod

Failover safety settings

The failover controller enforces two safety mechanisms by default that affect runtime behavior:

Setting	Default	Effect
Per-PG failover lock duration	5 minutes	A Kubernetes Lease (`failover-lock-<pg-name>`) is held for up to 5 minutes. If the controller crashes mid-failover, the lock auto-expires, allowing a retry.
Quorum taint safety check	Enabled	Before removing node-level DRBD quorum taints on the target cluster, the controller checks whether any other Protection Group in the same namespace has `currentState: running`. If conflicts exist, taint removal is aborted to avoid disrupting those VMs.

Both checks can be bypassed at invocation time using the --force flag or the force: true field in a Failover CR. See the Usage section for when this is appropriate.

Usage

Once a deployment is running, you interact with it primarily through pgctl for Protection Group and failover operations, and through kubectl for monitoring and lifecycle management.

Verify connectivity after deployment

Always confirm that the failover controller can reach both clusters before creating Protection Groups:

./cmds/pgctl config test

List all active deployments

kubectl --kubeconfig ~/.kube/config-quorum get ns -l dr-deployment=true

Check controller health across all deployments

kubectl --kubeconfig ~/.kube/config-quorum get deploy -A -l app=failover-controller

Trigger a failover for a Protection Group

Specify the Protection Group name and target cluster. The failover controller in the appropriate deployment namespace handles the request:

./cmds/pgctl failover my-protection-group --to-cluster2

How the failover controller executes a failover

Acquires a per-PG Kubernetes Lease lock to block concurrent operations on the same Protection Group.
Stops VMs on the source cluster by setting the Protection Group desiredState to stopped.
Checks whether any other Protection Group in the namespace has currentState: running on the target cluster. If conflicts exist, the operation aborts.
Removes DRBD quorum taints on the target cluster nodes (safe because step 3 passed).
Sets the Protection Group desiredState to running on the target cluster.
Monitors VM startup, re-applying taint removal as needed during the startup window.
Releases the lock.

Running concurrent failovers safely

Because DRBD quorum taints are node-level (not per-PVC or per-Protection Group), removing taints for one Protection Group affects all Protection Groups on the same cluster. To avoid disrupting running VMs:

Failover one Protection Group at a time to any given target cluster.
Wait for the first failover to complete before starting a second one to the same target.
The safety check will block a second concurrent failover automatically and report which Protection Groups are conflicting.

If you must proceed in an emergency and understand the risk of temporarily disrupting other Protection Groups:

# Shell script override
./intelligent-pg-failover.sh my-protection-group --to-cluster2 --force

# Failover CR override
apiVersion: siterecovery.trilio.io/v1alpha1
kind: Failover
spec:
  protectionGroup: my-protection-group
  targetCluster: cluster2
  force: true

Warning: --force bypasses all safety checks. Only use it in coordinated maintenance windows or genuine emergency DR scenarios. Coordinate with other operators before using it.

Monitor controller logs for a deployment

kubectl --kubeconfig ~/.kube/config-quorum logs -f \
  deployment/failover-controller -n dr-prod

Filter for safety-related events:

kubectl --kubeconfig ~/.kube/config-quorum logs -f \
  deployment/failover-controller -n dr-prod | grep -E "SAFETY|UNSAFE|lock"

Check Protection Group states across all namespaces

kubectl get protectiongroups -A -o custom-columns=\
NAME:.metadata.name,\
NAMESPACE:.metadata.namespace,\
STATE:.status.currentState,\
DESIRED:.spec.desiredState

Remove a deployment

Removing a deployment deletes its namespace and all namespace-scoped resources, and also removes the ClusterRole and ClusterRoleBinding to avoid leaving orphaned cluster-level resources:

# Interactive menu
./scripts/quorum-deployments.sh
# Select: 4) Remove DR Deployment

# Or directly
kubectl --kubeconfig ~/.kube/config-quorum delete namespace dr-staging
# Also manually remove:
# kubectl delete clusterrolebinding failover-controller-staging
# kubectl delete clusterrole failover-controller-staging

Using the interactive menu is preferred because it handles all three deletion steps in order.

Back up and restore deployment state

# Backup all deployments on the quorum cluster
./scripts/backup-quorum-state.sh

# Backup a specific deployment manually
kubectl --kubeconfig ~/.kube/config-quorum get all,secrets,configmaps \
  -n dr-prod -o yaml > backup-prod-$(date +%Y%m%d).yaml

# Restore from backup
./scripts/restore-quorum-state.sh /path/to/backup.tar.gz

Examples

Example 1: Deploy a production environment

Create a deployment named prod linking a primary and DR cluster:

./scripts/deploy-multi-tenant-quorum.sh \
  --deployment prod \
  --primary ~/.kube/config-cluster1 \
  --dr ~/.kube/config-cluster2

Expected output (abbreviated):

[INFO] Validating kubeconfig: ~/.kube/config-cluster1 ... OK
[INFO] Validating kubeconfig: ~/.kube/config-cluster2 ... OK
[INFO] Creating namespace dr-prod on quorum cluster
[INFO] Creating secrets: cluster1-kubeconfig, cluster2-kubeconfig
[INFO] Creating ClusterRole: failover-controller-prod
[INFO] Creating ClusterRoleBinding: failover-controller-prod
[INFO] Creating ConfigMaps: failover-controller-code, orchestration-library
[INFO] Deploying failover-controller
[INFO] Waiting for controller readiness...
[INFO] Deployment prod is ready.

Example 2: Deploy multiple environments from a config file

# deployments.yaml
deployments:
  - name: prod
    primary: ~/.kube/config-cluster1
    dr: ~/.kube/config-cluster2

  - name: staging
    primary: ~/.kube/config-staging-cluster1
    dr: ~/.kube/config-staging-cluster2

./scripts/deploy-multi-tenant-quorum.sh --config deployments.yaml

Expected result: Two isolated namespaces (dr-prod, dr-staging) created on the quorum cluster, each with its own failover controller and credentials. Neither deployment can interfere with the other.

Example 3: Verify a deployment's health

kubectl --kubeconfig ~/.kube/config-quorum get all -n dr-prod

Expected output:

NAME                                      READY   STATUS    RESTARTS   AGE
pod/failover-controller-7d9f8b6c4-xk2pq  1/1     Running   0          5m

NAME                                 READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/failover-controller  1/1     1            1           5m

Example 4: Trigger a failover and watch progress

# In terminal 1: start the failover
./cmds/pgctl failover production-vm-group --to-cluster2

# In terminal 2: watch controller logs
kubectl --kubeconfig ~/.kube/config-quorum logs -f \
  deployment/failover-controller -n dr-prod

Expected log output (normal path):

INFO: Acquired failover lock for production-vm-group
INFO: Stopping VMs on cluster1 via Protection Group desiredState
INFO: Safe to remove taints: No other Protection Groups have running VMs on cluster2
INFO: Removing DRBD quorum taints on cluster2
INFO: Starting VMs on cluster2 via Protection Group desiredState
INFO: All VMs running on cluster2
INFO: Released failover lock for production-vm-group

Example 5: Observe a blocked concurrent failover

Attempting to fail over a second Protection Group to the same target cluster while the first is still starting up:

# Terminal 1 (already running)
./cmds/pgctl failover pg-a --to-cluster2

# Terminal 2 (attempted while pg-a VMs are starting on cluster2)
./cmds/pgctl failover pg-b --to-cluster2

Expected error output for Terminal 2:

⚠ SAFETY CHECK: Other Protection Groups have running VMs on cluster2:
    - production-vm-group
UNSAFE to remove taints on cluster2 - other Protection Groups have running VMs
This could cause scheduling issues for those VMs
To override this safety check, use --force flag (NOT RECOMMENDED)

Wait for pg-a failover to complete, then retry pg-b.

Example 6: Check active failover locks

If you suspect a failover is stuck or a lock was not released:

kubectl get leases -n dr-prod | grep failover-lock

Expected output when a lock is held:

NAME                                    HOLDER                        AGE
failover-lock-production-vm-group       failover-controller-48291     2m

Locks auto-expire after 5 minutes. If the controller crashed and the lock has not expired, you can delete it manually:

kubectl delete lease failover-lock-production-vm-group -n dr-prod

Troubleshooting

Controller pod not starting

Symptom: kubectl get pods -n dr-prod shows the failover-controller pod in Pending, CrashLoopBackOff, or ImagePullBackOff.

Likely cause: The container image cannot be pulled, a required ConfigMap is missing, or the init container failed to set up the Python library structure.

Fix:

# Check pod events
kubectl --kubeconfig ~/.kube/config-quorum get events -n dr-prod \
  --sort-by='.lastTimestamp'

# Describe the pod for detailed error
kubectl --kubeconfig ~/.kube/config-quorum describe pod -n dr-prod \
  -l app=failover-controller

# Check controller logs (if pod started at all)
kubectl --kubeconfig ~/.kube/config-quorum logs -n dr-prod \
  deployment/failover-controller --tail=100

# Verify ConfigMaps exist
kubectl --kubeconfig ~/.kube/config-quorum get configmaps -n dr-prod

If ConfigMaps are missing, re-run the deployment script. The script is idempotent and will recreate missing resources.

Controller cannot reach primary or DR cluster

Symptom: Controller logs show connection refused or authentication errors when attempting to contact cluster1 or cluster2.

Likely cause: The kubeconfig Secrets were created with incorrect or expired credentials, or the cluster API endpoints are not reachable from the quorum cluster network.

Fix:

# Inspect the stored kubeconfig
kubectl --kubeconfig ~/.kube/config-quorum get secret cluster1-kubeconfig \
  -n dr-prod -o jsonpath='{.data.kubeconfig}' | base64 -d | head -10

# Test connectivity from inside the controller pod
kubectl --kubeconfig ~/.kube/config-quorum exec -n dr-prod \
  deployment/failover-controller -- \
  kubectl --kubeconfig /kubeconfigs/cluster1/kubeconfig cluster-info

# If the kubeconfig is stale, re-create the secret
kubectl --kubeconfig ~/.kube/config-quorum delete secret cluster1-kubeconfig -n dr-prod
kubectl --kubeconfig ~/.kube/config-quorum create secret generic cluster1-kubeconfig \
  --from-file=kubeconfig=~/.kube/config-cluster1 -n dr-prod

Failover blocked by safety check

Symptom: Failover aborts with a message similar to: UNSAFE to remove taints on cluster2 - other Protection Groups have running VMs.

Likely cause: Another Protection Group in the same namespace already has currentState: running on the target cluster. Because DRBD quorum taints are node-level, removing them would affect those running VMs.

Fix: Check which Protection Groups are running on the target cluster, then wait for them to stabilize before retrying:

kubectl get protectiongroups -n dr-prod -o custom-columns=\
NAME:.metadata.name,STATE:.status.currentState

Once no conflicting Protection Groups are in running state on the target cluster, retry the failover. Only use --force if you have explicitly coordinated with other operators and accept the risk of temporary VM disruption.

Failover is stuck / FailoverRequest not progressing

Symptom: pgctl failover command hangs, or a FailoverRequest CR remains in a pending state for an extended period.

Likely cause: A failover lock Lease from a previous crashed operation has not yet expired, or the FailoverRequest CR is in an error state.

Fix:

# Check FailoverRequest status
kubectl --kubeconfig ~/.kube/config-quorum get failoverrequest -n dr-prod

# Describe the stuck request
kubectl --kubeconfig ~/.kube/config-quorum describe failoverrequest \
  <failover-name> -n dr-prod

# Check for active failover locks
kubectl get leases -n dr-prod | grep failover-lock

# If a lock is held by a dead process (check AGE vs. 5-minute expiry)
kubectl delete lease failover-lock-<pg-name> -n dr-prod

# Delete the stuck FailoverRequest so it can be resubmitted
kubectl --kubeconfig ~/.kube/config-quorum delete failoverrequest \
  <failover-name> -n dr-prod

Orphaned ClusterRole or ClusterRoleBinding after namespace deletion

Symptom: After manually deleting a deployment namespace, kubectl get clusterrole and kubectl get clusterrolebinding still show failover-controller-<name> entries.

Likely cause: The namespace was deleted directly without using the management script, which also handles cluster-level resource cleanup.

Fix:

kubectl --kubeconfig ~/.kube/config-quorum \
  delete clusterrolebinding failover-controller-<deployment-name>
kubectl --kubeconfig ~/.kube/config-quorum \
  delete clusterrole failover-controller-<deployment-name>

To prevent this in future, always remove deployments through the interactive menu (option 4) or ensure you run both deletion commands alongside the namespace deletion.

FailoverRequest CRD not found

Symptom: kubectl get failoverrequest returns error: the server doesn't have a resource type "failoverrequest".

Likely cause: The CRD was not installed as part of the Ansible infrastructure deployment, or the quorum cluster was not included in the Ansible run.

Fix: Verify CRD presence and re-run the Ansible infrastructure playbook targeting the quorum cluster if it is missing:

kubectl --kubeconfig ~/.kube/config-quorum \
  get crd failoverrequests.siterecovery.trilio.io