Failover - Docs

Overview

Failover switches protected workloads from the primary site to the DR site. Initiate failover when a primary site failure is confirmed and recovery at the primary site is not possible within the RTO window.

Failover is a significant operation. Confirm that the primary site is genuinely unavailable before proceeding. An unnecessary failover requires a full failback cycle to restore normal operations.

Prerequisites

An active protection plan in ACTIVE replication status
Confirmation that the primary site is unavailable — cross-reference with XIMP monitoring
DR site confirmed healthy (navigate to Disaster Recovery → Sites)

Failover Procedure

Dashboard
CLI

Confirm primary site status

Navigate to Project → Disaster Recovery → Sites and verify the primary site health indicator shows Unreachable or Failed. Cross-reference with the XIMP monitoring portal for independent confirmation.

Do not rely on a single monitoring source. A network partition may make the primary site appear unreachable from the DR site while it is actually still operational. Verify from multiple vantage points before proceeding.

Initiate failover

Navigate to Project → Disaster Recovery → Protection Plans, select the affected plan, and click Failover. Confirm the failover dialog.

Option	Description
Latest Recovery Point	Use the most recent replicated snapshot
Specific Recovery Point	Select a point-in-time snapshot from the recovery point list
Test Mode	Bring up workloads in isolation without cutting over production traffic

Selecting Latest Recovery Point uses data from the last successful replication cycle. Any writes to the primary site since that cycle will be lost permanently. Review the current replication lag before confirming.

Monitor recovery progress

The DR Runbook executes automatically in the configured priority order. Track progress in Disaster Recovery → Failover Status. Each resource shows:

Status	Meaning
Pending	Waiting for dependencies to recover first
Recovering	Instance starting on DR site
Validated	Recovery script confirmed service is available
Failed	Recovery step encountered an error — review event log

Verify workloads on DR site

Confirm application-level availability by accessing services through the DR site endpoints. Update DNS or load balancer configurations to route traffic to the DR site.

Protected workloads are running on the DR site and serving traffic.

Post-Failover Checklist

After failover completes, perform these steps:

Validate application services

Run application-level health checks against the DR site endpoints. Verify databases are consistent, application tiers are connected, and external services can reach the DR site.

Update DNS and load balancers

Route production traffic to DR site IP addresses. Update:

External DNS A/CNAME records
Load balancer pools and health checks
Any hardcoded IP references in application configuration

Notify stakeholders

Communicate the failover event and DR site endpoints to:

Operations and on-call teams
Business stakeholders and affected service owners
Partners or customers if external connectivity has changed

Begin planning failback

Once the primary site issue is resolved, plan the failback operation. See Failback for the full procedure.

Next Steps

Failback

Return workloads to the primary site after it has been restored

Protection Plans

Review and update protection plans after the failover event

Troubleshooting

Diagnose failover stuck states and recovery script failures

XDR Admin — DR Automation

Configure automatic failover triggers to reduce response time (administrator)

​Overview

​Failover Procedure

Confirm primary site status

Initiate failover

Monitor recovery progress

Verify workloads on DR site

​Post-Failover Checklist

Validate application services

Update DNS and load balancers

Notify stakeholders

Begin planning failback

​Next Steps

Failback

Protection Plans

Troubleshooting

XDR Admin — DR Automation

Overview

Failover Procedure

Post-Failover Checklist

Next Steps