Skip to main content

Overview

Failover switches protected workloads from the primary site to the DR site. Initiate failover when a primary site failure is confirmed and recovery at the primary site is not possible within the RTO window.
Failover is a significant operation. Confirm that the primary site is genuinely unavailable before proceeding. An unnecessary failover requires a full failback cycle to restore normal operations.
Prerequisites
  • An active protection plan in ACTIVE replication status
  • Confirmation that the primary site is unavailable — cross-reference with XIMP monitoring
  • DR site confirmed healthy (navigate to Disaster Recovery → Sites)

Failover Procedure

Confirm primary site status

Navigate to Project → Disaster Recovery → Sites and verify the primary site health indicator shows Unreachable or Failed. Cross-reference with the XIMP monitoring portal for independent confirmation.
Do not rely on a single monitoring source. A network partition may make the primary site appear unreachable from the DR site while it is actually still operational. Verify from multiple vantage points before proceeding.

Initiate failover

Navigate to Project → Disaster Recovery → Protection Plans, select the affected plan, and click Failover. Confirm the failover dialog.
OptionDescription
Latest Recovery PointUse the most recent replicated snapshot
Specific Recovery PointSelect a point-in-time snapshot from the recovery point list
Test ModeBring up workloads in isolation without cutting over production traffic
Selecting Latest Recovery Point uses data from the last successful replication cycle. Any writes to the primary site since that cycle will be lost permanently. Review the current replication lag before confirming.

Monitor recovery progress

The DR Runbook executes automatically in the configured priority order. Track progress in Disaster Recovery → Failover Status. Each resource shows:
StatusMeaning
PendingWaiting for dependencies to recover first
RecoveringInstance starting on DR site
ValidatedRecovery script confirmed service is available
FailedRecovery step encountered an error — review event log

Verify workloads on DR site

Confirm application-level availability by accessing services through the DR site endpoints. Update DNS or load balancer configurations to route traffic to the DR site.
Protected workloads are running on the DR site and serving traffic.

Post-Failover Checklist

After failover completes, perform these steps:

Validate application services

Run application-level health checks against the DR site endpoints. Verify databases are consistent, application tiers are connected, and external services can reach the DR site.

Update DNS and load balancers

Route production traffic to DR site IP addresses. Update:
  • External DNS A/CNAME records
  • Load balancer pools and health checks
  • Any hardcoded IP references in application configuration

Notify stakeholders

Communicate the failover event and DR site endpoints to:
  • Operations and on-call teams
  • Business stakeholders and affected service owners
  • Partners or customers if external connectivity has changed

Begin planning failback

Once the primary site issue is resolved, plan the failback operation. See Failback for the full procedure.

Next Steps

Failback

Return workloads to the primary site after it has been restored

Protection Plans

Review and update protection plans after the failover event

Troubleshooting

Diagnose failover stuck states and recovery script failures

XDR Admin — DR Automation

Configure automatic failover triggers to reduce response time (administrator)