Overview
DR automation reduces manual intervention during failover by orchestrating the recovery sequence, updating external systems, and validating service health automatically. Well-tested automation is the difference between a 30-minute RTO and a 3-hour one.Prerequisites
- Recovery plans created and in
ACTIVEreplication status - Recovery runbook scripts tested against DR test instances before production use
- Automation hook scripts stored in a version-controlled location accessible from DR site
Automatic Failover Triggers
Configure health checks that trigger failover automatically when the primary site is unreachable, without requiring manual operator intervention.- Dashboard
- CLI
Navigate to Disaster Recovery → Recovery Plans → [Plan] → Automatic Triggers:
| Setting | Description |
|---|---|
| Health Check URL | Primary site endpoint to poll (e.g., load balancer VIP) |
| Check Interval | How often to poll (default: 30 seconds) |
| Failure Threshold | Consecutive failures before triggering (default: 3) |
| Secondary Check | Optional second endpoint — both must fail before triggering |
| Notification Channel | XIMP alert channel to notify when automatic failover begins |
Recovery Runbook Scripts
Runbook scripts add application-level coordination to the automated recovery sequence. Scripts run before or after each resource group recovers.Script Environment
Scripts receive the following environment variables from the XDR runtime:| Variable | Value |
|---|---|
XDR_PLAN_NAME | Name of the recovery plan |
XDR_SITE | Name of the recovery site (DR site) |
XDR_GROUP_NAME | Name of the current resource group |
XDR_RESOURCE_IDS | Space-separated list of recovered resource IDs |
XDR_EVENT_ID | Unique identifier for this failover event |
XDR_RECOVERY_POINT | Timestamp of the recovery point used |
Script Requirements
- Must exit 0 for success — any non-zero exit halts recovery and triggers an alert
- Timeout: 300 seconds (configurable per hook)
- Must be idempotent — scripts may be retried after transient failures
- Output is captured in the runbook log — keep output concise and meaningful
Example Scripts
DNS update after recovery
DNS update after recovery
post-recover: update internal DNS
Service registry update
Service registry update
post-recover: update service registry
Pre-failover notification
Pre-failover notification
pre-failover: notify on-call team
Attaching Scripts to Resource Groups
- Dashboard
- CLI
Navigate to Disaster Recovery → Recovery Plans → [Plan] → [Group] → Hooks
and click Add Hook. Paste the script content or reference a stored script path.
Replication Scheduling
For asynchronous replication, configure replication windows to control when replication traffic runs and how much bandwidth it consumes. Navigate to Disaster Recovery → Recovery Plans → [Plan] → Replication Schedule:| Setting | Description |
|---|---|
| Mode | Continuous — replicate in real time; Scheduled — replicate in defined windows |
| Schedule | Cron expression for scheduled replication windows |
| Bandwidth Cap | Limit throughput during peak production hours |
Testing Automation
Always validate runbook scripts with a DR test before relying on them in a real failover. Navigate to Disaster Recovery → Protection Plans → [Plan] and click Test Failover to exercise runbook scripts in an isolated environment. After the test completes, review the runbook execution log in Disaster Recovery → Test Sessions → [Session] → Runbook Log.Next Steps
Recovery Plans
Configure resource groups and health checks
Monitoring
Alert on automation failures and replication lag
XDR User Guide — DR Testing
Run DR tests to validate runbook scripts and measure RTO
Troubleshooting
Diagnose runbook script failures and unexpected failovers