Documentation Index
Fetch the complete documentation index at: https://docs.xloud.tech/llms.txt
Use this file to discover all available pages before exploring further.
Overview
DR automation reduces manual intervention during failover by orchestrating the recovery sequence, updating external systems, and validating service health automatically. Well-tested automation is the difference between a 30-minute RTO and a 3-hour one.Prerequisites
- Recovery plans created and in
ACTIVEreplication status - Recovery runbook scripts tested against DR test instances before production use
- Automation hook scripts stored in a version-controlled location accessible from DR site
Automatic Failover Triggers
Configure health checks that trigger failover automatically when the primary site is unreachable, without requiring manual operator intervention.- Dashboard
- CLI
Navigate to Disaster Recovery → Recovery Plans → [Plan] → Automatic Triggers:
| Setting | Description |
|---|---|
| Health Check URL | Primary site endpoint to poll (e.g., load balancer VIP) |
| Check Interval | How often to poll (default: 30 seconds) |
| Failure Threshold | Consecutive failures before triggering (default: 3) |
| Secondary Check | Optional second endpoint — both must fail before triggering |
| Notification Channel | XIMP alert channel to notify when automatic failover begins |
Recovery Runbook Scripts
Runbook scripts add application-level coordination to the automated recovery sequence. Scripts run before or after each resource group recovers.Script Environment
Scripts receive the following environment variables from the XDR runtime:| Variable | Value |
|---|---|
XDR_PLAN_NAME | Name of the recovery plan |
XDR_SITE | Name of the recovery site (DR site) |
XDR_GROUP_NAME | Name of the current resource group |
XDR_RESOURCE_IDS | Space-separated list of recovered resource IDs |
XDR_EVENT_ID | Unique identifier for this failover event |
XDR_RECOVERY_POINT | Timestamp of the recovery point used |
Script Requirements
- Must exit 0 for success — any non-zero exit halts recovery and triggers an alert
- Timeout: 300 seconds (configurable per hook)
- Must be idempotent — scripts may be retried after transient failures
- Output is captured in the runbook log — keep output concise and meaningful
Example Scripts
DNS update after recovery
DNS update after recovery
post-recover: update internal DNS
Service registry update
Service registry update
post-recover: update service registry
Pre-failover notification
Pre-failover notification
pre-failover: notify on-call team
Attaching Scripts to Resource Groups
- Dashboard
- CLI
Navigate to Disaster Recovery → Recovery Plans → [Plan] → [Group] → Hooks
and click Add Hook. Paste the script content or reference a stored script path.
Replication Scheduling
For asynchronous replication, configure replication windows to control when replication traffic runs and how much bandwidth it consumes. Navigate to Disaster Recovery → Recovery Plans → [Plan] → Replication Schedule:| Setting | Description |
|---|---|
| Mode | Continuous — replicate in real time; Scheduled — replicate in defined windows |
| Schedule | Cron expression for scheduled replication windows |
| Bandwidth Cap | Limit throughput during peak production hours |
Testing Automation
Always validate runbook scripts with a DR test before relying on them in a real failover. Navigate to Disaster Recovery → Protection Plans → [Plan] and click Test Failover to exercise runbook scripts in an isolated environment. After the test completes, review the runbook execution log in Disaster Recovery → Test Sessions → [Session] → Runbook Log.Next Steps
Recovery Plans
Configure resource groups and health checks
Monitoring
Alert on automation failures and replication lag
XDR User Guide — DR Testing
Run DR tests to validate runbook scripts and measure RTO
Troubleshooting
Diagnose runbook script failures and unexpected failovers