DR Automation

Overview

DR automation reduces manual intervention during failover by orchestrating the recovery sequence, updating external systems, and validating service health automatically. Well-tested automation is the difference between a 30-minute RTO and a 3-hour one.

Prerequisites

Recovery plans created and in ACTIVE replication status
Recovery runbook scripts tested against DR test instances before production use
Automation hook scripts stored in a version-controlled location accessible from DR site

Automatic Failover Triggers

Configure health checks that trigger failover automatically when the primary site is unreachable, without requiring manual operator intervention.

Dashboard
CLI

Navigate to Disaster Recovery → Recovery Plans → [Plan] → Automatic Triggers:

Setting	Description
Health Check URL	Primary site endpoint to poll (e.g., load balancer VIP)
Check Interval	How often to poll (default: 30 seconds)
Failure Threshold	Consecutive failures before triggering (default: 3)
Secondary Check	Optional second endpoint — both must fail before triggering
Notification Channel	XIMP alert channel to notify when automatic failover begins

Automatic failover bypasses the manual confirmation step. Enable only for workloads where downtime cost exceeds the risk of an unnecessary failover. Configure the failure threshold high enough to avoid triggering on transient network blips.

Recovery Runbook Scripts

Runbook scripts add application-level coordination to the automated recovery sequence. Scripts run before or after each resource group recovers.

Script Environment

Scripts receive the following environment variables from the XDR runtime:

Variable	Value
`XDR_PLAN_NAME`	Name of the recovery plan
`XDR_SITE`	Name of the recovery site (DR site)
`XDR_GROUP_NAME`	Name of the current resource group
`XDR_RESOURCE_IDS`	Space-separated list of recovered resource IDs
`XDR_EVENT_ID`	Unique identifier for this failover event
`XDR_RECOVERY_POINT`	Timestamp of the recovery point used

Script Requirements

Must exit 0 for success — any non-zero exit halts recovery and triggers an alert
Timeout: 300 seconds (configurable per hook)
Must be idempotent — scripts may be retried after transient failures
Output is captured in the runbook log — keep output concise and meaningful

Example Scripts

DNS update after recovery

post-recover: update internal DNS

#!/bin/bash
# Updates internal DNS records to point to recovered DR instances
# Runs as: Post-Recover hook on databases group
# XDR injects environment variables at runtime

set -euo pipefail

# XDR_RESOURCE_IDS and XDR_SITE are provided by the XDR runtime
# Resource IPs are resolved from the recovery metadata
for id in $XDR_RESOURCE_IDS; do
  # Parse recovery metadata for IP and hostname
  ip=$(echo "$XDR_RECOVERY_METADATA" | jq -r ".resources[\"$id\"].ip")
  hostname=$(echo "$XDR_RECOVERY_METADATA" | jq -r ".resources[\"$id\"].hostname")

  nsupdate -k /etc/xavs/dns.key <<EOF
server 10.0.1.71
update delete ${hostname}.internal A
update add ${hostname}.internal 60 A ${ip}
send
EOF

  echo "Updated DNS: ${hostname}.internal -> ${ip}"
done

echo "DNS update complete for group: $XDR_GROUP_NAME"

Service registry update

post-recover: update service registry

#!/bin/bash
# Registers recovered instances in the internal service registry
# Runs as: Post-Recover hook on app-servers group
# XDR injects environment variables at runtime

set -euo pipefail

REGISTRY_URL="http://service-registry.internal:8500"

# XDR_RESOURCE_IDS and XDR_RECOVERY_METADATA are provided by the XDR runtime
for id in $XDR_RESOURCE_IDS; do
  ip=$(echo "$XDR_RECOVERY_METADATA" | jq -r ".resources[\"$id\"].ip")
  service=$(echo "$XDR_RECOVERY_METADATA" | jq -r ".resources[\"$id\"].tags[\"service-name\"]")

  curl -sf -X PUT "${REGISTRY_URL}/v1/agent/service/register" \
    -H "Content-Type: application/json" \
    -d "{\"ID\": \"${id}\", \"Name\": \"${service}\", \"Address\": \"${ip}\"}"

  echo "Registered: $service at $ip"
done

Pre-failover notification

pre-failover: notify on-call team

#!/bin/bash
# Sends alert to on-call channel before failover begins
# Runs as: Pre-Failover hook on all groups

WEBHOOK_URL="https://hooks.<your-domain>/alert"

curl -sf -X POST "$WEBHOOK_URL" \
  -H "Content-Type: application/json" \
  -d "{
    \"event\": \"DR_FAILOVER_STARTING\",
    \"plan\": \"$XDR_PLAN_NAME\",
    \"site\": \"$XDR_SITE\",
    \"event_id\": \"$XDR_EVENT_ID\",
    \"recovery_point\": \"$XDR_RECOVERY_POINT\"
  }"

echo "On-call notified for failover event $XDR_EVENT_ID"

Attaching Scripts to Resource Groups

Dashboard
CLI

Navigate to Disaster Recovery → Recovery Plans → [Plan] → [Group] → Hooks and click Add Hook. Paste the script content or reference a stored script path.

Replication Scheduling

For asynchronous replication, configure replication windows to control when replication traffic runs and how much bandwidth it consumes. Navigate to Disaster Recovery → Recovery Plans → [Plan] → Replication Schedule:

Setting	Description
Mode	`Continuous` — replicate in real time; `Scheduled` — replicate in defined windows
Schedule	Cron expression for scheduled replication windows
Bandwidth Cap	Limit throughput during peak production hours

Configure these settings directly in the replication schedule panel for each recovery plan.

Scheduled replication increases RPO to the interval between replication windows. Use continuous replication for any workload with an RPO shorter than the replication interval.

Testing Automation

Always validate runbook scripts with a DR test before relying on them in a real failover. Navigate to Disaster Recovery → Protection Plans → [Plan] and click Test Failover to exercise runbook scripts in an isolated environment. After the test completes, review the runbook execution log in Disaster Recovery → Test Sessions → [Session] → Runbook Log.

After any change to a runbook script, run a DR test immediately to confirm the updated script completes successfully. A script that fails silently during a test will halt recovery during an actual disaster.

Next Steps

Recovery Plans

Configure resource groups and health checks

Monitoring

Alert on automation failures and replication lag

XDR User Guide — DR Testing

Run DR tests to validate runbook scripts and measure RTO

Troubleshooting

Diagnose runbook script failures and unexpected failovers

Core Services

Other Services

DR Automation

Overview

Automatic Failover Triggers

Recovery Runbook Scripts

Script Environment

Script Requirements

Example Scripts

Attaching Scripts to Resource Groups

Replication Scheduling

Testing Automation

Next Steps

Recovery Plans

Monitoring

XDR User Guide — DR Testing

Troubleshooting

Core Services

Other Services

Documentation Index

​Overview

​Automatic Failover Triggers

​Recovery Runbook Scripts

​Script Environment

​Script Requirements

​Example Scripts

​Attaching Scripts to Resource Groups

​Replication Scheduling

​Testing Automation

​Next Steps

Recovery Plans

Monitoring

XDR User Guide — DR Testing

Troubleshooting

Overview

Automatic Failover Triggers

Recovery Runbook Scripts

Script Environment

Script Requirements

Example Scripts

Attaching Scripts to Resource Groups

Replication Scheduling

Testing Automation

Next Steps