Skip to main content

Overview

Continuous monitoring of XDR replication is critical — a silent replication failure discovered only during an actual disaster event can mean data loss far beyond the configured RPO. Integrate XDR with XIMP to surface replication lag, site health, and DR readiness metrics before they become incidents.
Prerequisites
  • XIMP deployed and agents active on both primary and DR sites
  • XDR controller API accessible from the XIMP collector
  • Protection plans in ACTIVE status

Key Metrics

MetricDescriptionAlert Threshold
xdr_replication_lag_secondsCurrent replication lag per plan> RPO target
xdr_site_healthSite availability statusUNREACHABLE
xdr_plan_statusPlan replication stateNot ACTIVE
xdr_last_test_age_daysDays since last DR test for a plan> 90 days
xdr_recovery_point_countNumber of available recovery points< minimum configured
xdr_replication_throughput_bytesReplication throughput per linkN/A (trending)
xdr_sync_progress_percentInitial sync progress (0–100)< 100 after 48h
xdr_link_latency_msRound-trip latency between sites> site-specific threshold

XIMP Dashboard

XDR includes a pre-built XIMP dashboard showing all protection plans and their current replication status. Navigate to Monitoring → Dashboards → XDR Overview. The dashboard provides:
  • Per-plan replication lag (sparkline, 24h history)
  • Site health indicator for all registered sites
  • RPO compliance percentage per plan (7d rolling)
  • Last DR test date and outcome per plan
  • Active failover events (if any)

Alert Rules

Configure XIMP alert rules to notify operations teams before replication problems impact RPO compliance.
Alert before lag actually exceeds the RPO target — early warning allows investigation before data loss risk is realized.Navigate to Monitoring → Alerting → Alert Rules → Create Rule:
Name: XDR Replication Lag Warning
Condition: xdr_replication_lag_seconds > (xdr_rpo_target_seconds * 0.75)
For: 5 minutes
Severity: Warning
Channel: ops-alerts
Name: XDR Replication Lag Critical
Condition: xdr_replication_lag_seconds > xdr_rpo_target_seconds
For: 2 minutes
Severity: Critical
Channel: ops-pagerduty
Alert immediately if a protection plan transitions out of ACTIVE status — this means replication has stopped and the DR site data is not being updated.
Name: XDR Plan Not Active
Condition: xdr_plan_status != "ACTIVE"
For: 1 minute
Severity: Critical
Channel: ops-pagerduty
Alert when a site becomes unreachable — could indicate the primary site has failed or network connectivity between sites has been lost.
Name: XDR Site Unreachable
Condition: xdr_site_health == "UNREACHABLE"
For: 2 minutes
Severity: Critical
Channel: ops-pagerduty
Alert when a protection plan has not been tested within the configured interval. DR plans that are never tested cannot be relied upon during an actual disaster.
Name: XDR Test Overdue
Condition: xdr_last_test_age_days > 90
For: immediate
Severity: Warning
Channel: ops-alerts
Navigate to Monitoring → Alerting → Alert Rules and create a rule sourcing the xdr_last_test_age_days metric to ensure tests are not missed.

Diagnostic Views

All diagnostic information is available through the XDR Dashboard:
DiagnosticDashboard Location
Replication lag across all plansDisaster Recovery → Protection Plans — lag column in the plan list
Replication health historyDisaster Recovery → Protection Plans → [Plan] — replication lag sparkline (24h history)
Site health statusDisaster Recovery → Sites — health indicator per site
Link throughput statisticsDisaster Recovery → Sites → Replication Links → [Link] — throughput and latency metrics

Replication Health Thresholds

Use these thresholds when configuring XIMP alert rules:
MetricHealthyWarningCritical
Replication lag< 50% of RPO target50–100% of RPO target> RPO target
Plan statusACTIVEDEGRADEDFAILED / STOPPED
Site healthCONNECTEDDEGRADEDUNREACHABLE
Last recovery point age< RPO targetRPO target to 2× RPO> 2× RPO target
Sync progress (initial)IncreasingStalled > 30 minNo progress > 2h

Log Collection

XDR agent and controller logs are forwarded to XIMP log analytics automatically when agents are deployed via XDeploy. Query logs in Monitoring → Log Explorer:
Log SourceQuery Pattern
XDR controllerservice: xdr-controller
XDR agent (primary)service: xdr-agent AND site: primary-dc1
XDR agent (DR)service: xdr-agent AND site: dr-site-a
Failover eventsservice: xdr-controller AND event_type: failover
Runbook scriptsservice: xdr-runbook AND plan: prod-database-dr

Next Steps

XIMP Admin Guide — Alert Channels

Configure notification channels for XDR alerts

Compliance

Generate RPO/RTO compliance reports from monitoring history

DR Automation

Configure automatic failover on site health alerts

Troubleshooting

Diagnose replication lag and plan health issues