Overview
Xloud Instance HA delivers zero-touch recovery for protected compute workloads. When a compute host becomes unreachable, the service detects the fault, identifies all protected instances on that host, and automatically evacuates them to healthy nodes — without any manual intervention. This page explains the end-to-end detection and recovery flow.Prerequisites
- An active Xloud account
- Instance HA enabled on your platform by an administrator
- At least one failover segment configured with registered compute hosts
Core Components
Host Monitor
Continuously polls compute hosts using IPMI out-of-band management or SSH.
Declares a host unreachable when it fails to respond within the configured timeout.
Notification Engine
Receives fault signals from monitors, deduplicates events, and routes them to
the Recovery Engine as structured notifications.
Recovery Engine
The decision-making core. Identifies protected instances on the failed host and
determines the evacuation target based on the segment’s recovery method.
Compute API
Executes the evacuation. Instances are restarted on the selected healthy host
using the same image, volume, and network configuration.
Recovery Flow
The recovery process starts within seconds of fault detection. No human action is required for instances enrolled in an active protection segment.Detection Methods
IPMI (Out-of-Band)
IPMI (Out-of-Band)
The preferred detection method. The host monitor connects to the server’s IPMI
interface — which operates independently of the host OS — to verify whether the
physical node is powered and responsive.IPMI detection is more reliable than SSH because it does not depend on the host
network stack or OS. A host that has kernel-panicked or lost all network interfaces
is still detectable via IPMI.
| Advantage | Disadvantage |
|---|---|
| Works even when OS is unresponsive | Requires IPMI hardware and network access |
| Detects power failures | Requires IPMI credentials per host |
SSH (In-Band)
SSH (In-Band)
The SSH monitor attempts a TCP connection to the host on port 22. Use this method
when IPMI hardware is unavailable.SSH monitoring is susceptible to false positives caused by SSH service restarts,
temporary network partitions, or high host load. The monitor implements a configurable
retry interval to reduce spurious alerts.
| Advantage | Disadvantage |
|---|---|
| No special hardware required | Dependent on host network and OS |
| Easy to deploy | May miss physical hardware failures |
Recovery Methods
Each failover segment uses one of three recovery methods. Your administrator selects the method when creating the segment.| Method | Behaviour | Best Suited For |
|---|---|---|
auto | Evacuate to any healthy host in the segment | General workloads |
reserved_host | Evacuate only to pre-designated standby hosts | SLA-critical workloads |
rh_priority | Prefer reserved hosts, fall back to any host | Mixed environments |
The recovery method is configured per segment by your administrator. Contact your
administrator to understand which method applies to your protection segment. Your administrator can configure this through XDeploy.
Instance Lifecycle During Recovery
The instanceID, name, attached volumes, and network configuration are preserved
across the recovery. Only the physical host changes.
Next Steps
Protection Segments
View and understand the protection segments available for your instances.
Recovery Workflows
Understand recovery methods and how the engine selects evacuation targets.
Monitoring Status
Track active and historical recovery events in the Dashboard and CLI.
Instance HA Overview
Return to the Instance HA service overview page.