Documentation Index
Fetch the complete documentation index at: https://docs.xloud.tech/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Xloud Instance HA delivers zero-touch recovery for protected compute workloads. When a compute host becomes unreachable, the service detects the fault, identifies all protected instances on that host, and automatically evacuates them to healthy nodes — without any manual intervention. This page explains the end-to-end detection and recovery flow.Prerequisites
- An active Xloud account
- Instance HA enabled on your platform by an administrator
- At least one failover segment configured with registered compute hosts
Core Components
Host Monitor
Continuously polls compute hosts using IPMI out-of-band management or SSH.
Declares a host unreachable when it fails to respond within the configured timeout.
Notification Engine
Receives fault signals from monitors, deduplicates events, and routes them to
the Recovery Engine as structured notifications.
Recovery Engine
The decision-making core. Identifies protected instances on the failed host and
determines the evacuation target based on the segment’s recovery method.
Compute API
Executes the evacuation. Instances are restarted on the selected healthy host
using the same image, volume, and network configuration.
Recovery Flow
The recovery process starts within seconds of fault detection. No human action is required for instances enrolled in an active protection segment.Detection Methods
IPMI (Out-of-Band)
IPMI (Out-of-Band)
The preferred detection method. The host monitor connects to the server’s IPMI
interface — which operates independently of the host OS — to verify whether the
physical node is powered and responsive.IPMI detection is more reliable than SSH because it does not depend on the host
network stack or OS. A host that has kernel-panicked or lost all network interfaces
is still detectable via IPMI.
| Advantage | Disadvantage |
|---|---|
| Works even when OS is unresponsive | Requires IPMI hardware and network access |
| Detects power failures | Requires IPMI credentials per host |
SSH (In-Band)
SSH (In-Band)
The SSH monitor attempts a TCP connection to the host on port 22. Use this method
when IPMI hardware is unavailable.SSH monitoring is susceptible to false positives caused by SSH service restarts,
temporary network partitions, or high host load. The monitor implements a configurable
retry interval to reduce spurious alerts.
| Advantage | Disadvantage |
|---|---|
| No special hardware required | Dependent on host network and OS |
| Easy to deploy | May miss physical hardware failures |
Pacemaker
Pacemaker
Uses Pacemaker cluster monitoring to detect node failures. Pacemaker tracks cluster
membership and triggers a notification when a node is fenced or goes offline.This method integrates with existing Pacemaker/Corosync clusters and leverages
STONITH fencing for reliable failure detection.
| Advantage | Disadvantage |
|---|---|
| Integrates with existing cluster infrastructure | Requires Pacemaker/Corosync setup |
| Reliable fencing-based detection | More complex configuration |
Recovery Methods
Each failover segment uses one of four recovery methods. The method is selected when creating the segment.| Method | Behaviour | Best Suited For |
|---|---|---|
auto | Evacuate to any healthy host in the segment | General workloads |
auto_priority | Evacuate using priority-based host selection | Workloads with preferred targets |
reserved_host | Evacuate only to pre-designated standby hosts | SLA-critical workloads |
rh_priority | Prefer reserved hosts, fall back to any host | Mixed environments |
The recovery method is configured per segment by your administrator. Contact your
administrator to understand which method applies to your protection segment. Your
administrator can configure this through XDeploy.
Notification Types
Instance HA generates different notification types depending on the source of the failure:| Type | Color | Description |
|---|---|---|
| COMPUTE_HOST | Red | Physical compute host failure detected by the host monitor |
| VM | Orange | Individual VM failure detected by the instance monitor |
| PROCESS | Blue | Service process failure (e.g., nova-compute crash) |
| pacemaker | Purple | Failure detected by Pacemaker cluster monitoring |
Instance Lifecycle During Recovery
The instanceID, name, attached volumes, and network configuration are preserved
across the recovery. Only the physical host changes.
Monitoring in the Dashboard
The Xloud Dashboard provides dedicated pages for monitoring Instance HA:| Page | Path | Purpose |
|---|---|---|
| Segments | Instance HA > Segments | View and manage failover segments and hosts |
| Hosts | Instance HA > Hosts | View all registered hosts across all segments |
| Notifications | Instance HA > Notifications | Track recovery events with real-time progress |
| VM Moves | Instance HA > VM Moves | View all VM evacuations across all notifications |
Next Steps
Protection Segments
Create segments, add hosts, and configure recovery methods
Recovery Workflows
Understand recovery methods and how the engine selects evacuation targets
Monitoring Status
Track active and historical recovery events in the Dashboard
Instance HA Overview
Return to the Instance HA service overview page