Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.xloud.tech/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Xloud Instance HA delivers zero-touch recovery for protected compute workloads. When a compute host becomes unreachable, the service detects the fault, identifies all protected instances on that host, and automatically evacuates them to healthy nodes — without any manual intervention. This page explains the end-to-end detection and recovery flow.
Prerequisites
  • An active Xloud account
  • Instance HA enabled on your platform by an administrator
  • At least one failover segment configured with registered compute hosts

Core Components

Host Monitor

Continuously polls compute hosts using IPMI out-of-band management or SSH. Declares a host unreachable when it fails to respond within the configured timeout.

Notification Engine

Receives fault signals from monitors, deduplicates events, and routes them to the Recovery Engine as structured notifications.

Recovery Engine

The decision-making core. Identifies protected instances on the failed host and determines the evacuation target based on the segment’s recovery method.

Compute API

Executes the evacuation. Instances are restarted on the selected healthy host using the same image, volume, and network configuration.

Recovery Flow

The recovery process starts within seconds of fault detection. No human action is required for instances enrolled in an active protection segment.

Detection Methods

IPMI (Out-of-Band)

The preferred detection method. The host monitor connects to the server’s IPMI interface — which operates independently of the host OS — to verify whether the physical node is powered and responsive.IPMI detection is more reliable than SSH because it does not depend on the host network stack or OS. A host that has kernel-panicked or lost all network interfaces is still detectable via IPMI.
AdvantageDisadvantage
Works even when OS is unresponsiveRequires IPMI hardware and network access
Detects power failuresRequires IPMI credentials per host
The SSH monitor attempts a TCP connection to the host on port 22. Use this method when IPMI hardware is unavailable.SSH monitoring is susceptible to false positives caused by SSH service restarts, temporary network partitions, or high host load. The monitor implements a configurable retry interval to reduce spurious alerts.
AdvantageDisadvantage
No special hardware requiredDependent on host network and OS
Easy to deployMay miss physical hardware failures
Uses Pacemaker cluster monitoring to detect node failures. Pacemaker tracks cluster membership and triggers a notification when a node is fenced or goes offline.This method integrates with existing Pacemaker/Corosync clusters and leverages STONITH fencing for reliable failure detection.
AdvantageDisadvantage
Integrates with existing cluster infrastructureRequires Pacemaker/Corosync setup
Reliable fencing-based detectionMore complex configuration

Recovery Methods

Each failover segment uses one of four recovery methods. The method is selected when creating the segment.
MethodBehaviourBest Suited For
autoEvacuate to any healthy host in the segmentGeneral workloads
auto_priorityEvacuate using priority-based host selectionWorkloads with preferred targets
reserved_hostEvacuate only to pre-designated standby hostsSLA-critical workloads
rh_priorityPrefer reserved hosts, fall back to any hostMixed environments
The recovery method is configured per segment by your administrator. Contact your administrator to understand which method applies to your protection segment. Your administrator can configure this through XDeploy.

Notification Types

Instance HA generates different notification types depending on the source of the failure:
TypeColorDescription
COMPUTE_HOSTRedPhysical compute host failure detected by the host monitor
VMOrangeIndividual VM failure detected by the instance monitor
PROCESSBlueService process failure (e.g., nova-compute crash)
pacemakerPurpleFailure detected by Pacemaker cluster monitoring

Instance Lifecycle During Recovery

The instance ID, name, attached volumes, and network configuration are preserved across the recovery. Only the physical host changes.

Monitoring in the Dashboard

The Xloud Dashboard provides dedicated pages for monitoring Instance HA:
PagePathPurpose
SegmentsInstance HA > SegmentsView and manage failover segments and hosts
HostsInstance HA > HostsView all registered hosts across all segments
NotificationsInstance HA > NotificationsTrack recovery events with real-time progress
VM MovesInstance HA > VM MovesView all VM evacuations across all notifications
The Notifications detail page includes a Recovery Progress tab that shows real-time VM evacuation status with auto-refresh every 5 seconds during active recovery. See Monitoring Status for details.

Next Steps

Protection Segments

Create segments, add hosts, and configure recovery methods

Recovery Workflows

Understand recovery methods and how the engine selects evacuation targets

Monitoring Status

Track active and historical recovery events in the Dashboard

Instance HA Overview

Return to the Instance HA service overview page