Skip to main content

Overview

Xloud Instance HA delivers zero-touch recovery for protected compute workloads. When a compute host becomes unreachable, the service detects the fault, identifies all protected instances on that host, and automatically evacuates them to healthy nodes — without any manual intervention. This page explains the end-to-end detection and recovery flow.
Prerequisites
  • An active Xloud account
  • Instance HA enabled on your platform by an administrator
  • At least one failover segment configured with registered compute hosts

Core Components

Host Monitor

Continuously polls compute hosts using IPMI out-of-band management or SSH. Declares a host unreachable when it fails to respond within the configured timeout.

Notification Engine

Receives fault signals from monitors, deduplicates events, and routes them to the Recovery Engine as structured notifications.

Recovery Engine

The decision-making core. Identifies protected instances on the failed host and determines the evacuation target based on the segment’s recovery method.

Compute API

Executes the evacuation. Instances are restarted on the selected healthy host using the same image, volume, and network configuration.

Recovery Flow

The recovery process starts within seconds of fault detection. No human action is required for instances enrolled in an active protection segment.

Detection Methods

IPMI (Out-of-Band)

The preferred detection method. The host monitor connects to the server’s IPMI interface — which operates independently of the host OS — to verify whether the physical node is powered and responsive.IPMI detection is more reliable than SSH because it does not depend on the host network stack or OS. A host that has kernel-panicked or lost all network interfaces is still detectable via IPMI.
AdvantageDisadvantage
Works even when OS is unresponsiveRequires IPMI hardware and network access
Detects power failuresRequires IPMI credentials per host
The SSH monitor attempts a TCP connection to the host on port 22. Use this method when IPMI hardware is unavailable.SSH monitoring is susceptible to false positives caused by SSH service restarts, temporary network partitions, or high host load. The monitor implements a configurable retry interval to reduce spurious alerts.
AdvantageDisadvantage
No special hardware requiredDependent on host network and OS
Easy to deployMay miss physical hardware failures

Recovery Methods

Each failover segment uses one of three recovery methods. Your administrator selects the method when creating the segment.
MethodBehaviourBest Suited For
autoEvacuate to any healthy host in the segmentGeneral workloads
reserved_hostEvacuate only to pre-designated standby hostsSLA-critical workloads
rh_priorityPrefer reserved hosts, fall back to any hostMixed environments
The recovery method is configured per segment by your administrator. Contact your administrator to understand which method applies to your protection segment. Your administrator can configure this through XDeploy.

Instance Lifecycle During Recovery

The instance ID, name, attached volumes, and network configuration are preserved across the recovery. Only the physical host changes.

Next Steps

Protection Segments

View and understand the protection segments available for your instances.

Recovery Workflows

Understand recovery methods and how the engine selects evacuation targets.

Monitoring Status

Track active and historical recovery events in the Dashboard and CLI.

Instance HA Overview

Return to the Instance HA service overview page.