Skip to main content

Overview

Xloud Instance HA is a fault-detection and automated recovery service deployed alongside the compute cluster. Its architecture separates detection (monitors), event routing (notification engine), decision-making (recovery engine), and execution (Compute API calls) into independently scalable components. Understanding this separation helps administrators plan deployments, diagnose failures, and tune recovery behaviour.
This guide requires administrator privileges. Changes to the Instance HA deployment affect all active recovery workflows cluster-wide.

Component Diagram


Components

Host Monitor

Polls each registered compute host at a configurable interval using IPMI or SSH. Declares a host unreachable after a configurable number of consecutive failures and emits a COMPUTE_HOST fault notification.Deployed as: masakari-hostmonitor service on the controller node.
Monitors running instances for guest-level heartbeat failures, independent of the host state. Emits COMPUTE_INSTANCE fault notifications when a guest stops responding.Deployed as: masakari-instancemonitor service on each compute host.
Receives raw fault signals from monitors, deduplicates events within a configurable window, and routes structured notifications to the Recovery Engine via the message bus.The default driver is NovaNotificationDriver, which also listens to the Xloud Compute message bus for host and instance failure events.
The central decision-making component. On receiving a notification, it:
  1. Queries the Instance HA database to identify the affected segment
  2. Retrieves all protected instances on the failed host
  3. Applies the segment’s recovery method to select evacuation targets
  4. Invokes the Compute API to initiate evacuation
Deployed as: masakari-engine service on the controller node.
Stores all segment definitions, host registrations, reserved host flags, and notification history. Backed by the platform database (MySQL/MariaDB).Schema includes: segments, hosts, notifications, vm_moves tables.

Deployment Topology

Controller Nodes
masakari-api — REST API for segment / host management
masakari-engine — Recovery Engine + Notification Engine
masakari-hostmonitor — Host Monitor daemon
In XDeploy-managed deployments, all Instance HA components are deployed as Docker containers. Configuration files are managed via the /etc/xavs/instance-ha/ overlay directory.

Integration with Xloud Services

ServiceIntegrationPurpose
Xloud ComputeEvacuation API (/os-evacuate)Executes instance migrations to healthy hosts
Xloud IdentityService account authenticationAuthenticates Instance HA API calls
AMQP Message BusNovaNotificationDriver subscriptionReceives host/instance failure events from the Compute message bus
Xloud Distributed StorageShared instance disk backendRequired for live evacuation — local disk instances cannot be moved

Data Flow: Host Failure to Recovery


High Availability for Instance HA

To avoid a single point of failure in the recovery infrastructure:

Active/Passive API

Deploy multiple masakari-api instances behind the load balancer. The API is stateless — all state is in the database.

Engine Leader Election

Run masakari-engine on two controller nodes. The engine uses Tooz-based distributed locking to elect a leader — only one engine processes notifications at a time.

Next Steps

Failover Segments

Create and manage failover segments and register compute hosts.

Host Monitors

Configure IPMI and SSH host monitors for your compute nodes.

Engine Configuration

Tune recovery engine timing, retry intervals, and instance failure behaviour.

Security

Configure RBAC policies and credential management for the Instance HA service.