Instance HA Architecture

Overview

Xloud Instance HA is a fault-detection and automated recovery service deployed alongside the compute cluster. Its architecture separates detection (monitors), event routing (notification engine), decision-making (recovery engine), and execution (Compute API calls) into independently scalable components. Understanding this separation helps administrators plan deployments, diagnose failures, and tune recovery behaviour.

This guide requires administrator privileges. Changes to the Instance HA deployment affect all active recovery workflows cluster-wide.

Component Diagram

Components

Host Monitor

Polls each registered compute host at a configurable interval using IPMI or SSH. Declares a host unreachable after a configurable number of consecutive failures and emits a COMPUTE_HOST fault notification.Deployed as: masakari-hostmonitor service on the controller node.

Instance Monitor

Monitors running instances for guest-level heartbeat failures, independent of the host state. Emits COMPUTE_INSTANCE fault notifications when a guest stops responding.Deployed as: masakari-instancemonitor service on each compute host.

Notification Engine

Receives raw fault signals from monitors, deduplicates events within a configurable window, and routes structured notifications to the Recovery Engine via the message bus.The default driver is NovaNotificationDriver, which also listens to the Xloud Compute message bus for host and instance failure events.

Recovery Engine

The central decision-making component. On receiving a notification, it:

Queries the Instance HA database to identify the affected segment
Retrieves all protected instances on the failed host
Applies the segment’s recovery method to select evacuation targets
Invokes the Compute API to initiate evacuation

Deployed as: masakari-engine service on the controller node.

Instance HA Database

Stores all segment definitions, host registrations, reserved host flags, and notification history. Backed by the platform database (MySQL/MariaDB).Schema includes: segments, hosts, notifications, vm_moves tables.

Deployment Topology

Controller Nodes

masakari-api — REST API for segment / host management

masakari-engine — Recovery Engine + Notification Engine

masakari-hostmonitor — Host Monitor daemon

Compute Nodes

In XDeploy-managed deployments, all Instance HA components are deployed as Docker containers. Configuration files are managed via the /etc/xavs/instance-ha/ overlay directory.

Integration with Xloud Services

Service	Integration	Purpose
Xloud Compute	Evacuation API (`/os-evacuate`)	Executes instance migrations to healthy hosts
Xloud Identity	Service account authentication	Authenticates Instance HA API calls
AMQP Message Bus	`NovaNotificationDriver` subscription	Receives host/instance failure events from the Compute message bus
Xloud Distributed Storage	Shared instance disk backend	Required for live evacuation — local disk instances cannot be moved

Data Flow: Host Failure to Recovery

High Availability for Instance HA

To avoid a single point of failure in the recovery infrastructure:

Active/Passive API

Deploy multiple masakari-api instances behind the load balancer. The API is stateless — all state is in the database.

Engine Leader Election

Run masakari-engine on two controller nodes. The engine uses Tooz-based distributed locking to elect a leader — only one engine processes notifications at a time.

Next Steps

Failover Segments

Create and manage failover segments and register compute hosts.

Host Monitors

Configure IPMI and SSH host monitors for your compute nodes.

Engine Configuration

Tune recovery engine timing, retry intervals, and instance failure behaviour.

Security

Configure RBAC policies and credential management for the Instance HA service.

Core Services

Other Services

Instance HA Architecture

Overview

Component Diagram

Components

Deployment Topology

Integration with Xloud Services

Data Flow: Host Failure to Recovery

High Availability for Instance HA

Active/Passive API

Engine Leader Election

Next Steps

Failover Segments

Host Monitors

Engine Configuration

Security

Core Services

Other Services

Documentation Index

​Overview

​Component Diagram

​Components

​Deployment Topology

​Integration with Xloud Services

​Data Flow: Host Failure to Recovery

​High Availability for Instance HA

Active/Passive API

Engine Leader Election

​Next Steps

Failover Segments

Host Monitors

Engine Configuration

Security

Overview

Component Diagram

Components

Deployment Topology

Integration with Xloud Services

Data Flow: Host Failure to Recovery

High Availability for Instance HA

Next Steps