Overview
Xloud Instance HA is a fault-detection and automated recovery service deployed alongside the compute cluster. Its architecture separates detection (monitors), event routing (notification engine), decision-making (recovery engine), and execution (Compute API calls) into independently scalable components. Understanding this separation helps administrators plan deployments, diagnose failures, and tune recovery behaviour.Component Diagram
Components
Host Monitor
Host Monitor
Polls each registered compute host at a configurable interval using IPMI or SSH.
Declares a host unreachable after a configurable number of consecutive failures
and emits a
COMPUTE_HOST fault notification.Deployed as: masakari-hostmonitor service on the controller node.Instance Monitor
Instance Monitor
Monitors running instances for guest-level heartbeat failures, independent of the
host state. Emits
COMPUTE_INSTANCE fault notifications when a guest stops
responding.Deployed as: masakari-instancemonitor service on each compute host.Notification Engine
Notification Engine
Receives raw fault signals from monitors, deduplicates events within a configurable
window, and routes structured notifications to the Recovery Engine via the message bus.The default driver is
NovaNotificationDriver, which also listens to the Xloud
Compute message bus for host and instance failure events.Recovery Engine
Recovery Engine
The central decision-making component. On receiving a notification, it:
- Queries the Instance HA database to identify the affected segment
- Retrieves all protected instances on the failed host
- Applies the segment’s recovery method to select evacuation targets
- Invokes the Compute API to initiate evacuation
masakari-engine service on the controller node.Instance HA Database
Instance HA Database
Stores all segment definitions, host registrations, reserved host flags, and
notification history. Backed by the platform database (MySQL/MariaDB).Schema includes:
segments, hosts, notifications, vm_moves tables.Deployment Topology
Controller Nodes
masakari-api — REST API for segment / host management
masakari-engine — Recovery Engine + Notification Engine
masakari-hostmonitor — Host Monitor daemon
Compute Nodes
In XDeploy-managed deployments, all Instance HA components are deployed as Docker
containers. Configuration files are managed via the
/etc/xavs/instance-ha/ overlay
directory.Integration with Xloud Services
| Service | Integration | Purpose |
|---|---|---|
| Xloud Compute | Evacuation API (/os-evacuate) | Executes instance migrations to healthy hosts |
| Xloud Identity | Service account authentication | Authenticates Instance HA API calls |
| AMQP Message Bus | NovaNotificationDriver subscription | Receives host/instance failure events from the Compute message bus |
| Xloud Distributed Storage | Shared instance disk backend | Required for live evacuation — local disk instances cannot be moved |
Data Flow: Host Failure to Recovery
High Availability for Instance HA
To avoid a single point of failure in the recovery infrastructure:Active/Passive API
Deploy multiple
masakari-api instances behind the load balancer. The API is
stateless — all state is in the database.Engine Leader Election
Run
masakari-engine on two controller nodes. The engine uses Tooz-based
distributed locking to elect a leader — only one engine processes notifications
at a time.Next Steps
Failover Segments
Create and manage failover segments and register compute hosts.
Host Monitors
Configure IPMI and SSH host monitors for your compute nodes.
Engine Configuration
Tune recovery engine timing, retry intervals, and instance failure behaviour.
Security
Configure RBAC policies and credential management for the Instance HA service.