Skip to main content

Overview

Host monitors are the detection layer of Instance HA. They continuously poll registered compute hosts and emit fault notifications when a host becomes unreachable. Xloud Instance HA supports two monitor types: IPMI (out-of-band, recommended for production) and SSH (in-band, for environments without IPMI access). This page covers configuration for both types and the timing parameters that control detection sensitivity.
Misconfigured monitor credentials or unreachable IPMI endpoints will cause false-negative detections — the monitor reports success even when the host has failed. Validate all monitor connections before enabling production workloads.

Monitor Types

IPMI (Recommended)

Uses out-of-band management hardware. Detects failures even when the host OS, kernel, or all network interfaces are completely unresponsive.

SSH

Attempts an SSH TCP connection to the host. Simpler to set up but dependent on host network stack — may miss hardware failures that leave the SSH port unreachable.

IPMI Host Monitor

The IPMI monitor uses the host’s control_attributes JSON field, set when registering the host in a segment.

Configure IPMI Credentials

When adding a host to a segment, set the Control Attributes field to a JSON object with the IPMI endpoint:
IPMI control attributes
{
  "host": "192.168.10.11",
  "username": "admin",
  "password": "ipmi-password"
}
KeyDescription
hostIPMI management IP address
usernameIPMI user with chassis status read permissions
passwordIPMI user password

Validate IPMI Connectivity

Before registering hosts, validate IPMI access from the controller node:
Test IPMI connectivity
ipmitool -I lanplus \
  -H <ipmi-ip> \
  -U <username> \
  -P <password> \
  chassis status
Expected output: System Power State: on confirms IPMI access is working.
Ensure UDP port 623 is permitted between the Instance HA controller and all IPMI management interfaces. IPMI uses RMCP+ protocol over UDP 623 by default.

SSH Host Monitor

The SSH monitor attempts a TCP connection to port 22. It uses the SSH key configured for the Instance HA service account — no password authentication is used.

Deploy SSH Keys

Locate the service key

The Instance HA host monitor generates an SSH key pair at service startup. Locate the public key on the controller node:
Find service public key
cat /etc/xavs/instance-ha/id_rsa.pub

Deploy to compute hosts

Append the public key to the authorized_keys of the user the monitor will connect as (typically root or a dedicated service account):
Deploy key to compute host
ssh-copy-id -i /etc/xavs/instance-ha/id_rsa.pub root@<compute-host>

Register with SSH attributes

When registering the host in the segment, use the IP address only — no credentials:
Register host for SSH monitoring
openstack segment host create \
  --type COMPUTE \
  --control_attributes '{"host": "10.0.1.72"}' \
  <segment-uuid>

Validate connectivity

Test SSH from controller
ssh -i /etc/xavs/instance-ha/id_rsa root@<compute-host> hostname
SSH connection succeeds without password prompt.

Timing Parameters

Adjust monitoring sensitivity through the parameters below:
SectionParameterDefaultDescription
[DEFAULT]wait_period_after_service_update180Seconds to wait after a host enters maintenance before triggering recovery — prevents false alarms during planned restarts
[DEFAULT]long_rpc_timeout300Maximum seconds to wait for a Compute RPC call to complete before declaring it failed
[host_failure]host_failure_recovery_interval17Seconds between recovery retry attempts when the first evacuation attempt fails
[host_failure]ignore_lease_seconds0Seconds after host boot to suppress fault notifications — set to 60-120 to avoid startup noise

Open Advanced Configuration

In XDeploy, navigate to Advanced Configuration. In the Service Tree, select masakari.

Edit timing parameters

Select or create instance-ha.conf in the Code Editor. Add or modify the timing parameters:
Host monitor timing in XDeploy Advanced Configuration
[DEFAULT]
wait_period_after_service_update = 180
long_rpc_timeout = 300

[host_failure]
host_failure_recovery_interval = 17
ignore_lease_seconds = 60
Click Save Current File.

Apply changes

Navigate to Operations and run a reconfigure action. The host monitor service restarts automatically with the updated parameters.
Host monitor is running with the new timing configuration.

Monitor Health Check

Verify the host monitor is running and detecting hosts correctly:
Check monitor service status
docker ps --filter name=masakari_hostmonitor
View monitor logs
docker logs -f masakari_hostmonitor
Look for log entries confirming successful polls:
INFO masakari.hostmonitor: Host compute-01 is ALIVE
INFO masakari.hostmonitor: Host compute-02 is ALIVE
A repeated UNREACHABLE log for a running host indicates a credential or network configuration issue — not a genuine host failure.

Next Steps

Instance Monitors

Configure guest-level instance heartbeat monitoring for per-VM fault detection.

Failover Segments

Register and manage compute hosts within protection segments.

Engine Configuration

Tune recovery engine timing and retry parameters.

Security

Secure IPMI credentials and restrict access to Instance HA APIs.