Instance HA Admin Troubleshooting

Notification received but no recovery triggered

Cause: The segment may be disabled, or the failed host is not registered in any segment. Also occurs if the engine is not running.Resolution:

Check engine status

docker ps --filter name=masakari_engine
docker logs masakari_engine | tail -30

Check segment and host registration

openstack segment list
openstack segment host list <segment-uuid>

If the segment is disabled, re-enable it:

Re-enable segment

openstack segment update --enabled True <segment-uuid>

If the host is missing from the segment, register it:

openstack segment host create \
  --type COMPUTE \
  --control_attributes '{"host": "compute-01"}' \
  <segment-uuid>

Recovery fails with capacity error

Cause: No healthy host in the segment has sufficient vCPU or memory to accept the evacuated instances.Resolution:

Check host capacity

openstack host list --service compute
openstack host show <hostname>

Check per-host utilization

openstack hypervisor list --long

Add compute capacity or add additional hosts to the segment. For reserved_host segments, verify the reserved host has sufficient headroom:

Check reserved host utilization

openstack host show <reserved-hostname>

Instances stuck in UNKNOWN state after evacuation

Cause: The compute database still associates the instance with the failed host. The evacuation may have been partially completed.Resolution:

Force instance state to active

openstack server set --state active <instance-uuid>

If the instance remains stuck after state reset, manually evacuate:

Manual evacuation

openstack server evacuate <instance-uuid> --host <healthy-host>

Host monitor not detecting failures

Cause: IPMI or SSH credentials are incorrect, the monitor cannot reach the management network, or a firewall is blocking the monitoring port.Resolution:

Check host monitor logs

docker logs -f masakari_hostmonitor

Test IPMI connectivity manually:

Test IPMI connection

ipmitool -I lanplus \
  -H <ipmi-ip> \
  -U <username> \
  -P <password> \
  chassis status

Test SSH connectivity:

Test SSH connection

ssh -i /etc/xavs/instance-ha/id_rsa root@<compute-host> hostname

Confirm firewall rules permit UDP 623 (IPMI) and TCP 22 (SSH) from the Instance HA controller to all monitored hosts.

Engine fails to start

Cause: Database connectivity failure, Identity authentication error, or a configuration file syntax error.Resolution:

Check engine startup logs

docker logs masakari_engine | grep -E "ERROR|CRITICAL"

Common log patterns and their resolution:

Log Message	Cause	Fix
`OperationalError: (pymysql)`	Database unreachable	Check DB connection string and service status
`Unauthorized: The request you have made requires authentication`	Invalid Identity credentials	Rotate service account password
`ConfigFileNotFound`	Missing config file	Verify `/etc/xavs/instance-ha/instance-ha.conf` exists
`ImportError: No module named`	Missing Python dependency	Reinstall the Instance HA container image

Notifications permanently stuck in running state

Cause: The recovery workflow has stalled — the engine is waiting for a Compute RPC call that never completes, or the Taskflow state machine is stuck.Resolution:

Check engine logs for stalled workflows

docker logs masakari_engine | grep -E "stuck|timeout|waiting"

If a notification has been running for more than 15 minutes, manually reset it:

Reset stalled notification

openstack notification update \
  --status error \
  <notification-uuid>

Then run a manual evacuation for any instances that were not recovered:

Manual evacuation

openstack server evacuate <instance-uuid>

Restart the engine after resolving the root cause:

Restart engine

docker restart masakari_engine

Engine Configuration

Tune engine timing parameters to reduce false positives and improve recovery speed.

Host Monitors

Validate and reconfigure IPMI and SSH monitor connectivity.

Failover Segments

Review segment configuration and host registration.

User Troubleshooting

Guide for project users experiencing individual instance recovery failures.

Instance HA Admin Troubleshooting

Overview

Common Issues

Diagnostic Commands Reference

Next Steps

Engine Configuration

Host Monitors

Failover Segments

User Troubleshooting

​Overview

​Common Issues

​Diagnostic Commands Reference

​Next Steps

Engine Configuration

Host Monitors

Failover Segments

User Troubleshooting

Overview

Common Issues

Diagnostic Commands Reference

Next Steps