Skip to main content

Overview

This guide covers administrator-level troubleshooting for Xloud Instance HA — from service startup failures to notification processing issues and capacity-related recovery failures. For user-facing issues such as individual instance recovery failures, see the Instance HA User Troubleshooting guide.
Several diagnostic commands in this guide inspect live recovery state. Run them on the controller node and avoid interfering with in-progress recovery workflows.

Common Issues

Cause: The segment may be disabled, or the failed host is not registered in any segment. Also occurs if the engine is not running.Resolution:
Check engine status
docker ps --filter name=masakari_engine
docker logs masakari_engine | tail -30
Check segment and host registration
openstack segment list
openstack segment host list <segment-uuid>
If the segment is disabled, re-enable it:
Re-enable segment
openstack segment update --enabled True <segment-uuid>
If the host is missing from the segment, register it:
Register host
openstack segment host create \
  --type COMPUTE \
  --control_attributes '{"host": "compute-01"}' \
  <segment-uuid>
Cause: No healthy host in the segment has sufficient vCPU or memory to accept the evacuated instances.Resolution:
Check host capacity
openstack host list --service compute
openstack host show <hostname>
Check per-host utilization
openstack hypervisor list --long
Add compute capacity or add additional hosts to the segment. For reserved_host segments, verify the reserved host has sufficient headroom:
Check reserved host utilization
openstack host show <reserved-hostname>
Cause: The compute database still associates the instance with the failed host. The evacuation may have been partially completed.Resolution:
Force instance state to active
openstack server set --state active <instance-uuid>
If the instance remains stuck after state reset, manually evacuate:
Manual evacuation
openstack server evacuate <instance-uuid> --host <healthy-host>
Cause: IPMI or SSH credentials are incorrect, the monitor cannot reach the management network, or a firewall is blocking the monitoring port.Resolution:
Check host monitor logs
docker logs -f masakari_hostmonitor
Test IPMI connectivity manually:
Test IPMI connection
ipmitool -I lanplus \
  -H <ipmi-ip> \
  -U <username> \
  -P <password> \
  chassis status
Test SSH connectivity:
Test SSH connection
ssh -i /etc/xavs/instance-ha/id_rsa root@<compute-host> hostname
Confirm firewall rules permit UDP 623 (IPMI) and TCP 22 (SSH) from the Instance HA controller to all monitored hosts.
Cause: Database connectivity failure, Identity authentication error, or a configuration file syntax error.Resolution:
Check engine startup logs
docker logs masakari_engine | grep -E "ERROR|CRITICAL"
Common log patterns and their resolution:
Log MessageCauseFix
OperationalError: (pymysql)Database unreachableCheck DB connection string and service status
Unauthorized: The request you have made requires authenticationInvalid Identity credentialsRotate service account password
ConfigFileNotFoundMissing config fileVerify /etc/xavs/instance-ha/instance-ha.conf exists
ImportError: No module namedMissing Python dependencyReinstall the Instance HA container image
Cause: The recovery workflow has stalled — the engine is waiting for a Compute RPC call that never completes, or the Taskflow state machine is stuck.Resolution:
Check engine logs for stalled workflows
docker logs masakari_engine | grep -E "stuck|timeout|waiting"
If a notification has been running for more than 15 minutes, manually reset it:
Reset stalled notification
openstack notification update \
  --status error \
  <notification-uuid>
Then run a manual evacuation for any instances that were not recovered:
Manual evacuation
openstack server evacuate <instance-uuid>
Restart the engine after resolving the root cause:
Restart engine
docker restart masakari_engine

Diagnostic Commands Reference

Check all Instance HA service container statuses
docker ps --filter name=masakari
View engine logs (last 100 lines)
docker logs --tail 100 masakari_engine
List all notifications with status
openstack notification list -f table -c uuid -c hostname -c type -c status
Show full notification payload
openstack notification show <notification-uuid> -f json
Check Compute service status for all hosts
openstack compute service list --service nova-compute

Next Steps

Engine Configuration

Tune engine timing parameters to reduce false positives and improve recovery speed.

Host Monitors

Validate and reconfigure IPMI and SSH monitor connectivity.

Failover Segments

Review segment configuration and host registration.

User Troubleshooting

Guide for project users experiencing individual instance recovery failures.