> ## Documentation Index > Fetch the complete documentation index at: https://docs.xloud.tech/llms.txt > Use this file to discover all available pages before exploring further. # Instance HA Admin Troubleshooting > Diagnose Instance HA platform issues — engine failures, monitor connectivity, and notification delivery problems. ## Overview This guide covers administrator-level troubleshooting for Xloud Instance HA — from service startup failures to notification processing issues and capacity-related recovery failures. For user-facing issues such as individual instance recovery failures, see the [Instance HA User Troubleshooting](/services/instance-ha/user-guide/troubleshooting) guide. Several diagnostic commands in this guide inspect live recovery state. Run them on the controller node and avoid interfering with in-progress recovery workflows. *** ## Common Issues **Cause**: The segment may be disabled, or the failed host is not registered in any segment. Also occurs if the engine is not running. **Resolution**: ```bash title="Check engine status" theme={null} docker ps --filter name=masakari_engine docker logs masakari_engine | tail -30 ``` ```bash title="Check segment and host registration" theme={null} openstack segment list openstack segment host list ``` If the segment is disabled, re-enable it: ```bash title="Re-enable segment" theme={null} openstack segment update --enabled True ``` If the host is missing from the segment, register it: ```bash title="Register host" theme={null} openstack segment host create \ --type COMPUTE \ --control_attributes '{"host": "compute-01"}' \ ``` **Cause**: No healthy host in the segment has sufficient vCPU or memory to accept the evacuated instances. **Resolution**: ```bash title="Check host capacity" theme={null} openstack host list --service compute openstack host show ``` ```bash title="Check per-host utilization" theme={null} openstack hypervisor list --long ``` Add compute capacity or add additional hosts to the segment. For `reserved_host` segments, verify the reserved host has sufficient headroom: ```bash title="Check reserved host utilization" theme={null} openstack host show ``` **Cause**: The compute database still associates the instance with the failed host. The evacuation may have been partially completed. **Resolution**: ```bash title="Force instance state to active" theme={null} openstack server set --state active ``` If the instance remains stuck after state reset, manually evacuate: ```bash title="Manual evacuation" theme={null} openstack server evacuate --host ``` **Cause**: IPMI or SSH credentials are incorrect, the monitor cannot reach the management network, or a firewall is blocking the monitoring port. **Resolution**: ```bash title="Check host monitor logs" theme={null} docker logs -f masakari_hostmonitor ``` Test IPMI connectivity manually: ```bash title="Test IPMI connection" theme={null} ipmitool -I lanplus \ -H \ -U \ -P \ chassis status ``` Test SSH connectivity: ```bash title="Test SSH connection" theme={null} ssh -i /etc/xavs/instance-ha/id_rsa root@ hostname ``` Confirm firewall rules permit UDP 623 (IPMI) and TCP 22 (SSH) from the Instance HA controller to all monitored hosts. **Cause**: Database connectivity failure, Identity authentication error, or a configuration file syntax error. **Resolution**: ```bash title="Check engine startup logs" theme={null} docker logs masakari_engine | grep -E "ERROR|CRITICAL" ``` Common log patterns and their resolution: | Log Message | Cause | Fix | | ----------------------------------------------------------------- | ---------------------------- | ------------------------------------------------------ | | `OperationalError: (pymysql)` | Database unreachable | Check DB connection string and service status | | `Unauthorized: The request you have made requires authentication` | Invalid Identity credentials | Rotate service account password | | `ConfigFileNotFound` | Missing config file | Verify `/etc/xavs/instance-ha/instance-ha.conf` exists | | `ImportError: No module named` | Missing Python dependency | Reinstall the Instance HA container image | **Cause**: The recovery workflow has stalled — the engine is waiting for a Compute RPC call that never completes, or the Taskflow state machine is stuck. **Resolution**: ```bash title="Check engine logs for stalled workflows" theme={null} docker logs masakari_engine | grep -E "stuck|timeout|waiting" ``` If a notification has been `running` for more than 15 minutes, manually reset it: ```bash title="Reset stalled notification" theme={null} openstack notification update \ --status error \ ``` Then run a manual evacuation for any instances that were not recovered: ```bash title="Manual evacuation" theme={null} openstack server evacuate ``` Restart the engine after resolving the root cause: ```bash title="Restart engine" theme={null} docker restart masakari_engine ``` *** ## Diagnostic Commands Reference ```bash title="Check all Instance HA service container statuses" theme={null} docker ps --filter name=masakari ``` ```bash title="View engine logs (last 100 lines)" theme={null} docker logs --tail 100 masakari_engine ``` ```bash title="List all notifications with status" theme={null} openstack notification list -f table -c uuid -c hostname -c type -c status ``` ```bash title="Show full notification payload" theme={null} openstack notification show -f json ``` ```bash title="Check Compute service status for all hosts" theme={null} openstack compute service list --service nova-compute ``` *** ## Next Steps Tune engine timing parameters to reduce false positives and improve recovery speed. Validate and reconfigure IPMI and SSH monitor connectivity. Review segment configuration and host registration. Guide for project users experiencing individual instance recovery failures.