Overview
This guide covers administrator-level troubleshooting for Xloud Instance HA — from service startup failures to notification processing issues and capacity-related recovery failures. For user-facing issues such as individual instance recovery failures, see the Instance HA User Troubleshooting guide.Common Issues
Notification received but no recovery triggered
Notification received but no recovery triggered
Cause: The segment may be disabled, or the failed host is not registered in
any segment. Also occurs if the engine is not running.Resolution:If the segment is disabled, re-enable it:If the host is missing from the segment, register it:
Check engine status
Check segment and host registration
Re-enable segment
Register host
Recovery fails with capacity error
Recovery fails with capacity error
Cause: No healthy host in the segment has sufficient vCPU or memory to
accept the evacuated instances.Resolution:Add compute capacity or add additional hosts to the segment. For
Check host capacity
Check per-host utilization
reserved_host
segments, verify the reserved host has sufficient headroom:Check reserved host utilization
Instances stuck in UNKNOWN state after evacuation
Instances stuck in UNKNOWN state after evacuation
Cause: The compute database still associates the instance with the failed host.
The evacuation may have been partially completed.Resolution:If the instance remains stuck after state reset, manually evacuate:
Force instance state to active
Manual evacuation
Host monitor not detecting failures
Host monitor not detecting failures
Cause: IPMI or SSH credentials are incorrect, the monitor cannot reach the
management network, or a firewall is blocking the monitoring port.Resolution:Test IPMI connectivity manually:Test SSH connectivity:Confirm firewall rules permit UDP 623 (IPMI) and TCP 22 (SSH) from the
Instance HA controller to all monitored hosts.
Check host monitor logs
Test IPMI connection
Test SSH connection
Engine fails to start
Engine fails to start
Cause: Database connectivity failure, Identity authentication error, or a
configuration file syntax error.Resolution:Common log patterns and their resolution:
Check engine startup logs
| Log Message | Cause | Fix |
|---|---|---|
OperationalError: (pymysql) | Database unreachable | Check DB connection string and service status |
Unauthorized: The request you have made requires authentication | Invalid Identity credentials | Rotate service account password |
ConfigFileNotFound | Missing config file | Verify /etc/xavs/instance-ha/instance-ha.conf exists |
ImportError: No module named | Missing Python dependency | Reinstall the Instance HA container image |
Notifications permanently stuck in running state
Notifications permanently stuck in running state
Cause: The recovery workflow has stalled — the engine is waiting for a Compute
RPC call that never completes, or the Taskflow state machine is stuck.Resolution:If a notification has been Then run a manual evacuation for any instances that were not recovered:Restart the engine after resolving the root cause:
Check engine logs for stalled workflows
running for more than 15 minutes, manually reset it:Reset stalled notification
Manual evacuation
Restart engine
Diagnostic Commands Reference
Check all Instance HA service container statuses
View engine logs (last 100 lines)
List all notifications with status
Show full notification payload
Check Compute service status for all hosts
Next Steps
Engine Configuration
Tune engine timing parameters to reduce false positives and improve recovery speed.
Host Monitors
Validate and reconfigure IPMI and SSH monitor connectivity.
Failover Segments
Review segment configuration and host registration.
User Troubleshooting
Guide for project users experiencing individual instance recovery failures.