Overview
This page covers common Instance HA issues encountered by project users — instances that did not recover, notifications stuck in error or running states, and protection settings that are not visible in the Dashboard. For platform-level issues such as monitor failures or engine misconfiguration, refer to the Instance HA Admin Troubleshooting guide.Prerequisites
- Project access to the Xloud Dashboard or CLI
- Knowledge of the affected instance IDs and the compute host involved
Common Issues
Instance did not recover after host failure
Instance did not recover after host failure
Cause: Instance HA protection may not be enabled for the instance, the compute
host may not be registered in any segment, or the segment may be disabled.Resolution:Confirm the segment is enabled and the host is registered:If the failed host is missing from the segment, contact your administrator to
register it. Your administrator can configure this through XDeploy. If the segment is disabled (
Check segment status
List hosts in segment
enabled: False), your administrator
must re-enable it.Notification status is error
Notification status is error
Cause: Automatic recovery failed. Common causes: insufficient capacity on remaining
hosts, an instance stuck in Review the payload for specific failure information. Then check instance state:If the instance is in Contact your administrator if the instance cannot be recovered through the above steps. Your administrator can configure this through XDeploy.
ERROR state that the engine cannot recover, or a network
issue during evacuation.Resolution:Show notification detail
Show instance status
ERROR state, attempt a manual reset and evacuation:Reset instance state
Manually evacuate
Notification stuck in running state
Notification stuck in running state
Cause: The recovery engine is waiting for the target host to accept the instance,
a workflow step has timed out, or the target host is under load.Resolution:Check how long the notification has been in If the notification has been
running state:Show notification timestamps
running for more than 10 minutes, contact your
administrator. They can inspect the Instance HA engine logs and reset the workflow
if it has genuinely stalled. Your administrator can configure this through XDeploy.Protection segment not visible in Dashboard
Protection segment not visible in Dashboard
Cause: No failover segment has been created for your environment, or the segments
that exist have not been made accessible to your project.Resolution: Contact your Xloud administrator and request that a failover segment
be created and that the compute hosts used by your project be registered. Your administrator can configure this through XDeploy. See the
Instance HA Admin Guide — Failover Segments
for the configuration steps.
Instance shows UNKNOWN status after host failure
Instance shows UNKNOWN status after host failure
Cause: The compute service cannot confirm the instance state because the host
is unreachable. This is expected immediately after a host fault is detected.Resolution: Wait for the recovery workflow to complete. The instance transitions
from
UNKNOWN → MIGRATING → BUILD → ACTIVE as recovery proceeds.If the instance remains UNKNOWN for more than 5 minutes without a recovery
notification being created, verify that the failed host is registered in an enabled
segment and that the host monitor can reach the Instance HA notification endpoint.Instance restarted on wrong host
Instance restarted on wrong host
Cause: The
auto recovery method placed the instance on any available host
rather than a specific preferred host. This is expected behaviour for the auto method.Resolution: If your workloads require guaranteed placement on specific hosts, ask
your administrator to configure a segment with reserved_host or rh_priority
recovery method and designate the preferred host as a reserved standby.Manual Recovery Procedure
If automatic recovery fails, use the following procedure to restore service manually.Evacuate to a healthy host
Evacuate to a specific host
--host to let the scheduler choose any available host in the segment:Evacuate to any available host
Next Steps
Monitoring Status
Track live and historical recovery notifications.
Recovery Workflows
Understand recovery methods and workflow stages.
Instance HA Admin Troubleshooting
Administrator-level troubleshooting for monitors, engine failures, and capacity issues.
Compute User Guide
Manage and recover instances manually using core compute operations.