Instance HA Troubleshooting — User Guide

Instance did not recover after host failure

Cause: Instance HA protection may not be enabled for the instance, the compute host may not be registered in any segment, or the segment may be disabled.Resolution:Confirm the segment is enabled and the host is registered:

Check segment status

openstack segment list

List hosts in segment

openstack segment host list <segment-uuid>

If the failed host is missing from the segment, contact your administrator to register it. Your administrator can configure this through XDeploy. If the segment is disabled (enabled: False), your administrator must re-enable it.

After the root cause is resolved, manually evacuate the instance to restore service:

Manual evacuation

openstack server evacuate <instance-id> --host <healthy-host>

Notification status is error

Cause: Automatic recovery failed. Common causes: insufficient capacity on remaining hosts, an instance stuck in ERROR state that the engine cannot recover, or a network issue during evacuation.Resolution:

Show notification detail

openstack notification show <notification-uuid>

Review the payload for specific failure information. Then check instance state:

Show instance status

openstack server show <instance-id> -f value -c status

If the instance is in ERROR state, attempt a manual reset and evacuation:

Reset instance state

openstack server set --state active <instance-id>

Manually evacuate

openstack server evacuate <instance-id> --host <healthy-host>

Contact your administrator if the instance cannot be recovered through the above steps. Your administrator can configure this through XDeploy.

Notification stuck in running state

Cause: The recovery engine is waiting for the target host to accept the instance, a workflow step has timed out, or the target host is under load.Resolution:Check how long the notification has been in running state:

Show notification timestamps

openstack notification show <notification-uuid> \
  -f value -c generated_time -c status

If the notification has been running for more than 10 minutes, contact your administrator. They can inspect the Instance HA engine logs and reset the workflow if it has genuinely stalled. Your administrator can configure this through XDeploy.

Protection segment not visible in Dashboard

Cause: No failover segment has been created for your environment, or the segments that exist have not been made accessible to your project.Resolution: Contact your Xloud administrator and request that a failover segment be created and that the compute hosts used by your project be registered. Your administrator can configure this through XDeploy. See the Instance HA Admin Guide — Failover Segments for the configuration steps.

Instance shows UNKNOWN status after host failure

Cause: The compute service cannot confirm the instance state because the host is unreachable. This is expected immediately after a host fault is detected.Resolution: Wait for the recovery workflow to complete. The instance transitions from UNKNOWN → MIGRATING → BUILD → ACTIVE as recovery proceeds.If the instance remains UNKNOWN for more than 5 minutes without a recovery notification being created, verify that the failed host is registered in an enabled segment and that the host monitor can reach the Instance HA notification endpoint.

Instance restarted on wrong host

Cause: The auto recovery method placed the instance on any available host rather than a specific preferred host. This is expected behaviour for the auto method.Resolution: If your workloads require guaranteed placement on specific hosts, ask your administrator to configure a segment with reserved_host or rh_priority recovery method and designate the preferred host as a reserved standby.

Monitoring Status

Track live and historical recovery notifications.

Recovery Workflows

Understand recovery methods and workflow stages.

Instance HA Admin Troubleshooting

Administrator-level troubleshooting for monitors, engine failures, and capacity issues.

Compute User Guide

Manage and recover instances manually using core compute operations.

Instance HA Troubleshooting — User Guide

Overview

Common Issues

Manual Recovery Procedure

Identify the affected instances

Reset instance state if stuck in ERROR

Evacuate to a healthy host

Verify recovery

Next Steps

Monitoring Status

Recovery Workflows

Instance HA Admin Troubleshooting

Compute User Guide

​Overview

​Common Issues

​Manual Recovery Procedure

Identify the affected instances

Reset instance state if stuck in ERROR

Evacuate to a healthy host

Verify recovery

​Next Steps

Monitoring Status

Recovery Workflows

Instance HA Admin Troubleshooting

Compute User Guide

Overview

Common Issues

Manual Recovery Procedure

Next Steps