Skip to main content

Overview

A recovery workflow is the ordered sequence of actions the Instance HA engine takes after a host failure notification is received. The workflow covers instance evacuation, restart on a healthy host, and post-recovery status reporting. Understanding the workflow helps you interpret recovery notifications and set appropriate expectations for recovery time.
Prerequisites
  • Instance HA protection enabled on your instances
  • At least one failover segment configured with registered hosts

Recovery Workflow Stages


Recovery Methods in Detail

auto — Evacuate to Any Host

The auto method selects the healthiest available host in the segment based on current vCPU and memory availability. Instances are distributed across multiple target hosts if no single host has sufficient capacity for all evacuees.Characteristics:
  • No pre-reserved capacity required
  • Recovery succeeds as long as aggregate free capacity in the segment is sufficient
  • Most flexible option for mixed workloads
Risk: Recovery may fail if all remaining hosts are near capacity when the fault occurs. Maintain a minimum headroom of 20–30% unused capacity across the segment.
One or more hosts in the segment are designated as reserved standby nodes. These hosts remain idle until a failover event occurs, ensuring guaranteed capacity for recovery.Characteristics:
  • Guaranteed recovery capacity regardless of current cluster load
  • Reserved hosts do not accept regular instance scheduling
  • Higher infrastructure cost (idle nodes consume resources)
Best for: Mission-critical applications, financial systems, and workloads with strict RTO requirements.
The engine attempts recovery to reserved hosts first. If reserved hosts are full, it falls back to the auto behaviour and selects any available host in the segment.Characteristics:
  • Balances guaranteed capacity for high-priority workloads with flexibility
  • Works well in mixed segments that contain both critical and standard workloads
  • Requires at least one reserved host in the segment
Best for: Environments with heterogeneous workloads where some instances need guaranteed failover and others can tolerate best-effort recovery.

Instance State During Recovery

PhaseInstance StatusDescription
Normal operationACTIVEInstance running on original host
Fault detectedUNKNOWNHost unreachable; compute service cannot confirm instance state
Evacuation in progressMIGRATINGInstance being moved to target host
RestartingBUILDInstance starting up on target host
Recovery completeACTIVEInstance fully operational on new host
Recovery failedERRORManual intervention required
Instances in ERROR or SHUTOFF state at the time of the host failure may not be automatically recovered, depending on your administrator’s configuration. Verify your administrator has enabled the recover_ignoring_error_instances setting if needed.

Recovery Time Expectations

Recovery time depends on several factors:
FactorTypical Impact
Host monitor detection timeout30–120 seconds to declare host unreachable
Instance count on failed hostEach instance adds 30–120 seconds to total recovery time
Instance disk size (shared storage)Minimal — shared storage volumes are reattached, not copied
Target host boot overheadConstant per instance — determined by instance flavor and image
Use shared storage (Xloud Distributed Storage) for all protected instances. Instances backed by local ephemeral disk cannot be evacuated and will be lost on host failure.

Notification Status Reference

Every recovery event creates a notification record. The notification status field tracks progress through the workflow.
StatusMeaning
newFault notification received; recovery not yet started
runningRecovery workflow in progress
finishedAll instances recovered successfully
errorRecovery failed for one or more instances
ignoredNotification was de-duplicated or the segment was disabled

Viewing Recovery Notifications

Navigate to Project → Compute → Instance HA → Notifications. Each notification shows the affected host, failure type, and current recovery status.
Filter by Status: error to quickly identify recoveries that require follow-up.

Next Steps

Monitoring Status

Monitor live and historical recovery events in detail.

Troubleshooting

Resolve stuck or failed recovery workflows.

Protection Segments

Review segment configuration and verify your instance is enrolled.

Instance HA Admin Guide

Configure recovery methods and reserved hosts for your segments.