Overview
A recovery workflow is the ordered sequence of actions the Instance HA engine takes after a host failure notification is received. The workflow covers instance evacuation, restart on a healthy host, and post-recovery status reporting. Understanding the workflow helps you interpret recovery notifications and set appropriate expectations for recovery time.Prerequisites
- Instance HA protection enabled on your instances
- At least one failover segment configured with registered hosts
Recovery Workflow Stages
Recovery Methods in Detail
auto — Evacuate to Any Host
auto — Evacuate to Any Host
The
auto method selects the healthiest available host in the segment based on
current vCPU and memory availability. Instances are distributed across multiple
target hosts if no single host has sufficient capacity for all evacuees.Characteristics:- No pre-reserved capacity required
- Recovery succeeds as long as aggregate free capacity in the segment is sufficient
- Most flexible option for mixed workloads
reserved_host — Dedicated Standby
reserved_host — Dedicated Standby
One or more hosts in the segment are designated as reserved standby nodes. These
hosts remain idle until a failover event occurs, ensuring guaranteed capacity for
recovery.Characteristics:
- Guaranteed recovery capacity regardless of current cluster load
- Reserved hosts do not accept regular instance scheduling
- Higher infrastructure cost (idle nodes consume resources)
rh_priority — Prefer Reserved, Fall Back
rh_priority — Prefer Reserved, Fall Back
The engine attempts recovery to reserved hosts first. If reserved hosts are full,
it falls back to the
auto behaviour and selects any available host in the segment.Characteristics:- Balances guaranteed capacity for high-priority workloads with flexibility
- Works well in mixed segments that contain both critical and standard workloads
- Requires at least one reserved host in the segment
Instance State During Recovery
| Phase | Instance Status | Description |
|---|---|---|
| Normal operation | ACTIVE | Instance running on original host |
| Fault detected | UNKNOWN | Host unreachable; compute service cannot confirm instance state |
| Evacuation in progress | MIGRATING | Instance being moved to target host |
| Restarting | BUILD | Instance starting up on target host |
| Recovery complete | ACTIVE | Instance fully operational on new host |
| Recovery failed | ERROR | Manual intervention required |
Recovery Time Expectations
Recovery time depends on several factors:| Factor | Typical Impact |
|---|---|
| Host monitor detection timeout | 30–120 seconds to declare host unreachable |
| Instance count on failed host | Each instance adds 30–120 seconds to total recovery time |
| Instance disk size (shared storage) | Minimal — shared storage volumes are reattached, not copied |
| Target host boot overhead | Constant per instance — determined by instance flavor and image |
Notification Status Reference
Every recovery event creates a notification record. The notificationstatus field
tracks progress through the workflow.
| Status | Meaning |
|---|---|
new | Fault notification received; recovery not yet started |
running | Recovery workflow in progress |
finished | All instances recovered successfully |
error | Recovery failed for one or more instances |
ignored | Notification was de-duplicated or the segment was disabled |
Viewing Recovery Notifications
- Dashboard
- CLI
Navigate to Project → Compute → Instance HA → Notifications. Each notification
shows the affected host, failure type, and current recovery status.
Next Steps
Monitoring Status
Monitor live and historical recovery events in detail.
Troubleshooting
Resolve stuck or failed recovery workflows.
Protection Segments
Review segment configuration and verify your instance is enrolled.
Instance HA Admin Guide
Configure recovery methods and reserved hosts for your segments.