Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.xloud.tech/llms.txt

Use this file to discover all available pages before exploring further.

Overview

A recovery workflow is the ordered sequence of actions the Instance HA engine takes after a host failure notification is received. The workflow covers instance evacuation, restart on a healthy host, and post-recovery status reporting. The Dashboard provides real-time tracking of each VM evacuation through the Recovery Progress tab and a consolidated VM Moves page.
Prerequisites
  • Instance HA protection enabled on your instances
  • At least one failover segment configured with registered hosts

Recovery Workflow Stages


Recovery Progress — Real-Time Tracking

When a recovery is in progress, the Dashboard provides real-time tracking of each individual VM evacuation through the Recovery Progress tab on the notification detail page.
1

Open the notification detail

Navigate to Instance HA > Notifications. Click a notification UUID to open the detail page.
2

View the Recovery Progress tab

Click the Recovery Progress tab. This tab shows:Summary card at the top:
FieldDescription
Notification StatusCurrent status as a colored tag
Total VMsTotal number of VMs being evacuated
SucceededCount of successfully recovered VMs (green)
FailedCount of failed evacuations (red, if any)
ProgressCircular progress indicator showing completion
VM Evacuations table below the summary:
ColumnDescription
VM NameInstance name (falls back to UUID if no name)
Source HostThe failed compute host
Destination HostTarget recovery host (“Pending” if not yet assigned)
TypeEvacuation type (typically evacuation)
StatusEvacuation status with icon
Start TimeWhen the evacuation started
End TimeWhen the evacuation completed
MessageError message if the evacuation failed (red text)
VM evacuation status values:
StatusColorIconMeaning
PendingGreyClockEvacuation queued, not yet started
RunningBlueLoading spinnerEvacuation in progress
SucceededGreenCheck circleVM successfully recovered
FailedRedClose circleEvacuation failed — manual intervention needed
When a notification is in Running status, the Recovery Progress tab auto-refreshes every 5 seconds, showing a “Auto-refreshing every 5s” indicator. You can watch evacuations complete in real time.

VM Moves — Consolidated View

The VM Moves page provides a single view of all VM evacuations across all notifications, making it easy to review recovery history.
1

Navigate to VM Moves

Navigate to Instance HA > VM Moves in the sidebar.
2

Review the VM moves list

The page displays all VM evacuations from recent notifications (up to the last 50 notifications), sorted by start time.
ColumnDescription
VM NameInstance name (falls back to UUID)
Instance IDVM UUID (copyable, truncated display)
NotificationParent notification UUID (copyable, truncated display)
Source HostThe failed compute host
Destination HostTarget recovery host, or - if pending
TypeEvacuation type (typically evacuation)
StatusColored tag with icon (Succeeded/Failed/Running/Pending)
Start TimeWhen the evacuation started (default sort, descending)
End TimeWhen the evacuation completed, or -
MessageError message if failed (red text)
Use the Refresh button to reload the latest data.
The VM Moves page is read-only — it provides a consolidated view for monitoring and auditing. No actions are available on individual VM moves.

Recovery Methods in Detail

auto — Evacuate to Any Host

The auto method selects the healthiest available host in the segment based on current vCPU and memory availability. Instances are distributed across multiple target hosts if no single host has sufficient capacity for all evacuees.Characteristics:
  • No pre-reserved capacity required
  • Recovery succeeds as long as aggregate free capacity in the segment is sufficient
  • Most flexible option for mixed workloads
Risk: Recovery may fail if all remaining hosts are near capacity when the fault occurs. Maintain a minimum headroom of 20-30% unused capacity across the segment.
Similar to auto, but uses priority-based host selection. The engine evaluates hosts based on configured priority attributes and selects the highest-priority available host for each evacuation.Characteristics:
  • Allows administrators to influence target host selection
  • Still best-effort — no guaranteed standby capacity
  • Useful when certain hosts are preferred targets
One or more hosts in the segment are designated as reserved standby nodes. These hosts remain idle until a failover event occurs, ensuring guaranteed capacity for recovery.Characteristics:
  • Guaranteed recovery capacity regardless of current cluster load
  • Reserved hosts do not accept regular instance scheduling
  • Higher infrastructure cost (idle nodes consume resources)
Best for: Mission-critical applications, financial systems, and workloads with strict RTO requirements.
The engine attempts recovery to reserved hosts first. If reserved hosts are full, it falls back to the auto behaviour and selects any available host in the segment.Characteristics:
  • Balances guaranteed capacity for high-priority workloads with flexibility
  • Works well in mixed segments that contain both critical and standard workloads
  • Requires at least one reserved host in the segment
Best for: Environments with heterogeneous workloads where some instances need guaranteed failover and others can tolerate best-effort recovery.

Instance State During Recovery

PhaseInstance StatusDescription
Normal operationACTIVEInstance running on original host
Fault detectedUNKNOWNHost unreachable; compute service cannot confirm instance state
Evacuation in progressMIGRATINGInstance being moved to target host
RestartingBUILDInstance starting up on target host
Recovery completeACTIVEInstance fully operational on new host
Recovery failedERRORManual intervention required
Instances in ERROR or SHUTOFF state at the time of the host failure may not be automatically recovered, depending on your administrator’s configuration.

Notification Status Reference

Every recovery event creates a notification record. The notification status field tracks progress through the workflow.
StatusColorMeaning
newBlueFault notification received; recovery not yet started
runningOrangeRecovery workflow in progress
finishedGreenAll instances recovered successfully
errorRedRecovery encountered errors
failedRedRecovery failed completely
ignoredGreyNotification was de-duplicated or the segment was disabled

Recovery Time Expectations

Recovery time depends on several factors:
FactorTypical Impact
Host monitor detection timeout30-120 seconds to declare host unreachable
Instance count on failed hostEach instance adds 30-120 seconds to total recovery time
Instance disk size (shared storage)Minimal — shared storage volumes are reattached, not copied
Target host boot overheadConstant per instance — determined by instance flavor and image
Use shared storage (Xloud Distributed Storage) for all protected instances. Instances backed by local ephemeral disk cannot be evacuated and will be lost on host failure.

Next Steps

Monitoring Status

View notifications, hosts, and VM moves in the Dashboard

Troubleshooting

Resolve stuck or failed recovery workflows

Protection Segments

Create segments and manage host registrations

Instance HA Admin Guide

Configure recovery policies, monitors, and engine settings