Instance High Availability

Protect your workloads from compute host failures with automatic detection and recovery. Xloud Instance HA continuously monitors compute nodes and instances, triggering evacuation and restart workflows the moment a fault is detected — without manual intervention.

XAVS — Advanced Virtualization Solutions

Product details on xloud.tech

User Guide

Understand protection segments, instance protection policies, and how to monitor recovery workflows for your running workloads.

Admin Guide

Configure failover segments, host and instance monitors, notification drivers, and integrate Instance HA with your compute cluster.

CLI Reference

Complete command reference for managing failover segments, hosts, and recovery notifications using the openstack CLI.

Compute Service

Xloud Compute provides the hypervisor layer that Instance HA monitors and manages during host failover events.

Key Capabilities

Host Failure Detection

IPMI and SSH-based monitors detect unreachable hosts in seconds and immediately trigger evacuation of all protected instances.

Automatic Instance Recovery

Failed instances are automatically restarted on healthy hosts within the same protection segment, respecting affinity rules.

Reserved Host Failover

Designate standby compute hosts that remain idle until a failover event occurs — guaranteeing resource availability for recovery.

Protection Segments

Group hosts and instances into logical fault domains. Each segment has its own recovery policy, monitors, and notification targets.

Notification Drivers

Integrate with IPMI, SSH, and custom notification sources to receive precise fault signals from infrastructure monitoring tools.

Audit Trail

Every recovery event is logged with timestamps, affected instances, and resolution outcomes — fully queryable via the Dashboard and CLI.

How It Works

Platform Resilience

Xloud-Developed — These resilience capabilities are developed by Xloud and ship with XAVS.

VM High Availability

Automatic instance restart on host failure. Configurable per-instance priority. Failover segments for per-group recovery policies. Requires XAVS.

Power Recovery Automation

9-phase automated recovery playbook. Target recovery time: 7-13 minutes. Sequential service startup with health gates between each phase. Requires XAVS.

Container Self-Healing

Three-tier autoheal daemon with dependency-aware restart ordering. Circuit breaker pattern prevents restart loops. Exponential backoff. Requires XAVS.

Proactive Monitoring

Pre-configured alert rules across 13 groups covering storage, database, message queue, compute, networking, containers, APIs, system resources, disk, memory, security, and capacity. Predictive alerts for capacity forecasting. Requires XAVS.

Network Resilience

L3 high availability and DHCP high availability with sub-3-second failover. Automatic ARP gratuitous announcements for fast VIP convergence. Requires XAVS.

Rolling Upgrades with Rollback

Per-service container upgrades with 2-10 second swap time. Canary deployment (first node only). Image tag rollback mechanism. Previous images cached locally. Requires XAVS.

Related Services

Xloud Compute

The hypervisor layer monitored and managed by Instance HA

Resource Optimizer

Rebalances workloads after recovery to restore cluster efficiency

Xloud Block Storage

Persistent volumes that survive host failover when using shared storage