Skip to main content

Overview

The Instance HA engine processes fault notifications and orchestrates the recovery workflow. Its timing and behaviour parameters determine how quickly recovery begins, how many retries are attempted, and how edge-case scenarios (instances in ERROR state, short-lived faults) are handled. This page documents all key configuration parameters and their production recommendations.
Engine configuration changes require a service restart. The engine will not process new notifications during the restart window. Schedule configuration changes during low-risk periods.

Configuration File Location

etc
xavs
instance-ha
instance-ha.conf — Primary configuration file
Apply changes by restarting the engine container:
Restart Instance HA engine
docker restart masakari_engine

Core Parameters

DEFAULT Section

ParameterDefaultDescription
host<hostname>Service identifier used for distributed locking
long_rpc_timeout300Max seconds to wait for a Compute RPC call
wait_period_after_service_update180Seconds to wait after a host service update before triggering recovery — prevents false alarms from planned restarts
notification_service_endpointExternal webhook endpoint for incoming notifications

[host_failure] Section

ParameterDefaultDescription
host_failure_recovery_interval17Seconds between recovery retry attempts
ignore_lease_seconds0Seconds after host boot to suppress failure notifications
evacuate_all_instancesTrueEvacuate all instances from failed host, not just those with HA protection

[instance_failure] Section

ParameterDefaultDescription
recover_ignoring_error_instancesFalseAttempt recovery for instances already in ERROR state
recover_instance_failure_methodautoRecovery method for instance-level faults

Example Production Configuration

Enable Host HA

In XDeploy, navigate to Configuration → Advance Features and toggle Enable Host HA to Yes. Click Save Configuration.
Database connection strings ([database]) and Xloud Identity credentials ([keystone_authtoken]) are auto-managed by XDeploy. Do not edit these sections manually — they are generated during deployment and kept in sync with the cluster identity service automatically.

Customize engine parameters (optional)

To tune timing or behaviour parameters beyond the defaults, open Advanced Configuration in XDeploy. In the Service Tree, select masakari and open (or create) instance-ha.conf.Edit the parameters in the Code Editor:
Engine parameters in XDeploy Advanced Configuration
[DEFAULT]
host = controller-01
long_rpc_timeout = 300
wait_period_after_service_update = 180

[host_failure]
host_failure_recovery_interval = 17
ignore_lease_seconds = 60
evacuate_all_instances = True

[instance_failure]
recover_ignoring_error_instances = False
recover_instance_failure_method = auto
Click Save Current File.

Apply changes

Navigate to Operations and run a reconfigure action to apply the updated engine configuration.
Engine restarts with the new parameters. Verify via container logs.

Timing Tuning Guidance

Reduce false positives from planned restarts

Increase wait_period_after_service_update and ignore_lease_seconds to prevent recovery from triggering during planned host reboots:
Recommended for environments with frequent planned maintenance
[DEFAULT]
wait_period_after_service_update = 300

[host_failure]
ignore_lease_seconds = 120
This adds up to 2 minutes of tolerance for hosts coming back online after a reboot before Instance HA declares them permanently failed.
Reduce retry intervals for faster recovery at the cost of increased sensitivity to transient network partitions:
Faster recovery (higher false-positive risk)
[host_failure]
host_failure_recovery_interval = 10
ignore_lease_seconds = 30
Reducing intervals increases the risk of unnecessary evacuations during brief network interruptions. Use conservative values in shared or multi-tenant clusters.
For environments where instances frequently enter ERROR state due to transient issues, enable recovery for error-state instances:
Recover ERROR-state instances
[instance_failure]
recover_ignoring_error_instances = True
This setting is disabled by default because attempting to evacuate an instance that is in ERROR due to a configuration issue (rather than a host failure) may repeatedly fail and generate noise in the notification log.

Verify Engine Configuration

View active configuration
docker exec masakari_engine \
  python3 -c "from masakari.conf import CONF; CONF(['--config-file', '/etc/masakari/masakari.conf']); print(CONF.long_rpc_timeout)"
Check engine logs for configuration errors
docker logs masakari_engine | grep -E "ERROR|WARNING|ConfigFileNotFound"

Validation

Navigate to Admin → Compute → Instance HA → Notifications after a test event. Verify that recovery workflow timing aligns with the configured parameters.

Next Steps

Recovery Methods

Configure how instances are evacuated after fault detection.

Security

Manage service credentials and RBAC policies for Instance HA.

Troubleshooting

Diagnose engine startup failures and notification processing issues.

Notification Drivers

Configure the notification driver that feeds fault events to the engine.