Engine Configuration

Overview

The Instance HA engine processes fault notifications and orchestrates the recovery workflow. Its timing and behaviour parameters determine how quickly recovery begins, how many retries are attempted, and how edge-case scenarios (instances in ERROR state, short-lived faults) are handled. This page documents all key configuration parameters and their production recommendations.

Engine configuration changes require a service restart. The engine will not process new notifications during the restart window. Schedule configuration changes during low-risk periods.

Configuration File Location

etc

xavs

instance-ha

instance-ha.conf — Primary configuration file

Apply changes by restarting the engine container:

Restart Instance HA engine

docker restart masakari_engine

Core Parameters

DEFAULT Section

Parameter	Default	Description
`host`	`<hostname>`	Service identifier used for distributed locking
`long_rpc_timeout`	`300`	Max seconds to wait for a Compute RPC call
`wait_period_after_service_update`	`180`	Seconds to wait after a host service update before triggering recovery — prevents false alarms from planned restarts
`notification_service_endpoint`	—	External webhook endpoint for incoming notifications

[host_failure] Section

Parameter	Default	Description
`host_failure_recovery_interval`	`17`	Seconds between recovery retry attempts
`ignore_lease_seconds`	`0`	Seconds after host boot to suppress failure notifications
`evacuate_all_instances`	`True`	Evacuate all instances from failed host, not just those with HA protection

[instance_failure] Section

Parameter	Default	Description
`recover_ignoring_error_instances`	`False`	Attempt recovery for instances already in `ERROR` state
`recover_instance_failure_method`	`auto`	Recovery method for instance-level faults

Example Production Configuration

XDeploy
CLI

Enable Host HA

In XDeploy, navigate to Configuration → Advance Features and toggle Enable Host HA to Yes. Click Save Configuration.

Database connection strings ([database]) and Xloud Identity credentials ([keystone_authtoken]) are auto-managed by XDeploy. Do not edit these sections manually — they are generated during deployment and kept in sync with the cluster identity service automatically.

Customize engine parameters (optional)

To tune timing or behaviour parameters beyond the defaults, open Advanced Configuration in XDeploy. In the Service Tree, select masakari and open (or create) instance-ha.conf.Edit the parameters in the Code Editor:

Engine parameters in XDeploy Advanced Configuration

[DEFAULT]
host = controller-01
long_rpc_timeout = 300
wait_period_after_service_update = 180

[host_failure]
host_failure_recovery_interval = 17
ignore_lease_seconds = 60
evacuate_all_instances = True

[instance_failure]
recover_ignoring_error_instances = False
recover_instance_failure_method = auto

Click Save Current File.

Apply changes

Navigate to Operations and run a reconfigure action to apply the updated engine configuration.

Engine restarts with the new parameters. Verify via container logs.

Edit the configuration file directly and restart the engine container:

/etc/xavs/instance-ha/instance-ha.conf

[DEFAULT]
host = controller-01
long_rpc_timeout = 300
wait_period_after_service_update = 180

[host_failure]
host_failure_recovery_interval = 17
ignore_lease_seconds = 60
evacuate_all_instances = True

[instance_failure]
recover_ignoring_error_instances = False
recover_instance_failure_method = auto

[database]
connection = mysql+pymysql://masakari:password@10.0.1.70/masakari

[keystone_authtoken]
auth_url = http://10.0.1.70:5000/v3
project_name = service
username = masakari
password = <service-account-password>

Restart the engine

docker restart masakari_engine

The [database] and [keystone_authtoken] sections are generated by xavs-ansible during deployment. Edit them only if you are managing the configuration entirely through CLI without XDeploy.

Timing Tuning Guidance

Reduce false positives from planned restarts

Increase wait_period_after_service_update and ignore_lease_seconds to prevent recovery from triggering during planned host reboots:

Recommended for environments with frequent planned maintenance

[DEFAULT]
wait_period_after_service_update = 300

[host_failure]
ignore_lease_seconds = 120

This adds up to 2 minutes of tolerance for hosts coming back online after a reboot before Instance HA declares them permanently failed.

Faster recovery for latency-sensitive workloads

Reduce retry intervals for faster recovery at the cost of increased sensitivity to transient network partitions:

Faster recovery (higher false-positive risk)

[host_failure]
host_failure_recovery_interval = 10
ignore_lease_seconds = 30

Reducing intervals increases the risk of unnecessary evacuations during brief network interruptions. Use conservative values in shared or multi-tenant clusters.

Enable recovery for instances in ERROR state

For environments where instances frequently enter ERROR state due to transient issues, enable recovery for error-state instances:

Recover ERROR-state instances

[instance_failure]
recover_ignoring_error_instances = True

This setting is disabled by default because attempting to evacuate an instance that is in ERROR due to a configuration issue (rather than a host failure) may repeatedly fail and generate noise in the notification log.

Verify Engine Configuration

View active configuration

docker exec masakari_engine \
  python3 -c "from masakari.conf import CONF; CONF(['--config-file', '/etc/masakari/masakari.conf']); print(CONF.long_rpc_timeout)"

Check engine logs for configuration errors

docker logs masakari_engine | grep -E "ERROR|WARNING|ConfigFileNotFound"

Validation

Dashboard
CLI

Navigate to Instance-HA > Notifications (admin view) after a test event. Verify that recovery workflow timing aligns with the configured parameters.

Check engine service status

docker ps --filter name=masakari_engine

Confirm engine is processing notifications

docker logs masakari_engine | tail -50

Engine shows active log output and no configuration-related error messages.

Next Steps

Recovery Methods

Configure how instances are evacuated after fault detection.

Security

Manage service credentials and RBAC policies for Instance HA.

Troubleshooting

Diagnose engine startup failures and notification processing issues.

Notification Drivers

Configure the notification driver that feeds fault events to the engine.

Core Services

Other Services

Engine Configuration

Overview

Configuration File Location

Core Parameters

DEFAULT Section

[host_failure] Section

[instance_failure] Section

Example Production Configuration

Enable Host HA

Customize engine parameters (optional)

Apply changes

Timing Tuning Guidance

Verify Engine Configuration

Validation

Next Steps

Recovery Methods

Security

Troubleshooting

Notification Drivers

Core Services

Other Services

Documentation Index

​Overview

​Configuration File Location

​Core Parameters

​DEFAULT Section

​[host_failure] Section

​[instance_failure] Section

​Example Production Configuration

Enable Host HA

Customize engine parameters (optional)

Apply changes

​Timing Tuning Guidance

​Verify Engine Configuration

​Validation

​Next Steps

Recovery Methods

Security

Troubleshooting

Notification Drivers

Overview

Configuration File Location

Core Parameters

DEFAULT Section

[host_failure] Section

[instance_failure] Section

Example Production Configuration

Timing Tuning Guidance

Verify Engine Configuration

Validation

Next Steps