> ## Documentation Index > Fetch the complete documentation index at: https://docs.xloud.tech/llms.txt > Use this file to discover all available pages before exploring further. # Engine Configuration > Tune the Xloud Instance HA recovery engine — configure detection timeouts, retry intervals, instance failure behaviour, and service endpoints. ## Overview The Instance HA engine processes fault notifications and orchestrates the recovery workflow. Its timing and behaviour parameters determine how quickly recovery begins, how many retries are attempted, and how edge-case scenarios (instances in ERROR state, short-lived faults) are handled. This page documents all key configuration parameters and their production recommendations. Engine configuration changes require a service restart. The engine will not process new notifications during the restart window. Schedule configuration changes during low-risk periods. *** ## Configuration File Location Apply changes by restarting the engine container: ```bash title="Restart Instance HA engine" theme={null} docker restart masakari_engine ``` *** ## Core Parameters ### DEFAULT Section | Parameter | Default | Description | | ---------------------------------- | ------------ | -------------------------------------------------------------------------------------------------------------------- | | `host` | `` | Service identifier used for distributed locking | | `long_rpc_timeout` | `300` | Max seconds to wait for a Compute RPC call | | `wait_period_after_service_update` | `180` | Seconds to wait after a host service update before triggering recovery — prevents false alarms from planned restarts | | `notification_service_endpoint` | — | External webhook endpoint for incoming notifications | ### \[host\_failure] Section | Parameter | Default | Description | | -------------------------------- | ------- | -------------------------------------------------------------------------- | | `host_failure_recovery_interval` | `17` | Seconds between recovery retry attempts | | `ignore_lease_seconds` | `0` | Seconds after host boot to suppress failure notifications | | `evacuate_all_instances` | `True` | Evacuate all instances from failed host, not just those with HA protection | ### \[instance\_failure] Section | Parameter | Default | Description | | ---------------------------------- | ------- | ------------------------------------------------------- | | `recover_ignoring_error_instances` | `False` | Attempt recovery for instances already in `ERROR` state | | `recover_instance_failure_method` | `auto` | Recovery method for instance-level faults | *** ## Example Production Configuration In XDeploy, navigate to **Configuration → Advance Features** and toggle **Enable Host HA** to **Yes**. Click **Save Configuration**. Database connection strings (`[database]`) and Xloud Identity credentials (`[keystone_authtoken]`) are **auto-managed** by XDeploy. Do not edit these sections manually -- they are generated during deployment and kept in sync with the cluster identity service automatically. To tune timing or behaviour parameters beyond the defaults, open **Advanced Configuration** in XDeploy. In the **Service Tree**, select **masakari** and open (or create) `instance-ha.conf`. Edit the parameters in the Code Editor: ```ini title="Engine parameters in XDeploy Advanced Configuration" theme={null} [DEFAULT] host = controller-01 long_rpc_timeout = 300 wait_period_after_service_update = 180 [host_failure] host_failure_recovery_interval = 17 ignore_lease_seconds = 60 evacuate_all_instances = True [instance_failure] recover_ignoring_error_instances = False recover_instance_failure_method = auto ``` Click **Save Current File**. Navigate to **Operations** and run a **reconfigure** action to apply the updated engine configuration. Engine restarts with the new parameters. Verify via container logs. Edit the configuration file directly and restart the engine container: ```ini title="/etc/xavs/instance-ha/instance-ha.conf" theme={null} [DEFAULT] host = controller-01 long_rpc_timeout = 300 wait_period_after_service_update = 180 [host_failure] host_failure_recovery_interval = 17 ignore_lease_seconds = 60 evacuate_all_instances = True [instance_failure] recover_ignoring_error_instances = False recover_instance_failure_method = auto [database] connection = mysql+pymysql://masakari:password@10.0.1.70/masakari [keystone_authtoken] auth_url = http://10.0.1.70:5000/v3 project_name = service username = masakari password = ``` ```bash title="Restart the engine" theme={null} docker restart masakari_engine ``` The `[database]` and `[keystone_authtoken]` sections are generated by xavs-ansible during deployment. Edit them only if you are managing the configuration entirely through CLI without XDeploy. *** ## Timing Tuning Guidance Increase `wait_period_after_service_update` and `ignore_lease_seconds` to prevent recovery from triggering during planned host reboots: ```ini title="Recommended for environments with frequent planned maintenance" theme={null} [DEFAULT] wait_period_after_service_update = 300 [host_failure] ignore_lease_seconds = 120 ``` This adds up to 2 minutes of tolerance for hosts coming back online after a reboot before Instance HA declares them permanently failed. Reduce retry intervals for faster recovery at the cost of increased sensitivity to transient network partitions: ```ini title="Faster recovery (higher false-positive risk)" theme={null} [host_failure] host_failure_recovery_interval = 10 ignore_lease_seconds = 30 ``` Reducing intervals increases the risk of unnecessary evacuations during brief network interruptions. Use conservative values in shared or multi-tenant clusters. For environments where instances frequently enter `ERROR` state due to transient issues, enable recovery for error-state instances: ```ini title="Recover ERROR-state instances" theme={null} [instance_failure] recover_ignoring_error_instances = True ``` This setting is disabled by default because attempting to evacuate an instance that is in `ERROR` due to a configuration issue (rather than a host failure) may repeatedly fail and generate noise in the notification log. *** ## Verify Engine Configuration ```bash title="View active configuration" theme={null} docker exec masakari_engine \ python3 -c "from masakari.conf import CONF; CONF(['--config-file', '/etc/masakari/masakari.conf']); print(CONF.long_rpc_timeout)" ``` ```bash title="Check engine logs for configuration errors" theme={null} docker logs masakari_engine | grep -E "ERROR|WARNING|ConfigFileNotFound" ``` *** ## Validation Navigate to **Instance-HA > Notifications (admin view)** after a test event. Verify that recovery workflow timing aligns with the configured parameters. ```bash title="Check engine service status" theme={null} docker ps --filter name=masakari_engine ``` ```bash title="Confirm engine is processing notifications" theme={null} docker logs masakari_engine | tail -50 ``` Engine shows active log output and no configuration-related error messages. *** ## Next Steps Configure how instances are evacuated after fault detection. Manage service credentials and RBAC policies for Instance HA. Diagnose engine startup failures and notification processing issues. Configure the notification driver that feeds fault events to the engine.