Overview
This guide covers the most common failure modes in the Resource Optimizer: audits that fail to complete, action plans that stall during execution, live migration errors from the Compute API, and data source connectivity issues. Each section includes log locations, diagnostic commands, and remediation steps.Quick Diagnostic Reference
Check all Resource Optimizer container health
Check all Resource Optimizer container health
Check container status
(healthy):watcher_apiwatcher_decision_enginewatcher_applier
Check for recent errors in all containers
Audit state glossary
Audit state glossary
| State | Meaning |
|---|---|
PENDING | Queued, waiting for Decision Engine |
ONGOING | Decision Engine is running the strategy |
SUCCEEDED | Audit complete — action plan generated |
FAILED | Audit failed — check Decision Engine logs |
CANCELLED | Manually cancelled by an operator |
Action plan state glossary
Action plan state glossary
| State | Meaning |
|---|---|
RECOMMENDED | Awaiting operator approval |
PENDING | Approved, waiting for Applier |
ONGOING | Applier is executing actions |
SUCCEEDED | All actions completed successfully |
FAILED | One or more actions failed |
CANCELLED | Expired or manually cancelled |
Audit Failures
Audit Stuck in PENDING
The audit is queued but the Decision Engine has not picked it up.Check Decision Engine is running
- Decision Engine container is stopped or unhealthy
- RabbitMQ messaging connection is broken
- All Decision Engine workers are busy with another audit
Restart Decision Engine
Audit Fails with Strategy Error
Show audit details
Check Decision Engine logs for strategy errors
| Symptom | Cause | Fix |
|---|---|---|
NoDataFound | Data source not configured or unreachable | Configure Prometheus or Telemetry — see Data Sources |
InsufficientData | Not enough metric history | Wait 2–4 hours for Telemetry to accumulate history |
StrategyNotFound | Custom strategy not registered | Reinstall the strategy package and restart Decision Engine |
NoCandidateFound | All hosts are above the utilization threshold | Adjust strategy parameters — see Strategy Configuration |
Audit Succeeds but Generates Empty Action Plan
The audit completed successfully but no migrations were recommended. This is expected behaviour when:- All hosts are within the target utilization range (no consolidation needed)
- All instances are already on their optimal host
- The cluster is fully balanced for the selected goal
Check current utilization
Action Plan Execution Failures
Action Plan Stuck in PENDING
The plan was approved but the Applier has not started execution.Check Applier logs
Check action plan details
- Applier container is stopped
- Plan has expired (exceeded
action_plan_expiry) - Taskflow workflow database is locked
Restart Applier
Live Migration Action Fails
The Applier attempted a migration but Xloud Compute rejected it.Check action-level failure details
Check Applier logs for migration errors
Instance has no shared storage
Instance has no shared storage
CPU feature mismatch
CPU feature mismatch
Error: See Compute Integration for details.
MigrationPreCheckError: Guest requires CPU feature not present on destination.Compute hosts have different CPU feature sets and no common baseline is configured.Fix: Set a common CPU model in nova.conf on all compute hosts:nova.conf — CPU compatibility
Destination host has insufficient capacity
Destination host has insufficient capacity
Error:
NoValidHost: No valid host was found.The destination host identified during the audit no longer has sufficient vCPU or
memory available (cluster state changed between audit and execution).Fix: Run a new audit to generate a fresh plan reflecting current cluster state.
Lower action_plan_expiry to prevent stale plans from executing:watcher.conf
Source or destination host is disabled
Source or destination host is disabled
Error:
HTTPBadRequest: Cannot live migrate to disabled host.A host was disabled between audit completion and plan execution.Fix: Re-enable the host or run a new audit with current host availability.Re-enable a host
Data Source Issues
Prometheus Not Reachable
Strategies that require Prometheus (outlet_temperature, saving_energy) fail with
NoDataFound.
Test Prometheus connectivity from Decision Engine
"status": "success" with results.
Check Prometheus section in watcher.conf
Telemetry Metrics Missing
Strategies that require Telemetry (workload_stabilization, noisy_neighbor) fail
with InsufficientData or generate no recommendations.
Check Telemetry collector is configured
Verify metrics exist in Telemetry
ceilometer compute agent is enabled.
Authentication Failures
401 Unauthorized in Applier Logs
The Applier service account credentials are invalid or expired.Check Applier authentication errors
Test the service account token
watcher.conf are incorrect.
Update the [keystone_authtoken] section and restart all Optimizer containers:
Restart after credential update
Log Locations
| Component | Log Location |
|---|---|
| API | docker logs watcher_api |
| Decision Engine | docker logs watcher_decision_engine |
| Applier | docker logs watcher_applier |
| Full log files | /var/log/kolla/watcher/ (on controller host) |
Search all Optimizer logs for errors
Next Steps
Strategy Configuration
Adjust thresholds and parameters when audits generate no recommendations.
Compute Integration
Verify shared storage and CPU compatibility for live migration.
Data Sources
Diagnose Prometheus and Telemetry connectivity failures.
Security
Resolve service account authentication failures.