Skip to main content

Overview

This guide covers the most common failure modes in the Resource Optimizer: audits that fail to complete, action plans that stall during execution, live migration errors from the Compute API, and data source connectivity issues. Each section includes log locations, diagnostic commands, and remediation steps.

Quick Diagnostic Reference

Check all Resource Optimizer container health

Check container status
docker ps --filter name=watcher \
  --format "table {{.Names}}\t{{.Status}}"
All three containers must show (healthy):
  • watcher_api
  • watcher_decision_engine
  • watcher_applier
Check for recent errors in all containers
for c in watcher_api watcher_decision_engine watcher_applier; do
  echo "=== $c ==="; docker logs --tail 20 $c 2>&1 | grep -E "ERROR|CRITICAL"
done
StateMeaning
PENDINGQueued, waiting for Decision Engine
ONGOINGDecision Engine is running the strategy
SUCCEEDEDAudit complete — action plan generated
FAILEDAudit failed — check Decision Engine logs
CANCELLEDManually cancelled by an operator
StateMeaning
RECOMMENDEDAwaiting operator approval
PENDINGApproved, waiting for Applier
ONGOINGApplier is executing actions
SUCCEEDEDAll actions completed successfully
FAILEDOne or more actions failed
CANCELLEDExpired or manually cancelled

Audit Failures

Audit Stuck in PENDING

The audit is queued but the Decision Engine has not picked it up.
Check Decision Engine is running
docker ps --filter name=watcher_decision_engine
docker logs watcher_decision_engine --tail 50
Common causes:
  • Decision Engine container is stopped or unhealthy
  • RabbitMQ messaging connection is broken
  • All Decision Engine workers are busy with another audit
Restart Decision Engine
docker restart watcher_decision_engine

Audit Fails with Strategy Error

Show audit details
watcher audit show <audit-uuid> \
  -f value -c state -c scope
Check Decision Engine logs for strategy errors
docker logs watcher_decision_engine 2>&1 \
  | grep -A 5 "ERROR.*strategy\|exception in.*strategy"
Common causes and fixes:
SymptomCauseFix
NoDataFoundData source not configured or unreachableConfigure Prometheus or Telemetry — see Data Sources
InsufficientDataNot enough metric historyWait 2–4 hours for Telemetry to accumulate history
StrategyNotFoundCustom strategy not registeredReinstall the strategy package and restart Decision Engine
NoCandidateFoundAll hosts are above the utilization thresholdAdjust strategy parameters — see Strategy Configuration

Audit Succeeds but Generates Empty Action Plan

The audit completed successfully but no migrations were recommended. This is expected behaviour when:
  • All hosts are within the target utilization range (no consolidation needed)
  • All instances are already on their optimal host
  • The cluster is fully balanced for the selected goal
Check current utilization
openstack hypervisor list --long \
  -f table -c Hostname -c "vCPUs Used" -c "Memory MB Used"
If the cluster appears underutilized but no actions were generated, lower the strategy threshold parameters — see Strategy Configuration.

Action Plan Execution Failures

Action Plan Stuck in PENDING

The plan was approved but the Applier has not started execution.
Check Applier logs
docker logs watcher_applier --tail 50
Check action plan details
watcher actionplan show <plan-uuid>
Common causes:
  • Applier container is stopped
  • Plan has expired (exceeded action_plan_expiry)
  • Taskflow workflow database is locked
Restart Applier
docker restart watcher_applier
If the plan is expired, create a new audit to generate a fresh plan.

Live Migration Action Fails

The Applier attempted a migration but Xloud Compute rejected it.
Check action-level failure details
watcher action list \
  --action-plan <plan-uuid> \
  -f table -c uuid -c action_type -c state -c description
Check Applier logs for migration errors
docker logs watcher_applier 2>&1 \
  | grep -A 10 "ERROR.*migrate\|MigrationError"
Common migration errors:
Error: LiveMigrationWithOldNovaNotSupported or migration times out.The instance disk is backed by local ephemeral storage and cannot be live-migrated.Fix: Verify the instance is volume-backed before running optimization:
Check instance storage
openstack server show <instance-id> \
  -f value -c "os-extended-volumes:volumes_attached"
Instances with no attached volumes must be excluded from optimization scope or migrated to volume-backed equivalents by the project owner.
Error: MigrationPreCheckError: Guest requires CPU feature not present on destination.Compute hosts have different CPU feature sets and no common baseline is configured.Fix: Set a common CPU model in nova.conf on all compute hosts:
nova.conf — CPU compatibility
[libvirt]
cpu_mode = custom
cpu_model = Cascadelake-Server-noTSX
See Compute Integration for details.
Error: NoValidHost: No valid host was found.The destination host identified during the audit no longer has sufficient vCPU or memory available (cluster state changed between audit and execution).Fix: Run a new audit to generate a fresh plan reflecting current cluster state. Lower action_plan_expiry to prevent stale plans from executing:
watcher.conf
[DEFAULT]
action_plan_expiry = 6
Error: HTTPBadRequest: Cannot live migrate to disabled host.A host was disabled between audit completion and plan execution.Fix: Re-enable the host or run a new audit with current host availability.
Re-enable a host
openstack compute service set --enable <hostname> nova-compute

Data Source Issues

Prometheus Not Reachable

Strategies that require Prometheus (outlet_temperature, saving_energy) fail with NoDataFound.
Test Prometheus connectivity from Decision Engine
docker exec watcher_decision_engine \
  curl -s "http://10.0.1.71:9291/api/v1/query?query=up" | python3 -m json.tool
Expected: "status": "success" with results.
Check Prometheus section in watcher.conf
grep -A 5 "\[prometheus_client\]" /etc/xavs/watcher/watcher.conf

Telemetry Metrics Missing

Strategies that require Telemetry (workload_stabilization, noisy_neighbor) fail with InsufficientData or generate no recommendations.
Check Telemetry collector is configured
grep -A 5 "\[ceilometer_client\]" /etc/xavs/watcher/watcher.conf
Verify metrics exist in Telemetry
openstack metric resource list --type instance | head -5
If no instance resources are listed, the Telemetry service is not collecting metrics. Verify Xloud Telemetry is deployed and the ceilometer compute agent is enabled.

Authentication Failures

401 Unauthorized in Applier Logs

The Applier service account credentials are invalid or expired.
Check Applier authentication errors
docker logs watcher_applier 2>&1 | grep "401\|Unauthorized\|keystoneauth"
Test the service account token
openstack --os-username watcher-service \
  --os-password "<service-account-password>" \
  --os-project-name service \
  token issue
If token issue fails, the service account credentials in watcher.conf are incorrect. Update the [keystone_authtoken] section and restart all Optimizer containers:
Restart after credential update
docker restart watcher_api watcher_decision_engine watcher_applier

Log Locations

ComponentLog Location
APIdocker logs watcher_api
Decision Enginedocker logs watcher_decision_engine
Applierdocker logs watcher_applier
Full log files/var/log/kolla/watcher/ (on controller host)
Search all Optimizer logs for errors
grep -r "ERROR\|CRITICAL" /var/log/kolla/watcher/ \
  | tail -50

Next Steps

Strategy Configuration

Adjust thresholds and parameters when audits generate no recommendations.

Compute Integration

Verify shared storage and CPU compatibility for live migration.

Data Sources

Diagnose Prometheus and Telemetry connectivity failures.

Security

Resolve service account authentication failures.