Overview
Admin-level Orchestration issues differ from user-facing stack failures. They typically involve the service itself — engine workers not starting, the API becoming unreachable, trust or stack domain misconfiguration, or a resource plugin failing to load. Use the service log files andopenstack orchestration service list as primary diagnostic tools.
Diagnostic Reference
Engine workers not processing stacks
Engine workers not processing stacks
Symptoms: Stacks remain in Common causes and resolutions:
Restart the engine:
CREATE_IN_PROGRESS indefinitely. No events appear
in openstack stack event list. openstack orchestration service list shows engine
workers as down.Diagnosis:Check service status
Check engine container logs (XDeploy/XAVS deployment)
Check message queue connectivity
| Cause | Resolution |
|---|---|
| Engine container exited on startup | Check docker logs heat_engine for the error. Common causes: database connection failure, misconfigured heat.conf |
| RabbitMQ connection refused | Verify RabbitMQ is running: docker ps | grep rabbit. Check transport_url in engine configuration |
| Database migration not applied | Run docker exec heat_engine heat-manage db_sync to apply pending migrations |
| Stack domain not configured | Check stack_domain_admin and stack_domain_admin_password in heat.conf |
Restart engine container
API returning 500 or refusing connections
API returning 500 or refusing connections
Symptoms: Dashboard shows Orchestration as unavailable. CLI commands return
Common causes and resolutions:
503 Service Unavailable or connection refused on port 8004.Diagnosis:Check API container status
Test API endpoint directly
Check HAProxy backend health
| Cause | Resolution |
|---|---|
| API container not running | docker start heat_api |
| Keystone endpoint not registered | Verify: openstack endpoint list | grep orchestration |
| SSL certificate expired (if TLS enabled) | Renew certificate and restart API container |
| HAProxy backend marked DOWN | Check network connectivity between HAProxy and the API container; restart the API |
Stack domain user creation fails
Stack domain user creation fails
Symptoms: Stacks containing Common causes and resolutions:
WaitCondition or auto-scaling resources fail
with errors mentioning StackDomainUser or TrustActionMismatch. Users cannot
create stacks that require credentials delegation.Diagnosis:Verify stack domain exists
Verify stack domain admin user
Test stack domain admin credentials
| Cause | Resolution |
|---|---|
heat domain does not exist | Re-run xavs-ansible deploy -t heat to recreate the domain |
| Stack domain admin password incorrect | Update heat_domain_admin_password in passwords.yml and redeploy |
stack_domain_admin setting missing from heat.conf | Verify XDeploy configuration and redeploy |
| Xloud Identity service unreachable from engine | Check network connectivity between the engine container and port 5000 |
Resource plugin fails to load or raises errors
Resource plugin fails to load or raises errors
Symptoms: Specific resource types consistently fail with Common causes and resolutions:
InvalidTemplateVersion
or ResourceTypeUnavailable. The engine log shows import errors.Diagnosis:List available resource types
Show resource type schema
Check engine log for plugin errors
| Cause | Resolution |
|---|---|
| Dependent service not enabled | Some resource types require specific services. Xloud::Networking::FloatingIP requires networking; verify the service is enabled |
| Plugin version mismatch after upgrade | Restart the engine after upgrades: docker restart heat_engine |
| Custom plugin missing | If using custom resource plugins, verify the plugin file exists in the engine’s plugin directory and has correct permissions |
Large stacks time out or fail under load
Large stacks time out or fail under load
Symptoms: Stacks with many resources (100+) frequently time out or take much
longer than expected. Engine workers appear idle despite stacks being queued.Diagnosis:Resolutions:
Apply changes by updating globals and redeploying:
Check engine worker count
Check message queue depth
| Action | Setting |
|---|---|
| Increase engine workers | heat_engine_workers: 8 (or higher) |
| Increase RPC timeout | heat_rpc_response_timeout: 300 |
| Increase database pool | heat_db_max_pool_size: 20 |
| Verify convergence mode is on | heat_convergence_engine: true |
Redeploy with new settings
Log Locations
| Service | Log Path |
|---|---|
| Orchestration Engine | docker logs heat_engine or /var/log/kolla/heat/heat-engine.log |
| Orchestration API | docker logs heat_api or /var/log/kolla/heat/heat-api.log |
| CloudWatch API | docker logs heat_api_cfn or /var/log/kolla/heat/heat-api-cfn.log |
Next Steps
Configuration
Review and update service configuration through XDeploy
Scaling the Service
Add engine workers to resolve throughput and timeout issues
Security
Diagnose stack domain and trust authorization problems
User Troubleshooting
Stack-level diagnostics for CREATE_FAILED and template errors