> ## Documentation Index > Fetch the complete documentation index at: https://docs.xloud.tech/llms.txt > Use this file to discover all available pages before exploring further. # Operator Troubleshooting > Operator-side diagnostics for XMS — platform health, worker availability, and service-level recovery. ## Overview This page covers the operator-facing failure modes — the ones you investigate when multiple users report broken migrations or the Migration panel itself is unresponsive. For end-user workflow failures (single-job symptoms), see the [User Troubleshooting](/services/migration/user-guide/troubleshooting) page instead. *** ## Quick Health Check Work through these checks in order when an operator incident is reported: Confirm the Xloud Identity service is reachable and responding to token validation. XMS cannot accept any API call if identity is down. Confirm block storage and compute are healthy in the platform health dashboard. A migration cannot complete without both. Call the XMS API from an operator workstation or CLI. A non-responding API is the single most visible symptom and usually points at control plane or upstream identity issues. Check the XMS control plane for worker availability. If zero workers are available, new jobs queue indefinitely. Confirm the XMS deployment can still reach every registered source endpoint. A network path change on the operator side is a common cause of widespread job failures. *** ## Common Operator Incidents **Symptom**: Users submit migrations and they never leave **Queued**. **Cause**: No workers are available, either because workers are down or because an existing set of jobs is holding every worker slot. **Fix**: * Confirm worker health — are any workers in a failed state? * Check in-flight job count. If the job count equals the worker count, this is expected — jobs are waiting their turn * If workers are down, restart the worker pool and monitor the orchestrator **Symptom**: Every registered source reports **Disconnected** in the Dashboard at the same time. **Cause**: A network change on the XMS side broke the outbound path to all sources, or the platform's outbound DNS stopped resolving. **Fix**: * Confirm DNS resolution from XMS for the source hostnames * Confirm outbound TCP/443 reachability from XMS to the source endpoints * Check for any new egress policy on the operator-managed firewall **Symptom**: Multiple jobs fail during the write volume phase even though block storage is reported healthy. **Cause**: Block storage API is responsive but the target back-end is running out of capacity, or a specific volume type has hit its quota. **Fix**: * Review block storage capacity and volume type quota in the platform health dashboard * Confirm the target project has volume footprint headroom * Rebalance the campaign to a different volume type while capacity is provisioned **Symptom**: Dashboard users report the Migration panel is slow to load or returns a timeout. **Cause**: The XMS control plane is under load, the job state store is slow, or the upstream identity service is slow to validate tokens. **Fix**: * Measure API latency from the operator side * Check the job state store health * Check identity service latency — a slow identity service impacts every Xloud UI, not just XMS **Symptom**: Discovery runs take significantly longer than previous runs, or return partial results. **Cause**: The source environment is under load, or CBT state on the source is being rebuilt after a host or datastore change. **Fix**: * Coordinate with the source environment owner — they may be doing maintenance * Re-run discovery during a lower-load window on the source side *** ## Operator Diagnostics to Collect When escalating an incident to Xloud support, collect: * Platform health dashboard screenshot at the time of the incident * XMS worker availability snapshot * A representative failed job ID and its full event stream * Source environment type, version, and the specific endpoint affected * Any platform-side changes in the window before the incident started * Audit log entries for source and job operations in the incident window Attach all of the above to the incident ticket before escalating. *** ## Recovery Operations ### Cancel a Stuck Job Jobs that are stuck in a phase can be cancelled from the Dashboard or CLI. Cancellation stops the phase cleanly, releases any disk transport session and worker slot, and marks the job as **Cancelled**. Source and target state are left unchanged — the user can re-submit or clean up manually. ### Retry After a Transient Failure Failed jobs are not automatically retried. If a job failed due to a transient cause (network blip, momentary source unavailability), the user can re-submit it from the Dashboard. For warm migrations that failed after the full sync completed, re-submission restarts from scratch — coordinate with the user on whether a cold fallback is faster. ### Scale Worker Pool If worker availability is the bottleneck, the operator can scale the XMS worker pool. The exact steps depend on your deployment — typically it is a configuration change followed by a controlled restart of the control plane. *** ## Next Steps Single-job failure modes and fixes Size XMS to avoid capacity-driven incidents Component map and workload data flow