Operator Troubleshooting

Overview

This page covers the operator-facing failure modes — the ones you investigate when multiple users report broken migrations or the Migration panel itself is unresponsive. For end-user workflow failures (single-job symptoms), see the User Troubleshooting page instead.

Quick Health Check

Work through these checks in order when an operator incident is reported:

Platform identity

Confirm the Xloud Identity service is reachable and responding to token validation. XMS cannot accept any API call if identity is down.

Platform block storage and compute

Confirm block storage and compute are healthy in the platform health dashboard. A migration cannot complete without both.

XMS API responsiveness

Call the XMS API from an operator workstation or CLI. A non-responding API is the single most visible symptom and usually points at control plane or upstream identity issues.

Worker availability

Check the XMS control plane for worker availability. If zero workers are available, new jobs queue indefinitely.

Source reachability from XMS

Confirm the XMS deployment can still reach every registered source endpoint. A network path change on the operator side is a common cause of widespread job failures.

Common Operator Incidents

All new jobs are stuck in Queued

Symptom: Users submit migrations and they never leave Queued.Cause: No workers are available, either because workers are down or because an existing set of jobs is holding every worker slot.Fix:

Confirm worker health — are any workers in a failed state?
Check in-flight job count. If the job count equals the worker count, this is expected — jobs are waiting their turn
If workers are down, restart the worker pool and monitor the orchestrator

All sources show Disconnected

Symptom: Every registered source reports Disconnected in the Dashboard at the same time.Cause: A network change on the XMS side broke the outbound path to all sources, or the platform’s outbound DNS stopped resolving.Fix:

Confirm DNS resolution from XMS for the source hostnames
Confirm outbound TCP/443 reachability from XMS to the source endpoints
Check for any new egress policy on the operator-managed firewall

Jobs fail in Write Volume phase

Symptom: Multiple jobs fail during the write volume phase even though block storage is reported healthy.Cause: Block storage API is responsive but the target back-end is running out of capacity, or a specific volume type has hit its quota.Fix:

Review block storage capacity and volume type quota in the platform health dashboard
Confirm the target project has volume footprint headroom
Rebalance the campaign to a different volume type while capacity is provisioned

Migration panel is slow or times out

Symptom: Dashboard users report the Migration panel is slow to load or returns a timeout.Cause: The XMS control plane is under load, the job state store is slow, or the upstream identity service is slow to validate tokens.Fix:

Measure API latency from the operator side
Check the job state store health
Check identity service latency — a slow identity service impacts every Xloud UI, not just XMS

Sources refresh slowly or partially

Symptom: Discovery runs take significantly longer than previous runs, or return partial results.Cause: The source environment is under load, or CBT state on the source is being rebuilt after a host or datastore change.Fix:

Coordinate with the source environment owner — they may be doing maintenance
Re-run discovery during a lower-load window on the source side

Operator Diagnostics to Collect

When escalating an incident to Xloud support, collect:

Platform health dashboard screenshot at the time of the incident
XMS worker availability snapshot
A representative failed job ID and its full event stream
Source environment type, version, and the specific endpoint affected
Any platform-side changes in the window before the incident started
Audit log entries for source and job operations in the incident window

Attach all of the above to the incident ticket before escalating.

Recovery Operations

Cancel a Stuck Job

Jobs that are stuck in a phase can be cancelled from the Dashboard or CLI. Cancellation stops the phase cleanly, releases any disk transport session and worker slot, and marks the job as Cancelled. Source and target state are left unchanged — the user can re-submit or clean up manually.

Retry After a Transient Failure

Failed jobs are not automatically retried. If a job failed due to a transient cause (network blip, momentary source unavailability), the user can re-submit it from the Dashboard. For warm migrations that failed after the full sync completed, re-submission restarts from scratch — coordinate with the user on whether a cold fallback is faster.

Scale Worker Pool

If worker availability is the bottleneck, the operator can scale the XMS worker pool. The exact steps depend on your deployment — typically it is a configuration change followed by a controlled restart of the control plane.

Next Steps

User Troubleshooting

Single-job failure modes and fixes

Capacity Planning

Size XMS to avoid capacity-driven incidents

Architecture

Component map and workload data flow

Core Services

Other Services

Operator Troubleshooting

Overview

Quick Health Check

Platform identity

Platform block storage and compute

XMS API responsiveness

Worker availability

Source reachability from XMS

Common Operator Incidents

Operator Diagnostics to Collect

Recovery Operations

Cancel a Stuck Job

Retry After a Transient Failure

Scale Worker Pool

Next Steps

User Troubleshooting

Capacity Planning

Architecture

Core Services

Other Services

Documentation Index

​Overview

​Quick Health Check

Platform identity

Platform block storage and compute

XMS API responsiveness

Worker availability

Source reachability from XMS

​Common Operator Incidents

​Operator Diagnostics to Collect

​Recovery Operations

​Cancel a Stuck Job

​Retry After a Transient Failure

​Scale Worker Pool

​Next Steps

User Troubleshooting

Capacity Planning

Architecture

Overview

Quick Health Check

Common Operator Incidents

Operator Diagnostics to Collect

Recovery Operations

Cancel a Stuck Job

Retry After a Transient Failure

Scale Worker Pool

Next Steps