> ## Documentation Index > Fetch the complete documentation index at: https://docs.xloud.tech/llms.txt > Use this file to discover all available pages before exploring further. # Troubleshooting > Diagnose and resolve XSDS cluster-level issues — HEALTH_WARN states, OSD failures, slow requests, PG inconsistencies, and capacity emergencies. ## Overview This page covers cluster-level troubleshooting procedures for XSDS administrators. For tenant-facing storage issues (stuck volumes, snapshot failures), see the [XSDS User Guide Troubleshooting](/services/sds/user-guide/troubleshooting) page. **Administrator Access Required** — This operation requires the `admin` role. Contact your Xloud administrator if you do not have sufficient permissions. **Prerequisites** * Administrator credentials with the `admin` role * SSH access to a cluster management node *** ## Diagnostic Reference Before investigating specific issues, collect the full health report: ```bash title="Full health report" theme={null} ceph health detail ``` ```bash title="Cluster status overview" theme={null} ceph status ``` ```bash title="Recent cluster log events" theme={null} ceph log last 100 ``` *** ## Common Issues **Cause**: A pool has fewer placement groups than recommended for its current data volume. Under-provisioned PGs cause I/O imbalance across OSDs. **Diagnosis**: ```bash title="Check PG autoscale status" theme={null} ceph osd pool autoscale-status ``` **Resolution**: Enable PG auto-scaling to let the cluster manage PG counts automatically: ```bash title="Enable autoscaler on a pool" theme={null} ceph osd pool set pg_autoscale_mode on ``` Enable `pg_autoscale_mode on` on all pools as a default configuration. The autoscaler prevents both under- and over-provisioned PG counts as pool data volumes change. **Cause**: A physical disk or node has failed, causing one or more OSDs to go down. **Diagnosis**: ```bash title="Identify failed OSDs" theme={null} ceph osd tree | grep down ``` **Resolution**: 1. Identify the failed OSD and its host from `ceph osd tree` 2. Mark the OSD `out` to begin data recovery on remaining OSDs: ```bash title="Mark OSD out" theme={null} ceph osd out ``` 3. Monitor recovery: `watch ceph status` 4. After recovery completes (status shows `HEALTH_OK`), replace the failed hardware 5. Redeploy the OSD through XDeploy on the replacement device Do not remove an OSD before the cluster has finished recovering data. Removing an OSD during recovery on an already-degraded cluster risks data loss. **Cause**: Blocked OSD operations, high CPU load on OSD nodes, or network congestion on the storage network. **Diagnosis**: ```bash title="Check for blocked operations" theme={null} ceph health detail | grep -i slow ``` ```bash title="View OSD performance" theme={null} ceph osd perf ``` **Resolution**: | Root Cause | Resolution | | ------------------------------------- | ------------------------------------------------------------ | | HDD fragmentation | Schedule `ceph osd scrub ` during low-traffic period | | CPU throttling on OSD nodes | Review CPU allocation in XDeploy | | Rebalancing competing with client I/O | Throttle recovery I/O (see below) | | Full disks on some OSDs | Add capacity or rebalance via CRUSH weight | ```bash title="Throttle recovery I/O" theme={null} ceph osd set-recovery-delay 5 ceph osd set-backfillfull-ratio 0.85 ``` **Cause**: Data inconsistency detected between OSD replicas during scrubbing. This may indicate a hardware error (bad disk sector, bit flip). **Diagnosis**: ```bash title="Find inconsistent PGs" theme={null} ceph health detail | grep inconsistent ``` ```bash title="Get PG details" theme={null} ceph pg query ``` **Resolution**: ```bash title="Repair inconsistent PG" theme={null} ceph pg repair ``` PG repair selects the primary OSD's copy as authoritative and overwrites inconsistent replicas. If the primary copy is also corrupt, this may propagate corruption. Verify the application-level data integrity after repair. **Cause**: Pool or cluster capacity has reached a critical threshold, causing the cluster to throttle or reject writes. **Diagnosis**: ```bash title="Check cluster capacity" theme={null} ceph df ``` ```bash title="Identify full OSDs" theme={null} ceph osd df | sort -k9 -rn | head -10 ``` **Resolution**: * **Immediate**: Raise the `full_ratio` temporarily to allow the cluster to accept writes while expansion is underway: ```bash title="Temporarily raise full ratio (emergency only)" theme={null} ceph osd set-full-ratio 0.97 ``` * **Short-term**: Delete unnecessary data, snapshots, or expired objects * **Long-term**: Add new OSD nodes through XDeploy (see [Capacity Planning](/services/sds/admin-guide/capacity-planning)) Operating above 90% utilization with the `full_ratio` raised is an emergency measure only. Data loss can occur if any additional OSD failures happen while the cluster is in this state. **Cause**: The RGW service is down, misconfigured, or the backing data pool is unavailable. **Diagnosis**: ```bash title="Check RGW service status" theme={null} ceph orch ps | grep rgw ``` ```bash title="Check RGW logs" theme={null} ceph log last 50 | grep rgw ``` **Resolution**: ```bash title="Restart RGW service" theme={null} ceph orch restart rgw. ``` If RGW remains down after restart, check the data pool health: ```bash title="Check RGW data pool" theme={null} ceph osd pool ls detail | grep rgw ceph health detail | grep ``` *** ## When to Contact Support Contact [support@xloud.tech](mailto:support@xloud.tech) if: * `HEALTH_ERR` persists after initial investigation * Data appears inaccessible or corrupt after OSD recovery * PG repair does not resolve the inconsistency * Cluster is full and expansion cannot be provisioned immediately When opening a support ticket, include: ```bash title="Information to include with support ticket" theme={null} ceph health detail ceph status ceph osd tree ceph log last 200 ``` *** ## Next Steps Routine operational procedures — adding OSDs, maintenance mode Set up proactive alerts to catch issues before they become critical Prevent capacity emergencies with proactive expansion planning Tenant-facing storage issues — stuck volumes, snapshot failures