Skip to main content

Overview

This page covers cluster-level troubleshooting procedures for XSDS administrators. For tenant-facing storage issues (stuck volumes, snapshot failures), see the XSDS User Guide Troubleshooting page.
Administrator Access Required — This operation requires the admin role. Contact your Xloud administrator if you do not have sufficient permissions.
Prerequisites
  • Administrator credentials with the admin role
  • SSH access to a cluster management node

Diagnostic Reference

Before investigating specific issues, collect the full health report:
Full health report
ceph health detail
Cluster status overview
ceph status
Recent cluster log events
ceph log last 100

Common Issues

Cause: A pool has fewer placement groups than recommended for its current data volume. Under-provisioned PGs cause I/O imbalance across OSDs.Diagnosis:
Check PG autoscale status
ceph osd pool autoscale-status
Resolution: Enable PG auto-scaling to let the cluster manage PG counts automatically:
Enable autoscaler on a pool
ceph osd pool set <POOL_NAME> pg_autoscale_mode on
Enable pg_autoscale_mode on on all pools as a default configuration. The autoscaler prevents both under- and over-provisioned PG counts as pool data volumes change.
Cause: A physical disk or node has failed, causing one or more OSDs to go down.Diagnosis:
Identify failed OSDs
ceph osd tree | grep down
Resolution:
  1. Identify the failed OSD and its host from ceph osd tree
  2. Mark the OSD out to begin data recovery on remaining OSDs:
    Mark OSD out
    ceph osd out <OSD_ID>
    
  3. Monitor recovery: watch ceph status
  4. After recovery completes (status shows HEALTH_OK), replace the failed hardware
  5. Redeploy the OSD through XDeploy on the replacement device
Do not remove an OSD before the cluster has finished recovering data. Removing an OSD during recovery on an already-degraded cluster risks data loss.
Cause: Blocked OSD operations, high CPU load on OSD nodes, or network congestion on the storage network.Diagnosis:
Check for blocked operations
ceph health detail | grep -i slow
View OSD performance
ceph osd perf
Resolution:
Root CauseResolution
HDD fragmentationSchedule ceph osd scrub <OSD_ID> during low-traffic period
CPU throttling on OSD nodesReview CPU allocation in XDeploy
Rebalancing competing with client I/OThrottle recovery I/O (see below)
Full disks on some OSDsAdd capacity or rebalance via CRUSH weight
Throttle recovery I/O
ceph osd set-recovery-delay 5
ceph osd set-backfillfull-ratio 0.85
Cause: Data inconsistency detected between OSD replicas during scrubbing. This may indicate a hardware error (bad disk sector, bit flip).Diagnosis:
Find inconsistent PGs
ceph health detail | grep inconsistent
Get PG details
ceph pg <PG_ID> query
Resolution:
Repair inconsistent PG
ceph pg repair <PG_ID>
PG repair selects the primary OSD’s copy as authoritative and overwrites inconsistent replicas. If the primary copy is also corrupt, this may propagate corruption. Verify the application-level data integrity after repair.
Cause: Pool or cluster capacity has reached a critical threshold, causing the cluster to throttle or reject writes.Diagnosis:
Check cluster capacity
ceph df
Identify full OSDs
ceph osd df | sort -k9 -rn | head -10
Resolution:
  • Immediate: Raise the full_ratio temporarily to allow the cluster to accept writes while expansion is underway:
    Temporarily raise full ratio (emergency only)
    ceph osd set-full-ratio 0.97
    
  • Short-term: Delete unnecessary data, snapshots, or expired objects
  • Long-term: Add new OSD nodes through XDeploy (see Capacity Planning)
Operating above 90% utilization with the full_ratio raised is an emergency measure only. Data loss can occur if any additional OSD failures happen while the cluster is in this state.
Cause: The RGW service is down, misconfigured, or the backing data pool is unavailable.Diagnosis:
Check RGW service status
ceph orch ps | grep rgw
Check RGW logs
ceph log last 50 | grep rgw
Resolution:
Restart RGW service
ceph orch restart rgw.<SERVICE_ID>
If RGW remains down after restart, check the data pool health:
Check RGW data pool
ceph osd pool ls detail | grep rgw
ceph health detail | grep <RGW_POOL>

When to Contact Support

Contact support@xloud.tech if:
  • HEALTH_ERR persists after initial investigation
  • Data appears inaccessible or corrupt after OSD recovery
  • PG repair does not resolve the inconsistency
  • Cluster is full and expansion cannot be provisioned immediately
When opening a support ticket, include:
Information to include with support ticket
ceph health detail
ceph status
ceph osd tree
ceph log last 200

Next Steps

Cluster Management

Routine operational procedures — adding OSDs, maintenance mode

Monitoring

Set up proactive alerts to catch issues before they become critical

Capacity Planning

Prevent capacity emergencies with proactive expansion planning

User Guide Troubleshooting

Tenant-facing storage issues — stuck volumes, snapshot failures