Overview
This page covers cluster-level troubleshooting procedures for XSDS administrators. For tenant-facing storage issues (stuck volumes, snapshot failures), see the XSDS User Guide Troubleshooting page.Prerequisites
- Administrator credentials with the
adminrole - SSH access to a cluster management node
Diagnostic Reference
Before investigating specific issues, collect the full health report:Full health report
Cluster status overview
Recent cluster log events
Common Issues
HEALTH_WARN: too few PGs
HEALTH_WARN: too few PGs
Cause: A pool has fewer placement groups than recommended for its current
data volume. Under-provisioned PGs cause I/O imbalance across OSDs.Diagnosis:Resolution:
Enable PG auto-scaling to let the cluster manage PG counts automatically:
Check PG autoscale status
Enable autoscaler on a pool
OSD down after hardware failure
OSD down after hardware failure
Cause: A physical disk or node has failed, causing one or more OSDs to go down.Diagnosis:Resolution:
Identify failed OSDs
- Identify the failed OSD and its host from
ceph osd tree - Mark the OSD
outto begin data recovery on remaining OSDs:Mark OSD out - Monitor recovery:
watch ceph status - After recovery completes (status shows
HEALTH_OK), replace the failed hardware - Redeploy the OSD through XDeploy on the replacement device
Slow requests / high latency
Slow requests / high latency
Cause: Blocked OSD operations, high CPU load on OSD nodes, or network congestion
on the storage network.Diagnosis:Resolution:
Check for blocked operations
View OSD performance
| Root Cause | Resolution |
|---|---|
| HDD fragmentation | Schedule ceph osd scrub <OSD_ID> during low-traffic period |
| CPU throttling on OSD nodes | Review CPU allocation in XDeploy |
| Rebalancing competing with client I/O | Throttle recovery I/O (see below) |
| Full disks on some OSDs | Add capacity or rebalance via CRUSH weight |
Throttle recovery I/O
PG inconsistent
PG inconsistent
Cause: Data inconsistency detected between OSD replicas during scrubbing.
This may indicate a hardware error (bad disk sector, bit flip).Diagnosis:Resolution:
Find inconsistent PGs
Get PG details
Repair inconsistent PG
Pool near full / cluster full
Pool near full / cluster full
Cause: Pool or cluster capacity has reached a critical threshold, causing
the cluster to throttle or reject writes.Diagnosis:Resolution:
Check cluster capacity
Identify full OSDs
- Immediate: Raise the
full_ratiotemporarily to allow the cluster to accept writes while expansion is underway:Temporarily raise full ratio (emergency only) - Short-term: Delete unnecessary data, snapshots, or expired objects
- Long-term: Add new OSD nodes through XDeploy (see Capacity Planning)
Operating above 90% utilization with the
full_ratio raised is an emergency
measure only. Data loss can occur if any additional OSD failures happen while
the cluster is in this state.Object gateway (S3) returning errors
Object gateway (S3) returning errors
Cause: The RGW service is down, misconfigured, or the backing data pool is unavailable.Diagnosis:Resolution:If RGW remains down after restart, check the data pool health:
Check RGW service status
Check RGW logs
Restart RGW service
Check RGW data pool
When to Contact Support
Contact support@xloud.tech if:HEALTH_ERRpersists after initial investigation- Data appears inaccessible or corrupt after OSD recovery
- PG repair does not resolve the inconsistency
- Cluster is full and expansion cannot be provisioned immediately
Information to include with support ticket
Next Steps
Cluster Management
Routine operational procedures — adding OSDs, maintenance mode
Monitoring
Set up proactive alerts to catch issues before they become critical
Capacity Planning
Prevent capacity emergencies with proactive expansion planning
User Guide Troubleshooting
Tenant-facing storage issues — stuck volumes, snapshot failures