Troubleshooting

HEALTH_WARN: too few PGs

Cause: A pool has fewer placement groups than recommended for its current data volume. Under-provisioned PGs cause I/O imbalance across OSDs.Diagnosis:

Check PG autoscale status

ceph osd pool autoscale-status

Resolution: Enable PG auto-scaling to let the cluster manage PG counts automatically:

Enable autoscaler on a pool

ceph osd pool set <POOL_NAME> pg_autoscale_mode on

Enable pg_autoscale_mode on on all pools as a default configuration. The autoscaler prevents both under- and over-provisioned PG counts as pool data volumes change.

OSD down after hardware failure

Cause: A physical disk or node has failed, causing one or more OSDs to go down.Diagnosis:

Identify failed OSDs

ceph osd tree | grep down

Resolution:

Identify the failed OSD and its host from ceph osd tree
Mark the OSD out to begin data recovery on remaining OSDs:
Mark OSD out
```
ceph osd out <OSD_ID>
```
Monitor recovery: watch ceph status
After recovery completes (status shows HEALTH_OK), replace the failed hardware
Redeploy the OSD through XDeploy on the replacement device

Do not remove an OSD before the cluster has finished recovering data. Removing an OSD during recovery on an already-degraded cluster risks data loss.

Slow requests / high latency

Cause: Blocked OSD operations, high CPU load on OSD nodes, or network congestion on the storage network.Diagnosis:

Check for blocked operations

ceph health detail | grep -i slow

View OSD performance

ceph osd perf

Resolution:

Root Cause	Resolution
HDD fragmentation	Schedule `ceph osd scrub <OSD_ID>` during low-traffic period
CPU throttling on OSD nodes	Review CPU allocation in XDeploy
Rebalancing competing with client I/O	Throttle recovery I/O (see below)
Full disks on some OSDs	Add capacity or rebalance via CRUSH weight

Throttle recovery I/O

ceph osd set-recovery-delay 5
ceph osd set-backfillfull-ratio 0.85

PG inconsistent

Cause: Data inconsistency detected between OSD replicas during scrubbing. This may indicate a hardware error (bad disk sector, bit flip).Diagnosis:

Find inconsistent PGs

ceph health detail | grep inconsistent

Get PG details

ceph pg <PG_ID> query

Resolution:

Repair inconsistent PG

ceph pg repair <PG_ID>

PG repair selects the primary OSD’s copy as authoritative and overwrites inconsistent replicas. If the primary copy is also corrupt, this may propagate corruption. Verify the application-level data integrity after repair.

Pool near full / cluster full

Cause: Pool or cluster capacity has reached a critical threshold, causing the cluster to throttle or reject writes.Diagnosis:

Check cluster capacity

ceph df

Identify full OSDs

ceph osd df | sort -k9 -rn | head -10

Resolution:

Immediate: Raise the full_ratio temporarily to allow the cluster to accept writes while expansion is underway:
Temporarily raise full ratio (emergency only)
```
ceph osd set-full-ratio 0.97
```
Short-term: Delete unnecessary data, snapshots, or expired objects
Long-term: Add new OSD nodes through XDeploy (see Capacity Planning)

Operating above 90% utilization with the full_ratio raised is an emergency measure only. Data loss can occur if any additional OSD failures happen while the cluster is in this state.

Object gateway (S3) returning errors

Cause: The RGW service is down, misconfigured, or the backing data pool is unavailable.Diagnosis:

Check RGW service status

ceph orch ps | grep rgw

Check RGW logs

ceph log last 50 | grep rgw

Resolution:

Restart RGW service

ceph orch restart rgw.<SERVICE_ID>

If RGW remains down after restart, check the data pool health:

Check RGW data pool

ceph osd pool ls detail | grep rgw
ceph health detail | grep <RGW_POOL>

Cluster Management

Routine operational procedures — adding OSDs, maintenance mode

Monitoring

Set up proactive alerts to catch issues before they become critical

Capacity Planning

Prevent capacity emergencies with proactive expansion planning

User Guide Troubleshooting

Tenant-facing storage issues — stuck volumes, snapshot failures

Troubleshooting

Overview

Diagnostic Reference

Common Issues

When to Contact Support

Next Steps

Cluster Management

Monitoring

Capacity Planning

User Guide Troubleshooting

​Overview

​Diagnostic Reference

​Common Issues

​When to Contact Support

​Next Steps

Cluster Management

Monitoring

Capacity Planning

User Guide Troubleshooting

Overview

Diagnostic Reference

Common Issues

When to Contact Support

Next Steps