Skip to main content

Overview

This guide covers platform-level object storage issues that require administrator access. For user-facing issues such as 403 access errors or upload timeouts, see the Object Storage Troubleshooting guide.
Administrator Access Required — This operation requires the admin role. Contact your Xloud administrator if you do not have sufficient permissions.

Diagnostic Checklist

Overall cluster health
xavs-storage-recon --all
Disk usage across all nodes
xavs-storage-recon --diskusage
Replication status
xavs-storage-recon --replication
Ring file consistency
xavs-storage-recon --md5

Platform Issues

507 Insufficient Storage on object PUT

Cause: One or more storage nodes targeted by the ring have insufficient free space to accept the write.Diagnosis:
Check disk usage per node
xavs-storage-recon --diskusage
Resolution:
  • Identify nodes or drives above 85% utilization
  • Expand storage capacity by adding new drives (see Ring Management)
  • Alternatively, rebalance the ring to shift weight toward nodes with available space:
    Adjust weight on high-capacity node
    xavs-ring-builder object.builder set_weight <device-id> <reduced-weight>
    xavs-ring-builder object.builder rebalance
    xavs-ring-builder object.builder write_ring
    
Cause: Replication traffic competing with foreground I/O, or a storage node with degraded drives experiencing high read latency.Diagnosis:
Check replication load
xavs-storage-recon --replication --verbose
If a specific node shows high replication_time, inspect that node’s disk I/O:
Check disk I/O on storage node (SSH required)
iostat -x 1 5
Resolution:
  • Consider throttling the replicator with --concurrency 1 during peak hours
  • If a specific drive is degraded, reduce its ring weight to shift load away
  • Replace drives showing high latency or recurring I/O errors in dmesg
Cause: The updated ring file was not distributed to all nodes after a rebalance.Diagnosis:
Check MD5 of ring files on all nodes
xavs-storage-recon --md5
Resolution: Nodes with mismatched MD5 hashes have stale ring files. Redistribute the ring to affected nodes:
Copy ring files to an affected node
scp /etc/xavs-object-storage/*.ring.gz <node-ip>:/etc/xavs-object-storage/
Restart the object-server and replicator on the affected node after distribution.
Cause: Drive failure, bit-rot, or network errors during replication causing data corruption detected by the auditor.Diagnosis:
Check quarantine counts
xavs-storage-recon --quarantined --verbose
Check drive health (SSH to node)
smartctl -a /dev/<device>
dmesg | grep -i "error\|fail\|ata"
Resolution:
  1. If the drive is failing, set its ring weight to 0 and rebalance to drain data
  2. Replace the physical drive
  3. Add the replacement drive to the ring and rebalance
  4. The replicator will restore the quarantined objects from healthy replicas
Do not simply delete quarantined objects — they may be the only remaining copy if other replicas are also corrupted. Always verify healthy replicas exist before any quarantine cleanup.
Cause: The proxy-server cannot reach a quorum of storage nodes for an operation.Diagnosis:
Check proxy container status
docker ps --filter name=swift-proxy
docker logs swift-proxy --tail 50
Verify storage nodes are reachable from proxy
xavs-storage-recon --all | grep -i "error\|fail"
Resolution:
  • Verify storage node containers are running: docker ps --filter name=swift
  • Check network connectivity from proxy hosts to storage nodes on ports 6200, 6201, 6202
  • If nodes are degraded, the proxy will still serve reads from available replicas but writes require the configured replica quorum

Log Locations

ComponentLog Command
Proxy serverdocker logs swift-proxy
Object serverdocker logs swift-object
Container serverdocker logs swift-container
Account serverdocker logs swift-account
Replicatordocker logs swift-object-replicator

Next Steps

Object Storage Troubleshooting (User)

User-facing issues — 403 errors, upload timeouts, versioning failures

Ring Management

Add capacity and redistribute data after failures

Replication

Monitor and restore data durability

Monitoring

Proactively catch issues before they become outages