Overview
Effective monitoring of the object storage cluster ensures early detection of capacity constraints, performance degradation, and data integrity issues. This guide covers the key metrics and commands for ongoing operational visibility.Cluster Capacity
- Capacity overview
- Per-node usage
Storage capacity across all nodes
| Metric | Warning | Critical | Action |
|---|---|---|---|
| Node capacity used | 70% | 85% | Plan capacity expansion |
| Single drive capacity | 80% | 90% | Add drives or rebalance ring |
| Cluster-wide free space | < 20% | < 10% | Immediate expansion required |
Proxy Metrics
- Recon endpoint
- Key proxy metrics
The proxy-server exposes metrics on the recon middleware endpoint:
Check proxy load
Check proxy memory
Check proxy async pending updates
Replication Health
Replication status across all nodes
Check for quarantined (corrupted) objects
Verify ring file consistency across nodes
| Condition | Severity | Response |
|---|---|---|
replication_time > 300s | Warning | Investigate slow nodes |
replication_last > 600s | Critical | Check replicator daemon status |
| Quarantine count increasing | Critical | Check drive health, replace failed drives |
| MD5 mismatch | Critical | Redistribute ring files immediately |
Integration with XIMP
For continuous monitoring, connect the object storage recon endpoint to XIMP (Xloud Infrastructure Monitoring Platform):Prometheus scrape config for object storage
Next Steps
Replication
Deep-dive into replication health and quarantine management
Ring Management
Expand capacity by adding drives and rebalancing rings
Admin Troubleshooting
Respond to monitoring alerts and diagnose failures
Quotas
Set limits to prevent individual projects from consuming all capacity