Skip to main content

Overview

Effective monitoring of the object storage cluster ensures early detection of capacity constraints, performance degradation, and data integrity issues. This guide covers the key metrics and commands for ongoing operational visibility.
Administrator Access Required — This operation requires the admin role. Contact your Xloud administrator if you do not have sufficient permissions.

Cluster Capacity

Storage capacity across all nodes
xavs-storage-recon --diskusage --verbose
Capacity thresholds:
MetricWarningCriticalAction
Node capacity used70%85%Plan capacity expansion
Single drive capacity80%90%Add drives or rebalance ring
Cluster-wide free space< 20%< 10%Immediate expansion required
When any storage node exceeds 85% capacity, the ring rebalancer may be unable to place new replicas, causing 507 Insufficient Storage errors for writes. Plan capacity expansion before reaching 70% utilization.

Proxy Metrics

The proxy-server exposes metrics on the recon middleware endpoint:
Check proxy load
curl -s http://<proxy-node-ip>:6000/recon/load
Check proxy memory
curl -s http://<proxy-node-ip>:6000/recon/mem
Check proxy async pending updates
curl -s http://<proxy-node-ip>:6000/recon/async

Replication Health

Replication status across all nodes
xavs-storage-recon --replication
Check for quarantined (corrupted) objects
xavs-storage-recon --quarantined
Verify ring file consistency across nodes
xavs-storage-recon --md5
Replication health alerts:
ConditionSeverityResponse
replication_time > 300sWarningInvestigate slow nodes
replication_last > 600sCriticalCheck replicator daemon status
Quarantine count increasingCriticalCheck drive health, replace failed drives
MD5 mismatchCriticalRedistribute ring files immediately

Integration with XIMP

For continuous monitoring, connect the object storage recon endpoint to XIMP (Xloud Infrastructure Monitoring Platform):
Prometheus scrape config for object storage
scrape_configs:
  - job_name: 'xavs-object-storage-recon'
    static_configs:
      - targets: ['<proxy-node-1>:6000', '<proxy-node-2>:6000']
    metrics_path: '/recon/metrics'
Configure alerting rules in XIMP for the critical thresholds above. Set notification channels for the on-call team to respond to 507 storage errors and quarantine count spikes promptly.

Next Steps

Replication

Deep-dive into replication health and quarantine management

Ring Management

Expand capacity by adding drives and rebalancing rings

Admin Troubleshooting

Respond to monitoring alerts and diagnose failures

Quotas

Set limits to prevent individual projects from consuming all capacity