Object Storage Monitoring

Overview

Effective monitoring of the object storage cluster ensures early detection of capacity constraints, performance degradation, and data integrity issues. This guide covers the key metrics and commands for ongoing operational visibility.

Administrator Access Required — This operation requires the admin role. Contact your Xloud administrator if you do not have sufficient permissions.

Cluster Capacity

Capacity overview
Per-node usage

Storage capacity across all nodes

xavs-storage-recon --diskusage --verbose

Capacity thresholds:

Metric	Warning	Critical	Action
Node capacity used	70%	85%	Plan capacity expansion
Single drive capacity	80%	90%	Add drives or rebalance ring
Cluster-wide free space	< 20%	< 10%	Immediate expansion required

When any storage node exceeds 85% capacity, the ring rebalancer may be unable to place new replicas, causing 507 Insufficient Storage errors for writes. Plan capacity expansion before reaching 70% utilization.

Disk usage summary per node

xavs-storage-recon --diskusage

The output shows each node’s total capacity, used bytes, and percentage utilized. Identify outliers — nodes significantly above the cluster average indicate uneven data distribution, which may require ring weight adjustments.

Proxy Metrics

Recon endpoint
Key proxy metrics

The proxy-server exposes metrics on the recon middleware endpoint:

Check proxy load

curl -s http://<proxy-node-ip>:6000/recon/load

Check proxy memory

curl -s http://<proxy-node-ip>:6000/recon/mem

Check proxy async pending updates

curl -s http://<proxy-node-ip>:6000/recon/async

Monitor these proxy-level metrics:

Metric	Description	Alert Threshold
Request rate	Requests per second per proxy node	Baseline + 3× standard deviation
Error rate	4xx and 5xx responses as % of total	> 5% 5xx errors
GET latency	p95 response time for object reads	> 500ms p95
PUT latency	p95 response time for object writes	> 1000ms p95
Async pending	Container/account updates queued	> 1000 pending

Replication Health

Replication status across all nodes

xavs-storage-recon --replication

Check for quarantined (corrupted) objects

xavs-storage-recon --quarantined

Verify ring file consistency across nodes

xavs-storage-recon --md5

Replication health alerts:

Condition	Severity	Response
`replication_time` > 300s	Warning	Investigate slow nodes
`replication_last` > 600s	Critical	Check replicator daemon status
Quarantine count increasing	Critical	Check drive health, replace failed drives
MD5 mismatch	Critical	Redistribute ring files immediately

Integration with XIMP

For continuous monitoring, connect the object storage recon endpoint to XIMP (Xloud Infrastructure Monitoring Platform):

Prometheus scrape config for object storage

scrape_configs:
  - job_name: 'xavs-object-storage-recon'
    static_configs:
      - targets: ['<proxy-node-1>:6000', '<proxy-node-2>:6000']
    metrics_path: '/recon/metrics'

Configure alerting rules in XIMP for the critical thresholds above. Set notification channels for the on-call team to respond to 507 storage errors and quarantine count spikes promptly.

Next Steps

Replication

Deep-dive into replication health and quarantine management

Ring Management

Expand capacity by adding drives and rebalancing rings

Admin Troubleshooting

Respond to monitoring alerts and diagnose failures

Quotas

Set limits to prevent individual projects from consuming all capacity

​Overview

​Cluster Capacity

​Proxy Metrics

​Replication Health

​Integration with XIMP

​Next Steps

Replication

Ring Management

Admin Troubleshooting

Quotas

Overview

Cluster Capacity

Proxy Metrics

Replication Health

Integration with XIMP

Next Steps