Overview
Day-to-day cluster management involves monitoring health status, responding to warnings and errors, managing OSD lifecycle, and performing maintenance operations. This page covers the most common operational tasks for XSDS administrators.Prerequisites
- Administrator credentials with the
adminrole - SSH access to a cluster node running the management CLI
- Access to XDeploy (
https://connect.<your-domain>)
Monitoring Cluster Health
- Dashboard
- CLI
Navigate to XDeploy → Storage → Cluster Health for a graphical overview of
cluster status, OSD counts, capacity utilization, and active alerts.
| Health State | Meaning | Action Required |
|---|---|---|
HEALTH_OK | All components healthy, data fully replicated | None |
HEALTH_WARN | Non-critical issue detected | Investigate and resolve |
HEALTH_ERR | Critical issue — data may be at risk | Immediate attention required |
Service Operations
Common cluster management operations performed through XDeploy or the CLI.- Service Management
- OSD Operations
List all cluster services
View service placement
Redeploy a specific service
OSD Lifecycle Management
Adding a new OSD
Deploy new OSDs through XDeploy:After adding, the cluster begins re-balancing data automatically.
- Navigate to XDeploy → Storage → OSDs → Add OSD
- Select the target host and available device
- XDeploy provisions and integrates the OSD into the cluster
Add OSD on specific host and device
Removing a failed OSD
Mark OSD out to trigger data recovery
Watch recovery progress
HEALTH_OK with no active recovery), remove the OSD:Remove the OSD from the cluster
Replacing a failed disk
After physically replacing the failed disk, redeploy the OSD through XDeploy:
- Navigate to XDeploy → Storage → OSDs → Replace OSD
- Select the host and the new device
- XDeploy provisions the replacement OSD and the cluster begins rebalancing
New OSD shows
up in in ceph osd tree and cluster returns to HEALTH_OK.Maintenance Mode
Before performing maintenance on a storage node (firmware updates, hardware replacement, OS maintenance), set the cluster to maintenance mode to prevent false recovery triggers:Enable maintenance mode for a node
Disable maintenance mode
Next Steps
Pool Management
Create and configure storage pools for different workload types
CRUSH Maps
Manage failure domains and device class routing
Capacity Planning
Monitor utilization and plan OSD additions before capacity is exhausted
Troubleshooting
Diagnose HEALTH_WARN states, OSD failures, and slow request issues