Skip to main content

Overview

Day-to-day cluster management involves monitoring health status, responding to warnings and errors, managing OSD lifecycle, and performing maintenance operations. This page covers the most common operational tasks for XSDS administrators.
Administrator Access Required — This operation requires the admin role. Contact your Xloud administrator if you do not have sufficient permissions.
Prerequisites
  • Administrator credentials with the admin role
  • SSH access to a cluster node running the management CLI
  • Access to XDeploy (https://connect.<your-domain>)

Monitoring Cluster Health

Navigate to XDeploy → Storage → Cluster Health for a graphical overview of cluster status, OSD counts, capacity utilization, and active alerts.
Health StateMeaningAction Required
HEALTH_OKAll components healthy, data fully replicatedNone
HEALTH_WARNNon-critical issue detectedInvestigate and resolve
HEALTH_ERRCritical issue — data may be at riskImmediate attention required

Service Operations

Common cluster management operations performed through XDeploy or the CLI.
List all cluster services
ceph orch ls
View service placement
ceph orch ps
Redeploy a specific service
ceph orch redeploy <SERVICE_TYPE>.<SERVICE_ID>
Always verify cluster health is HEALTH_OK before adding or removing OSDs. Operations on an already-degraded cluster can cause data unavailability.

OSD Lifecycle Management

Adding a new OSD

Deploy new OSDs through XDeploy:
  1. Navigate to XDeploy → Storage → OSDs → Add OSD
  2. Select the target host and available device
  3. XDeploy provisions and integrates the OSD into the cluster
Alternatively via CLI:
Add OSD on specific host and device
ceph orch daemon add osd <HOSTNAME>:/dev/nvme1n1
After adding, the cluster begins re-balancing data automatically.
Add OSDs in batches rather than one at a time. Adding multiple OSDs simultaneously reduces the number of rebalancing cycles and recovers faster than sequential additions.

Removing a failed OSD

Mark OSD out to trigger data recovery
ceph osd out <OSD_ID>
Monitor recovery progress:
Watch recovery progress
watch ceph status
Once recovery completes (HEALTH_OK with no active recovery), remove the OSD:
Remove the OSD from the cluster
ceph osd purge <OSD_ID> --yes-i-really-mean-it
Do not remove an OSD before recovery is complete. Removing an OSD during active recovery on a degraded cluster risks data loss.

Replacing a failed disk

After physically replacing the failed disk, redeploy the OSD through XDeploy:
  1. Navigate to XDeploy → Storage → OSDs → Replace OSD
  2. Select the host and the new device
  3. XDeploy provisions the replacement OSD and the cluster begins rebalancing
New OSD shows up in in ceph osd tree and cluster returns to HEALTH_OK.

Maintenance Mode

Before performing maintenance on a storage node (firmware updates, hardware replacement, OS maintenance), set the cluster to maintenance mode to prevent false recovery triggers:
Enable maintenance mode for a node
ceph osd set noout
ceph osd set norebalance
Perform your maintenance, then restore normal operation:
Disable maintenance mode
ceph osd unset noout
ceph osd unset norebalance
Do not leave noout and norebalance flags set for extended periods. If an OSD failure occurs while noout is set, the cluster will not re-replicate data to compensate, increasing the risk of data loss.

Next Steps

Pool Management

Create and configure storage pools for different workload types

CRUSH Maps

Manage failure domains and device class routing

Capacity Planning

Monitor utilization and plan OSD additions before capacity is exhausted

Troubleshooting

Diagnose HEALTH_WARN states, OSD failures, and slow request issues