Cluster Management

Overview

Day-to-day cluster management involves monitoring health status, responding to warnings and errors, managing OSD lifecycle, and performing maintenance operations. This page covers the most common operational tasks for XSDS administrators.

Administrator Access Required — This operation requires the admin role. Contact your Xloud administrator if you do not have sufficient permissions.

Prerequisites

Administrator credentials with the admin role
SSH access to a cluster node running the management CLI
Access to XDeploy (https://connect.<your-domain>)

Monitoring Cluster Health

Dashboard
CLI

Navigate to XDeploy → Storage → Cluster Health for a graphical overview of cluster status, OSD counts, capacity utilization, and active alerts.

Health State	Meaning	Action Required
`HEALTH_OK`	All components healthy, data fully replicated	None
`HEALTH_WARN`	Non-critical issue detected	Investigate and resolve
`HEALTH_ERR`	Critical issue — data may be at risk	Immediate attention required

View cluster health summary

Cluster health summary

ceph status

Investigate warnings

Detailed health report

ceph health detail

Review each warning entry. Common warnings are documented in the Troubleshooting guide.

View OSD status

OSD tree — placement and status

ceph osd tree

OSD utilization

ceph osd df tree

All OSDs show up and in status. OSDs showing down or out require investigation.

Service Operations

Common cluster management operations performed through XDeploy or the CLI.

Service Management
OSD Operations

List all cluster services

ceph orch ls

View service placement

ceph orch ps

Redeploy a specific service

ceph orch redeploy <SERVICE_TYPE>.<SERVICE_ID>

Always verify cluster health is HEALTH_OK before adding or removing OSDs. Operations on an already-degraded cluster can cause data unavailability.

Add an OSD on a new device

ceph orch daemon add osd <HOST>:<DEVICE_PATH>

Mark an OSD out (begin data evacuation)

ceph osd out <OSD_ID>

Remove an OSD after data evacuation

ceph orch osd rm <OSD_ID>

View OSD details

ceph osd dump | grep "^osd\."

OSD Lifecycle Management

Adding a new OSD

Deploy new OSDs through XDeploy:

Navigate to XDeploy → Storage → OSDs → Add OSD
Select the target host and available device
XDeploy provisions and integrates the OSD into the cluster

Alternatively via CLI:

Add OSD on specific host and device

ceph orch daemon add osd <HOSTNAME>:/dev/nvme1n1

After adding, the cluster begins re-balancing data automatically.

Add OSDs in batches rather than one at a time. Adding multiple OSDs simultaneously reduces the number of rebalancing cycles and recovers faster than sequential additions.

Removing a failed OSD

Mark OSD out to trigger data recovery

ceph osd out <OSD_ID>

Monitor recovery progress:

Watch recovery progress

watch ceph status

Once recovery completes (HEALTH_OK with no active recovery), remove the OSD:

Remove the OSD from the cluster

ceph osd purge <OSD_ID> --yes-i-really-mean-it

Do not remove an OSD before recovery is complete. Removing an OSD during active recovery on a degraded cluster risks data loss.

Replacing a failed disk

After physically replacing the failed disk, redeploy the OSD through XDeploy:

Navigate to XDeploy → Storage → OSDs → Replace OSD
Select the host and the new device
XDeploy provisions the replacement OSD and the cluster begins rebalancing

New OSD shows up in in ceph osd tree and cluster returns to HEALTH_OK.

Maintenance Mode

Before performing maintenance on a storage node (firmware updates, hardware replacement, OS maintenance), set the cluster to maintenance mode to prevent false recovery triggers:

Enable maintenance mode for a node

ceph osd set noout
ceph osd set norebalance

Perform your maintenance, then restore normal operation:

Disable maintenance mode

ceph osd unset noout
ceph osd unset norebalance

Do not leave noout and norebalance flags set for extended periods. If an OSD failure occurs while noout is set, the cluster will not re-replicate data to compensate, increasing the risk of data loss.

Next Steps

Pool Management

Create and configure storage pools for different workload types

CRUSH Maps

Manage failure domains and device class routing

Capacity Planning

Monitor utilization and plan OSD additions before capacity is exhausted

Troubleshooting

Diagnose HEALTH_WARN states, OSD failures, and slow request issues

​Overview

​Monitoring Cluster Health

View cluster health summary

Investigate warnings

View OSD status

​Service Operations

​OSD Lifecycle Management

Adding a new OSD

Removing a failed OSD

Replacing a failed disk

​Maintenance Mode

​Next Steps

Pool Management

CRUSH Maps

Capacity Planning

Troubleshooting

Overview

Monitoring Cluster Health

Service Operations

OSD Lifecycle Management

Maintenance Mode

Next Steps