> ## Documentation Index > Fetch the complete documentation index at: https://docs.xloud.tech/llms.txt > Use this file to discover all available pages before exploring further. # Cluster Management > Monitor XSDS cluster health, manage service placement, add and remove OSDs, and perform day-to-day operational procedures. ## Overview Day-to-day cluster management involves monitoring health status, responding to warnings and errors, managing OSD lifecycle, and performing maintenance operations. This page covers the most common operational tasks for XSDS administrators. **Administrator Access Required** — This operation requires the `admin` role. Contact your Xloud administrator if you do not have sufficient permissions. **Prerequisites** * Administrator credentials with the `admin` role * SSH access to a cluster node running the management CLI * Access to **XDeploy** (`https://connect.`) *** ## Monitoring Cluster Health Navigate to **XDeploy → Storage → Cluster Health** for a graphical overview of cluster status, OSD counts, capacity utilization, and active alerts. | Health State | Meaning | Action Required | | ------------- | --------------------------------------------- | ---------------------------- | | `HEALTH_OK` | All components healthy, data fully replicated | None | | `HEALTH_WARN` | Non-critical issue detected | Investigate and resolve | | `HEALTH_ERR` | Critical issue — data may be at risk | Immediate attention required | ```bash title="Cluster health summary" theme={null} ceph status ``` ```bash title="Detailed health report" theme={null} ceph health detail ``` Review each warning entry. Common warnings are documented in the [Troubleshooting](/services/sds/admin-guide/troubleshooting) guide. ```bash title="OSD tree — placement and status" theme={null} ceph osd tree ``` ```bash title="OSD utilization" theme={null} ceph osd df tree ``` All OSDs show `up` and `in` status. OSDs showing `down` or `out` require investigation. *** ## Service Operations Common cluster management operations performed through XDeploy or the CLI. ```bash title="List all cluster services" theme={null} ceph orch ls ``` ```bash title="View service placement" theme={null} ceph orch ps ``` ```bash title="Redeploy a specific service" theme={null} ceph orch redeploy . ``` Always verify cluster health is `HEALTH_OK` before adding or removing OSDs. Operations on an already-degraded cluster can cause data unavailability. ```bash title="Add an OSD on a new device" theme={null} ceph orch daemon add osd : ``` ```bash title="Mark an OSD out (begin data evacuation)" theme={null} ceph osd out ``` ```bash title="Remove an OSD after data evacuation" theme={null} ceph orch osd rm ``` ```bash title="View OSD details" theme={null} ceph osd dump | grep "^osd\." ``` *** ## OSD Lifecycle Management Deploy new OSDs through XDeploy: 1. Navigate to **XDeploy → Storage → OSDs → Add OSD** 2. Select the target host and available device 3. XDeploy provisions and integrates the OSD into the cluster Alternatively via CLI: ```bash title="Add OSD on specific host and device" theme={null} ceph orch daemon add osd :/dev/nvme1n1 ``` After adding, the cluster begins re-balancing data automatically. Add OSDs in batches rather than one at a time. Adding multiple OSDs simultaneously reduces the number of rebalancing cycles and recovers faster than sequential additions. ```bash title="Mark OSD out to trigger data recovery" theme={null} ceph osd out ``` Monitor recovery progress: ```bash title="Watch recovery progress" theme={null} watch ceph status ``` Once recovery completes (`HEALTH_OK` with no active recovery), remove the OSD: ```bash title="Remove the OSD from the cluster" theme={null} ceph osd purge --yes-i-really-mean-it ``` Do not remove an OSD before recovery is complete. Removing an OSD during active recovery on a degraded cluster risks data loss. After physically replacing the failed disk, redeploy the OSD through XDeploy: 1. Navigate to **XDeploy → Storage → OSDs → Replace OSD** 2. Select the host and the new device 3. XDeploy provisions the replacement OSD and the cluster begins rebalancing New OSD shows `up in` in `ceph osd tree` and cluster returns to `HEALTH_OK`. *** ## Maintenance Mode Before performing maintenance on a storage node (firmware updates, hardware replacement, OS maintenance), set the cluster to maintenance mode to prevent false recovery triggers: ```bash title="Enable maintenance mode for a node" theme={null} ceph osd set noout ceph osd set norebalance ``` Perform your maintenance, then restore normal operation: ```bash title="Disable maintenance mode" theme={null} ceph osd unset noout ceph osd unset norebalance ``` Do not leave `noout` and `norebalance` flags set for extended periods. If an OSD failure occurs while `noout` is set, the cluster will not re-replicate data to compensate, increasing the risk of data loss. *** ## Next Steps Create and configure storage pools for different workload types Manage failure domains and device class routing Monitor utilization and plan OSD additions before capacity is exhausted Diagnose HEALTH\_WARN states, OSD failures, and slow request issues