> ## Documentation Index
> Fetch the complete documentation index at: https://docs.xloud.tech/llms.txt
> Use this file to discover all available pages before exploring further.

# Capacity Planning

> Monitor XSDS cluster utilization, maintain safe capacity headroom, and plan storage expansion before capacity constraints impact performance or availability.

## Overview

Maintaining adequate free capacity in an XSDS cluster is critical for both performance
and data safety. At high utilization, the cluster cannot complete recovery operations
after OSD failures, and I/O performance degrades significantly. This page covers
monitoring, thresholds, and expansion procedures.

<Warning>
  **Administrator Access Required** — This operation requires the `admin` role. Contact your
  Xloud administrator if you do not have sufficient permissions.
</Warning>

<Note>
  **Prerequisites**

  * Administrator credentials with the `admin` role
  * SSH access to a cluster management node
  * Access to **XDeploy** (`https://connect.<your-domain>`) for node provisioning
</Note>

***

## Capacity Thresholds

| Utilization | Status    | Action Required                           |
| ----------- | --------- | ----------------------------------------- |
| \< 60%      | Healthy   | Monitor routinely                         |
| 60–70%      | Watch     | Begin planning expansion                  |
| 70–80%      | Warning   | Initiate expansion — order hardware       |
| 80–85%      | Critical  | Accelerate expansion — immediate action   |
| > 85%       | Emergency | Risk of degraded I/O and recovery failure |

<Warning>
  Above 85% utilization, the cluster may refuse writes and cannot complete data
  recovery after OSD failures. Maintain a minimum of 30% free capacity headroom.
</Warning>

***

## Monitoring Utilization

<Tabs>
  <Tab title="Dashboard" icon="gauge">
    Navigate to **XDeploy → Storage → Capacity** for a graphical capacity overview
    showing per-pool and cluster-wide utilization with trend projections.
  </Tab>

  <Tab title="CLI" icon="terminal">
    ```bash title="Cluster-wide capacity summary" theme={null}
    ceph df
    ```

    ```bash title="Per-pool capacity" theme={null}
    ceph df detail
    ```

    ```bash title="Per-OSD utilization" theme={null}
    ceph osd df tree
    ```

    ```bash title="PG autoscale status" theme={null}
    ceph osd pool autoscale-status
    ```

    Key metrics to monitor:

    * **Used %**: Alert at 70%, act at 80%
    * **PG distribution**: Imbalanced PGs cause some OSDs to bear disproportionate load
    * **Recovery I/O**: Active recovery competes with client I/O — schedule OSD additions
      during low-traffic windows where possible
  </Tab>
</Tabs>

***

## Capacity Calculations

<AccordionGroup>
  <Accordion title="Replicated pools" icon="copy">
    For a pool with replication factor `n`, usable capacity = raw capacity / `n`.

    | Raw Capacity | Replication Factor | Usable Capacity |
    | ------------ | ------------------ | --------------- |
    | 100 TB       | 3 (default)        | \~33 TB         |
    | 100 TB       | 2                  | \~50 TB         |

    Account for the 30% headroom recommendation:

    * 100 TB raw, factor 3 = \~33 TB usable
    * 30% headroom = \~10 TB reserved
    * Effective usable = \~23 TB
  </Accordion>

  <Accordion title="Erasure-coded pools" icon="code">
    For an erasure code profile `k+m`, usable capacity = raw capacity × `k/(k+m)`.

    | Profile | Overhead | Usable from 100 TB |
    | ------- | -------- | ------------------ |
    | 4+2     | 1.5×     | \~67 TB            |
    | 6+2     | 1.33×    | \~75 TB            |
    | 8+3     | 1.375×   | \~73 TB            |
  </Accordion>

  <Accordion title="Snapshot space" icon="camera">
    Snapshots consume incremental capacity proportional to the change rate after the
    snapshot is taken. A volume with 10% daily churn accumulates approximately 10% of
    its size in snapshot data per day per snapshot retained.

    Factor snapshot retention into capacity planning. For 7-day retention on a 10-TB
    pool with 10% daily churn: approximately 7 TB additional snapshot space required.
  </Accordion>
</AccordionGroup>

***

## Expanding the Cluster

<Steps titleSize="h3">
  <Step title="Deploy new OSD node via XDeploy" icon="server">
    Navigate to **XDeploy → Infrastructure → Nodes → Add Node** and register the
    new storage node. XDeploy configures the OS, installs storage packages, and
    joins the node to the cluster.

    <Tip>
      Add at least 3 OSDs per expansion batch to ensure balanced data distribution
      across the cluster. Adding a single OSD may cause temporary imbalance.
    </Tip>
  </Step>

  <Step title="Verify OSD integration" icon="circle-check">
    ```bash title="Confirm new OSDs are up and in" theme={null}
    ceph osd tree
    ```

    New OSDs should show `up` and `in`. The cluster begins re-balancing data
    automatically once OSDs are registered.
  </Step>

  <Step title="Monitor rebalancing" icon="refresh-cw">
    ```bash title="Watch recovery progress" theme={null}
    watch ceph status
    ```

    Rebalancing completes when `ceph status` shows `HEALTH_OK` with no active
    recovery operations. Rebalancing speed depends on cluster size and network
    bandwidth.

    <Note>
      Recovery I/O competes with client I/O. If client performance is impacted during
      rebalancing, throttle recovery:

      ```bash title="Throttle recovery I/O" theme={null}
      ceph osd set-recovery-delay 5
      ```
    </Note>

    <Check>Cluster returns to `HEALTH_OK` with data distributed across all OSDs including new ones.</Check>
  </Step>
</Steps>

***

## Capacity Trend Monitoring

Configure XIMP alerts to proactively notify administrators before capacity reaches
critical thresholds:

| Alert             | Threshold       | XIMP Metric                   |
| ----------------- | --------------- | ----------------------------- |
| Capacity Warning  | Pool used > 70% | `xloud_storage_pool_used_pct` |
| Capacity Critical | Pool used > 80% | `xloud_storage_pool_used_pct` |
| OSD Near Full     | OSD used > 85%  | `xloud_storage_osd_used_pct`  |

Navigate to **Monitoring → Alerting → Alert Rules** in the XIMP portal and create
rules sourcing from the `xloud_storage` metric namespace.

***

## Next Steps

<CardGroup cols={2}>
  <Card title="Cluster Management" href="/services/sds/admin-guide/cluster-management" color="#197560">
    Add OSDs and manage cluster health during expansion
  </Card>

  <Card title="Monitoring" href="/services/sds/admin-guide/monitoring" color="#197560">
    Configure XIMP alerts for capacity and health thresholds
  </Card>

  <Card title="Storage Tiers" href="/services/sds/admin-guide/storage-tiers" color="#197560">
    Add new tiers when expanding with different device classes
  </Card>

  <Card title="Troubleshooting" href="/services/sds/admin-guide/troubleshooting" color="#197560">
    Diagnose capacity-related HEALTH\_WARN states
  </Card>
</CardGroup>
