> ## Documentation Index > Fetch the complete documentation index at: https://docs.xloud.tech/llms.txt > Use this file to discover all available pages before exploring further. # Object Storage Architecture > Understand the Xloud Object Storage cluster topology — proxy tier, storage tier, consistent hash ring mechanics, S3 API middleware, and data flow for all. ## Overview Xloud Object Storage uses a three-tier, fully distributed architecture. Proxy nodes handle all API requests and authentication. Storage nodes persist object data on local drives. The consistent hash ring maps every object to its target storage locations without a central metadata server. This design eliminates single points of failure and enables horizontal scaling of each tier independently. **Prerequisites** * Familiarity with Xloud Object Storage [storage policies](/services/object-storage/storage-policies) * Admin access to review cluster topology *** ## Cluster Topology ```mermaid theme={null} graph TD Client["API Client\n(Swift / S3)"] --> ProxyLB["Load Balancer :443"] ProxyLB --> P1["Proxy Node 1\nproxy-server"] ProxyLB --> P2["Proxy Node 2\nproxy-server"] P1 --> AuthM["Keystone Auth Middleware"] P1 --> S3MW["S3 API Middleware"] P1 --> AR["Account Ring"] P1 --> CR["Container Ring"] P1 --> OR["Object Ring(s)"] AR --> SN1["Storage Node 1\naccount-server\ncontainer-server\nobject-server"] CR --> SN1 OR --> SN1 OR --> SN2["Storage Node 2\nobject-server"] OR --> SN3["Storage Node 3\nobject-server"] SN1 <-->|Replication| SN2 SN2 <-->|Replication| SN3 SN3 <-->|Replication| SN1 subgraph "Proxy Tier (Stateless)" P1 P2 AuthM S3MW end subgraph "Ring Layer" AR CR OR end subgraph "Storage Tier" SN1 SN2 SN3 end ``` Proxy nodes are fully stateless — they hold only ring files (updated via ring distribution). All persistent state lives on storage nodes. Adding proxy nodes scales API throughput without touching the storage tier. *** ## Component Descriptions The proxy server is the single entry point for all client requests (Swift and S3 API). It performs: * Token validation via Keystone auth middleware * Ring lookups to identify target storage nodes for each request * Parallel writes to all replica nodes for PUT operations * Read fan-out and quorum resolution for GET operations * Transparent S3 API translation via `s3api` middleware Proxy nodes never store object data. They are horizontally scalable and stateless. The account server manages project-level metadata: * Tracks all containers belonging to a project * Stores account-level statistics (bytes used, object count, container count) * Enforces quota limits in conjunction with the proxy * Served from the account ring — one partition per account The container server manages container-level metadata: * Lists all objects within a container (object listings) * Stores container-level statistics and custom metadata headers * Container records are replicated across the container ring * Object listings are eventually consistent — updates propagate asynchronously via the updater The object server handles the actual object data: * Stores objects on local XFS or ext4 filesystems * Each object stored as a file at a path derived from its MD5 hash * Handles PUT, GET, DELETE, HEAD, and COPY operations * Writes metadata (content-type, custom headers) as extended file attributes * Generates a unique transaction ID for every operation The replication engine runs continuously on every storage node to maintain the configured replica count: * **Object replicator**: Compares local partition hashes with remote nodes; pushes missing objects via rsync or direct HTTP * **Container replicator**: Synchronizes container database records across ring replicas * **Account replicator**: Synchronizes account database records * Replication is partition-based — the ring divides the hash space into partitions, and each partition's primary and handoff nodes are replicated to Additional background services maintain cluster health: | Service | Function | | ----------------- | ------------------------------------------------------------------------------------------------------------- | | **Auditor** | Reads every stored object and verifies checksum integrity. Quarantines corrupted objects. | | **Updater** | Processes failed container and account update queues asynchronously. Resolves eventually-consistent listings. | | **Expirer** | Deletes objects that have reached their `X-Delete-At` or `X-Delete-After` expiry timestamp. | | **Reconstructor** | EC-specific: reconstructs missing or corrupted EC fragments from surviving shards. | The `s3api` middleware translates S3-format requests into Swift internal requests transparently: * Mounted in the proxy pipeline before the auth middleware * Translates S3 bucket operations to Swift container operations * Translates S3 object operations to Swift object operations * Handles S3 authentication (HMAC-SHA256 signature v4) * Translates S3 ACLs to Swift ACL headers * Supports multipart upload via Swift dynamic large objects * Supports object versioning via Swift versioning middleware S3 and Swift APIs share the same underlying storage. An object uploaded via S3 API is immediately accessible via the Swift API using the same account/container/object path structure, and vice versa. *** ## Consistent Hash Ring The consistent hash ring is the core distribution mechanism. It determines which storage nodes hold each object without any central directory server. ```mermaid theme={null} graph LR Object["Object Name\n(MD5 hash)"] --> Part["Partition\n(high bits of hash)"] Part --> Node1["Primary Node"] Part --> Node2["Replica Node 1"] Part --> Node3["Replica Node 2"] Ring["Ring File\n(.ring.gz)"] -->|"lookup(partition)"| Node1 Ring -->|"lookup(partition)"| Node2 Ring -->|"lookup(partition)"| Node3 ``` Ring mechanics: | Concept | Description | | ------------------- | ----------------------------------------------------------------------- | | **Partition power** | `2^partition_power` partitions in the ring (typically 2^18 = 262,144) | | **Partition** | A slice of the hash space. Every object maps to exactly one partition. | | **Device** | A physical drive with assigned weight (capacity proportion) | | **Weight** | Determines what fraction of partitions a device receives | | **Replica count** | How many distinct devices hold each partition's data | | **Zone** | Fault domain grouping — ring enforces replicas land in distinct zones | | **Region** | Geographic grouping — for geo-redundant deployments across data centers | Higher partition power means more partitions and finer-grained data distribution, but larger ring files. Use `2^18` for clusters up to \~200 storage nodes. Use `2^20` for very large clusters. *** ## Object Request Flow ```mermaid theme={null} sequenceDiagram participant Client participant LB as Load Balancer participant Proxy as Proxy Server participant Ring as Object Ring participant S1 as Storage Node 1 participant S2 as Storage Node 2 participant S3 as Storage Node 3 Client->>LB: PUT /v1/AUTH_proj/container/object LB->>Proxy: Forward request Proxy->>Proxy: Keystone token validation Proxy->>Ring: lookup(MD5(object path)) Ring-->>Proxy: Primary: S1, Replicas: S2, S3 Proxy->>S1: PUT object (replica 1) — parallel Proxy->>S2: PUT object (replica 2) — parallel Proxy->>S3: PUT object (replica 3) — parallel S1-->>Proxy: 201 Created S2-->>Proxy: 201 Created S3-->>Proxy: 201 Created Proxy->>Proxy: Quorum check (2 of 3 required) Proxy-->>Client: 201 Created ``` The proxy writes to all replicas in parallel. It returns success to the client once a write quorum (default: `(replicas // 2) + 1`) confirms the write. ```mermaid theme={null} sequenceDiagram participant Client participant Proxy as Proxy Server participant Ring as Object Ring participant S1 as Storage Node 1 participant S2 as Storage Node 2 Client->>Proxy: GET /v1/AUTH_proj/container/object Proxy->>Proxy: Keystone token validation Proxy->>Ring: lookup(MD5(object path)) Ring-->>Proxy: Primary: S1, Replicas: S2, S3 Proxy->>S1: GET object (primary) S1-->>Proxy: 200 OK + data stream Proxy-->>Client: 200 OK + data stream Note over Proxy,S2: S2/S3 not contacted on successful primary read ``` Reads go to the primary node first. If the primary is unreachable, the proxy automatically falls back to replica nodes — transparent to the client. ```mermaid theme={null} sequenceDiagram participant Client as S3 Client participant Proxy as Proxy + s3api middleware participant Keystone participant Storage as Storage Nodes Client->>Proxy: PUT /bucket/object (S3 signature v4) Proxy->>Proxy: s3api: validate HMAC signature Proxy->>Keystone: Exchange S3 creds for Swift token Keystone-->>Proxy: Swift auth token Proxy->>Proxy: Translate S3 → Swift request Proxy->>Storage: Swift PUT /v1/AUTH_proj/bucket/object Storage-->>Proxy: 201 Created Proxy->>Proxy: Translate Swift → S3 response Proxy-->>Client: 200 OK (S3 format) ``` The S3 middleware translates the entire request and response lifecycle. The underlying storage operation is identical to a native Swift write. *** ## Replication Zones and Fault Domains Storage nodes are grouped into zones for fault domain separation. The ring builder enforces replica placement across distinct zones. | Zone | Typical Mapping | Failure Isolated | | ------ | ------------------------- | ----------------------- | | Zone 1 | Rack 1 / PDU A / Switch A | Any single rack failure | | Zone 2 | Rack 2 / PDU B / Switch B | Any single rack failure | | Zone 3 | Rack 3 / PDU C / Switch C | Any single rack failure | A 3-replica policy distributes one replica per zone. A full zone failure (rack down, switch failure) results in zero data loss and read/write operations continue using the two surviving zones. For geo-redundant deployments, configure regions in addition to zones. Each region hosts a complete replica set. Cross-region replication introduces higher write latency — design policies accordingly. *** ## Capacity Planning 3-replica policy: usable capacity = raw capacity ÷ 3. For 60 TB raw storage across 10 nodes, usable capacity = 20 TB. 8+4 EC policy: usable capacity = raw capacity × (8 ÷ 12) = 66.7%. For 60 TB raw storage, usable capacity = 40 TB — 2× more efficient than 3-replica. Replication (3×): minimum 3 nodes in 3 distinct zones. EC 8+4: minimum 12 nodes. EC 4+2: minimum 6 nodes. Adding nodes triggers a ring rebalance. Set `min_part_hours` (minimum 1 hour) to limit partition moves per cycle and prevent rebalance storms during rapid node additions. *** ## Next Steps Configure replication, EC, and multi-tier storage policies Add drives, adjust weights, and distribute updated rings Monitor replication health and manage quarantined objects Track cluster capacity and proxy request metrics