Object Storage Architecture

Overview

Xloud Object Storage uses a three-tier, fully distributed architecture. Proxy nodes handle all API requests and authentication. Storage nodes persist object data on local drives. The consistent hash ring maps every object to its target storage locations without a central metadata server. This design eliminates single points of failure and enables horizontal scaling of each tier independently.

Prerequisites

Familiarity with Xloud Object Storage storage policies
Admin access to review cluster topology

Cluster Topology

Proxy nodes are fully stateless — they hold only ring files (updated via ring distribution). All persistent state lives on storage nodes. Adding proxy nodes scales API throughput without touching the storage tier.

Component Descriptions

Proxy Server

The proxy server is the single entry point for all client requests (Swift and S3 API). It performs:

Token validation via Keystone auth middleware
Ring lookups to identify target storage nodes for each request
Parallel writes to all replica nodes for PUT operations
Read fan-out and quorum resolution for GET operations
Transparent S3 API translation via s3api middleware

Proxy nodes never store object data. They are horizontally scalable and stateless.

Account Server

The account server manages project-level metadata:

Tracks all containers belonging to a project
Stores account-level statistics (bytes used, object count, container count)
Enforces quota limits in conjunction with the proxy
Served from the account ring — one partition per account

Container Server

The container server manages container-level metadata:

Lists all objects within a container (object listings)
Stores container-level statistics and custom metadata headers
Container records are replicated across the container ring
Object listings are eventually consistent — updates propagate asynchronously via the updater

Object Server

The object server handles the actual object data:

Stores objects on local XFS or ext4 filesystems
Each object stored as a file at a path derived from its MD5 hash
Handles PUT, GET, DELETE, HEAD, and COPY operations
Writes metadata (content-type, custom headers) as extended file attributes
Generates a unique transaction ID for every operation

Replication Engine

The replication engine runs continuously on every storage node to maintain the configured replica count:

Object replicator: Compares local partition hashes with remote nodes; pushes missing objects via rsync or direct HTTP
Container replicator: Synchronizes container database records across ring replicas
Account replicator: Synchronizes account database records
Replication is partition-based — the ring divides the hash space into partitions, and each partition’s primary and handoff nodes are replicated to

Background Services

Additional background services maintain cluster health:

Service	Function
Auditor	Reads every stored object and verifies checksum integrity. Quarantines corrupted objects.
Updater	Processes failed container and account update queues asynchronously. Resolves eventually-consistent listings.
Expirer	Deletes objects that have reached their `X-Delete-At` or `X-Delete-After` expiry timestamp.
Reconstructor	EC-specific: reconstructs missing or corrupted EC fragments from surviving shards.

S3 API Middleware

The s3api middleware translates S3-format requests into Swift internal requests transparently:

Mounted in the proxy pipeline before the auth middleware
Translates S3 bucket operations to Swift container operations
Translates S3 object operations to Swift object operations
Handles S3 authentication (HMAC-SHA256 signature v4)
Translates S3 ACLs to Swift ACL headers
Supports multipart upload via Swift dynamic large objects
Supports object versioning via Swift versioning middleware

S3 and Swift APIs share the same underlying storage. An object uploaded via S3 API is immediately accessible via the Swift API using the same account/container/object path structure, and vice versa.

Consistent Hash Ring

The consistent hash ring is the core distribution mechanism. It determines which storage nodes hold each object without any central directory server. Ring mechanics:

Concept	Description
Partition power	`2^partition_power` partitions in the ring (typically 2^18 = 262,144)
Partition	A slice of the hash space. Every object maps to exactly one partition.
Device	A physical drive with assigned weight (capacity proportion)
Weight	Determines what fraction of partitions a device receives
Replica count	How many distinct devices hold each partition’s data
Zone	Fault domain grouping — ring enforces replicas land in distinct zones
Region	Geographic grouping — for geo-redundant deployments across data centers

Higher partition power means more partitions and finer-grained data distribution, but larger ring files. Use 2^18 for clusters up to ~200 storage nodes. Use 2^20 for very large clusters.

Object Request Flow

Write (PUT)
Read (GET)
S3 API

The proxy writes to all replicas in parallel. It returns success to the client once a write quorum (default: (replicas // 2) + 1) confirms the write.

Replication Zones and Fault Domains

Storage nodes are grouped into zones for fault domain separation. The ring builder enforces replica placement across distinct zones.

Zone	Typical Mapping	Failure Isolated
Zone 1	Rack 1 / PDU A / Switch A	Any single rack failure
Zone 2	Rack 2 / PDU B / Switch B	Any single rack failure
Zone 3	Rack 3 / PDU C / Switch C	Any single rack failure

A 3-replica policy distributes one replica per zone. A full zone failure (rack down, switch failure) results in zero data loss and read/write operations continue using the two surviving zones.

For geo-redundant deployments, configure regions in addition to zones. Each region hosts a complete replica set. Cross-region replication introduces higher write latency — design policies accordingly.

Capacity Planning

Replication Overhead

3-replica policy: usable capacity = raw capacity ÷ 3. For 60 TB raw storage across 10 nodes, usable capacity = 20 TB.

EC Overhead

8+4 EC policy: usable capacity = raw capacity × (8 ÷ 12) = 66.7%. For 60 TB raw storage, usable capacity = 40 TB — 2× more efficient than 3-replica.

Minimum Nodes per Policy

Replication (3×): minimum 3 nodes in 3 distinct zones. EC 8+4: minimum 12 nodes. EC 4+2: minimum 6 nodes.

Ring Rebalancing

Adding nodes triggers a ring rebalance. Set min_part_hours (minimum 1 hour) to limit partition moves per cycle and prevent rebalance storms during rapid node additions.

Next Steps

Storage Policies

Configure replication, EC, and multi-tier storage policies

Ring Management

Add drives, adjust weights, and distribute updated rings

Replication

Monitor replication health and manage quarantined objects

Monitoring

Track cluster capacity and proxy request metrics

​Overview

​Cluster Topology

​Component Descriptions

​Consistent Hash Ring

​Object Request Flow

​Replication Zones and Fault Domains

​Capacity Planning

Replication Overhead

EC Overhead

Minimum Nodes per Policy

Ring Rebalancing

​Next Steps

Storage Policies

Ring Management

Replication

Monitoring

Overview

Cluster Topology

Component Descriptions

Consistent Hash Ring

Object Request Flow

Replication Zones and Fault Domains

Capacity Planning

Next Steps