Skip to main content

Overview

Xloud Object Storage uses a three-tier, fully distributed architecture. Proxy nodes handle all API requests and authentication. Storage nodes persist object data on local drives. The consistent hash ring maps every object to its target storage locations without a central metadata server. This design eliminates single points of failure and enables horizontal scaling of each tier independently.
Prerequisites
  • Familiarity with Xloud Object Storage storage policies
  • Admin access to review cluster topology

Cluster Topology

Proxy nodes are fully stateless — they hold only ring files (updated via ring distribution). All persistent state lives on storage nodes. Adding proxy nodes scales API throughput without touching the storage tier.

Component Descriptions

Proxy Server

The proxy server is the single entry point for all client requests (Swift and S3 API). It performs:
  • Token validation via Keystone auth middleware
  • Ring lookups to identify target storage nodes for each request
  • Parallel writes to all replica nodes for PUT operations
  • Read fan-out and quorum resolution for GET operations
  • Transparent S3 API translation via s3api middleware
Proxy nodes never store object data. They are horizontally scalable and stateless.
The account server manages project-level metadata:
  • Tracks all containers belonging to a project
  • Stores account-level statistics (bytes used, object count, container count)
  • Enforces quota limits in conjunction with the proxy
  • Served from the account ring — one partition per account
The container server manages container-level metadata:
  • Lists all objects within a container (object listings)
  • Stores container-level statistics and custom metadata headers
  • Container records are replicated across the container ring
  • Object listings are eventually consistent — updates propagate asynchronously via the updater
The object server handles the actual object data:
  • Stores objects on local XFS or ext4 filesystems
  • Each object stored as a file at a path derived from its MD5 hash
  • Handles PUT, GET, DELETE, HEAD, and COPY operations
  • Writes metadata (content-type, custom headers) as extended file attributes
  • Generates a unique transaction ID for every operation
The replication engine runs continuously on every storage node to maintain the configured replica count:
  • Object replicator: Compares local partition hashes with remote nodes; pushes missing objects via rsync or direct HTTP
  • Container replicator: Synchronizes container database records across ring replicas
  • Account replicator: Synchronizes account database records
  • Replication is partition-based — the ring divides the hash space into partitions, and each partition’s primary and handoff nodes are replicated to
Additional background services maintain cluster health:
ServiceFunction
AuditorReads every stored object and verifies checksum integrity. Quarantines corrupted objects.
UpdaterProcesses failed container and account update queues asynchronously. Resolves eventually-consistent listings.
ExpirerDeletes objects that have reached their X-Delete-At or X-Delete-After expiry timestamp.
ReconstructorEC-specific: reconstructs missing or corrupted EC fragments from surviving shards.
The s3api middleware translates S3-format requests into Swift internal requests transparently:
  • Mounted in the proxy pipeline before the auth middleware
  • Translates S3 bucket operations to Swift container operations
  • Translates S3 object operations to Swift object operations
  • Handles S3 authentication (HMAC-SHA256 signature v4)
  • Translates S3 ACLs to Swift ACL headers
  • Supports multipart upload via Swift dynamic large objects
  • Supports object versioning via Swift versioning middleware
S3 and Swift APIs share the same underlying storage. An object uploaded via S3 API is immediately accessible via the Swift API using the same account/container/object path structure, and vice versa.

Consistent Hash Ring

The consistent hash ring is the core distribution mechanism. It determines which storage nodes hold each object without any central directory server. Ring mechanics:
ConceptDescription
Partition power2^partition_power partitions in the ring (typically 2^18 = 262,144)
PartitionA slice of the hash space. Every object maps to exactly one partition.
DeviceA physical drive with assigned weight (capacity proportion)
WeightDetermines what fraction of partitions a device receives
Replica countHow many distinct devices hold each partition’s data
ZoneFault domain grouping — ring enforces replicas land in distinct zones
RegionGeographic grouping — for geo-redundant deployments across data centers
Higher partition power means more partitions and finer-grained data distribution, but larger ring files. Use 2^18 for clusters up to ~200 storage nodes. Use 2^20 for very large clusters.

Object Request Flow

The proxy writes to all replicas in parallel. It returns success to the client once a write quorum (default: (replicas // 2) + 1) confirms the write.

Replication Zones and Fault Domains

Storage nodes are grouped into zones for fault domain separation. The ring builder enforces replica placement across distinct zones.
ZoneTypical MappingFailure Isolated
Zone 1Rack 1 / PDU A / Switch AAny single rack failure
Zone 2Rack 2 / PDU B / Switch BAny single rack failure
Zone 3Rack 3 / PDU C / Switch CAny single rack failure
A 3-replica policy distributes one replica per zone. A full zone failure (rack down, switch failure) results in zero data loss and read/write operations continue using the two surviving zones.
For geo-redundant deployments, configure regions in addition to zones. Each region hosts a complete replica set. Cross-region replication introduces higher write latency — design policies accordingly.

Capacity Planning

Replication Overhead

3-replica policy: usable capacity = raw capacity ÷ 3. For 60 TB raw storage across 10 nodes, usable capacity = 20 TB.

EC Overhead

8+4 EC policy: usable capacity = raw capacity × (8 ÷ 12) = 66.7%. For 60 TB raw storage, usable capacity = 40 TB — 2× more efficient than 3-replica.

Minimum Nodes per Policy

Replication (3×): minimum 3 nodes in 3 distinct zones. EC 8+4: minimum 12 nodes. EC 4+2: minimum 6 nodes.

Ring Rebalancing

Adding nodes triggers a ring rebalance. Set min_part_hours (minimum 1 hour) to limit partition moves per cycle and prevent rebalance storms during rapid node additions.

Next Steps

Storage Policies

Configure replication, EC, and multi-tier storage policies

Ring Management

Add drives, adjust weights, and distribute updated rings

Replication

Monitor replication health and manage quarantined objects

Monitoring

Track cluster capacity and proxy request metrics