Block Storage Troubleshooting (Admin)

Overview

This guide covers service-level troubleshooting for Xloud Block Storage administrators. It addresses issues with the volume service, scheduler, backend connectivity, and data operations that are not visible to or resolvable by end users.

Administrator Access Required — This operation requires the admin role. Contact your Xloud administrator if you do not have sufficient permissions.

Before troubleshooting

Authenticate with admin credentials: source openrc.sh
Access service logs via XDeploy for detailed error messages
For critical production issues, contact Xloud Support with the affected volume IDs and log excerpts

Service Health Checks

Run these commands first to establish the overall service state:

Check all volume service states

openstack volume service list

List all backend pools and capacity

openstack volume backend pool list --long

Check API endpoint health

openstack volume list --all-projects --limit 1

All services should show state = up and status = enabled. Any service showing down requires immediate investigation.

Volume Service Issues

Volume service is 'down'

Symptom: openstack volume service list shows one or more services with state down.Cause: The volume service container has stopped, lost message queue connectivity, or the storage backend driver failed to initialize.Resolution:

Access the affected node through XDeploy and check the volume service container:
Check container status (on storage node)
```
docker ps | grep cinder
```
Review container logs for initialization errors: Access logs via XDeploy → Logs → cinder-volume on the affected node.
Common causes:
- Message queue (RabbitMQ) connectivity lost — check network and RabbitMQ status
- Database connection failure — verify MariaDB/Galera cluster health
- Backend driver error (keyring file missing, pool name wrong) — review driver-specific log entries
After resolving the root cause, restart the volume service via XDeploy.

Run openstack volume service list 60 seconds after restarting to confirm the service re-registers with the scheduler.

Scheduler fails to place volumes — 'No valid backend'

Symptom: Volume creation fails with “No valid host was found” or scheduler filter messages appear in the logs.Cause: All backends were eliminated by the scheduler filters. Common causes:

All backends are at capacity
The requested volume type’s volume_backend_name does not match any active backend
The requested availability zone has no active backend

Resolution:

Verify backend capacity

openstack volume backend pool list --long

Verify volume type extra specs

openstack volume type show <type-name> -c extra_specs

Confirm that volume_backend_name in the type’s extra specs matches the name column in the backend pool list.

Backend Connectivity Issues

Backend reporting zero or unknown capacity

Symptom: openstack volume backend pool list shows free_capacity_gb = 0 or the backend pool does not appear in the list.Cause: The volume service cannot connect to the storage cluster to query capacity.Resolution:

Verify storage cluster health from the storage administration interface.
Verify the authentication keyring file is present on the volume service node:
Check keyring file (on storage node)
```
ls -la /etc/ceph/ceph.client.xloud-volume.keyring
```

Verify the pool name matches the configured backend:

List storage pools

# Run on a node with storage admin access
# (command varies by storage backend)

Restart the volume service via XDeploy after resolving connectivity issues.

Volume creation succeeds but attachment fails

Symptom: Volumes reach available status but fail when attaching to instances. Error typically references connection initialization or iSCSI/RBD target.Cause: The compute node cannot connect to the storage backend to initialize the volume attachment. Common causes:

Missing storage client package on the compute node
Authentication keyring not present on the compute node
Network routing between compute and storage nodes is blocked

Resolution:

Verify the storage client package is installed on the compute node (e.g., librbd-dev, ceph-common, or iSCSI initiator packages)
Verify the keyring file is present on the compute node
Test connectivity from the compute node to the storage cluster monitors

Data Operation Issues

Volume migration stuck in 'migrating' status

Symptom: Volume status remains migrating for more than 30 minutes with no completion.Resolution:

Check migration status

openstack volume show <volume-id> -c migration_status -c status

Check volume service logs on both source and destination nodes via XDeploy for migration-related errors.If permanently stuck and data integrity has been verified:

Reset volume state (admin only)

openstack volume set --state available <volume-id>

Resetting the state does not undo a partial migration. Verify data integrity on both source and destination backends before resetting. Consult your storage backend documentation for checking partial migration state.

Snapshot stuck in 'deleting' state

Symptom: A snapshot remains in deleting state for an extended period.Cause: The backend could not complete the deletion — typically because dependent volumes still reference the snapshot, or the storage cluster is degraded.Resolution:

Check for volumes created from the snapshot:
List dependent volumes
```
openstack volume list --all-projects
```
Look for volumes with source_volid matching the snapshot.
Delete dependent volumes first, then retry the snapshot deletion.
If the storage cluster is degraded, restore cluster health before retrying.

Encryption errors on volume attach

Symptom: Encrypted volume attachment fails with a key management or dm-crypt error.Diagnosis:

Verify the Key Management service is running and accessible from the compute node:
Check key manager connectivity
```
openstack secret list
```
Confirm the compute service on the affected node can reach the Key Management service API (network path, port 9311).
Check compute service logs on the affected node via XDeploy for messages containing barbican, secret, or crypt.

Encryption key loss means the volume data is permanently inaccessible. Ensure the Key Management service is in a high-availability configuration and has a database backup before enabling volume encryption in production.

Recovering Orphaned Volumes

Volumes can become orphaned (in-use with no valid attachment) when compute instances are force-deleted without detaching their volumes first:

Find orphaned volumes (in-use with no valid instance)

openstack volume list --all-projects --status in-use

For each result, verify the attached instance still exists:

Check attached instance

openstack server show <instance-id>

If the instance no longer exists, reset the volume state:

Reset orphaned volume to available

openstack volume set --state available <volume-id>

Diagnostic Commands Reference

Command	Purpose
`openstack volume service list`	Check all service states
`openstack volume backend pool list --long`	Verify backend capacity
`openstack volume list --all-projects --status error`	Find volumes in error state
`openstack volume snapshot list --all-projects`	Audit all snapshots
`openstack quota list --detail`	Check quota usage across projects
`openstack volume show <id> -c migration_status`	Check migration state
`openstack volume set --state <state> <id>`	Force-reset volume state (admin)

Next Steps

User Troubleshooting

Common issues from the user perspective

Storage Backends

Review backend configuration and connectivity requirements

Architecture

Understand service components to narrow down failure domains

Contact Support

Open a support ticket for unresolved production issues

​Overview

​Service Health Checks

​Volume Service Issues

​Backend Connectivity Issues

​Data Operation Issues

​Recovering Orphaned Volumes

​Diagnostic Commands Reference

​Next Steps

User Troubleshooting

Storage Backends

Architecture

Contact Support

Overview

Service Health Checks

Volume Service Issues

Backend Connectivity Issues

Data Operation Issues

Recovering Orphaned Volumes

Diagnostic Commands Reference

Next Steps