Troubleshooting

High cardinality causing metric store performance issues

Cause: Metric labels with unbounded values (e.g., request IDs, user IDs, or ephemeral container names) create millions of unique metric series, degrading query performance and consuming excessive storage.Diagnosis:

List highest-cardinality metric series

ximp metric cardinality top --limit 20

Resolution: Drop or relabel high-cardinality labels in the scrape configuration:

Relabel config — drop high-cardinality label

relabel_configs:
  - source_labels: [request_id]
    action: drop

Apply via:

Apply relabel configuration

ximp target update <TARGET_ID> --relabel-file relabel.yaml

Dropping a label is irreversible for historical data. The label will be absent from future ingested metrics. Consider using labelmap to replace high-cardinality values with aggregate labels instead of dropping them entirely.

Log ingestion backlog

Cause: Log volume exceeds the collector’s processing capacity, causing a write backlog and delayed delivery to the search index.Diagnosis:

Check ingestion queue depth

ximp log ingest-status

Resolution:

Reduce log verbosity on high-volume services (set log level to WARNING instead of DEBUG):
Example: reduce Nova log level
```
docker exec nova_api crudini --set /etc/nova/nova.conf DEFAULT debug false
```
Increase log collector worker count in the XIMP configuration: Navigate to Monitor Center > Logging (Collector Settings, admin view)
Add a second log collector node through XDeploy for horizontal scaling

A single high-verbosity service at DEBUG level can generate more log volume than 100 services at INFO. Identify the top log emitters: ximp log stats top-emitters --last 1h

Scrape target in DOWN state

Cause: The scrape target is unreachable — firewall blocking, service down, or authentication failure.Diagnosis:

Check specific target health

ximp target health --target <URL> --verbose

Common causes:

Symptom	Cause	Resolution
Connection refused	Service not running on target port	Verify service is running; check port
Timeout	Firewall blocking	Add inbound rule for XIMP collector IP
401 Unauthorized	Invalid auth credentials	Update auth config in target definition
503 Service Unavailable	Service overloaded	Review service health; reduce scrape frequency

XIMP metric store disk full

Cause: Metric volume has exceeded the allocated storage for the metric store. This can occur from high cardinality, insufficient retention management, or unexpected metric bursts.Diagnosis:

Check metric store disk usage

ximp storage status

Resolution (in order of preference):

Reduce raw metric retention to free space immediately:
Reduce raw retention to 15 days (emergency)
```
ximp retention set --type metrics-raw --duration 15d
```
Identify and drop high-cardinality series (see above)
Expand storage on the metric store node through XDeploy
Add a second metric store node for horizontal capacity

Dashboard shows 'No data' for a metric

Cause: The scrape target is down, the agent is offline, or the metric name has changed after a software update.Diagnosis:

Check target health: ximp target health --target <URL>
Verify agent is active: ximp agent list --node <HOSTNAME>
Search for the metric by prefix to find renamed metrics:
Search metrics by prefix
```
ximp metric search --prefix xloud_compute_cpu
```

If the metric was renamed in a recent software update, update dashboard queries and alert rules to use the new metric name.

Issue	Diagnostic Command
Cardinality	`ximp metric cardinality top --limit 20`
Log backlog	`ximp log ingest-status`
Target DOWN	`ximp target health --verbose`
Storage usage	`ximp storage status`
Agent offline	`ximp agent list --status offline`
Top log emitters	`ximp log stats top-emitters --last 1h`

Agent Configuration

Review and fix agent configuration that may be causing issues

Retention Policies

Adjust retention settings to address storage pressure

Metric Endpoints

Review and fix scrape target configurations

User Guide Troubleshooting

User-facing issues — alerts not firing, log delays

Troubleshooting

Overview

Common Issues

Diagnostics Reference

Next Steps

Agent Configuration

Retention Policies

Metric Endpoints

User Guide Troubleshooting

​Overview

​Common Issues

​Diagnostics Reference

​Next Steps

Agent Configuration

Retention Policies

Metric Endpoints

User Guide Troubleshooting

Overview

Common Issues

Diagnostics Reference

Next Steps