Skip to main content

Overview

This page covers administrator-level XIMP troubleshooting. For user-facing issues such as alert delivery failures and missing metrics on dashboards, see the XIMP User Guide Troubleshooting page.
Administrator Access Required — This operation requires the admin role. Contact your Xloud administrator if you do not have sufficient permissions.
Prerequisites
  • Administrator credentials with the admin role
  • Access to XIMP CLI and management interfaces

Common Issues

Cause: Metric labels with unbounded values (e.g., request IDs, user IDs, or ephemeral container names) create millions of unique metric series, degrading query performance and consuming excessive storage.Diagnosis:
List highest-cardinality metric series
ximp metric cardinality top --limit 20
Resolution: Drop or relabel high-cardinality labels in the scrape configuration:
Relabel config — drop high-cardinality label
relabel_configs:
  - source_labels: [request_id]
    action: drop
Apply via:
Apply relabel configuration
ximp target update <TARGET_ID> --relabel-file relabel.yaml
Dropping a label is irreversible for historical data. The label will be absent from future ingested metrics. Consider using labelmap to replace high-cardinality values with aggregate labels instead of dropping them entirely.
Cause: Log volume exceeds the collector’s processing capacity, causing a write backlog and delayed delivery to the search index.Diagnosis:
Check ingestion queue depth
ximp log ingest-status
Resolution:
  • Reduce log verbosity on high-volume services (set log level to WARNING instead of DEBUG):
    Example: reduce Nova log level
    docker exec nova_api crudini --set /etc/nova/nova.conf DEFAULT debug false
    
  • Increase log collector worker count in the XIMP configuration: Navigate to Monitoring → Administration → Collector Settings → Log Workers
  • Add a second log collector node through XDeploy for horizontal scaling
A single high-verbosity service at DEBUG level can generate more log volume than 100 services at INFO. Identify the top log emitters: ximp log stats top-emitters --last 1h
Cause: The scrape target is unreachable — firewall blocking, service down, or authentication failure.Diagnosis:
Check specific target health
ximp target health --target <URL> --verbose
Common causes:
SymptomCauseResolution
Connection refusedService not running on target portVerify service is running; check port
TimeoutFirewall blockingAdd inbound rule for XIMP collector IP
401 UnauthorizedInvalid auth credentialsUpdate auth config in target definition
503 Service UnavailableService overloadedReview service health; reduce scrape frequency
Cause: Metric volume has exceeded the allocated storage for the metric store. This can occur from high cardinality, insufficient retention management, or unexpected metric bursts.Diagnosis:
Check metric store disk usage
ximp storage status
Resolution (in order of preference):
  1. Reduce raw metric retention to free space immediately:
    Reduce raw retention to 15 days (emergency)
    ximp retention set --type metrics-raw --duration 15d
    
  2. Identify and drop high-cardinality series (see above)
  3. Expand storage on the metric store node through XDeploy
  4. Add a second metric store node for horizontal capacity
Cause: The scrape target is down, the agent is offline, or the metric name has changed after a software update.Diagnosis:
  1. Check target health: ximp target health --target <URL>
  2. Verify agent is active: ximp agent list --node <HOSTNAME>
  3. Search for the metric by prefix to find renamed metrics:
    Search metrics by prefix
    ximp metric search --prefix xloud_compute_cpu
    
If the metric was renamed in a recent software update, update dashboard queries and alert rules to use the new metric name.

Diagnostics Reference

IssueDiagnostic Command
Cardinalityximp metric cardinality top --limit 20
Log backlogximp log ingest-status
Target DOWNximp target health --verbose
Storage usageximp storage status
Agent offlineximp agent list --status offline
Top log emittersximp log stats top-emitters --last 1h

Next Steps

Agent Configuration

Review and fix agent configuration that may be causing issues

Retention Policies

Adjust retention settings to address storage pressure

Metric Endpoints

Review and fix scrape target configurations

User Guide Troubleshooting

User-facing issues — alerts not firing, log delays