Skip to main content

Overview

Prometheus is the primary metrics backend for Xloud environments. It collects time-series metrics from compute nodes, storage clusters, and deployed services through scrape targets and service discovery. Alertmanager receives rule evaluation results from Prometheus and routes alert notifications — including webhook signals that trigger Xloud Orchestration auto-scaling policies. Prometheus replaces legacy telemetry stacks (Ceilometer/Aodh) for metric collection and alarm-based scaling in Xloud deployments.
Prerequisites
  • Prometheus 2.40 or later deployed (included in XIMP monitoring stack)
  • Alertmanager 0.25 or later
  • Node exporter deployed on all instances to be monitored
  • Network access from the Prometheus host to scrape targets on port 9100 (node exporter)

Architecture


Prometheus Configuration

Base Configuration

prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  scrape_timeout: 10s
  external_labels:
    environment: "production"
    region: "RegionOne"

rule_files:
  - "/etc/prometheus/rules/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - "alertmanager:9093"

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "node_exporter"
    static_configs:
      - targets:
          - "10.0.1.71:9100"
          - "10.0.1.72:9100"
          - "10.0.1.75:9100"

  - job_name: "ceph"
    static_configs:
      - targets: ["10.0.1.71:9095"]

Service Discovery via Xloud API

Use the openstack_sd_configs scrape configuration to automatically discover instances by project and assign labels from instance metadata:
scrape-config-discovery.yml
scrape_configs:
  - job_name: "xloud_instances"
    openstack_sd_configs:
      - identity_endpoint: "https://api.<your-domain>:5000/v3"
        username: "prometheus"
        password: "{{ OS_PASSWORD }}"
        domain_name: Default
        project_name: monitoring
        region: RegionOne
        role: instance
        port: 9100
        tls_config:
          insecure_skip_verify: false

    relabel_configs:
      - source_labels: [__meta_openstack_instance_name]
        target_label: instance
      - source_labels: [__meta_openstack_project_id]
        target_label: project
      - source_labels: [__meta_openstack_tag_role]
        target_label: role
      - source_labels: [__meta_openstack_instance_status]
        regex: ACTIVE
        action: keep
Tag instances with role=web, role=app, or role=db via instance metadata to enable role-based Prometheus label filtering and targeted alert rule evaluation.

Alert Rules

Infrastructure Alert Rules

/etc/prometheus/rules/infrastructure.yml
groups:
  - name: infrastructure
    interval: 30s
    rules:
      - alert: InstanceDown
        expr: up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} is unreachable"
          description: "Prometheus has not received a scrape response for 2 minutes."

      - alert: HighCpuUsage
        expr: >
          100 - (avg by (instance) (
            rate(node_cpu_seconds_total{mode="idle"}[2m])
          ) * 100) > 80
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High CPU on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | printf \"%.1f\" }}% — above 80% threshold."

      - alert: LowCpuUsage
        expr: >
          100 - (avg by (instance) (
            rate(node_cpu_seconds_total{mode="idle"}[10m])
          ) * 100) < 20
        for: 10m
        labels:
          severity: info
        annotations:
          summary: "Low CPU on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | printf \"%.1f\" }}% — below 20% threshold."

      - alert: HighMemoryUsage
        expr: >
          (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value | printf \"%.1f\" }}%."

      - alert: DiskSpaceLow
        expr: >
          (node_filesystem_avail_bytes{mountpoint="/"} /
           node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Root filesystem has {{ $value | printf \"%.1f\" }}% space remaining."

Auto-Scaling Alert Rules

Wire these alert rules into Alertmanager webhook receivers to drive Xloud Orchestration scaling policies:
/etc/prometheus/rules/autoscaling.yml
groups:
  - name: autoscaling
    rules:
      - alert: ScaleOutWeb
        expr: >
          avg(rate(node_cpu_seconds_total{mode!="idle",role="web"}[2m])) > 0.80
        for: 2m
        labels:
          severity: warning
          action: scale_out
          tier: web
        annotations:
          summary: "Web tier CPU high — scale out"

      - alert: ScaleInWeb
        expr: >
          avg(rate(node_cpu_seconds_total{mode!="idle",role="web"}[10m])) < 0.20
        for: 10m
        labels:
          severity: info
          action: scale_in
          tier: web
        annotations:
          summary: "Web tier CPU low — scale in"

Alertmanager Configuration

alertmanager.yml
global:
  resolve_timeout: 5m

route:
  receiver: "default"
  group_by: ["alertname", "tier"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        action: scale_out
        tier: web
      receiver: "web-scale-out"
      repeat_interval: 2m

    - match:
        action: scale_in
        tier: web
      receiver: "web-scale-in"
      repeat_interval: 12m

    - match:
        severity: critical
      receiver: "ops-critical"

receivers:
  - name: "default"
    email_configs:
      - to: "ops@example.com"
        from: "alertmanager@xloud.tech"
        smarthost: "smtp.xloud.tech:587"

  - name: "web-scale-out"
    webhook_configs:
      - url: "<scale_out_url from orchestration stack output>"
        send_resolved: false

  - name: "web-scale-in"
    webhook_configs:
      - url: "<scale_in_url from orchestration stack output>"
        send_resolved: false

  - name: "ops-critical"
    email_configs:
      - to: "oncall@example.com"
        from: "alertmanager@xloud.tech"
        smarthost: "smtp.xloud.tech:587"

inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ["instance"]

Useful Queries

QueryPurpose
upCheck which scrape targets are reachable
rate(node_cpu_seconds_total{mode!="idle"}[5m])CPU utilization per core
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytesMemory availability ratio
node_filesystem_avail_bytes{mountpoint="/"}Root disk free bytes
rate(node_network_receive_bytes_total[5m])Network ingress rate
node_load11-minute load average

Validation

Navigate to http://<prometheus-host>:9090:
  1. Open Status → Targets — all scrape targets show UP state
  2. Open Alerts — configured rules appear with their evaluation state
  3. Run a query: enter up in the expression bar and click Execute
All expected targets appear with state UP and no scrape errors.

Next Steps

Grafana Dashboards

Build operational dashboards using Prometheus as a data source

Auto-Scaling

Wire Alertmanager webhooks into Orchestration scaling policy signal URLs

Wazuh SIEM

Complement Prometheus metrics with Wazuh security event monitoring

XIMP Monitoring

Explore the built-in XIMP monitoring stack that includes Prometheus