Prometheus Integration

Overview

Prometheus is the primary metrics backend for Xloud environments. It collects time-series metrics from compute nodes, storage clusters, and deployed services through scrape targets and service discovery. Alertmanager receives rule evaluation results from Prometheus and routes alert notifications — including webhook signals that trigger Xloud Orchestration auto-scaling policies. Prometheus replaces legacy telemetry stacks (Ceilometer/Aodh) for metric collection and alarm-based scaling in Xloud deployments.

Prerequisites

Prometheus 2.40 or later deployed (included in XIMP monitoring stack)
Alertmanager 0.25 or later
Node exporter deployed on all instances to be monitored
Network access from the Prometheus host to scrape targets on port 9100 (node exporter)

Architecture

Prometheus Configuration

Base Configuration

prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  scrape_timeout: 10s
  external_labels:
    environment: "production"
    region: "RegionOne"

rule_files:
  - "/etc/prometheus/rules/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - "alertmanager:9093"

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "node_exporter"
    static_configs:
      - targets:
          - "10.0.1.71:9100"
          - "10.0.1.72:9100"
          - "10.0.1.75:9100"

  - job_name: "ceph"
    static_configs:
      - targets: ["10.0.1.71:9095"]

Service Discovery via Xloud API

Use the openstack_sd_configs scrape configuration to automatically discover instances by project and assign labels from instance metadata:

scrape-config-discovery.yml

scrape_configs:
  - job_name: "xloud_instances"
    openstack_sd_configs:
      - identity_endpoint: "https://api.<your-domain>:5000/v3"
        username: "prometheus"
        password: "{{ OS_PASSWORD }}"
        domain_name: Default
        project_name: monitoring
        region: RegionOne
        role: instance
        port: 9100
        tls_config:
          insecure_skip_verify: false

    relabel_configs:
      - source_labels: [__meta_openstack_instance_name]
        target_label: instance
      - source_labels: [__meta_openstack_project_id]
        target_label: project
      - source_labels: [__meta_openstack_tag_role]
        target_label: role
      - source_labels: [__meta_openstack_instance_status]
        regex: ACTIVE
        action: keep

Tag instances with role=web, role=app, or role=db via instance metadata to enable role-based Prometheus label filtering and targeted alert rule evaluation.

Alert Rules

Infrastructure Alert Rules

/etc/prometheus/rules/infrastructure.yml

groups:
  - name: infrastructure
    interval: 30s
    rules:
      - alert: InstanceDown
        expr: up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} is unreachable"
          description: "Prometheus has not received a scrape response for 2 minutes."

      - alert: HighCpuUsage
        expr: >
          100 - (avg by (instance) (
            rate(node_cpu_seconds_total{mode="idle"}[2m])
          ) * 100) > 80
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High CPU on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | printf \"%.1f\" }}% — above 80% threshold."

      - alert: LowCpuUsage
        expr: >
          100 - (avg by (instance) (
            rate(node_cpu_seconds_total{mode="idle"}[10m])
          ) * 100) < 20
        for: 10m
        labels:
          severity: info
        annotations:
          summary: "Low CPU on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | printf \"%.1f\" }}% — below 20% threshold."

      - alert: HighMemoryUsage
        expr: >
          (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value | printf \"%.1f\" }}%."

      - alert: DiskSpaceLow
        expr: >
          (node_filesystem_avail_bytes{mountpoint="/"} /
           node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Root filesystem has {{ $value | printf \"%.1f\" }}% space remaining."

Auto-Scaling Alert Rules

Wire these alert rules into Alertmanager webhook receivers to drive Xloud Orchestration scaling policies:

/etc/prometheus/rules/autoscaling.yml

groups:
  - name: autoscaling
    rules:
      - alert: ScaleOutWeb
        expr: >
          avg(rate(node_cpu_seconds_total{mode!="idle",role="web"}[2m])) > 0.80
        for: 2m
        labels:
          severity: warning
          action: scale_out
          tier: web
        annotations:
          summary: "Web tier CPU high — scale out"

      - alert: ScaleInWeb
        expr: >
          avg(rate(node_cpu_seconds_total{mode!="idle",role="web"}[10m])) < 0.20
        for: 10m
        labels:
          severity: info
          action: scale_in
          tier: web
        annotations:
          summary: "Web tier CPU low — scale in"

Alertmanager Configuration

alertmanager.yml

global:
  resolve_timeout: 5m

route:
  receiver: "default"
  group_by: ["alertname", "tier"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        action: scale_out
        tier: web
      receiver: "web-scale-out"
      repeat_interval: 2m

    - match:
        action: scale_in
        tier: web
      receiver: "web-scale-in"
      repeat_interval: 12m

    - match:
        severity: critical
      receiver: "ops-critical"

receivers:
  - name: "default"
    email_configs:
      - to: "ops@example.com"
        from: "alertmanager@xloud.tech"
        smarthost: "smtp.xloud.tech:587"

  - name: "web-scale-out"
    webhook_configs:
      - url: "<scale_out_url from orchestration stack output>"
        send_resolved: false

  - name: "web-scale-in"
    webhook_configs:
      - url: "<scale_in_url from orchestration stack output>"
        send_resolved: false

  - name: "ops-critical"
    email_configs:
      - to: "oncall@example.com"
        from: "alertmanager@xloud.tech"
        smarthost: "smtp.xloud.tech:587"

inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ["instance"]

Useful Queries

Query	Purpose
`up`	Check which scrape targets are reachable
`rate(node_cpu_seconds_total{mode!="idle"}[5m])`	CPU utilization per core
`node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes`	Memory availability ratio
`node_filesystem_avail_bytes{mountpoint="/"}`	Root disk free bytes
`rate(node_network_receive_bytes_total[5m])`	Network ingress rate
`node_load1`	1-minute load average

Validation

Prometheus UI
CLI

Navigate to http://<prometheus-host>:9090:

Open Status → Targets — all scrape targets show UP state
Open Alerts — configured rules appear with their evaluation state
Run a query: enter up in the expression bar and click Execute

All expected targets appear with state UP and no scrape errors.

Check Prometheus health

curl -s http://localhost:9090/-/healthy

Query active alerts via API

curl -s http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | .labels'

Check Alertmanager status

curl -s http://localhost:9093/-/healthy

All endpoints return Healthy status and alert list matches configured rules.

Next Steps

Grafana Dashboards

Build operational dashboards using Prometheus as a data source

Auto-Scaling

Wire Alertmanager webhooks into Orchestration scaling policy signal URLs

Wazuh SIEM

Complement Prometheus metrics with Wazuh security event monitoring

XIMP Monitoring

Explore the built-in XIMP monitoring stack that includes Prometheus

​Overview

​Architecture

​Prometheus Configuration

​Base Configuration

​Service Discovery via Xloud API

​Alert Rules

​Infrastructure Alert Rules

​Auto-Scaling Alert Rules

​Alertmanager Configuration

​Useful Queries

​Validation

​Next Steps

Grafana Dashboards

Auto-Scaling

Wazuh SIEM

XIMP Monitoring

Overview

Architecture

Prometheus Configuration

Base Configuration

Service Discovery via Xloud API

Alert Rules

Infrastructure Alert Rules

Auto-Scaling Alert Rules

Alertmanager Configuration

Useful Queries

Validation

Next Steps