Skip to content

Prometheus

Vue d'ensemble

Prometheus est un système de monitoring et d'alerting open-source qui collecte et stocke les métriques sous forme de données temporelles. Développé par SoundCloud, il est devenu le standard pour le monitoring d'infrastructures cloud-native.

Philosophie

"Monitoring and alerting made reliable - Collectez, stockez et alertez sur vos métriques système et applicatives."

Avantages clés

Architecture pull-based

  • Scraping : Collecte active des métriques
  • Service discovery : Découverte automatique des cibles
  • Robustesse : Pas de dépendance côté application
  • Scalabilité : Architecture décentralisée

Modèle de données puissant

  • Time series : Données horodatées
  • Labels : Dimensions multiples
  • PromQL : Langage de requête expressif
  • Agrégations : Calculs en temps réel

Écosystème riche

  • Exporters : 100+ exporters communautaires
  • Grafana : Visualisation avancée
  • Alertmanager : Gestion des alertes
  • Federation : Monitoring multi-cluster

Architecture

Composants principaux

  • Prometheus Server : Collecte et stockage
  • Exporters : Exposition des métriques
  • Pushgateway : Métriques batch/temporaires
  • Alertmanager : Gestion des alertes
  • Web UI : Interface de requête

Configuration de base

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alerts.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'application'
    static_configs:
      - targets: ['app1:8080', 'app2:8080']
    metrics_path: '/metrics'
    scrape_interval: 10s

Métriques et types

Types de métriques

// Counter - Valeur qui ne fait qu'augmenter
http_requests_total{method="GET", status="200"} 1234

// Gauge - Valeur qui peut monter/descendre  
memory_usage_bytes{instance="server1"} 1073741824

// Histogram - Distribution de valeurs
http_request_duration_seconds_bucket{le="0.1"} 100
http_request_duration_seconds_bucket{le="0.5"} 200
http_request_duration_seconds_sum 45.2
http_request_duration_seconds_count 250

// Summary - Quantiles précalculés
response_time_seconds{quantile="0.95"} 0.234
response_time_seconds_sum 123.45
response_time_seconds_count 500

Intégrations

Avec Laravel

// Exporter Prometheus pour Laravel
use Prometheus\CollectorRegistry;
use Prometheus\Storage\Redis;

class PrometheusMetrics
{
    private $registry;

    public function __construct()
    {
        $this->registry = new CollectorRegistry(new Redis());
    }

    public function incrementRequestCounter(string $method, string $status)
    {
        $counter = $this->registry->getOrRegisterCounter(
            'laravel',
            'http_requests_total',
            'Total HTTP requests',
            ['method', 'status', 'route']
        );

        $counter->incBy(1, [$method, $status, request()->route()->getName()]);
    }

    public function recordResponseTime(float $duration)
    {
        $histogram = $this->registry->getOrRegisterHistogram(
            'laravel',
            'http_request_duration_seconds',
            'HTTP request duration',
            ['route'],
            [0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0]
        );

        $histogram->observe($duration, [request()->route()->getName()]);
    }
}

// Middleware pour métriques automatiques
class PrometheusMiddleware
{
    public function handle($request, Closure $next)
    {
        $start = microtime(true);
        $response = $next($request);
        $duration = microtime(true) - $start;

        app(PrometheusMetrics::class)->incrementRequestCounter(
            $request->method(),
            $response->getStatusCode()
        );

        app(PrometheusMetrics::class)->recordResponseTime($duration);

        return $response;
    }
}

Avec Nomad

# Configuration scraping Nomad
scrape_configs:
  - job_name: 'nomad-servers'
    consul_sd_configs:
      - server: 'localhost:8500'
        services: ['nomad']
        tags: ['server']
    relabel_configs:
      - source_labels: [__meta_consul_service_port]
        target_label: __address__
        replacement: '${1}:4646'
    metrics_path: /v1/metrics
    params:
      format: ['prometheus']

  - job_name: 'nomad-clients'
    consul_sd_configs:
      - server: 'localhost:8500'
        services: ['nomad-client']
    relabel_configs:
      - source_labels: [__meta_consul_service_port]
        target_label: __address__
        replacement: '${1}:4646'
    metrics_path: /v1/metrics
    params:
      format: ['prometheus']

Avec Kubernetes

# ServiceMonitor pour Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: laravel-app
  labels:
    app: laravel-app
spec:
  selector:
    matchLabels:
      app: laravel-app
  endpoints:
  - port: metrics
    path: /metrics
    interval: 30s

---
# Service avec port métriques
apiVersion: v1
kind: Service
metadata:
  name: laravel-app-metrics
  labels:
    app: laravel-app
spec:
  ports:
  - port: 9090
    name: metrics
    targetPort: 9090
  selector:
    app: laravel-app

Avec GitLab CI

# .gitlab-ci.yml - Monitoring de pipelines
stages:
  - test
  - deploy
  - monitor

monitor_deployment:
  stage: monitor
  script:
    - |
      # Push métriques de déploiement
      curl -X POST http://pushgateway:9091/metrics/job/gitlab-deploy/instance/$CI_JOB_ID \
        --data-binary @- <<EOF
      deployment_duration_seconds{job="$CI_JOB_NAME",pipeline="$CI_PIPELINE_ID"} $(echo "$CI_JOB_STARTED_AT $CI_JOB_FINISHED_AT" | awk '{print $2-$1}')
      deployment_status{job="$CI_JOB_NAME",pipeline="$CI_PIPELINE_ID",status="success"} 1
      EOF
  after_script:
    - echo "Metrics pushed to Prometheus"

Requêtes PromQL

Requêtes de base

# Taux de requêtes HTTP par seconde
rate(http_requests_total[5m])

# Utilisation mémoire moyenne par instance
avg by (instance) (memory_usage_bytes)

# Top 5 des endpoints les plus lents
topk(5, avg by (route) (http_request_duration_seconds))

# Pourcentage d'erreurs 5xx
sum(rate(http_requests_total{status=~"5.."}[5m])) / 
sum(rate(http_requests_total[5m])) * 100

# Prédiction linéaire pour les prochaines 4 heures
predict_linear(disk_usage_bytes[1h], 4*3600)

Requêtes avancées

# SLI - 95ème percentile de latence
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Détection d'anomalies
abs(memory_usage_bytes - avg_over_time(memory_usage_bytes[1d])) > 
2 * stddev_over_time(memory_usage_bytes[1d])

# Corrélation entre CPU et latence
increase(cpu_usage_seconds_total[5m]) * on(instance) 
group_left histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Alerting

Règles d'alertes

# alerts.yml
groups:
- name: application.rules
  rules:
  - alert: HighErrorRate
    expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate detected"
      description: "Error rate is {{ $value | humanizePercentage }} for the last 5 minutes"

  - alert: HighLatency
    expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "High latency detected"
      description: "95th percentile latency is {{ $value }}s"

  - alert: ServiceDown
    expr: up == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Service {{ $labels.instance }} is down"
      description: "Service has been down for more than 1 minute"

- name: infrastructure.rules
  rules:
  - alert: HighCPUUsage
    expr: 100 - (avg(irate(cpu_usage_idle[5m])) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage"
      description: "CPU usage is above 80% for 5 minutes"

  - alert: LowDiskSpace
    expr: (disk_free_bytes / disk_total_bytes) * 100 < 10
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Low disk space on {{ $labels.instance }}"
      description: "Only {{ $value | humanizePercentage }} disk space remaining"

Exporters populaires

Node Exporter

# Installation
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvfz node_exporter-1.6.1.linux-amd64.tar.gz
sudo cp node_exporter-1.6.1.linux-amd64/node_exporter /usr/local/bin/

# Service systemd
sudo tee /etc/systemd/system/node_exporter.service > /dev/null <<EOF
[Unit]
Description=Node Exporter
After=network.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

Application personnalisée

# Exporter Python avec prometheus_client
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
import random

# Métriques
REQUEST_COUNT = Counter('app_requests_total', 'Total requests', ['method', 'endpoint'])
REQUEST_LATENCY = Histogram('app_request_duration_seconds', 'Request latency')
ACTIVE_USERS = Gauge('app_active_users', 'Number of active users')

def process_request(method, endpoint):
    REQUEST_COUNT.labels(method=method, endpoint=endpoint).inc()

    with REQUEST_LATENCY.time():
        # Simulation traitement
        time.sleep(random.uniform(0.1, 0.5))

    ACTIVE_USERS.set(random.randint(50, 200))

if __name__ == '__main__':
    start_http_server(8000)
    print("Metrics server started on port 8000")

    while True:
        process_request('GET', '/api/users')
        time.sleep(1)

Déploiement avec Ansible

- name: Deploy Prometheus
  hosts: monitoring
  become: yes

  tasks:
    - name: Create prometheus user
      user:
        name: prometheus
        system: yes
        shell: /bin/false
        home: /var/lib/prometheus

    - name: Download and install Prometheus
      unarchive:
        src: "https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz"
        dest: /tmp
        remote_src: yes
        creates: /tmp/prometheus-2.45.0.linux-amd64

    - name: Copy Prometheus binary
      copy:
        src: /tmp/prometheus-2.45.0.linux-amd64/prometheus
        dest: /usr/local/bin/prometheus
        mode: '0755'
        owner: prometheus
        group: prometheus
        remote_src: yes

    - name: Create configuration directory
      file:
        path: /etc/prometheus
        state: directory
        owner: prometheus
        group: prometheus

    - name: Deploy Prometheus configuration
      template:
        src: prometheus.yml.j2
        dest: /etc/prometheus/prometheus.yml
        owner: prometheus
        group: prometheus
      notify: restart prometheus

    - name: Create systemd service
      template:
        src: prometheus.service.j2
        dest: /etc/systemd/system/prometheus.service
      notify:
        - reload systemd
        - restart prometheus

  handlers:
    - name: reload systemd
      systemd:
        daemon_reload: yes

    - name: restart prometheus
      systemd:
        name: prometheus
        state: restarted
        enabled: yes

Stockage et rétention

Configuration rétention

# Configuration avancée
global:
  scrape_interval: 15s
  external_labels:
    cluster: 'production'
    replica: 'A'

# Rétention des données
storage:
  tsdb:
    retention.time: 90d
    retention.size: 50GB

# Remote write pour stockage long terme
remote_write:
  - url: "https://cortex.example.com/api/prom/push"
    basic_auth:
      username: prometheus
      password_file: /etc/prometheus/cortex_password

Optimisation performances

Configuration haute performance

# Optimisations
global:
  scrape_interval: 30s  # Augmenter l'intervalle
  evaluation_interval: 30s

# Parallélisme
scrape_configs:
  - job_name: 'high-cardinality-app'
    scrape_interval: 60s  # Réduire fréquence pour haute cardinalité
    sample_limit: 50000   # Limiter échantillons par scrape

# Relabeling pour réduire cardinalité
relabel_configs:
  - source_labels: [__meta_kubernetes_pod_label_version]
    regex: '(\d+\.\d+)\..*'
    target_label: version
    replacement: '${1}'  # Garder seulement major.minor

Ressources