Prometheus
Vue d'ensemble
Prometheus est un système de monitoring et d'alerting open-source qui collecte et stocke les métriques sous forme de données temporelles. Développé par SoundCloud, il est devenu le standard pour le monitoring d'infrastructures cloud-native.
Philosophie
"Monitoring and alerting made reliable - Collectez, stockez et alertez sur vos métriques système et applicatives."
Avantages clés
Architecture pull-based
- Scraping : Collecte active des métriques
- Service discovery : Découverte automatique des cibles
- Robustesse : Pas de dépendance côté application
- Scalabilité : Architecture décentralisée
Modèle de données puissant
- Time series : Données horodatées
- Labels : Dimensions multiples
- PromQL : Langage de requête expressif
- Agrégations : Calculs en temps réel
Écosystème riche
- Exporters : 100+ exporters communautaires
- Grafana : Visualisation avancée
- Alertmanager : Gestion des alertes
- Federation : Monitoring multi-cluster
Architecture
Composants principaux
- Prometheus Server : Collecte et stockage
- Exporters : Exposition des métriques
- Pushgateway : Métriques batch/temporaires
- Alertmanager : Gestion des alertes
- Web UI : Interface de requête
Configuration de base
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alerts.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['localhost:9100']
- job_name: 'application'
static_configs:
- targets: ['app1:8080', 'app2:8080']
metrics_path: '/metrics'
scrape_interval: 10s
Métriques et types
Types de métriques
// Counter - Valeur qui ne fait qu'augmenter
http_requests_total{method="GET", status="200"} 1234
// Gauge - Valeur qui peut monter/descendre
memory_usage_bytes{instance="server1"} 1073741824
// Histogram - Distribution de valeurs
http_request_duration_seconds_bucket{le="0.1"} 100
http_request_duration_seconds_bucket{le="0.5"} 200
http_request_duration_seconds_sum 45.2
http_request_duration_seconds_count 250
// Summary - Quantiles précalculés
response_time_seconds{quantile="0.95"} 0.234
response_time_seconds_sum 123.45
response_time_seconds_count 500
Intégrations
Avec Laravel
// Exporter Prometheus pour Laravel
use Prometheus\CollectorRegistry;
use Prometheus\Storage\Redis;
class PrometheusMetrics
{
private $registry;
public function __construct()
{
$this->registry = new CollectorRegistry(new Redis());
}
public function incrementRequestCounter(string $method, string $status)
{
$counter = $this->registry->getOrRegisterCounter(
'laravel',
'http_requests_total',
'Total HTTP requests',
['method', 'status', 'route']
);
$counter->incBy(1, [$method, $status, request()->route()->getName()]);
}
public function recordResponseTime(float $duration)
{
$histogram = $this->registry->getOrRegisterHistogram(
'laravel',
'http_request_duration_seconds',
'HTTP request duration',
['route'],
[0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0]
);
$histogram->observe($duration, [request()->route()->getName()]);
}
}
// Middleware pour métriques automatiques
class PrometheusMiddleware
{
public function handle($request, Closure $next)
{
$start = microtime(true);
$response = $next($request);
$duration = microtime(true) - $start;
app(PrometheusMetrics::class)->incrementRequestCounter(
$request->method(),
$response->getStatusCode()
);
app(PrometheusMetrics::class)->recordResponseTime($duration);
return $response;
}
}
Avec Nomad
# Configuration scraping Nomad
scrape_configs:
- job_name: 'nomad-servers'
consul_sd_configs:
- server: 'localhost:8500'
services: ['nomad']
tags: ['server']
relabel_configs:
- source_labels: [__meta_consul_service_port]
target_label: __address__
replacement: '${1}:4646'
metrics_path: /v1/metrics
params:
format: ['prometheus']
- job_name: 'nomad-clients'
consul_sd_configs:
- server: 'localhost:8500'
services: ['nomad-client']
relabel_configs:
- source_labels: [__meta_consul_service_port]
target_label: __address__
replacement: '${1}:4646'
metrics_path: /v1/metrics
params:
format: ['prometheus']
Avec Kubernetes
# ServiceMonitor pour Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: laravel-app
labels:
app: laravel-app
spec:
selector:
matchLabels:
app: laravel-app
endpoints:
- port: metrics
path: /metrics
interval: 30s
---
# Service avec port métriques
apiVersion: v1
kind: Service
metadata:
name: laravel-app-metrics
labels:
app: laravel-app
spec:
ports:
- port: 9090
name: metrics
targetPort: 9090
selector:
app: laravel-app
Avec GitLab CI
# .gitlab-ci.yml - Monitoring de pipelines
stages:
- test
- deploy
- monitor
monitor_deployment:
stage: monitor
script:
- |
# Push métriques de déploiement
curl -X POST http://pushgateway:9091/metrics/job/gitlab-deploy/instance/$CI_JOB_ID \
--data-binary @- <<EOF
deployment_duration_seconds{job="$CI_JOB_NAME",pipeline="$CI_PIPELINE_ID"} $(echo "$CI_JOB_STARTED_AT $CI_JOB_FINISHED_AT" | awk '{print $2-$1}')
deployment_status{job="$CI_JOB_NAME",pipeline="$CI_PIPELINE_ID",status="success"} 1
EOF
after_script:
- echo "Metrics pushed to Prometheus"
Requêtes PromQL
Requêtes de base
# Taux de requêtes HTTP par seconde
rate(http_requests_total[5m])
# Utilisation mémoire moyenne par instance
avg by (instance) (memory_usage_bytes)
# Top 5 des endpoints les plus lents
topk(5, avg by (route) (http_request_duration_seconds))
# Pourcentage d'erreurs 5xx
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100
# Prédiction linéaire pour les prochaines 4 heures
predict_linear(disk_usage_bytes[1h], 4*3600)
Requêtes avancées
# SLI - 95ème percentile de latence
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Détection d'anomalies
abs(memory_usage_bytes - avg_over_time(memory_usage_bytes[1d])) >
2 * stddev_over_time(memory_usage_bytes[1d])
# Corrélation entre CPU et latence
increase(cpu_usage_seconds_total[5m]) * on(instance)
group_left histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
Alerting
Règles d'alertes
# alerts.yml
groups:
- name: application.rules
rules:
- alert: HighErrorRate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }} for the last 5 minutes"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
for: 2m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "95th percentile latency is {{ $value }}s"
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.instance }} is down"
description: "Service has been down for more than 1 minute"
- name: infrastructure.rules
rules:
- alert: HighCPUUsage
expr: 100 - (avg(irate(cpu_usage_idle[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage"
description: "CPU usage is above 80% for 5 minutes"
- alert: LowDiskSpace
expr: (disk_free_bytes / disk_total_bytes) * 100 < 10
for: 1m
labels:
severity: critical
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Only {{ $value | humanizePercentage }} disk space remaining"
Exporters populaires
Node Exporter
# Installation
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvfz node_exporter-1.6.1.linux-amd64.tar.gz
sudo cp node_exporter-1.6.1.linux-amd64/node_exporter /usr/local/bin/
# Service systemd
sudo tee /etc/systemd/system/node_exporter.service > /dev/null <<EOF
[Unit]
Description=Node Exporter
After=network.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/node_exporter
[Install]
WantedBy=multi-user.target
EOF
Application personnalisée
# Exporter Python avec prometheus_client
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
import random
# Métriques
REQUEST_COUNT = Counter('app_requests_total', 'Total requests', ['method', 'endpoint'])
REQUEST_LATENCY = Histogram('app_request_duration_seconds', 'Request latency')
ACTIVE_USERS = Gauge('app_active_users', 'Number of active users')
def process_request(method, endpoint):
REQUEST_COUNT.labels(method=method, endpoint=endpoint).inc()
with REQUEST_LATENCY.time():
# Simulation traitement
time.sleep(random.uniform(0.1, 0.5))
ACTIVE_USERS.set(random.randint(50, 200))
if __name__ == '__main__':
start_http_server(8000)
print("Metrics server started on port 8000")
while True:
process_request('GET', '/api/users')
time.sleep(1)
Déploiement avec Ansible
- name: Deploy Prometheus
hosts: monitoring
become: yes
tasks:
- name: Create prometheus user
user:
name: prometheus
system: yes
shell: /bin/false
home: /var/lib/prometheus
- name: Download and install Prometheus
unarchive:
src: "https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz"
dest: /tmp
remote_src: yes
creates: /tmp/prometheus-2.45.0.linux-amd64
- name: Copy Prometheus binary
copy:
src: /tmp/prometheus-2.45.0.linux-amd64/prometheus
dest: /usr/local/bin/prometheus
mode: '0755'
owner: prometheus
group: prometheus
remote_src: yes
- name: Create configuration directory
file:
path: /etc/prometheus
state: directory
owner: prometheus
group: prometheus
- name: Deploy Prometheus configuration
template:
src: prometheus.yml.j2
dest: /etc/prometheus/prometheus.yml
owner: prometheus
group: prometheus
notify: restart prometheus
- name: Create systemd service
template:
src: prometheus.service.j2
dest: /etc/systemd/system/prometheus.service
notify:
- reload systemd
- restart prometheus
handlers:
- name: reload systemd
systemd:
daemon_reload: yes
- name: restart prometheus
systemd:
name: prometheus
state: restarted
enabled: yes
Stockage et rétention
Configuration rétention
# Configuration avancée
global:
scrape_interval: 15s
external_labels:
cluster: 'production'
replica: 'A'
# Rétention des données
storage:
tsdb:
retention.time: 90d
retention.size: 50GB
# Remote write pour stockage long terme
remote_write:
- url: "https://cortex.example.com/api/prom/push"
basic_auth:
username: prometheus
password_file: /etc/prometheus/cortex_password
Optimisation performances
Configuration haute performance
# Optimisations
global:
scrape_interval: 30s # Augmenter l'intervalle
evaluation_interval: 30s
# Parallélisme
scrape_configs:
- job_name: 'high-cardinality-app'
scrape_interval: 60s # Réduire fréquence pour haute cardinalité
sample_limit: 50000 # Limiter échantillons par scrape
# Relabeling pour réduire cardinalité
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_version]
regex: '(\d+\.\d+)\..*'
target_label: version
replacement: '${1}' # Garder seulement major.minor
Ressources
- Documentation : prometheus.io
- Exporters : prometheus.io/docs/instrumenting/exporters
- Best Practices : prometheus.io/docs/practices