Prometheus 中文学习笔记¶

基于Prometheus官网文档，结合Prometheus 3.5 & Alertmanager 0.27整理而成。

一、Introduction (简介)¶

1.1 Overview (概览)¶

什么是 Prometheus?¶

Prometheus 是一个开源的系统监控和告警工具包，最初由 SoundCloud 构建。自 2012 年诞生以来，许多公司和组织都采用了 Prometheus，该项目拥有非常活跃的开发者和用户社区。2016 年，Prometheus 作为第二个托管项目（继 Kubernetes 之后）加入了云原生计算基金会（CNCF）。

核心特性：

Prometheus 以时间序列数据的形式收集和存储指标
指标信息与记录时的时间戳一起存储
可选的键值对称为标签（Labels）

主要特性¶

多维数据模型
时间序列数据通过指标名称和键值对（标签）进行标识
灵活的查询语言 PromQL
利用维度性进行强大的数据查询
不依赖分布式存储
单个服务器节点是自治的
基于 HTTP 的 Pull 模型
通过 HTTP 拉取方式采集时间序列数据
支持推送时间序列
通过中间网关（Push Gateway）支持短生命周期的任务
服务发现
通过服务发现或静态配置发现目标
多种图形和仪表板支持

什么是指标（Metrics）？¶

指标是数值测量的通俗术语。时间序列是指随时间变化的记录。

应用场景示例：

Web 服务器: 请求时间
数据库: 活跃连接数、活跃查询数
应用性能: 当请求数量高时，应用可能变慢，通过请求计数指标可以确定原因并增加服务器数量

组件架构¶

Prometheus 生态系统由多个组件组成，其中许多是可选的：

组件	说明
Prometheus Server	抓取和存储时间序列数据的主服务器
Client Libraries	用于为应用程序代码添加监控埋点的客户端库
Push Gateway	支持短生命周期任务的推送网关
Exporters	用于 HAProxy、StatsD、Graphite 等服务的专用导出器
Alertmanager	处理告警的告警管理器
Support Tools	各种支持工具

大多数 Prometheus 组件使用 Go 语言编写，易于构建和部署为静态二进制文件。

架构图¶

graph TD
    A[Prometheus Server] -->|scrape| B[Jobs/Exporters]
    A -->|scrape| C[Short-lived Jobs]
    C -->|push| D[Push Gateway]
    A -->|scrape| D
    A --> E[TSDB 本地存储]
    A -->|PromQL| F[Alertmanager]
    F -->|通知| G[Email/Pagerduty/etc]
    A -->|HTTP API| H[Grafana]
    A -->|HTTP API| I[API Clients]
    J[Service Discovery] -.发现.-> A
    K[Static Config] -.配置.-> A

工作流程：

Prometheus 从被监控的任务中抓取指标，可以直接抓取，也可以通过推送网关抓取短生命周期的任务
将所有抓取的样本存储在本地
对数据运行规则以聚合和记录新的时间序列，或生成告警
使用 Grafana 或其他 API 消费者可视化收集的数据

适用场景¶

✅ 适合使用 Prometheus 的场景：

记录纯数值时间序列数据
面向机器的监控
高度动态的面向服务的架构监控
微服务环境（多维数据收集和查询是其特别优势）
需要高可靠性的场景（每个 Prometheus 服务器是独立的，不依赖网络存储或其他远程服务）

❌ 不适合使用 Prometheus 的场景：

需要 100% 准确性的场景（如按请求计费）
Prometheus 更注重可靠性而非绝对准确性
对于需要详细和完整数据的计费系统，建议使用其他系统进行数据收集和分析

1.2 First Steps (入门步骤)¶

本节将指导你完成 Prometheus 的安装、配置和监控第一个资源的过程。

下载 Prometheus¶

下载适合你平台的最新版本 Prometheus
解压文件：

tar xvfz prometheus-*.tar.gz
cd prometheus-*

Prometheus 服务器是一个名为 prometheus 的单一二进制文件（Windows 上为 prometheus.exe）
查看帮助信息：

./prometheus --help

配置 Prometheus¶

Prometheus 使用 YAML 格式进行配置。下载包中包含一个示例配置文件 prometheus.yml。

配置文件示例：

global:
  scrape_interval:     15s    # 全局抓取间隔
  evaluation_interval: 15s    # 规则评估间隔

rule_files:
  # - "first.rules"
  # - "second.rules"

scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets: ['localhost:9090']

配置文件三大块：

配置块	说明
global	Prometheus 服务器的全局配置
rule_files	指定 Prometheus 服务器要加载的规则文件位置
scrape_configs	控制 Prometheus 监控哪些资源

配置详解：

global 块
scrape_interval: 控制 Prometheus 抓取目标的频率（默认 15 秒）
evaluation_interval: 控制 Prometheus 评估规则的频率
rule_files 块
指定规则文件位置
Prometheus 使用规则创建新时间序列和生成告警
scrape_configs 块
定义监控目标
默认配置包含一个名为 prometheus 的任务，抓取 Prometheus 服务器自身暴露的时间序列数据
默认抓取路径: http://localhost:9090/metrics

启动 Prometheus¶

使用配置文件启动 Prometheus：

./prometheus --config.file=prometheus.yml

启动后，可以访问以下地址：

状态页面: http://localhost:9090
指标端点: http://localhost:9090/metrics

使用表达式浏览器¶

访问表达式浏览器: http://localhost:9090/graph
选择 "Graph" 标签页中的 "Table" 视图

查询示例：

# 查询 Prometheus 服务器处理的 /metrics 请求总数
promhttp_metric_handler_requests_total

# 仅查询 HTTP 状态码 200 的请求
promhttp_metric_handler_requests_total{code="200"}

# 计算返回的时间序列数量
count(promhttp_metric_handler_requests_total)

使用图形界面¶

访问 http://localhost:9090/graph 并使用 "Graph" 标签页。

示例：绘制每秒 HTTP 请求速率

rate(promhttp_metric_handler_requests_total{code="200"}[1m])

监控其他目标¶

要更好地了解 Prometheus 的功能，推荐探索其他导出器的文档，例如：

使用 Node Exporter 监控 Linux 或 macOS 主机指标

1.3 Comparison (与其他监控系统对比)¶

Prometheus vs. Graphite¶

范围（Scope）

Graphite: 被动的时间序列数据库，具有查询语言和图形功能，其他关注点由外部组件处理
Prometheus: 完整的监控和趋势系统，包括内置的主动抓取、存储、查询、图形和告警功能

数据模型（Data Model）

特性	Graphite	Prometheus
指标命名	点分隔的组件（隐式编码维度）	显式的键值对标签
示例	`stats.api-server.tracks.post.500 -> 93`	`api_server_http_requests_total{method="POST",handler="/tracks",status="500",instance="sample1"} -> 34`
维度支持	隐式	显式，易于过滤、分组和匹配

存储（Storage）

Graphite: 使用 Whisper 格式在本地磁盘存储（RRD 风格数据库），样本需按固定间隔到达
Prometheus: 每个时间序列一个本地文件，允许以任意间隔存储样本，新样本简单追加，可长期保留旧数据

总结

✅ Prometheus 优势: 更丰富的数据模型和查询语言，更易于运行和集成
✅ Graphite 优势: 如果需要集群化解决方案和长期历史数据存储

Prometheus vs. InfluxDB¶

范围（Scope）

需要将 Kapacitor 与 InfluxDB 一起考虑，以解决与 Prometheus + Alertmanager 相同的问题空间
InfluxDB 提供连续查询（continuous queries），相当于 Prometheus 的记录规则
Kapacitor 的范围类似于 Prometheus 记录规则、告警规则和 Alertmanager 的通知功能

数据模型/存储

特性	InfluxDB	Prometheus
标签	Tags（第一级）+ Fields（第二级）	Labels
时间戳精度	纳秒级	毫秒级
数据类型	float64, int64, bool, string	float64（有限的字符串支持）
存储方式	LSM 树变体 + WAL，按时间分片	每个时间序列一个仅追加文件

架构（Architecture）

Prometheus: 服务器彼此独立运行，仅依赖本地存储
InfluxDB 开源版: 类似 Prometheus
InfluxDB 商业版: 分布式存储集群，存储和查询由多个节点处理

优劣对比

InfluxDB 更优：

✅ 事件日志场景
✅ 商业版提供集群化支持，更适合长期数据存储
✅ 副本之间的最终一致性视图

Prometheus 更优：

✅ 主要用于指标监控
✅ 更强大的查询语言、告警和通知功能
✅ 更高的图形和告警可用性和正常运行时间

Prometheus vs. OpenTSDB¶

范围（Scope）

与 Graphite 的范围差异相同

数据模型（Data Model）

OpenTSDB 的数据模型几乎与 Prometheus 相同
时间序列由一组任意键值对标识（OpenTSDB 的 tags = Prometheus 的 labels）
差异:
Prometheus 允许标签值中使用任意字符，OpenTSDB 限制更多
OpenTSDB 缺少完整的查询语言，仅允许通过 API 进行简单聚合和数学运算

存储（Storage）

OpenTSDB 基于 Hadoop 和 HBase 实现
易于水平扩展，但需要接受运行 Hadoop/HBase 集群的复杂性
Prometheus 初期运行更简单，但超过单节点容量后需要显式分片

总结

✅ Prometheus 优势: 更丰富的查询语言，能处理更高基数的指标，是完整监控系统的一部分
✅ OpenTSDB 优势: 如果已经运行 Hadoop 并且重视长期存储

Prometheus vs. Nagios¶

范围（Scope）

Nagios: 起源于 1990 年代，主要基于脚本退出码进行告警（"checks"）
支持单个告警的静默，但无分组、路由或去重功能

数据模型（Data Model）

Nagios 是基于主机的
每个主机可以有一个或多个服务，每个服务可以执行一次检查
没有标签概念或查询语言

存储（Storage）

Nagios 本身没有存储，只存储当前检查状态
可以通过插件存储数据用于可视化

架构（Architecture）

Nagios 服务器是独立的
所有检查配置通过文件完成

总结

✅ Nagios 适用: 小型和/或静态系统的基本监控，黑盒探测足够的场景
✅ Prometheus 适用: 白盒监控，或具有动态/云环境的场景

Prometheus vs. Sensu¶

范围（Scope）

Sensu: 开源监控和可观测性管道，专注于处理和告警可观测性数据流（事件流）
提供事件过滤、聚合、转换和处理的可扩展框架
Sensu 的事件处理能力类似于 Prometheus 告警规则和 Alertmanager 的范围

数据模型（Data Model）

Sensu 事件通过实体名称、事件名称和可选的键值元数据（"labels" 或 "annotations"）进行标识
事件有效载荷可能包含一个或多个指标点（JSON 对象：name, tags, timestamp, value）

存储（Storage）

Sensu 将当前和最近的事件状态信息及实时清单数据存储在嵌入式数据库（etcd）或外部 RDBMS（PostgreSQL）中

架构（Architecture）

Sensu 部署的所有组件都可以集群化以实现高可用性和提高事件处理吞吐量

优劣对比

Sensu 更优：

✅ 收集和处理混合可观测性数据（包括指标和/或事件）
✅ 整合多个监控工具，需要支持指标和 Nagios 风格插件或检查脚本
✅ 更强大的事件处理平台

Prometheus 更优：

✅ 主要收集和评估指标
✅ 监控同质 Kubernetes 基础设施（100% K8s 工作负载，Prometheus 提供更好的 K8s 集成）
✅ 更强大的查询语言，内置历史数据分析支持

总结¶

通过以上内容，我们了解了：

Prometheus 的核心概念：时间序列数据、指标、标签
如何快速开始：下载、配置、启动和基本查询
与其他监控系统的对比：帮助选择合适的监控方案

接下来将深入学习 Prometheus 的核心概念、服务器配置、查询语言和最佳实践。

二、Concepts (核心概念)¶

2.1 Data Model (数据模型)¶

时间序列数据¶

Prometheus 从根本上将所有数据存储为时间序列：属于同一指标和同一组标签维度的带时间戳的值流。

除了存储的时间序列外，Prometheus 还可以根据查询结果生成临时的派生时间序列。

指标名称和标签¶

每个时间序列都通过其指标名称和可选的键值对（标签）唯一标识。

指标名称（Metric Names）

规则	说明
命名规范	应指定被测量系统的一般特性（例如 `http_requests_total` - 收到的 HTTP 请求总数）
字符支持	可以使用任何 UTF-8 字符
推荐格式	应匹配正则表达式 `[a-zA-Z_:][a-zA-Z0-9_:]*` 以获得最佳体验和兼容性
保留字符	冒号（`:`）保留用于用户定义的记录规则，导出器或直接埋点不应使用

指标标签（Metric Labels）

标签让你能够捕获同一指标名称的不同实例。例如：所有使用 POST 方法访问 /api/tracks 处理器的 HTTP 请求。这就是 Prometheus 的"多维数据模型"。

查询语言允许基于这些维度进行过滤和聚合。

规则	说明
字符支持	标签名可以使用任何 UTF-8 字符
保留前缀	以 `__`（两个下划线）开头的标签名必须保留给 Prometheus 内部使用
推荐格式	应匹配正则表达式 `[a-zA-Z_][a-zA-Z0-9_]*` 以获得最佳兼容性
标签值	可以包含任何 UTF-8 字符
空值处理	具有空标签值的标签被视为等同于不存在的标签

⚠️ 注意: Prometheus v3.0.0 中添加了对指标和标签名称的 UTF-8 支持。建议使用推荐的字符集以获得最佳兼容性。

样本（Samples）¶

样本构成实际的时间序列数据。每个样本包括：

组成部分	说明
值	float64 或原生直方图值
时间戳	毫秒精度的时间戳

表示法（Notation）¶

给定指标名称和一组标签，时间序列通常使用以下表示法标识：

<metric name>{<label name>="<label value>", ...}

示例：

# 标准表示法
api_http_requests_total{method="POST", handler="/messages"}

# UTF-8 字符需要引号
{"特殊指标名", label="value"}

# 使用 __name__ 标签（内部表示）
{__name__="api_http_requests_total", method="POST", handler="/messages"}

💡 这与 OpenTSDB 使用的表示法相同。

2.2 Metric Types (指标类型)¶

Prometheus 客户端库提供四种核心指标类型。这些类型目前仅在客户端库中区分（以支持特定类型的使用 API）和在传输协议中区分。Prometheus 服务器尚未使用类型信息，而是将所有数据展平为无类型时间序列。

graph TD
    A[Prometheus 指标类型] --> B[Counter 计数器]
    A --> C[Gauge 仪表]
    A --> D[Histogram 直方图]
    A --> E[Summary 摘要]

    B --> B1[单调递增]
    B --> B2[可重置为0]

    C --> C1[可增可减]
    C --> C2[适用于温度/内存等]

    D --> D1[观测值分桶]
    D --> D2[提供总和与计数]
    D --> D3[可计算分位数]

    E --> E1[观测值分位数]
    E --> E2[滑动时间窗口]
    E --> E3[提供总和与计数]

Counter (计数器)¶

Counter 是一个累积指标，表示单个单调递增的计数器，其值只能增加或在重启时重置为零。

适用场景：

✅ 请求服务数
✅ 已完成的任务数
✅ 错误数

❌ 不要使用 Counter:

不要用于可能减少的值（如当前运行的进程数，应使用 Gauge）

示例：

# HTTP 请求总数
http_requests_total{method="POST", endpoint="/api/users"}

Gauge (仪表)¶

Gauge 是表示单个数值的指标，该值可以任意上下波动。

适用场景：

✅ 温度测量
✅ 当前内存使用量
✅ 可以上下波动的"计数"（如并发请求数）

示例：

# 当前内存使用量（字节）
memory_usage_bytes{instance="localhost:9090"}

# 当前并发请求数
http_concurrent_requests{endpoint="/api"}

Histogram (直方图)¶

Histogram 对观测值（通常是请求持续时间或响应大小等）进行采样，并将其计入可配置的桶中。它还提供所有观测值的总和。

暴露的时间序列：

对于基础指标名称为 <basename> 的直方图，会在抓取时暴露多个时间序列：

时间序列	说明
`<basename>_bucket{le="<上限>"}"`	观测桶的累积计数器
`<basename>_sum`	所有观测值的总和
`<basename>_count`	已观测到的事件计数（等同于 `<basename>_bucket{le="+Inf"}`）

使用场景：

✅ 使用 histogram_quantile() 函数计算直方图的分位数
✅ 计算 Apdex 分数
✅ 适合聚合多个实例的直方图

示例：

# 原始直方图指标
http_request_duration_seconds_bucket{le="0.1"} 1000
http_request_duration_seconds_bucket{le="0.5"} 1500
http_request_duration_seconds_bucket{le="1.0"} 1800
http_request_duration_seconds_bucket{le="+Inf"} 2000
http_request_duration_seconds_sum 850
http_request_duration_seconds_count 2000

# 查询 95 分位数
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

📌 Prometheus v2.40+: 实验性支持原生直方图（Native Histograms），只需一个时间序列，包含动态数量的桶，提供更高分辨率且成本更低。

📌 Prometheus v3.0+: 经典直方图的 le 标签值在摄取期间规范化，遵循 OpenMetrics 规范数字格式。

Summary (摘要)¶

Summary 与 Histogram 类似，也对观测值进行采样（通常是请求持续时间和响应大小）。它提供观测值的总计数和总和，并可以计算滑动时间窗口上的可配置分位数。

暴露的时间序列：

对于基础指标名称为 <basename> 的摘要，会在抓取时暴露多个时间序列：

时间序列	说明
`<basename>{quantile="<φ>"}`	观测事件的流式 φ-分位数（0 ≤ φ ≤ 1）
`<basename>_sum`	所有观测值的总和
`<basename>_count`	已观测到的事件计数

示例：

# 摘要指标
http_request_duration_seconds{quantile="0.5"} 0.05
http_request_duration_seconds{quantile="0.9"} 0.2
http_request_duration_seconds{quantile="0.99"} 0.5
http_request_duration_seconds_sum 850
http_request_duration_seconds_count 2000

📌 Prometheus v3.0+: 分位数标签值在摄取期间规范化，遵循 OpenMetrics 规范数字格式。

Histogram vs Summary 对比：

特性	Histogram	Summary
分位数计算	服务器端（PromQL）	客户端（应用端）
聚合能力	可聚合多个实例	无法聚合
精度	桶粒度限制	精确
性能开销	低	较高
适用场景	需要聚合、预先不知道分位数	已知所需分位数、单实例观测

2.3 Jobs and Instances (任务与实例)¶

基本概念¶

在 Prometheus 术语中：

Instance（实例）: 可以抓取的端点，通常对应单个进程
Job（任务）: 具有相同目的的实例集合（例如为了可扩展性或可靠性而复制的进程）

示例：一个有 4 个副本实例的 API 服务器任务

job: api-server
  instance 1: 1.2.3.4:5670
  instance 2: 1.2.3.4:5671
  instance 3: 5.6.7.8:5670
  instance 4: 5.6.7.8:5671

架构关系图：

graph LR
    A[Job: api-server] --> B[Instance: 1.2.3.4:5670]
    A --> C[Instance: 1.2.3.4:5671]
    A --> D[Instance: 5.6.7.8:5670]
    A --> E[Instance: 5.6.7.8:5671]

    F[Job: database] --> G[Instance: 10.0.0.1:3306]
    F --> H[Instance: 10.0.0.2:3306]

自动生成的标签和时间序列¶

当 Prometheus 抓取目标时，会自动附加一些标签到抓取的时间序列，用于标识被抓取的目标：

标签	说明
job	目标所属的已配置任务名称
instance	被抓取目标 URL 的 `<host>:<port>` 部分

💡 如果这些标签已存在于抓取的数据中，行为取决于 honor_labels 配置选项。

自动生成的指标¶

对于每个实例抓取，Prometheus 会在以下时间序列中存储样本：

核心指标：

指标	说明
`up{job="<job-name>", instance="<instance-id>"}`	实例是否健康：`1` = 可达，`0` = 抓取失败
`scrape_duration_seconds{job="<job-name>", instance="<instance-id>"}`	抓取持续时间（秒）
`scrape_samples_post_metric_relabeling{job="<job-name>", instance="<instance-id>"}`	应用指标重新标记后剩余的样本数
`scrape_samples_scraped{job="<job-name>", instance="<instance-id>"}`	目标暴露的样本数
`scrape_series_added{job="<job-name>", instance="<instance-id>"}`	此次抓取中新增的大约序列数（v2.10+）

扩展指标（需启用 extra-scrape-metrics 特性标志）：

指标	说明
`scrape_timeout_seconds{job="<job-name>", instance="<instance-id>"}`	为目标配置的 `scrape_timeout`
`scrape_sample_limit{job="<job-name>", instance="<instance-id>"}`	为目标配置的 `sample_limit`（0 = 无限制）
`scrape_body_size_bytes{job="<job-name>", instance="<instance-id>"}`	最近一次成功抓取的未压缩响应大小（失败返回 0 或 -1）

实用查询示例：

# 检查所有实例的健康状态
up

# 查询所有不健康的实例
up == 0

# 查询特定任务的所有实例
up{job="api-server"}

# 计算任务的健康实例数
sum(up{job="api-server"})

# 查询抓取时间超过 1 秒的实例
scrape_duration_seconds > 1

💡 up 时间序列对于实例可用性监控非常有用！

总结¶

通过本章节，我们深入了解了 Prometheus 的核心概念：

数据模型: 时间序列如何通过指标名称和标签唯一标识
指标类型: Counter、Gauge、Histogram、Summary 的区别和使用场景
任务与实例: Job 和 Instance 的关系，以及 Prometheus 自动生成的监控指标

这些概念是理解和使用 Prometheus 的基础，为后续学习服务器配置、查询语言和最佳实践奠定了坚实基础。

三、Server (服务器)¶

3.1 Getting Started (快速开始)¶

本指南是一个 "Hello World" 风格的教程，展示如何安装、配置和使用一个简单的 Prometheus 实例。

下载和运行 Prometheus¶

下载适合你平台的最新版本 Prometheus
解压并运行：

tar xvfz prometheus-*.tar.gz
cd prometheus-*

配置 Prometheus 监控自身¶

Prometheus 通过抓取指标 HTTP 端点从目标收集指标。由于 Prometheus 以相同方式暴露自身数据，它也可以抓取和监控自己的健康状况。

将以下基本配置保存为 prometheus.yml：

global:
  scrape_interval: 15s  # 默认每 15 秒抓取目标一次

  # 与外部系统通信时附加这些标签
  # (federation, remote storage, Alertmanager)
  external_labels:
    monitor: 'codelab-monitor'

# 包含一个要抓取的端点的抓取配置
# 这里是 Prometheus 自身
scrape_configs:
  # job 名称作为标签 `job=<job_name>` 添加到从此配置抓取的任何时间序列
  - job_name: 'prometheus'

    # 覆盖全局默认值，每 5 秒抓取此任务的目标
    scrape_interval: 5s

    static_configs:
      - targets: ['localhost:9090']

配置说明：

配置项	说明
`global.scrape_interval`	全局抓取间隔，默认 15 秒
`global.external_labels`	与外部系统通信时附加的标签
`scrape_configs`	抓取配置列表
`job_name`	任务名称，将作为 `job` 标签添加
`scrape_interval`	特定任务的抓取间隔（覆盖全局设置）
`static_configs.targets`	静态配置的目标列表

启动 Prometheus¶

# 启动 Prometheus
# 默认情况下，Prometheus 将数据库存储在 ./data (标志 --storage.tsdb.path)
./prometheus --config.file=prometheus.yml

启动后访问：

状态页面: http://localhost:9090
指标端点: http://localhost:9090/metrics

使用表达式浏览器¶

访问 http://localhost:9090/graph 并选择 "Graph" 标签页中的 "Table" 视图。

查询示例：

# 查询目标抓取间隔
prometheus_target_interval_length_seconds

# 仅查询 99 分位数延迟
prometheus_target_interval_length_seconds{quantile="0.99"}

# 计算返回的时间序列数量
count(prometheus_target_interval_length_seconds)

使用图形界面¶

访问 http://localhost:9090/graph 并使用 "Graph" 标签页。

示例：绘制每秒创建的块速率

rate(prometheus_tsdb_head_chunks_created_total[1m])

启动示例目标¶

使用 Node Exporter 作为示例目标：

tar -xzvf node_exporter-*.*.tar.gz
cd node_exporter-*.*

# 在不同终端启动 3 个示例目标
./node_exporter --web.listen-address 127.0.0.1:8080
./node_exporter --web.listen-address 127.0.0.1:8081
./node_exporter --web.listen-address 127.0.0.1:8082

现在有三个目标监听：

http://localhost:8080/metrics
http://localhost:8081/metrics
http://localhost:8082/metrics

配置 Prometheus 监控示例目标¶

将所有三个端点分组到一个名为 node 的任务中。我们将前两个端点标记为生产目标，第三个表示金丝雀实例。

在 prometheus.yml 的 scrape_configs 部分添加：

scrape_configs:
  - job_name: 'node'

    # 每 5 秒抓取一次
    scrape_interval: 5s

    static_configs:
      - targets: ['localhost:8080', 'localhost:8081']
        labels:
          group: 'production'

      - targets: ['localhost:8082']
        labels:
          group: 'canary'

访问表达式浏览器，验证是否有 node_cpu_seconds_total 等指标。

配置聚合规则¶

为了提高效率，Prometheus 可以通过配置的记录规则将表达式预录制到新的持久时间序列中。

示例：记录 CPU 使用率的 5 分钟平均值

创建 prometheus.rules.yml：

groups:
- name: cpu-node
  rules:
  - record: job_instance_mode:node_cpu_seconds:avg_rate5m
    expr: avg by (job, instance, mode) (rate(node_cpu_seconds_total[5m]))

更新 prometheus.yml 以加载规则：

global:
  scrape_interval: 15s
  evaluation_interval: 15s  # 每 15 秒评估规则

  external_labels:
    monitor: 'codelab-monitor'

rule_files:
  - 'prometheus.rules.yml'

scrape_configs:
  - job_name: 'prometheus'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:8080', 'localhost:8081']
        labels:
          group: 'production'
      - targets: ['localhost:8082']
        labels:
          group: 'canary'

重启 Prometheus 并验证新指标 job_instance_mode:node_cpu_seconds:avg_rate5m 是否可用。

重新加载配置¶

无需重启进程即可重新加载配置（使用 SIGHUP 信号）：

# Linux 系统
kill -s SIGHUP <PID>

优雅关闭实例¶

建议使用信号进行干净关闭：

# Linux 系统
kill -s SIGTERM <PID>
# 或
kill -s SIGINT <PID>
# 或在终端按 Control-C

3.2 Installation (安装)¶

Prometheus 提供多种安装方式，适合不同的使用场景。

安装方式概览¶

graph TD
    A[Prometheus 安装方式] --> B[预编译二进制文件]
    A --> C[源码编译]
    A --> D[Docker 镜像]

    B --> B1[下载官方二进制]
    B --> B2[直接运行]

    C --> C1[使用 Makefile]
    C --> C2[自定义构建]

    D --> D1[Docker Hub]
    D --> D2[Quay.io]
    D --> D3[容器化部署]

方式一：使用预编译二进制文件¶

优点： 简单快速，适合快速测试和开发环境

访问官方下载页面
选择适合你平台的版本
下载并解压
直接运行 prometheus 二进制文件

tar xvfz prometheus-*.tar.gz
cd prometheus-*
./prometheus --config.file=prometheus.yml

方式二：从源码编译¶

优点： 可自定义构建选项，获取最新功能

查看相应仓库中的 Makefile 目标进行构建。

# 通常流程
git clone https://github.com/prometheus/prometheus.git
cd prometheus
make build

方式三：使用 Docker¶

优点： 容器化部署，易于管理和扩展

所有 Prometheus 服务都提供 Docker 镜像，可从 Quay.io 或 Docker Hub 获取。

基本运行：

# 使用示例配置启动 Prometheus
docker run -p 9090:9090 prom/prometheus

这将使用示例配置启动 Prometheus 并在端口 9090 上暴露。

重要提示：

Prometheus 镜像使用卷来存储实际指标
生产部署强烈建议使用命名卷来简化 Prometheus 升级时的数据管理

Docker 高级配置¶

1. 设置命令行参数

Docker 镜像以多个默认命令行参数启动（参见 Dockerfile）。

如果要添加额外的命令行参数，需要重新添加默认参数，因为它们会被覆盖。

2. 挂载配置文件（Bind-mount）

方法 A：挂载 prometheus.yml 文件

docker run \
    -p 9090:9090 \
    -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \
    prom/prometheus

方法 B：挂载包含 prometheus.yml 的目录

docker run \
    -p 9090:9090 \
    -v /path/to/config:/etc/prometheus \
    prom/prometheus

3. 保存 Prometheus 数据

Prometheus 数据存储在容器内的 /prometheus 目录中。每次容器重启时数据都会被清除。

要保存数据，需要为容器设置持久存储：

# 创建持久卷
docker volume create prometheus-data

# 使用持久存储启动 Prometheus
docker run \
    -p 9090:9090 \
    -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \
    -v prometheus-data:/prometheus \
    prom/prometheus

存储位置：

路径	说明
`/etc/prometheus/`	配置文件目录
`/prometheus/`	数据存储目录（TSDB）

4. 自定义镜像

将配置烘焙到镜像中，避免在主机上管理文件。适合配置相对静态且跨环境一致的场景。

创建自定义镜像：

创建新目录，包含 prometheus.yml 和 Dockerfile：

FROM prom/prometheus
ADD prometheus.yml /etc/prometheus/

构建并运行：

docker build -t my-prometheus .
docker run -p 9090:9090 my-prometheus

高级选项：

启动时动态渲染配置（使用工具）
使用守护进程定期更新配置

Docker 部署完整示例¶

# 1. 创建持久卷
docker volume create prometheus-data
docker volume create prometheus-config

# 2. 创建配置文件
cat > /tmp/prometheus.yml <<EOF
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
EOF

# 3. 启动 Prometheus
docker run -d \
    --name prometheus \
    -p 9090:9090 \
    -v /tmp/prometheus.yml:/etc/prometheus/prometheus.yml \
    -v prometheus-data:/prometheus \
    prom/prometheus \
    --config.file=/etc/prometheus/prometheus.yml \
    --storage.tsdb.path=/prometheus

# 4. 查看日志
docker logs -f prometheus

# 5. 访问 Web UI
# http://localhost:9090

安装方式对比¶

方式	优点	缺点	适用场景
预编译二进制	简单快速、无依赖	需手动管理更新	开发、测试、小规模部署
源码编译	可定制、最新功能	构建复杂、依赖多	需要特定功能、深度定制
Docker	容器化、易于管理、可扩展	需要 Docker 环境	生产环境、云原生架构、K8s

总结¶

通过本章节，我们学习了：

Getting Started:
Prometheus 的基本配置和启动
监控自身和外部目标
使用表达式浏览器和图形界面
配置记录规则进行数据预聚合
Installation:
三种主要安装方式及其适用场景
Docker 部署的详细配置和最佳实践
持久化存储和自定义镜像的使用

这些内容为后续深入学习 Prometheus 的配置、查询和告警功能打下了坚实基础。

3.3 Configuration (配置详解)¶

Prometheus 通过命令行标志和配置文件进行配置。命令行标志配置不可变的系统参数(如存储位置、保留数据量等),而配置文件定义抓取任务、实例及要加载的规则文件。

配置文件基础¶

配置文件格式: YAML

查看所有命令行标志:

./prometheus -h

指定配置文件:

./prometheus --config.file=prometheus.yml

运行时重新加载配置:

# 方法 1: 发送 SIGHUP 信号
kill -s SIGHUP <PID>

# 方法 2: HTTP POST (需启用 --web.enable-lifecycle)
curl -X POST http://localhost:9090/-/reload

💡 如果新配置格式不正确,更改不会被应用。重新加载也会重新加载规则文件。

配置文件结构¶

graph TD
    A[prometheus.yml] --> B[global 全局配置]
    A --> C[runtime 运行时配置]
    A --> D[rule_files 规则文件]
    A --> E[scrape_configs 抓取配置]
    A --> F[alerting 告警配置]
    A --> G[remote_write 远程写入]
    A --> H[remote_read 远程读取]
    A --> I[storage 存储配置]

    E --> E1[static_configs 静态配置]
    E --> E2[服务发现]
    E2 --> E2A[kubernetes_sd_configs]
    E2 --> E2B[consul_sd_configs]
    E2 --> E2C[ec2_sd_configs]
    E2 --> E2D[dns_sd_configs]
    E2 --> E2E[其他20+种服务发现]

通用占位符说明¶

占位符	说明	示例
`<boolean>`	布尔值	`true` 或 `false`
`<duration>`	持续时间	`1d`, `1h30m`, `5m`, `10s`
`<filename>`	当前工作目录中的有效路径	`prometheus.yml`
`<float>`	浮点数	`0.5`, `1.25`
`<host>`	主机名或 IP + 可选端口号	`localhost:9090`
`<int>`	整数值	`100`, `5000`
`<labelname>`	标签名 (正则:`[a-zA-Z_][a-zA-Z0-9_]*`)	`job`, `instance`
`<labelvalue>`	Unicode 字符串	任何 UTF-8 字符
`<path>`	有效的 URL 路径	`/metrics`
`<scheme>`	协议方案	`http` 或 `https`
`<secret>`	机密字符串(如密码)	`password123`
`<string>`	常规字符串	任意字符串
`<size>`	字节大小(需要单位)	`512MB`, `1GB`

Global 全局配置¶

全局配置在所有其他配置上下文中有效,并作为其他配置节的默认值。

global:
  # 抓取目标的默认频率
  scrape_interval: 15s  # 默认 1m

  # 抓取请求超时时间 (不能大于 scrape_interval)
  scrape_timeout: 10s  # 默认 10s

  # 规则评估频率
  evaluation_interval: 15s  # 默认 1m

  # 规则评估时间戳偏移(确保底层指标已收到)
  rule_query_offset: 0s  # 默认 0s

  # 与外部系统通信时添加的标签
  # (federation, remote storage, Alertmanager)
  external_labels:
    cluster: 'production'
    region: 'us-east-1'

  # PromQL 查询日志文件
  query_log_file: '/var/log/prometheus/query.log'

  # 抓取失败日志文件
  scrape_failure_log_file: '/var/log/prometheus/scrape_failures.log'

关键全局配置项:

配置项	说明	默认值
`scrape_interval`	抓取目标的频率	`1m`
`scrape_timeout`	抓取超时时间	`10s`
`evaluation_interval`	规则评估频率	`1m`
`body_size_limit`	响应体大小限制	`0` (无限制)
`sample_limit`	每次抓取样本数限制	`0` (无限制)
`label_limit`	每个样本标签数限制	`0` (无限制)
`target_limit`	每个抓取配置的目标数限制	`0` (无限制)
`metric_name_validation_scheme`	指标名称验证方案	`utf8`

Runtime 运行时配置¶

runtime:
  # 配置 Go 垃圾回收器 GOGC 参数
  # 降低此数字会增加 CPU 使用率
  gogc: 75  # 默认 75

Scrape Configs 抓取配置¶

抓取配置指定一组目标及其抓取参数。一般情况下,一个抓取配置指定一个任务。

基本结构:

scrape_configs:
  - job_name: 'prometheus'  # 任务名称(必需,唯一)

    # 抓取间隔(覆盖全局设置)
    scrape_interval: 5s

    # 抓取超时
    scrape_timeout: 5s

    # 指标路径
    metrics_path: /metrics

    # 协议方案
    scheme: http

    # honor_labels 控制标签冲突处理
    # true: 保留抓取数据中的标签,忽略服务端标签
    # false: 重命名冲突标签为 exported_<label>
    honor_labels: false

    # honor_timestamps 控制是否使用目标的时间戳
    # true: 使用目标提供的时间戳
    # false: 忽略目标时间戳
    honor_timestamps: true

    # 静态配置的目标列表
    static_configs:
      - targets: ['localhost:9090']
        labels:
          env: 'production'

完整配置示例:

scrape_configs:
  # 监控 Prometheus 自身
  - job_name: 'prometheus'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:9090']

  # 监控 Node Exporter
  - job_name: 'node'
    scrape_interval: 10s
    static_configs:
      - targets: 
          - 'node1:9100'
          - 'node2:9100'
        labels:
          group: 'production'
      - targets: ['node3:9100']
        labels:
          group: 'testing'

  # 使用 Kubernetes 服务发现
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        target_label: app

抓取配置关键选项:

选项	说明	默认值
`job_name`	任务名称(分配给抓取指标的标签)	必需
`scrape_interval`	抓取频率	继承 global
`scrape_timeout`	抓取超时	继承 global
`metrics_path`	指标路径	`/metrics`
`scheme`	协议	`http`
`honor_labels`	是否保留原始标签	`false`
`honor_timestamps`	是否使用目标时间戳	`true`
`enable_compression`	是否请求压缩响应	`true`

HTTP Config HTTP 配置¶

HTTP 配置允许配置 HTTP 请求的认证和 TLS 设置。

scrape_configs:
  - job_name: 'secure-app'
    # Basic 认证
    basic_auth:
      username: 'admin'
      password: 'secret123'

    # 或使用文件
    # basic_auth:
    #   username_file: /path/to/username
    #   password_file: /path/to/password

    # 或使用 Bearer Token
    # authorization:
    #   type: Bearer
    #   credentials: 'mytoken123'
    #   # 或从文件读取
    #   # credentials_file: /path/to/token

    # TLS 配置
    tls_config:
      # CA 证书
      ca_file: /path/to/ca.crt
      # 客户端证书
      cert_file: /path/to/client.crt
      key_file: /path/to/client.key
      # 跳过证书验证(不建议)
      insecure_skip_verify: false
      # 最小 TLS 版本
      min_version: TLS12

    # 代理设置
    proxy_url: http://proxy.example.com:8080

    # 自定义 HTTP 头
    http_headers:
      X-Custom-Header:
        values: ['custom-value']

    static_configs:
      - targets: ['app.example.com:443']

    scheme: https

HTTP 认证方式对比:

认证方式	适用场景	示例
Basic Auth	简单的用户名/密码认证	内部服务、开发环境
Bearer Token	API Token 认证	Kubernetes, 云服务
OAuth 2.0	复杂的授权流程	需要刷新 token 的场景
TLS Client Cert	mTLS 双向认证	高安全要求的服务

服务发现 (Service Discovery)¶

Prometheus 支持 25+ 种服务发现机制,可以自动发现监控目标。

支持的服务发现类型:

graph LR
    A[服务发现] --> B[云平台]
    A --> C[容器编排]
    A --> D[服务注册]
    A --> E[其他]

    B --> B1[AWS EC2]
    B --> B2[Azure]
    B --> B3[GCE]
    B --> B4[DigitalOcean]
    B --> B5[Lightsail]
    B --> B6[Linode]
    B --> B7[Hetzner]

    C --> C1[Kubernetes]
    C --> C2[Docker]
    C --> C3[Docker Swarm]
    C --> C4[Nomad]
    C --> C5[Marathon]

    D --> D1[Consul]
    D --> D2[Eureka]
    D --> D3[Zookeeper]
    D --> D4[DNS]

    E --> E1[HTTP]
    E --> E2[File]
    E --> E3[OpenStack]

常用服务发现配置示例:

1. Kubernetes 服务发现

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - default
            - production

    relabel_configs:
      # 仅抓取带有 prometheus.io/scrape=true 注解的 Pod
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

      # 使用自定义路径
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

      # 使用自定义端口
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__

2. Consul 服务发现

scrape_configs:
  - job_name: 'consul-services'
    consul_sd_configs:
      - server: 'localhost:8500'
        datacenter: 'dc1'
        services: ['web', 'api', 'database']
        tags: ['prometheus']

    relabel_configs:
      - source_labels: [__meta_consul_service]
        target_label: service

3. DNS 服务发现

scrape_configs:
  - job_name: 'dns-discovery'
    dns_sd_configs:
      - names:
          - 'tasks.myservice.example.com'
        type: 'A'
        port: 9100
    refresh_interval: 30s

4. EC2 服务发现

scrape_configs:
  - job_name: 'ec2-nodes'
    ec2_sd_configs:
      - region: us-east-1
        port: 9100
        filters:
          - name: tag:Environment
            values: ['production']
          - name: instance-state-name
            values: ['running']

    relabel_configs:
      - source_labels: [__meta_ec2_tag_Name]
        target_label: instance_name

5. File 服务发现 (最灵活)

scrape_configs:
  - job_name: 'file-sd'
    file_sd_configs:
      - files:
          - '/etc/prometheus/targets/*.json'
          - '/etc/prometheus/targets/*.yaml'
        refresh_interval: 30s

targets.json 示例:

[
  {
    "targets": ["host1:9100", "host2:9100"],
    "labels": {
      "env": "production",
      "team": "backend"
    }
  },
  {
    "targets": ["host3:9100"],
    "labels": {
      "env": "staging"
    }
  }
]

Relabel Configs 重新标记配置¶

重新标记是一个强大的工具,可以在抓取之前动态重写目标的标签集。

scrape_configs:
  - job_name: 'example'
    static_configs:
      - targets: ['localhost:9090']

    relabel_configs:
      # 保留特定标签的目标
      - source_labels: [__meta_kubernetes_namespace]
        action: keep
        regex: 'production|staging'

      # 删除特定标签的目标
      - source_labels: [__meta_kubernetes_pod_name]
        action: drop
        regex: 'test-.*'

      # 替换标签值
      - source_labels: [__address__]
        target_label: instance
        regex: '(.+):.*'
        replacement: '$1'

      # 添加新标签
      - target_label: cluster
        replacement: 'us-central'

      # 保留标签
      - action: labelkeep
        regex: '__meta_kubernetes_(namespace|pod_name|pod_ip)'

      # 删除标签
      - action: labeldrop
        regex: '__meta_kubernetes_pod_label_.*'

Relabel 动作类型:

Action	说明
`replace`	替换标签值(默认)
`keep`	保留匹配 regex 的目标
`drop`	丢弃匹配 regex 的目标
`labelkeep`	保留匹配 regex 的标签
`labeldrop`	删除匹配 regex 的标签
`labelmap`	将标签名映射到新名称
`hashmod`	计算哈希值用于分片

Alerting 告警配置¶

alerting:
  # 告警重新标记
  alert_relabel_configs:
    - source_labels: [dc]
      regex: 'dc1'
      target_label: severity
      replacement: 'critical'

  # Alertmanager 配置
  alertmanagers:
    - static_configs:
        - targets:
            - 'alertmanager1:9093'
            - 'alertmanager2:9093'

      # 超时设置
      timeout: 10s

      # 路径前缀
      path_prefix: /alertmanager

Remote Write/Read 远程写入/读取¶

# 远程写入(用于长期存储)
remote_write:
  - url: 'http://remote-storage:9201/write'

    # 写入超时
    remote_timeout: 30s

    # 队列配置
    queue_config:
      capacity: 10000
      max_shards: 50
      max_samples_per_send: 1000

    # 写入重新标记(过滤不需要的指标)
    write_relabel_configs:
      - source_labels: [__name__]
        regex: 'go_.*'
        action: drop

# 远程读取
remote_read:
  - url: 'http://remote-storage:9201/read'
    read_recent: true

Storage 存储配置¶

storage:
  tsdb:
    # 数据保留时间
    retention_time: 15d

    # 最大数据块大小
    retention_size: 512GB

  exemplars:
    # Exemplar 最大存储大小
    max_exemplars: 100000

完整配置示例¶

# 全局配置
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'production'
    datacenter: 'us-east-1'

# 运行时配置
runtime:
  gogc: 75

# 规则文件
rule_files:
  - '/etc/prometheus/rules/*.yml'

# 抓取配置
scrape_configs:
  # Prometheus 自身
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node Exporters (使用文件服务发现)
  - job_name: 'node'
    file_sd_configs:
      - files:
          - '/etc/prometheus/targets/nodes/*.json'

    relabel_configs:
      - source_labels: [__address__]
        target_label: instance

  # Kubernetes Pods
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod

    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

# 告警配置
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

# 远程写入
remote_write:
  - url: 'http://thanos-receive:19291/api/v1/receive'
    queue_config:
      capacity: 10000
      max_samples_per_send: 1000

# 存储配置
storage:
  tsdb:
    retention_time: 30d
    retention_size: 1TB

总结¶

通过本节,我们学习了 Prometheus 配置的核心内容：

全局配置: 抓取间隔、评估间隔、外部标签等
抓取配置: 如何定义监控目标、静态配置和服务发现
服务发现: 支持 25+ 种自动发现机制(Kubernetes、Consul、EC2 等)
认证与安全: Basic Auth、Bearer Token、OAuth 2.0、TLS 配置
重新标记: 动态修改标签,实现灵活的目标过滤和标签管理
告警配置: 配置 Alertmanager 集成
远程存储: 远程写入/读取,实现长期数据存储

这些配置选项为 Prometheus 提供了极大的灵活性,可以适应各种监控场景和基础设施环境。

3.4 Recording Rules (记录规则)¶

记录规则允许你预计算频繁使用或计算成本高的表达式,并将结果保存为新的时间序列。查询预计算的结果通常比每次执行原始表达式快得多,特别适用于需要重复查询相同表达式的仪表板。

规则配置基础¶

Prometheus 支持两种类型的规则:

Recording Rules (记录规则): 预计算表达式,保存为新的时间序列
Alerting Rules (告警规则): 定义告警条件,触发通知

规则文件格式: YAML

在配置文件中加载规则:

rule_files:
  - '/etc/prometheus/rules/*.yml'
  - '/etc/prometheus/alerts/*.yml'

运行时重新加载规则:

# 发送 SIGHUP 信号
kill -s SIGHUP <PID>

# 或使用 HTTP 接口
curl -X POST http://localhost:9090/-/reload

⚠️ 只有所有规则文件格式正确时,更改才会被应用

语法检查工具:

promtool check rules /path/to/example.rules.yml

规则文件结构¶

groups:
  - name: <组名>              # 必需,文件内唯一
    interval: <持续时间>       # 可选,规则评估频率
    limit: <整数>              # 可选,限制告警/序列数量
    query_offset: <持续时间>   # 可选,查询时间偏移
    labels:                    # 可选,添加到所有规则的标签
      <标签名>: <标签值>
    rules:
      - <规则定义>

基本示例:

groups:
  - name: example
    interval: 30s
    rules:
      # 记录规则
      - record: code:prometheus_http_requests_total:sum
        expr: sum by (code) (prometheus_http_requests_total)

Recording Rule 语法¶

# 记录规则名称(必需,必须是有效的指标名称)
record: <string>

# PromQL 表达式(必需)
# 每次评估时执行,结果保存为新的时间序列
expr: <string>

# 添加或覆盖的标签(可选)
labels:
  <labelname>: <labelvalue>

完整配置示例:

groups:
  # 1. HTTP 请求聚合
  - name: http_requests_aggregation
    interval: 30s
    rules:
      # 按状态码聚合请求总数
      - record: code:prometheus_http_requests_total:sum
        expr: sum by (code) (prometheus_http_requests_total)

      # 按 job 聚合请求总数
      - record: job:prometheus_http_requests_total:sum
        expr: sum by (job) (prometheus_http_requests_total)

      # 计算每个 job 的请求速率 (5分钟平均)
      - record: job:prometheus_http_requests_total:rate5m
        expr: sum by (job) (rate(prometheus_http_requests_total[5m]))

  # 2. 资源使用率聚合
  - name: resource_usage
    interval: 1m
    rules:
      # CPU 使用率聚合
      - record: instance:node_cpu_utilization:rate5m
        expr: 1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))

      # 内存使用率
      - record: instance:node_memory_utilization:ratio
        expr: |
          1 - (
            node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
          )
        labels:
          team: infrastructure

  # 3. 业务指标聚合
  - name: business_metrics
    interval: 5m
    rules:
      # 计算 API 请求成功率
      - record: job:api_success_rate:ratio
        expr: |
          sum by (job) (rate(api_requests_total{status=~"2.."}[5m]))
          /
          sum by (job) (rate(api_requests_total[5m]))

      # 计算 99 分位延迟
      - record: job:request_latency_seconds:p99
        expr: histogram_quantile(0.99, sum by (job, le) (rate(request_duration_seconds_bucket[5m])))

记录规则命名最佳实践¶

命名格式: level:metric:operations

level: 聚合级别 (如 job, instance, cluster)
metric: 原始指标名称
operations: 应用的操作 (如 sum, rate5m, ratio)

示例:

规则名称	说明
`job:http_requests_total:rate5m`	按 job 聚合的 HTTP 请求 5 分钟速率
`instance:node_cpu:utilization`	按实例聚合的 CPU 使用率
`cluster:memory:available_bytes`	集群级别的可用内存
`code:http_errors_total:sum`	按状态码聚合的错误总数

规则组配置选项¶

选项	说明	默认值
`name`	规则组名称(必需,唯一)	-
`interval`	规则评估间隔	`global.evaluation_interval`
`limit`	告警/序列数量限制	`0` (无限制)
`query_offset`	查询时间偏移	`global.rule_query_offset`
`labels`	附加到所有规则的标签	-

规则评估性能优化¶

1. 设置合适的评估间隔

groups:
  # 高频评估 - 关键指标
  - name: critical_metrics
    interval: 15s
    rules:
      - record: instance:up:count
        expr: count by (job) (up == 1)

  # 低频评估 - 聚合指标
  - name: daily_aggregations
    interval: 5m
    rules:
      - record: cluster:cpu_usage:avg24h
        expr: avg_over_time(cluster:cpu_usage:ratio[24h])

2. 使用查询偏移确保数据可用

groups:
  - name: remote_write_metrics
    # 偏移 1 分钟,确保远程写入的数据已到达
    query_offset: 1m
    rules:
      - record: job:requests_total:rate5m
        expr: sum by (job) (rate(requests_total[5m]))

3. 限制规则产生的序列数量

groups:
  - name: high_cardinality_metrics
    # 限制最多产生 10000 个时间序列
    limit: 10000
    rules:
      - record: path:http_requests_total:sum
        expr: sum by (path) (http_requests_total)

3.5 Alerting Rules (告警规则)¶

告警规则允许你基于 Prometheus 表达式定义告警条件,并在告警触发时发送通知到外部服务(如 Alertmanager)。

告警规则语法¶

# 告警名称(必需,必须是有效的标签值)
alert: <string>

# PromQL 表达式(必需)
# 当表达式结果为真时,告警激活
expr: <string>

# 持续时间阈值(可选)
# 告警在持续触发这段时间后才会 firing
[ for: <duration> | default = 0s ]

# 保持触发时间(可选)
# 告警条件消失后继续触发的时间
[ keep_firing_for: <duration> | default = 0s ]

# 附加标签(可选)
# 标签值支持模板化
labels:
  [ <labelname>: <tmpl_string> ]

# 注解(可选)
# 用于存储描述、Runbook 链接等信息
# 注解值支持模板化
annotations:
  [ <labelname>: <tmpl_string> ]

告警规则示例¶

基础告警:

groups:
  - name: instance_alerts
    rules:
      # 实例宕机告警
      - alert: InstanceDown
        expr: up == 0
        for: 5m
        labels:
          severity: critical
          team: infrastructure
        annotations:
          summary: "实例 {{ $labels.instance }} 宕机"
          description: "{{ $labels.instance }} (job: {{ $labels.job }}) 已宕机超过 5 分钟"

高级告警配置:

groups:
  - name: application_alerts
    labels:
      team: backend
    rules:
      # 1. API 高延迟告警
      - alert: APIHighRequestLatency
        expr: |
          histogram_quantile(0.5, 
            rate(api_http_request_latencies_second_bucket[5m])
          ) > 1
        for: 10m
        labels:
          severity: warning
          service: api
        annotations:
          summary: "{{ $labels.instance }} API 延迟过高"
          description: "实例 {{ $labels.instance }} 的中位数请求延迟超过 1s (当前值: {{ $value }}s)"
          dashboard: "https://grafana.example.com/d/api-dashboard"
          runbook: "https://wiki.example.com/runbooks/api-latency"

      # 2. 错误率过高告警
      - alert: HighErrorRate
        expr: |
          (
            sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
            /
            sum by (job) (rate(http_requests_total[5m]))
          ) > 0.05
        for: 5m
        keep_firing_for: 10m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.job }} 错误率过高"
          description: "{{ $labels.job }} 的 5xx 错误率超过 5% (当前: {{ $value | humanizePercentage }})"

      # 3. 磁盘空间不足告警
      - alert: DiskSpaceLow
        expr: |
          (
            node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs"}
            /
            node_filesystem_size_bytes{fstype!~"tmpfs|fuse.lxcfs"}
          ) < 0.1
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.instance }} 磁盘空间不足"
          description: |
            实例 {{ $labels.instance }} 的挂载点 {{ $labels.mountpoint }} 
            可用空间低于 10% (当前: {{ $value | humanizePercentage }})

      # 4. 内存使用率过高告警
      - alert: HighMemoryUsage
        expr: |
          (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.instance }} 内存使用率过高"
          description: "实例 {{ $labels.instance }} 内存使用率超过 90% (当前: {{ $value | humanizePercentage }})"

告警状态转换¶

graph LR
    A[Inactive<br/>未触发] -->|expr = true| B[Pending<br/>待触发]
    B -->|持续 for 时间| C[Firing<br/>触发中]
    B -->|expr = false| A
    C -->|expr = false<br/>& keep_firing_for = 0| A
    C -->|expr = false<br/>& keep_firing_for > 0| D[Keep Firing<br/>保持触发]
    D -->|超过 keep_firing_for| A
    D -->|expr = true| C

状态说明:

状态	说明
Inactive	告警表达式结果为假,告警未激活
Pending	表达式为真,但未达到 `for` 持续时间
Firing	表达式为真且持续时间超过 `for` 阈值
Keep Firing	表达式已变为假,但在 `keep_firing_for` 期间内继续触发

告警模板化¶

告警的 labels 和 annotations 支持使用 Go 模板语法。

可用变量:

变量	说明	示例
`$labels`	告警实例的标签键值对	`{{ $labels.instance }}`
`$value`	告警表达式的评估值	`{{ $value }}`
`$externalLabels`	全局配置的外部标签	`{{ $externalLabels.cluster }}`

模板函数:

annotations:
  # 格式化百分比
  usage: "{{ $value | humanizePercentage }}"

  # 格式化数字(添加单位)
  memory: "{{ $value | humanize }}B"

  # 格式化为 1024 进制
  disk: "{{ $value | humanize1024 }}B"

  # 格式化时间戳
  timestamp: "{{ $value | humanizeTimestamp }}"

完整模板示例:

groups:
  - name: template_examples
    rules:
      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 5m
        labels:
          severity: warning
          namespace: "{{ $labels.namespace }}"
          pod: "{{ $labels.pod }}"
        annotations:
          summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} 频繁重启"
          description: |
            命名空间: {{ $labels.namespace }}
            Pod: {{ $labels.pod }}
            容器: {{ $labels.container }}
            重启速率: {{ $value | humanize }} 次/秒
            集群: {{ $externalLabels.cluster }}
          grafana: "https://grafana/d/pods?var-namespace={{ $labels.namespace }}&var-pod={{ $labels.pod }}"

告警最佳实践¶

1. 合理设置 for 持续时间

# ❌ 不好 - 可能产生大量瞬时告警
- alert: HighCPU
  expr: cpu_usage > 0.8
  # 没有 for,立即告警

# ✅ 好 - 只在持续高负载时告警
- alert: HighCPU
  expr: cpu_usage > 0.8
  for: 10m  # 持续 10 分钟才告警

2. 使用 keep_firing_for 防止抖动

- alert: ServiceDown
  expr: up{job="myservice"} == 0
  for: 5m
  keep_firing_for: 10m  # 服务恢复后继续告警 10 分钟
  annotations:
    summary: "服务可能存在不稳定问题"

3. 分级告警严重程度

groups:
  - name: disk_alerts
    rules:
      # 警告级别
      - alert: DiskSpaceWarning
        expr: disk_free_percent < 20
        for: 30m
        labels:
          severity: warning

      # 严重级别
      - alert: DiskSpaceCritical
        expr: disk_free_percent < 10
        for: 15m
        labels:
          severity: critical

      # 紧急级别
      - alert: DiskSpaceEmergency
        expr: disk_free_percent < 5
        for: 5m
        labels:
          severity: emergency

4. 提供详细的上下文信息

- alert: DatabaseConnectionPoolExhausted
  expr: db_connection_pool_usage > 0.95
  for: 5m
  labels:
    severity: critical
    component: database
  annotations:
    summary: "{{ $labels.instance }} 数据库连接池即将耗尽"
    description: |
      数据库连接池使用率: {{ $value | humanizePercentage }}
      实例: {{ $labels.instance }}
      数据库: {{ $labels.database }}
    impact: "可能导致应用无法建立新的数据库连接,影响业务"
    action: |
      1. 检查是否存在连接泄漏
      2. 考虑增加连接池大小
      3. 检查慢查询
    runbook: "https://wiki.example.com/runbooks/db-connection-pool"
    dashboard: "https://grafana.example.com/d/db-dashboard?var-instance={{ $labels.instance }}"

查看告警状态¶

1. 通过 Prometheus UI

访问 http://prometheus:9090/alerts 查看所有告警的当前状态。

2. 通过 API 查询

# 查询所有 firing 告警
curl http://localhost:9090/api/v1/alerts

# 查询 ALERTS 指标
curl -g 'http://localhost:9090/api/v1/query?query=ALERTS'

3. ALERTS 合成指标

Prometheus 为每个告警自动创建 ALERTS 时间序列:

# 查询所有 firing 状态的告警
ALERTS{alertstate="firing"}

# 查询所有 pending 状态的告警
ALERTS{alertstate="pending"}

# 查询特定告警
ALERTS{alertname="InstanceDown", alertstate="firing"}

集成 Alertmanager¶

Prometheus 负责告警检测,Alertmanager 负责告警通知、分组、抑制和静默。

在 prometheus.yml 中配置:

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - 'alertmanager1:9093'
            - 'alertmanager2:9093'
      timeout: 10s

告警流程:

graph LR
    A[Prometheus<br/>评估规则] -->|告警触发| B[Prometheus<br/>发送告警]
    B --> C[Alertmanager<br/>接收告警]
    C --> D[分组<br/>Group]
    D --> E[抑制<br/>Inhibition]
    E --> F[静默<br/>Silence]
    F --> G[路由<br/>Route]
    G --> H1[Email]
    G --> H2[Slack]
    G --> H3[PagerDuty]
    G --> H4[Webhook]

3.6 Template Examples (模板示例)¶

Prometheus 在告警的注解和标签中支持模板化,也支持在控制台页面中使用模板。模板基于 Go 模板系统,可以执行查询、迭代数据、使用条件语句和格式化数据。

告警字段模板¶

基本模板:

- alert: InstanceDown
  expr: up == 0
  for: 5m
  labels:
    severity: page
  annotations:
    summary: "实例 {{ $labels.instance }} 宕机"
    description: "{{ $labels.instance }} (job: {{ $labels.job }}) 已宕机超过 5 分钟"

⚠️ 告警模板在每次规则迭代时对每个触发的告警执行,应保持查询和模板的轻量化。对于复杂模板,建议链接到控制台页面。

简单迭代¶

显示实例列表及其状态:

{{ range query "up" }}
  {{ .Labels.instance }} {{ .Value }}
{{ end }}

输出示例:

localhost:9090 1
localhost:9100 1
server1:9100 0

特殊变量 . 包含当前循环迭代的样本值。

显示单个值¶

获取特定指标值:

{{ with query "some_metric{instance='someinstance'}" }}
  {{ . | first | value | humanize }}
{{ end }}

说明:

with 语句用于错误处理,防止查询无结果时出错
first 获取第一个结果
value 提取数值
humanize 格式化数字

使用控制台 URL 参数¶

动态查询参数:

{{ with printf "node_memory_MemTotal{job='node',instance='%s'}" .Params.instance | query }}
  {{ . | first | value | humanize1024 }}B
{{ end }}

访问方式:

console.html?instance=hostname

此时 .Params.instance 的值为 hostname。

高级迭代示例¶

显示网络接口流量表格:

<table>
{{ range printf "node_network_receive_bytes{job='node',instance='%s',device!='lo'}" .Params.instance | query | sortByLabel "device"}}
  <tr><th colspan=2>{{ .Labels.device }}</th></tr>
  <tr>
    <td>接收</td>
    <td>{{ with printf "rate(node_network_receive_bytes{job='node',instance='%s',device='%s'}[5m])" .Labels.instance .Labels.device | query }}{{ . | first | value | humanize }}B/s{{end}}</td>
  </tr>
  <tr>
    <td>发送</td>
    <td>{{ with printf "rate(node_network_transmit_bytes{job='node',instance='%s',device='%s'}[5m])" .Labels.instance .Labels.device | query }}{{ . | first | value | humanize }}B/s{{end}}</td>
  </tr>
{{ end }}
</table>

说明:

sortByLabel "device" 按设备名称排序
在 range 循环中,. 变为循环变量,因此 .Params.instance 不再可用

定义可重用模板¶

定义模板:

{{/* 定义模板 */}}
{{define "myTemplate"}}
  执行某些操作
{{end}}

{{/* 使用模板 */}}
{{template "myTemplate"}}

多参数模板:

{{define "myMultiArgTemplate"}}
  第一个参数: {{.arg0}}
  第二个参数: {{.arg1}}
{{end}}

{{/* 使用 args 函数传递多个参数 */}}
{{template "myMultiArgTemplate" (args 1 2)}}

常用模板函数¶

函数	说明	示例	输出
`humanize`	格式化数字,添加单位	`{{ 1234567 \\| humanize }}`	`1.234567M`
`humanize1024`	按 1024 进制格式化	`{{ 1048576 \\| humanize1024 }}B`	`1MiB`
`humanizePercentage`	格式化为百分比	`{{ 0.8234 \\| humanizePercentage }}`	`82.34%`
`humanizeDuration`	格式化持续时间	`{{ 3661 \\| humanizeDuration }}`	`1h1m1s`
`humanizeTimestamp`	格式化时间戳	`{{ 1609459200 \\| humanizeTimestamp }}`	`2021-01-01 00:00:00`
`title`	首字母大写	`{{ "hello" \\| title }}`	`Hello`
`toUpper`	转大写	`{{ "hello" \\| toUpper }}`	`HELLO`
`toLower`	转小写	`{{ "HELLO" \\| toLower }}`	`hello`
`stripPort`	移除端口号	`{{ "host:8080" \\| stripPort }}`	`host`

实用模板示例¶

1. 格式化告警消息:

annotations:
  summary: |
    {{ $labels.job }} 服务异常

  description: |
    服务: {{ $labels.job }}
    实例: {{ $labels.instance }}
    状态: {{ if eq $value 0.0 }}宕机{{ else }}运行中{{ end }}
    持续时间: {{ .StartsAt | humanizeDuration }}

  metrics: |
    CPU 使用率: {{ with query (printf "instance:cpu_usage:ratio{instance='%s'}" $labels.instance) }}{{ . | first | value | humanizePercentage }}{{ end }}
    内存使用率: {{ with query (printf "instance:memory_usage:ratio{instance='%s'}" $labels.instance) }}{{ . | first | value | humanizePercentage }}{{ end }}

2. 动态生成 Runbook 链接:

annotations:
  runbook: |
    https://wiki.example.com/runbooks/{{ $labels.alertname | toLower | reReplaceAll " " "-" }}

3. 条件格式化:

annotations:
  severity_emoji: |
    {{ if eq $labels.severity "critical" }}🔴{{ else if eq $labels.severity "warning" }}⚠️{{ else }}ℹ️{{ end }}

  message: |
    {{ .Annotations.severity_emoji }} {{ $labels.alertname }}

    {{ if gt $value 0.9 }}
      紧急! 使用率超过 90%
    {{ else if gt $value 0.7 }}
      警告: 使用率超过 70%
    {{ else }}
      注意: 使用率为 {{ $value | humanizePercentage }}
    {{ end }}

4. 表格格式化:

annotations:
  details: |
    | 指标 | 当前值 | 阈值 |
    |------|--------|------|
    | CPU 使用率 | {{ with query "instance:cpu_usage:ratio" }}{{ . | first | value | humanizePercentage }}{{ end }} | 80% |
    | 内存使用率 | {{ with query "instance:memory_usage:ratio" }}{{ . | first | value | humanizePercentage }}{{ end }} | 80% |
    | 磁盘使用率 | {{ with query "instance:disk_usage:ratio" }}{{ . | first | value | humanizePercentage }}{{ end }} | 80% |

模板调试技巧¶

1. 使用 promtool 验证模板:

promtool check rules rules.yml

2. 在告警注解中输出调试信息:

annotations:
  debug: |
    Labels: {{ $labels }}
    Value: {{ $value }}
    External Labels: {{ $externalLabels }}

3. 测试查询结果:

annotations:
  query_result: |
    {{ with query "up{instance='localhost:9090'}" }}
      结果数量: {{ . | len }}
      第一个值: {{ . | first | value }}
    {{ else }}
      查询无结果
    {{ end }}

总结¶

通过本节,我们学习了 Prometheus 规则和模板的核心内容:

Recording Rules (记录规则):

预计算频繁使用或计算昂贵的表达式
提供命名最佳实践 (level:metric:operations)
支持规则分组、评估间隔和限制配置
适用于仪表板、告警和长期存储优化

Alerting Rules (告警规则):

基于 PromQL 表达式定义告警条件
支持告警状态转换 (Inactive → Pending → Firing)
提供 for 和 keep_firing_for 控制告警行为
与 Alertmanager 集成实现通知路由

Template Examples (模板示例):

基于 Go 模板系统的强大模板化能力
支持在告警标签、注解和控制台中使用
提供丰富的模板函数 (humanize、格式化等)
可定义可重用模板,提高配置复用性

这些功能组合使用,能够构建强大的监控和告警系统,满足复杂的生产环境需求。

4. Querying (查询)¶

4.1 Basics (基础)¶

Prometheus 提供了一种功能强大的查询语言 PromQL (Prometheus Query Language),允许用户实时选择和聚合时间序列数据。

PromQL 查询类型¶

1. Instant Query (即时查询)

在单个时间点评估
返回该时刻的数据
UI 中使用 "Table" 标签页

2. Range Query (范围查询)

在起始和结束时间之间的等间隔步长评估
相当于在多个时间戳运行即时查询
UI 中使用 "Graph" 标签页

graph LR
    A[PromQL 查询] --> B[Instant Query<br/>即时查询]
    A --> C[Range Query<br/>范围查询]
    B --> D[返回单个时间点数据<br/>Table视图]
    C --> E[返回时间范围数据<br/>Graph视图]

表达式数据类型¶

PromQL 表达式可以评估为以下四种类型之一:

数据类型	说明	示例用途
Instant Vector``(即时向量)	包含单个时间戳的一组时间序列,每个序列一个样本	当前 CPU 使用率
Range Vector``(范围向量)	包含一段时间内数据点的一组时间序列	过去5分钟的请求数据
Scalar``(标量)	简单的浮点数值	阈值、常数
String``(字符串)	简单的字符串值(当前未使用)	-

查询类型限制:

即时查询: 支持所有数据类型
范围查询: 只支持 Scalar 和 Instant Vector

4. Querying (查询)¶

本部分详细介绍 PromQL 查询语言的基础知识、操作符、函数和实用示例。由于内容较多,以下是简明版本,完整文档请参考官方文档。

4.1 Basics (基础)¶

核心概念:

Instant Query: 即时查询,返回单个时间点数据
Range Query: 范围查询,返回时间范围内数据

数据类型:

Instant Vector (即时向量) - 时间序列集合,每个序列一个样本
Range Vector (范围向量) - 时间序列集合,包含一段时间的数据
Scalar (标量) - 浮点数值
String (字符串) - 字符串值

选择器语法:

# 基本选择
http_requests_total

# 标签过滤
http_requests_total{job="prometheus", method="GET"}

# 正则匹配
http_requests_total{job=~".*server", status!~"4.."}

# 范围向量
http_requests_total[5m]

# 时间修饰符
http_requests_total offset 5m
http_requests_total @ 1609746000

标签匹配运算符:

= : 等于
!= : 不等于
=~ : 正则匹配
!~ : 正则不匹配

4.2 Operators (操作符)¶

算术操作符: + - * / % ^

比较操作符: == != > < >= <=

逻辑/集合操作符:

and - 交集
or - 并集
unless - 补集

向量匹配:

# One-to-One 匹配
method_code:http_errors:rate5m{code="500"} 
  / ignoring(code) 
  method:http_requests:rate5m

# Many-to-One 匹配
method_code:http_errors:rate5m 
  / ignoring(code) group_left 
  method:http_requests:rate5m

聚合操作符:

sum, avg, min, max, count
topk, bottomk, quantile
stddev, stdvar, group
count_values

# 按标签聚合
sum by (job) (http_requests_total)
sum without (instance) (http_requests_total)

# 取最大的 k 个值
topk(5, http_requests_total)

# 分位数
quantile(0.95, http_request_duration_seconds)

4.3 Functions (函数)¶

速率函数:

rate(http_requests_total[5m])           # 平均速率
irate(http_requests_total[5m])          # 瞬时速率
increase(http_requests_total[1h])       # 总增量

时间聚合函数:

avg_over_time(node_cpu_usage[5m])
max_over_time(node_cpu_usage[24h])
min_over_time(node_cpu_usage[24h])
sum_over_time(http_requests_total[1h])
quantile_over_time(0.9, latency[5m])

直方图函数:

# 计算 P90 延迟
histogram_quantile(0.9, 
  rate(http_request_duration_seconds_bucket[10m])
)

# 计算平均值
histogram_avg(rate(http_request_duration_seconds[5m]))

# 计算区间占比
histogram_fraction(0, 0.2, rate(http_request_duration_seconds[1h]))

数学函数:

abs(v)            # 绝对值
ceil(v)           # 向上取整
floor(v)          # 向下取整
round(v)          # 四舍五入
sqrt(v)           # 平方根
ln(v), log2(v), log10(v)   # 对数
exp(v)            # 指数
clamp_min(v, min), clamp_max(v, max)  # 钳位

缺失数据检测:

absent(up{job="critical-service"})               # 检测指标缺失
absent_over_time(http_requests_total{job="api"}[1h])  # 时间范围内缺失

预测函数:

predict_linear(node_filesystem_free_bytes[1h], 4*3600)  # 预测未来值
deriv(node_temperature_celsius[1h])     # 计算导数

标签操作:

label_replace(v, "dst", "$1", "src", "(.*):.*")  # 替换标签
label_join(v, "dst", ",", "src1", "src2")        # 连接标签

时间函数:

time()                    # 当前时间戳
timestamp(v)              # 样本时间戳
year(), month(), day_of_month(), hour(), minute()  # 时间组件

4.4 Examples (示例)¶

1. 计算请求速率

# QPS
sum(rate(http_requests_total[5m]))

# 按服务聚合
sum by (service) (rate(http_requests_total[5m]))

2. 计算成功率/错误率

# 成功率
sum(rate(http_requests_total{status=~"2.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# 错误率
sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
/
sum by (service) (rate(http_requests_total[5m]))

3. 计算资源使用率

# CPU 使用率
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 内存使用率
(node_memory_MemTotal_bytes - node_memory_MemAvail able_bytes) 
/ node_memory_MemTotal_bytes * 100

# 磁盘使用率
(node_filesystem_size_bytes - node_filesystem_avail_bytes) 
/ node_filesystem_size_bytes * 100

4. 计算延迟分位数

# P50/P90/P99 延迟
histogram_quantile(0.5, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
histogram_quantile(0.9, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))

5. Top K 查询

# CPU 使用率最高的 5 个实例
topk(5, rate(node_cpu_seconds_total[5m]))

# 请求量最大的 10 个服务
topk(10, sum by (service) (rate(http_requests_total[5m])))

6. 预测和趋势

# 预测 4 小时后磁盘是否会满
predict_linear(node_filesystem_avail_bytes[1h], 4*3600) < 0

# 流量环比增长
(rate(http_requests_total[5m]) - rate(http_requests_total[5m] offset 1h))
/ rate(http_requests_total[5m] offset 1h)

7. 计算 Uptime

# 运行时长(秒)
time() - node_boot_time_seconds

# 转换为天
(time() - node_boot_time_seconds) / 86400

8. 统计计数

# 运行中的实例数
count(up == 1)

# 每个应用的实例数
count by (app) (up)

# 不同版本的实例数
count_values("version", build_version)

9. SLA 可用性

# 过去 30 天的可用性百分比
sum(rate(http_requests_total{status=~"2..|3.."}[30d]))
/
sum(rate(http_requests_total[30d]))
* 100

10. 告警相关查询

# 检测服务宕机
up{job="critical-service"} == 0

# 检测指标缺失
absent(up{job="critical-service"})

# 高错误率
(sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) > 0.05

# CPU 持续高负载
avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) > 0.8

PromQL 最佳实践¶

1. 性能优化

✅ 使用记录规则预计算复杂查询
✅ 先聚合后查询,减少时间序列数量
✅ 避免高基数标签
✅ 使用合适的时间范围

2. 准确性

✅ Counter 使用 rate() 或 increase()
✅ Gauge 使用 delta() 或直接查询
✅ 计算速率时先 rate() 再聚合
✅ 注意操作符优先级,必要时使用括号

3. 可读性

✅ 使用有意义的记录规则名称
✅ 添加注释说明查询目的
✅ 将复杂查询分解为多个步骤
✅ 保持一致的格式和缩进

4. 告警查询

✅ 使用 rate() 而非 irate() (更稳定)
✅ 设置合适的 for 持续时间避免抖动
✅ 使用 absent() 检测缺失指标
✅ 考虑使用布尔运算符提高可读性

4.5 API (HTTP API)¶

Prometheus 提供了完整的 HTTP API 用于查询数据和管理服务器。当前稳定版本API位于 /api/v1 端点。

API 响应格式¶

所有 API 返回 JSON 格式:

{
  "status": "success" | "error",
  "data": <...>,

  // 错误时包含
  "errorType": "<string>",
  "error": "<string>",

  // 警告信息(可选)
  "warnings": ["<string>"],
  "infos": ["<string>"]
}

HTTP 状态码:

200 - 成功
400 - 参数错误
422 - 表达式无法执行
503 - 查询超时或中止

1. 表达式查询 API¶

即时查询 (Instant Query)

# GET 请求
GET /api/v1/query

# 参数
query=<string>        # PromQL 表达式
time=<timestamp>      # 评估时间戳(可选,默认当前时间)
timeout=<duration>    # 超时时间(可选)
limit=<number>        # 返回序列数量限制(可选)

示例:

# 查询 up 指标在特定时间点的值
curl 'http://localhost:9090/api/v1/query?query=up&time=2015-07-01T20:10:51.781Z'

# 查询所有实例的 CPU 使用率
curl 'http://localhost:9090/api/v1/query?query=rate(node_cpu_seconds_total[5m])'

响应示例:

{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "up",
          "job": "prometheus",
          "instance": "localhost:9090"
        },
        "value": [1435781451.781, "1"]
      }
    ]
  }
}

范围查询 (Range Query)

GET /api/v1/query_range

# 参数
query=<string>    # PromQL 表达式
start=<timestamp> # 开始时间
end=<timestamp>   # 结束时间
step=<duration>   # 查询步长
timeout=<duration># 超时(可选)
limit=<number>    # 序列数量限制(可选)

示例:

# 查询过去 30 秒的 up 指标,步长 15s
curl 'http://localhost:9090/api/v1/query_range?query=up&start=2015-07-01T20:10:30.781Z&end=2015-07-01T20:11:00.781Z&step=15s'

2. 元数据查询 API¶

查询时间序列 (Series)

GET /api/v1/series

# 参数
match[]=<selector>  # 序列选择器(必需,可重复)
start=<timestamp>   # 开始时间(可选)
end=<timestamp>     # 结束时间(可选)
limit=<number>      # 返回数量限制(可选)

示例:

# 查询匹配的时间序列
curl -g 'http://localhost:9090/api/v1/series?match[]=up&match[]=process_start_time_seconds{job="prometheus"}'

查询标签名称

GET /api/v1/labels

# 参数
start=<timestamp>     # 开始时间(可选)
end=<timestamp>       # 结束时间(可选)
match[]=<selector>    # 过滤选择器(可选)
limit=<number>        # 返回数量限制(可选)

示例:

curl 'http://localhost:9090/api/v1/labels'

响应:

{
  "status": "success",
  "data": [
    "__name__",
    "instance",
    "job",
    "method",
    "status"
  ]
}

查询标签值

GET /api/v1/label/<label_name>/values

# 示例
curl 'http://localhost:9090/api/v1/label/job/values'

3. 目标和规则查询¶

查询抓取目标

GET /api/v1/targets

# 参数
state=<active|dropped|any>  # 过滤目标状态(可选)
scrapePool=<name>            # 过滤抓取池(可选)

示例:

curl 'http://localhost:9090/api/v1/targets?state=active'

查询规则

GET /api/v1/rules

# 参数
type=<alert|record>       # 过滤规则类型(可选)
rule_name[]=<name>        # 过滤规则名称(可选)
rule_group[]=<name>       # 过滤规则组(可选)
file[]=<path>             # 过滤文件路径(可选)
exclude_alerts=<bool>     # 排除告警(可选)

查询告警

GET /api/v1/alerts

# 返回所有活跃告警
curl 'http://localhost:9090/api/v1/alerts'

4. 状态和配置 API¶

查询配置

GET /api/v1/status/config

# 返回当前加载的配置文件(YAML格式)

查询启动标志

GET /api/v1/status/flags

# 返回 Prometheus 启动参数

查询运行时信息

GET /api/v1/status/runtimeinfo

# 返回运行时信息(启动时间、时间序列数量、内存等)

查询构建信息

GET /api/v1/status/buildinfo

# 返回版本、Git 版本号、Go 版本等

查询 TSDB 统计

GET /api/v1/status/tsdb

# 参数
limit=<number>  # 限制返回项数(默认10)

# 返回时间序列数据库的基数统计

5. 管理 API (需启用 --web.enable-admin-api)¶

创建快照

POST /api/v1/admin/tsdb/snapshot

# 参数
skip_head=<bool>  # 跳过 head block(可选)

删除序列数据

POST /api/v1/admin/tsdb/delete_series

# 参数
match[]=<selector>  # 序列选择器(必需,可重复)
start=<timestamp>   # 开始时间(可选)
end=<timestamp>     # 结束时间(可选)

清理墓碑

POST /api/v1/admin/tsdb/clean_tombstones

# 清理已删除的数据,释放磁盘空间

6. 实用工具 API¶

格式化查询

GET /api/v1/format_query

# 参数
query=<string>  # PromQL 表达式

# 返回格式化后的查询字符串
curl 'http://localhost:9090/api/v1/format_query?query=foo/bar'

解析查询为 AST

GET /api/v1/parse_query

# 参数
query=<string>  # PromQL 表达式

# 返回查询的抽象语法树(JSON格式)

7. 常用 API 调用示例¶

1. 查询当前 CPU 使用率

curl -g 'http://localhost:9090/api/v1/query' \
  --data-urlencode 'query=100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)'

2. 查询过去 1 小时的请求速率

curl -g 'http://localhost:9090/api/v1/query_range' \
  --data-urlencode 'query=rate(http_requests_total[5m])' \
  --data-urlencode "start=$(date -u -d '1 hour ago' +%s)" \
  --data-urlencode "end=$(date -u +%s)" \
  --data-urlencode 'step=60s'

3. 查询特定 job 的所有标签

curl -g 'http://localhost:9090/api/v1/series?match[]=up{job="prometheus"}'

4. 查询所有 job 的标签值

curl 'http://localhost:9090/api/v1/label/job/values'

5. 查询当前活跃的抓取目标

curl 'http://localhost:9090/api/v1/targets?state=active' | jq '.data.activeTargets[] | {instance: .labels.instance, health: .health}'

6. 查询所有 firing 状态的告警

curl 'http://localhost:9090/api/v1/alerts' | jq '.data.alerts[] | select(.state=="firing")'

7. 获取 Prometheus 版本信息

curl 'http://localhost:9090/api/v1/status/buildinfo' | jq '.data.version'

8. 查询 TSDB 中的时间序列数量

curl 'http://localhost:9090/api/v1/status/runtimeinfo' | jq '.data.timeSeriesCount'

8. API 使用最佳实践¶

1. 使用 POST 方式发送复杂查询

对于长查询字符串,使用 POST 避免 URL 长度限制:

curl -X POST 'http://localhost:9090/api/v1/query' \
  --data-urlencode 'query=sum by (job) (rate(http_requests_total{status=~"5..",path=~"/api/.*"}[5m]))'

2. 合理设置超时

curl 'http://localhost:9090/api/v1/query?query=complex_query&timeout=30s'

3. 使用 limit 限制返回数据量

curl 'http://localhost:9090/api/v1/query?query=up&limit=100'

4. URL 编码参数

使用 --data-urlencode 或手动编码:

# 使用 --data-urlencode
curl -G 'http://localhost:9090/api/v1/query' \
  --data-urlencode 'query=http_requests_total{method="GET"}'

# 手动编码
curl 'http://localhost:9090/api/v1/query?query=http_requests_total%7Bmethod%3D%22GET%22%7D'

5. 处理 JSON 响应

使用 jq 处理 JSON:

# 提取数据部分
curl 'http://localhost:9090/api/v1/query?query=up' | jq '.data'

# 提取所有指标值
curl 'http://localhost:9090/api/v1/query?query=up' | jq '.data.result[].value[1]'

# 检查状态
curl 'http://localhost:9090/api/v1/query?query=up' | jq -r '.status'

6. 错误处理

response=$(curl -s 'http://localhost:9090/api/v1/query?query=invalid_query')
status=$(echo "$response" | jq -r '.status')

if [ "$status" = "error" ]; then
  echo "Error: $(echo "$response" | jq -r '.error')"
  echo "Type: $(echo "$response" | jq -r '.errorType')"
fi

9. 编程语言客户端示例¶

Python 示例:

import requests
import json
from datetime import datetime, timedelta

# 基础配置
PROMETHEUS_URL = "http://localhost:9090"

# 即时查询
def instant_query(query):
    url = f"{PROMETHEUS_URL}/api/v1/query"
    params = {"query": query}
    response = requests.get(url, params=params)
    return response.json()

# 范围查询
def range_query(query, start, end, step="15s"):
    url = f"{PROMETHEUS_URL}/api/v1/query_range"
    params = {
        "query": query,
        "start": start.timestamp(),
        "end": end.timestamp(),
        "step": step
    }
    response = requests.get(url, params=params)
    return response.json()

# 使用示例
if __name__ == "__main__":
    # 查询当前 CPU 使用率
    result = instant_query('rate(node_cpu_seconds_total[5m])')
    print(json.dumps(result, indent=2))

    # 查询过去 1 小时的数据
    end = datetime.now()
    start = end - timedelta(hours=1)
    result = range_query('up', start, end, '60s')
    print(json.dumps(result, indent=2))

Go 示例:

package main

import (
    "context"
    "fmt"
    "time"

    "github.com/prometheus/client_golang/api"
    v1 "github.com/prometheus/client_golang/api/prometheus/v1"
)

func main() {
    client, err := api.NewClient(api.Config{
        Address: "http://localhost:9090",
    })
    if err != nil {
        panic(err)
    }

    v1api := v1.NewAPI(client)
    ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
    defer cancel()

    // 即时查询
    result, warnings, err := v1api.Query(ctx, "up", time.Now())
    if err != nil {
        panic(err)
    }
    if len(warnings) > 0 {
        fmt.Printf("Warnings: %v\n", warnings)
    }
    fmt.Printf("Result: %v\n", result)

    // 范围查询
    r := v1.Range{
        Start: time.Now().Add(-time.Hour),
        End:   time.Now(),
        Step:  time.Minute,
    }
    result, warnings, err = v1api.QueryRange(ctx, "rate(http_requests_total[5m])", r)
    if err != nil {
        panic(err)
    }
    fmt.Printf("Result: %v\n", result)
}

API 快速参考¶

API 端点	方法	用途
`/api/v1/query`	GET/POST	即时查询
`/api/v1/query_range`	GET/POST	范围查询
`/api/v1/series`	GET/POST	查询时间序列
`/api/v1/labels`	GET/POST	查询标签名称
`/api/v1/label/<name>/values`	GET	查询标签值
`/api/v1/targets`	GET	查询抓取目标
`/api/v1/rules`	GET	查询规则
`/api/v1/alerts`	GET	查询告警
`/api/v1/status/config`	GET	查询配置
`/api/v1/status/flags`	GET	查询标志
`/api/v1/status/runtimeinfo`	GET	查询运行时信息
`/api/v1/status/buildinfo`	GET	查询构建信息
`/api/v1/status/tsdb`	GET	查询 TSDB 统计
`/api/v1/admin/tsdb/snapshot`	POST	创建快照
`/api/v1/admin/tsdb/delete_series`	POST	删除序列
`/api/v1/admin/tsdb/clean_tombstones`	POST	清理墓碑

5. Storage (存储)¶

Prometheus 包含本地磁盘时间序列数据库,同时可选地与远程存储系统集成。

5.1 本地存储¶

磁盘布局¶

Block 结构:

采样数据按 2 小时 分组为 block
每个 block 包含:
chunks/ - 时间序列样本数据(最多 512MB/文件)
index - 索引文件(指标名称和标签到时间序列的映射)
meta.json - 元数据文件
tombstones - 删除记录

目录结构示例:

./data
├── 01BKGV7JBM69T2G1BGBGM6KB12/    # 2小时 block
│   ├── chunks/
│   │   └── 000001
│   ├── index
│   ├── tombstones
│   └── meta.json
├── chunks_head/                    # Head block (内存中)
│   └── 000001
└── wal/                            # 写前日志
    ├── 000000002
    └── checkpoint.00000001/
        └── 00000000

WAL (Write-Ahead Log)¶

特性:

保护内存中的当前 block 防止崩溃丢失
文件大小: 128MB/段
最少保留: 3 个 WAL 文件
高流量服务器可能保留更多(至少 2 小时原始数据)

启用 WAL 压缩:

--storage.tsdb.wal-compression
# 可将 WAL 大小减半,CPU 开销很小
# v2.11.0 引入,v2.20.0 默认启用

压缩 (Compaction)¶

初始 2 小时 blocks 会在后台压缩成更大的 blocks
最大 block 大小: 保留时间的 10% 或 31 天(取较小值)

存储配置参数¶

参数	说明	默认值
`--storage.tsdb.path`	数据库目录路径	`data/`
`--storage.tsdb.retention.time`	样本保留时间	`15d`
`--storage.tsdb.retention.size`	最大存储字节数	`0`(禁用)
`--storage.tsdb.wal-compression`	启用 WAL 压缩	`true`(v2.20+)

支持的时间单位: y, w, d, h, m, s, ms

支持的存储单位: B, KB, MB, GB, TB, PB, EB

容量规划¶

估算公式:

所需磁盘空间 = 保留时间(秒) × 每秒采样数 × 每样本字节数

每样本平均大小: 1-2 bytes

示例计算:

# 假设: 10万时间序列,15秒抓取间隔,保留15天
采样率 = 100,000 / 15 = 6,666 samples/s
所需空间 = 15天 * 86400秒 * 6,666 * 2 bytes 
         ≈ 17.3 GB

降低采样率的方法:

减少时间序列数量(更少的目标或每个目标更少的序列)
增加抓取间隔
使用记录规则预聚合

保留策略¶

时间保留:

--storage.tsdb.retention.time=30d

大小保留:

--storage.tsdb.retention.size=50GB

建议:

如果同时设置时间和大小保留,首先触发的生效
大小保留建议设置为磁盘空间的 80-85%
过期 block 清理在后台进行,最多可能需要 2 小时

数据完整性¶

文件系统要求:

✅ 必须使用 POSIX 兼容的文件系统
❌ 不支持 NFS(包括 AWS EFS)
✅ 强烈推荐使用 本地文件系统

备份建议:

使用快照 API 进行备份
非快照备份可能丢失最后 2 小时的数据(上次 WAL 同步以来)

数据恢复:

如果本地存储损坏:

备份存储目录
从备份恢复损坏的 block 目录
最后手段: 删除损坏的文件/WAL

5.2 远程存储集成¶

Prometheus 通过以下方式与远程存储系统集成:

四种集成方式¶

graph LR
    A[Prometheus 服务器] -->|Remote Write| B[远程存储]
    C[其他客户端] -->|Remote Write| A
    B -->|Remote Read| A
    A -->|Remote Read| D[查询客户端]

Remote Write (写出): Prometheus 将采集的样本写入远程 URL
Remote Write Receiver (接收): Prometheus 接收其他客户端的样本
Remote Read (读取): Prometheus 从远程 URL 读取样本数据
Remote Read Server (提供): Prometheus 为客户端提供样本数据

协议特性¶

编码方式:

Snappy 压缩的 Protocol Buffer
通过 HTTP 传输

版本:

Write 协议: 1.0 (稳定) + 2.0 (实验性)
Read 协议: 尚未稳定

配置位置: prometheus.yml 的 remote_write 和 remote_read 部分

Remote Write Receiver¶

启用方式:

prometheus --web.enable-remote-write-receiver

端点: /api/v1/write

⚠️ 注意: 这不是高效的样本采集方式,仅用于特定低流量场景,不适合替代抓取机制

Remote Read 限制¶

Prometheus 仅从远程端获取原始序列数据
所有 PromQL 评估仍在 Prometheus 本地进行
存在可扩展性限制(所有数据需先加载到查询服务器)
完全分布式 PromQL 评估目前不可行

5.3 数据回填 (Backfilling)¶

OpenMetrics 格式回填¶

用途: 从其他监控系统或时间序列数据库迁移数据

使用方法:

# 从 OpenMetrics 格式创建 blocks
promtool tsdb create-blocks-from openmetrics <input_file> [<output_dir>]

# 移动生成的 blocks 到 Prometheus 数据目录
mv <output_dir>/* /path/to/prometheus/data/

注意事项:

⚠️ 不要回填最近 3 小时的数据(可能与 head block 重叠)
每个 block 包含 2 小时的数据
需要启用 --storage.tsdb.allow-overlapping-blocks (v2.38 及以下)
回填数据受保留策略约束

调整 Block 大小:

# 使用更大的 block 持续时间(适合回填大量历史数据)
promtool tsdb create-blocks-from openmetrics input.txt \
  --max-block-duration=7d

⚠️ 注意: 大 block 会影响基于时间的保留策略效果

Recording Rules 回填¶

用途: 为新创建的记录规则生成历史数据

使用方法:

promtool tsdb create-blocks-from rules \
  --start 1617079873 \
  --end 1617097873 \
  --url http://prometheus:9090 \
  rules.yaml rules2.yaml

限制:

重复运行会创建重复数据
所有规则都会被评估
同组规则无法看到其他规则的结果
告警会被忽略

6. Federation (联邦)¶

Federation 允许一个 Prometheus 服务器从另一个 Prometheus 服务器抓取选定的时间序列。

6.1 使用场景¶

1. 分层联邦 (Hierarchical Federation)

用于扩展到拥有数十个数据中心和数百万节点的环境。

graph TD
    A[全局 Prometheus<br/>聚合数据] --> B[数据中心1 Prometheus<br/>详细数据]
    A --> C[数据中心2 Prometheus<br/>详细数据]
    A --> D[数据中心3 Prometheus<br/>详细数据]
    B --> E[实例级监控]
    C --> F[实例级监控]
    D --> G[实例级监控]

架构特点:

下层服务器: 采集详细数据(实例级)
上层服务器: 采集聚合数据(job级)
提供全局聚合视图 + 本地详细视图

2. 跨服务联邦 (Cross-Service Federation)

一个服务的 Prometheus 从另一个服务的 Prometheus 抓取数据,实现跨服务查询和告警。

示例场景:

集群调度器: 提供资源使用信息(CPU、内存)
服务实例: 提供应用特定指标
通过联邦合并两类指标到一个服务器

6.2 配置 Federation¶

Federation 端点: /federate

必需参数: match[] - 选择要暴露的时间序列

选择器示例:

up
{job="api-server"}
{__name__=~"job:.*"}

完整配置示例:

scrape_configs:
  - job_name: 'federate'
    scrape_interval: 15s

    # 不覆盖源服务器暴露的标签
    honor_labels: true

    # Federation 端点
    metrics_path: '/federate'

    # 选择要联邦的指标
    params:
      'match[]':
        - '{job="prometheus"}'           # 所有 prometheus job 的指标
        - '{__name__=~"job:.*"}'         # 所有 job: 开头的指标(聚合指标)
        - '{__name__=~"instance:.*"}'    # 所有 instance: 开头的指标

    static_configs:
      - targets:
          - 'source-prometheus-1:9090'
          - 'source-prometheus-2:9090'
          - 'source-prometheus-3:9090'

关键配置说明:

配置项	说明	必需
`honor_labels: true`	保留源服务器的标签,不覆盖	✅
`metrics_path: '/federate'`	Federation 端点路径	✅
`params.match[]`	时间序列选择器(可多个)	✅

最佳实践:

1. 只联邦必要的指标

params:
  'match[]':
    # ❌ 不好 - 联邦所有指标
    - '{__name__=~".+"}'

    # ✅ 好 - 只联邦聚合指标
    - '{__name__=~"job:.*"}'
    - '{__name__=~"instance:.*"}'

2. 使用记录规则进行预聚合

下层服务器创建记录规则:

groups:
  - name: federation_aggregation
    interval: 30s
    rules:
      # 为联邦准备的聚合指标
      - record: job:http_requests_total:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

      - record: instance:node_cpu:avg
        expr: avg by (instance) (rate(node_cpu_seconds_total[5m]))

上层服务器联邦这些指标:

params:
  'match[]':
    - '{__name__=~"job:.*"}'
    - '{__name__=~"instance:.*"}'

3. 合理设置抓取间隔

scrape_interval: 30s  # Federation 可以使用较长间隔

原生直方图支持:

要通过 federation 抓取原生直方图,需要:

# 在抓取服务器启用原生直方图
prometheus --enable-feature=native-histograms

⚠️ 注意: 如果联邦指标包含混合样本类型(float64、counter histogram、gauge histogram),会违反 protobuf 格式规则,但 Prometheus 仍能正确采集

7. Management API (管理 API)¶

Prometheus 提供了一组管理 API 用于自动化和集成。

7.1 健康检查 API¶

健康检查 (Health Check)

GET  /-/healthy
HEAD /-/healthy

用途: 检查 Prometheus 是否正在运行

返回: 始终返回 200

示例:

curl -s http://localhost:9090/-/healthy
# 返回 "Prometheus Server is Healthy."

# 或仅检查状态码
curl -I http://localhost:9090/-/healthy

就绪检查 (Readiness Check)

GET  /-/ready
HEAD /-/ready

用途: 检查 Prometheus 是否准备好处理查询

返回: 准备好时返回 200,否则返回 503

示例:

# 检查就绪状态
curl -s http://localhost:9090/-/ready

# Kubernetes 就绪探针配置
readinessProbe:
  httpGet:
    path: /-/ready
    port: 9090
  initialDelaySeconds: 30
  periodSeconds: 5

区别:

/-/healthy: 服务进程是否存活
/-/ready: 服务是否准备好处理流量(WAL 重放完成等)

7.2 配置管理 API¶

重新加载配置

POST /-/reload
PUT  /-/reload

用途: 触发配置和规则文件重新加载

前提: 需启用 --web.enable-lifecycle 标志

示例:

# 重新加载配置
curl -X POST http://localhost:9090/-/reload

# 或使用 PUT
curl -X PUT http://localhost:9090/-/reload

替代方法 - 发送 SIGHUP 信号:

# 查找 Prometheus 进程 ID
ps aux | grep prometheus

# 发送 SIGHUP 信号
kill -HUP <PID>

优雅关闭

POST /-/quit
PUT  /-/quit

用途: 触发 Prometheus 优雅关闭

前提: 需启用 --web.enable-lifecycle 标志

示例:

# 优雅关闭 Prometheus
curl -X POST http://localhost:9090/-/quit

替代方法 - 发送 SIGTERM 信号:

# 优雅关闭
kill -TERM <PID>

# 或使用 systemd
systemctl stop prometheus

7.3 使用场景¶

1. Kubernetes 健康和就绪探针

apiVersion: v1
kind: Pod
metadata:
  name: prometheus
spec:
  containers:
  - name: prometheus
    image: prom/prometheus:latest
    ports:
    - containerPort: 9090

    # 存活探针
    livenessProbe:
      httpGet:
        path: /-/healthy
        port: 9090
      initialDelaySeconds: 30
      periodSeconds: 10
      timeoutSeconds: 5
      failureThreshold: 3

    # 就绪探针
    readinessProbe:
      httpGet:
        path: /-/ready
        port: 9090
      initialDelaySeconds: 30
      periodSeconds: 5
      timeoutSeconds: 3
      successThreshold: 1
      failureThreshold: 3

2. 自动配置更新脚本

#!/bin/bash

# 更新 Prometheus 配置文件
cp /path/to/new/prometheus.yml /etc/prometheus/prometheus.yml

# 验证配置
if promtool check config /etc/prometheus/prometheus.yml; then
  echo "Config valid, reloading..."

  # 重新加载配置
  curl -X POST http://localhost:9090/-/reload

  if [ $? -eq 0 ]; then
    echo "Successfully reloaded configuration"
  else
    echo "Failed to reload configuration"
    # 恢复旧配置
    cp /path/to/backup/prometheus.yml /etc/prometheus/prometheus.yml
    exit 1
  fi
else
  echo "Invalid config, not reloading"
  exit 1
fi

3. 监控脚本

import requests
import time

PROMETHEUS_URL = "http://localhost:9090"

def check_health():
    try:
        response = requests.get(f"{PROMETHEUS_URL}/-/healthy", timeout=5)
        return response.status_code == 200
    except:
        return False

def check_ready():
    try:
        response = requests.get(f"{PROMETHEUS_URL}/-/ready", timeout=5)
        return response.status_code == 200
    except:
        return False

# 持续监控
while True:
    health = check_health()
    ready = check_ready()

    print(f"Health: {'✓' if health else '✗'} | Ready: {'✓' if ready else '✗'}")

    if health and not ready:
        print("⚠️  Prometheus is healthy but not ready (possibly replaying WAL)")
    elif not health:
        print("🔴 Prometheus is not healthy!")

    time.sleep(10)

4. CI/CD 集成

# GitLab CI 示例
deploy:
  stage: deploy
  script:
    # 部署新配置
    - kubectl apply -f prometheus-config.yaml

    # 等待 ConfigMap 更新
    - sleep 5

    # 触发重新加载
    - |
      kubectl exec -n monitoring prometheus-0 -- \
        wget --post-data="" -O- http://localhost:9090/-/reload

    # 验证重新加载成功
    - |
      kubectl exec -n monitoring prometheus-0 -- \
        wget -O- http://localhost:9090/-/ready

总结¶

Storage (存储):

本地存储采用 2 小时 block 结构
WAL 保护内存数据防止崩溃
支持时间和大小两种保留策略
可通过 Remote Write/Read 集成远程存储
支持数据回填用于迁移和规则历史数据生成

Federation (联邦):

实现 Prometheus 服务器之间的数据共享
支持分层和跨服务两种场景
使用 /federate 端点和 match[] 参数
建议只联邦聚合指标以提高效率

Management API (管理API):

/-/healthy - 健康检查(存活探针)
/-/ready - 就绪检查(就绪探针)
/-/reload - 重新加载配置
/-/quit - 优雅关闭
后两者需启用 --web.enable-lifecycle

8. Command Line (命令行工具)¶

8.1 Prometheus 命令¶

Prometheus 监控服务器启动命令。

基本用法¶

prometheus [flags]

核心启动参数¶

配置文件:

# 指定配置文件
--config.file=prometheus.yml          # 默认: prometheus.yml

# 配置自动重载间隔
--config.auto-reload-interval=30s     # 默认: 30s

Web 服务器:

# 监听地址(可重复指定)
--web.listen-address=0.0.0.0:9090     # 默认: 0.0.0.0:9090

# 外部访问 URL
--web.external-url=http://prometheus.example.com

# TLS/认证配置
--web.config.file=/path/to/web-config.yml

# 页面标题
--web.page-title="Production Prometheus"

# CORS 源
--web.cors.origin='https?://(domain1|domain2).com'

启用功能:

# 启用生命周期 API(重载/关闭)
--web.enable-lifecycle

# 启用管理 API
--web.enable-admin-api

# 启用 Remote Write 接收器
--web.enable-remote-write-receiver

# 启用 OTLP 接收器
--web.enable-otlp-receiver

存储配置¶

本地存储 (Server 模式):

# 数据目录
--storage.tsdb.path=data/

# 时间保留
--storage.tsdb.retention.time=15d     # 单位: y,w,d,h,m,s,ms

# 大小保留
--storage.tsdb.retention.size=50GB    # 单位: B,KB,MB,GB,TB,PB,EB

# 禁用锁文件
--storage.tsdb.no-lockfile

Agent 模式存储:

# Agent 数据目录
--storage.agent.path=data-agent/

# WAL 压缩
--storage.agent.wal-compression=true

# 保留时间
--storage.agent.retention.min-time=0h
--storage.agent.retention.max-time=0h

远程存储:

# 刷新截止时间
--storage.remote.flush-deadline=1m

# Remote Read 限制
--storage.remote.read-sample-limit=50000000
--storage.remote.read-concurrent-limit=10

查询配置¶

# 查询超时
--query.timeout=2m

# 最大并发查询数
--query.max-concurrency=20

# 最大样本数
--query.max-samples=50000000

# 回溯时长
--query.lookback-delta=5m

规则和告警配置¶

规则评估:

# 最大并发规则评估数
--rules.max-concurrent-evals=4

# 告警 for 容错时间
--rules.alert.for-outage-tolerance=1h

# 告警 for 宽限期
--rules.alert.for-grace-period=10m

# 告警重发延迟
--rules.alert.resend-delay=1m

Alertmanager:

# 通知队列容量
--alertmanager.notification-queue-capacity=10000

# 批量通知大小
--alertmanager.notification-batch-size=256

# 关闭时清空队列
--alertmanager.drain-notification-queue-on-shutdown=true

特性标志¶

--enable-feature=feature1,feature2

# 可用特性:
# - exemplar-storage              示例存储
# - native-histograms             原生直方图
# - promql-experimental-functions 实验性 PromQL 函数
# - extra-scrape-metrics          额外抓取指标
# - memory-snapshot-on-shutdown   关闭时内存快照
# - auto-gomaxprocs               自动 GOMAXPROCS
# - otlp-deltatocumulative        OTLP Delta 转累积

日志配置¶

# 日志级别
--log.level=info                # 选项: debug, info, warn, error

# 日志格式
--log.format=logfmt             # 选项: logfmt, json

Agent 模式¶

# 以 Agent 模式运行
--agent

# Agent 模式特点:
# - 仅写入 Remote Write,不存储本地数据
# - 不支持查询
# - 更轻量级

完整启动示例¶

生产环境配置:

prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --web.listen-address=0.0.0.0:9090 \
  --web.external-url=https://prometheus.example.com \
  --web.enable-lifecycle \
  --web.enable-admin-api \
  --storage.tsdb.path=/var/lib/prometheus \
  --storage.tsdb.retention.time=30d \
  --storage.tsdb.retention.size=50GB \
  --query.timeout=2m \
  --query.max-concurrency=20 \
  --log.level=info \
  --log.format=json

Agent 模式:

prometheus \
  --agent \
  --config.file=/etc/prometheus/agent.yml \
  --storage.agent.path=/var/lib/prometheus-agent \
  --web.listen-address=0.0.0.0:9090 \
  --enable-feature=native-histograms

8.2 Promtool 工具¶

Prometheus 监控系统的工具箱,用于验证、测试、查询和调试。

基本用法¶

promtool [flags] <command> [args...]

常用命令列表¶

命令	说明
`check`	检查资源有效性
`query`	对 Prometheus 服务器执行查询
`test`	单元测试
`tsdb`	TSDB 相关命令
`debug`	获取调试信息
`push`	推送数据到 Prometheus
`promql`	PromQL 格式化和编辑(实验性)

1. Check 命令 - 验证配置¶

检查配置文件:

# 检查 Prometheus 配置
promtool check config prometheus.yml

# 只检查语法
promtool check config --syntax-only prometheus.yml

# 启用 linting
promtool check config --lint=all prometheus.yml

# Agent 模式配置
promtool check config --agent agent.yml

检查规则文件:

# 检查规则文件
promtool check rules /path/to/rules/*.yml

# 启用 linting
promtool check rules --lint=all rules.yml

# 忽略未知字段
promtool check rules --ignore-unknown-fields rules.yml

检查 Web 配置:

promtool check web-config web-config.yml

检查服务发现:

# 执行服务发现并查看结果
promtool check service-discovery prometheus.yml job_name

# 设置超时
promtool check service-discovery --timeout=30s prometheus.yml job_name

检查 Prometheus 健康状态:

# 健康检查
promtool check healthy --url=http://localhost:9090

# 就绪检查
promtool check ready --url=http://localhost:9090

检查指标格式:

# 检查指标有效性
cat metrics.prom | promtool check metrics

# 从远程检查
curl -s http://localhost:9090/metrics | promtool check metrics

2. Query 命令 - 执行查询¶

即时查询:

# 基本即时查询
promtool query instant http://localhost:9090 'up'

# 指定时间
promtool query instant \
  --time='2024-01-01T12:00:00Z' \
  http://localhost:9090 \
  'rate(http_requests_total[5m])'

范围查询:

promtool query range \
  --start='2024-01-01T00:00:00Z' \
  --end='2024-01-01T01:00:00Z' \
  --step=60s \
  http://localhost:9090 \
  'rate(http_requests_total[5m])'

查询时间序列:

# 查询匹配的序列
promtool query series \
  --match='{job="prometheus"}' \
  --match='{__name__=~"job:.*"}' \
  --start='2024-01-01T00:00:00Z' \
  --end='2024-01-01T01:00:00Z' \
  http://localhost:9090

查询标签值:

# 查询标签值
promtool query labels \
  --match='{job="prometheus"}' \
  http://localhost:9090 \
  job

分析查询:

# 分析直方图使用模式
promtool query analyze \
  --server=http://localhost:9090 \
  --type=histogram \
  --duration=1h \
  --match='{__name__=~"http_request_duration.*"}'

3. Test 命令 - 单元测试¶

测试规则:

# 运行规则单元测试
promtool test rules test.yml

# 运行特定测试组
promtool test rules --run='TestGroup.*' test.yml

# 启用调试
promtool test rules --debug test.yml

# 显示差异
promtool test rules --diff test.yml

# 输出 JUnit XML
promtool test rules --junit=results.xml test.yml

测试文件示例:

# test.yml
rule_files:
  - rules.yml

evaluation_interval: 1m

tests:
  - interval: 1m
    input_series:
      - series: 'http_requests_total{job="api",instance="0"}'
        values: '0 100 200 300 400'

    alert_rule_test:
      - eval_time: 5m
        alertname: HighRequestRate
        exp_alerts:
          - exp_labels:
              severity: warning
              job: api
            exp_annotations:
              summary: "High request rate"

4. TSDB 命令 - 数据库操作¶

列出 Blocks:

# 列出所有 blocks
promtool tsdb list data/

# 人类可读格式
promtool tsdb list -r data/

分析 TSDB:

# 分析数据库
promtool tsdb analyze data/

# 扩展分析
promtool tsdb analyze --extended data/

# 限制结果数量
promtool tsdb analyze --limit=50 data/

# 分析特定序列
promtool tsdb analyze --match='{job="api"}' data/

导出数据:

# 导出为 OpenMetrics 格式
promtool tsdb dump-openmetrics \
  --min-time=1609459200000 \
  --max-time=1609545600000 \
  --match='{job="prometheus"}' \
  data/

# 导出原始数据
promtool tsdb dump \
  --min-time=1609459200000 \
  --max-time=1609545600000 \
  data/

回填数据:

# 从 OpenMetrics 文件创建 blocks
promtool tsdb create-blocks-from openmetrics \
  input.txt \
  data/

# 为记录规则回填
promtool tsdb create-blocks-from rules \
  --start=1609459200 \
  --end=1609545600 \
  --url=http://localhost:9090 \
  --output-dir=data/ \
  --eval-interval=15s \
  rules.yml

性能基准测试:

promtool tsdb bench write \
  --metrics=10000 \
  --scrapes=3000 \
  --out=benchout \
  samples.json

5. Debug 命令 - 调试信息¶

获取性能分析数据:

promtool debug pprof http://localhost:9090

获取指标:

promtool debug metrics http://localhost:9090

获取所有调试信息:

promtool debug all http://localhost:9090

6. Push 命令 - 推送数据¶

# 推送指标到 Remote Write 端点
promtool push metrics \
  --label=job=test \
  --label=instance=local \
  --timeout=30s \
  http://localhost:9090/api/v1/write \
  metrics.txt

# 从标准输入读取
cat metrics.txt | promtool push metrics http://localhost:9090/api/v1/write

7. PromQL 命令 - 格式化 (实验性)¶

需要 --experimental 标志。

格式化查询:

promtool --experimental promql format \
  'sum(rate(http_requests_total[5m]))by(job)'

# 输出:
# sum by(job) (rate(http_requests_total[5m]))

编辑标签匹配器:

# 设置标签
promtool --experimental promql label-matchers set \
  --type='=' \
  'up' \
  'job' \
  'prometheus'
# 输出: up{job="prometheus"}

# 删除标签
promtool --experimental promql label-matchers delete \
  'up{job="prometheus",instance="localhost"}' \
  'instance'
# 输出: up{job="prometheus"}

实用 Shell 脚本示例¶

自动化配置验证:

#!/bin/bash
# validate-config.sh

set -e

CONFIG_FILE="prometheus.yml"
RULES_DIR="/etc/prometheus/rules"

echo "Validating Prometheus configuration..."
if promtool check config "$CONFIG_FILE"; then
    echo "✓ Configuration is valid"
else
    echo "✗ Configuration is invalid"
    exit 1
fi

echo "Validating rules..."
for rule_file in "$RULES_DIR"/*.yml; do
    if promtool check rules "$rule_file"; then
        echo "✓ $rule_file is valid"
    else
        echo "✗ $rule_file is invalid"
        exit 1
    fi
done

echo "All checks passed!"

CI/CD 集成示例:

#!/bin/bash
# ci-validate.sh

# 验证所有配置
find . -name "*.yml" -type f | while read -r file; do
    if [[ $file == *"prometheus"* ]]; then
        promtool check config "$file" || exit 1
    elif [[ $file == *"rules"* ]]; then
        promtool check rules "$file" || exit 1
    fi
done

# 运行单元测试
promtool test rules tests/*.yml || exit 1

echo "✓ All validations passed"

查询性能测试:

#!/bin/bash
# query-benchmark.sh

PROMETHEUS_URL="http://localhost:9090"
QUERIES=(
    "up"
    "rate(http_requests_total[5m])"
    "histogram_quantile(0.99, sum by(le) (rate(http_request_duration_seconds_bucket[5m])))"
)

for query in "${QUERIES[@]}"; do
    echo "Testing: $query"
    time promtool query instant "$PROMETHEUS_URL" "$query"
    echo "---"
done

命令行工具最佳实践¶

1. 配置验证

✅ 在重载配置前始终使用 promtool check config
✅ 在 CI/CD 中集成配置验证
✅ 使用 --lint 检查常见问题

2. 规则测试

✅ 为所有规则编写单元测试
✅ 在 CI/CD 中运行 promtool test rules
✅ 使用 --debug 排查测试问题

3. TSDB 维护

✅ 定期使用 tsdb analyze 检查基数问题
✅ 使用 tsdb list 监控 block 大小
✅ 在迁移前使用 tsdb dump 导出数据

4. 查询优化

✅ 使用 query analyze 分析查询模式
✅ 在生产环境测试前先用 promtool query 验证
✅ 使用 --format 参数获取不同格式的输出

5. 安全性

✅ 使用 --http.config.file 配置 TLS 客户端
✅ 避免在命令行中暴露敏感信息
✅ 使用环境变量或配置文件传递凭证

9. Migration (迁移指南)¶

Prometheus 3.0 迁移指南¶

Prometheus 3.0 包含重大变更。本指南帮助从 2.x 迁移到 3.0+。

已移除的特性标志¶

以下特性标志已移除并成为默认行为:

旧特性标志	新行为	说明
`promql-at-modifier`	默认启用	`@` 修饰符
`promql-negative-offset`	默认启用	负偏移支持
`new-service-discovery-manager`	默认启用	新服务发现管理器
`expand-external-labels`	默认启用	外部标签支持环境变量 `${var}`
`no-default-scrape-port`	默认启用	不再自动添加端口
`agent`	使用 `--agent`	独立的 Agent 标志
`remote-write-receiver`	使用 `--web.enable-remote-write-receiver`	独立的标志
`auto-gomemlimit`	默认启用	自动设置 GOMEMLIMIT
`auto-gomaxprocs`	默认启用	自动设置 GOMAXPROCS

⚠️ 注意: 继续使用这些标志会记录警告

配置变更¶

1. 经典直方图配置重命名:

# v2.x
scrape_configs:
  - job_name: 'app'
    scrape_classic_histograms: true  # ❌ 旧名称

# v3.0
scrape_configs:
  - job_name: 'app'
    always_scrape_classic_histograms: true  # ✅ 新名称

2. Remote Write HTTP/2 默认值变更:

# v3.0 默认为 false,需显式启用
remote_write:
  - url: https://remote-storage.example.com
    http_config:
      enable_http2: true  # 显式启用 HTTP/2

PromQL 变更¶

1. 正则表达式匹配换行符

# v3.0 中 . 匹配换行符
{label=~".*"}  # 现在匹配包含 \n 的字符串

# 如需 v2 行为,使用 [^\n]
{label=~"foo[^\n]*"}  # 不匹配换行符

2. 范围选择器变更 (左开右闭)

# 假设样本间隔 1 分钟
foo[5m]  # v2: 可能返回 5 或 6 个样本
         # v3: 始终返回 5 个样本

影响:

子查询受影响最大
foo[1m:1m] 在 v3 中只返回 1 个点(不足以计算 rate)
修复: 扩展窗口 foo[2m:1m] 返回 2 个点

3. holt_winters 函数重命名:

# v2.x
holt_winters(metric[5m], 0.5, 0.5)  # ❌

# v3.0 (需启用实验性函数)
double_exponential_smoothing(metric[5m], 0.5, 0.5)  # ✅

# 启动参数
prometheus --enable-feature=promql-experimental-functions

抓取协议变更¶

Prometheus 3.0 对 Content-Type 头更严格:

v2.x 行为:

缺少 Content-Type → 默认为文本格式
可能导致数据错误

v3.0 行为:

缺少/无效 Content-Type → 抓取失败
需配置 fallback 协议:

scrape_configs:
  - job_name: 'legacy_app'
    fallback_scrape_protocol: PrometheusText0.0.4

其他重要变更¶

1. TSDB 格式和降级

TSDB 格式在 v2.55 已变更
v3.0 只能降级到 v2.55+,不能更低
建议: 先升级到 v2.55 测试,再升级到 v3.0

2. UTF-8 指标名称支持

# 保留旧验证行为(全局)
global:
  metric_name_validation_scheme: legacy

# 或按 job 配置
scrape_configs:
  - job_name: 'new_app'
    metric_name_validation_scheme: utf8
  - job_name: 'legacy_app'
    metric_name_validation_scheme: legacy

3. 日志格式变更

# v2.x
ts=2024-10-23T22:01:06.074Z caller=main.go:627 level=info msg="..."

# v3.0
time=2024-10-24T00:03:07.542+02:00 level=INFO source=main.go:640 msg="..."

4. le 和 quantile 标签归一化

# v2.x
my_histogram{le="1"}  # 文本格式
my_histogram{le="1.0"}  # protobuf 格式

# v3.0 统一归一化
my_histogram{le="1.0"}  # 所有格式

# 需更新查询
# ❌ le="1"
# ✅ le="1.0"

5. Alertmanager API v1 移除

# v2.x
alerting:
  alertmanagers:
    - api_version: v1  # ❌ 不再支持

# v3.0 (需要 Alertmanager 0.16.0+)
alerting:
  alertmanagers:
    - api_version: v2  # ✅

10. Feature Flags (特性标志)¶

使用 --enable-feature 启用实验性或破坏性变更的特性。

prometheus --enable-feature=feature1,feature2

可用特性标志¶

1. exemplar-storage - 示例存储

--enable-feature=exemplar-storage

存储 OpenMetrics 示例(如 trace_id)
固定大小循环缓冲区
每个示例约 100 字节内存
持久化到 WAL

2. native-histograms - 原生直方图

--enable-feature=native-histograms

采集原生直方图(高分辨率直方图)
实验性: 可能有破坏性变更
使用 protobuf 格式
可与经典直方图共存

配置示例:

scrape_configs:
  - job_name: 'app'
    always_scrape_classic_histograms: true  # 同时保留经典直方图

3. extra-scrape-metrics - 额外抓取指标

--enable-feature=extra-scrape-metrics

添加以下指标:

scrape_timeout_seconds - 配置的超时时间
scrape_sample_limit - 样本限制
scrape_body_size_bytes - 响应体大小

用途:

# 检查接近超时的目标
scrape_duration_seconds / scrape_timeout_seconds > 0.9

# 检查接近限制的目标
scrape_samples_post_metric_relabeling / (scrape_sample_limit > 0) > 0.9

4. memory-snapshot-on-shutdown - 关闭时内存快照

--enable-feature=memory-snapshot-on-shutdown

关闭时保存内存快照到磁盘
重启时快速恢复
减少 WAL 重放时间

5. concurrent-rule-eval - 并发规则评估

--enable-feature=concurrent-rule-eval

同组内无依赖的规则并发执行
配合 --rules.max-concurrent-rule-evals=4 使用
提高规则评估速度

6. auto-reload-config - 自动重载配置

--enable-feature=auto-reload-config
--config.auto-reload-interval=30s

自动检测配置变更
定期检查文件校验和
原子更新文件以避免问题

7. otlp-deltatocumulative - OTLP Delta 转换

--enable-feature=otlp-deltatocumulative

将 OTLP delta 指标转换为累积
内存状态存储增量
重启后从零开始(导致计数器重置)

8. promql-experimental-functions - 实验性 PromQL 函数

--enable-feature=promql-experimental-functions

启用实验性函数(如 double_exponential_smoothing)。

9. type-and-unit-labels - 类型和单位标签

--enable-feature=type-and-unit-labels

注入 __type__ 和 __unit__ 保留标签
从元数据获取类型和单位信息
支持基于类型/单位的查询

10. old-ui - 旧版 UI

--enable-feature=old-ui

使用 Prometheus 2.x 的旧版 Web UI。

其他特性:

promql-per-step-stats - 每步统计
created-timestamp-zero-ingestion - Created 时间戳注入
delayed-compaction - 延迟压缩
promql-delayed-name-removal - 延迟 __name__ 移除
promql-duration-expr - 持续时间算术表达式
otlp-native-delta-ingestion - OTLP 原生 Delta
metadata-wal-records - 元数据 WAL 记录
use-uncached-io - 非缓存 I/O (Linux)

11. Security (安全)¶

安全模型概述¶

⚠️ 重要: Prometheus HTTP 端点不应暴露到公网,包括:

/metrics 端点

API 端点 (/api/v1/*)

/pprof 调试端点

安全假设¶

Prometheus:

✅ 不可信用户可访问 HTTP 端点和日志
✅ 不可信用户可查询所有时间序列数据
❌ 只有可信用户可修改配置、规则文件、命令行参数

目标:

不可信用户可运行抓取目标
默认情况下目标不能伪装成其他目标
honor_labels: true 会移除此保护

管理 API¶

v2.0+ 安全控制:

# 启用管理 API (删除时间序列等)
--web.enable-admin-api

# 启用生命周期 API (重载/关闭)
--web.enable-lifecycle

端点:

/api/*/admin/* - 管理功能
/-/reload - 重载配置
/-/quit - 关闭服务

组件安全¶

Alertmanager:

任何访问者可创建、解析告警
任何访问者可创建、修改、删除静默

Pushgateway:

任何访问者可创建、修改、删除指标
通常与 honor_labels 一起使用
可伪造任意时间序列

Exporters:

SNMP/Blackbox Exporter 从 URL 参数获取目标
可能泄露认证信息(如 HTTP Basic Auth、SNMP community)

认证、授权和加密¶

TLS 支持:

# prometheus.yml
tls_config:
  cert_file: server.crt
  key_file: server.key
  client_ca_file: ca.crt

# 最低 TLS 1.2
# 目标: Qualys SSL Labs A 级

HTTP Basic 认证:

basic_auth:
  username: admin
  password_file: /path/to/password

# 密码使用 bcrypt 哈希存储
# 建议使用 TLS,否则明文传输

客户端认证:

tls_config:
  insecure_skip_verify: false  # 不跳过 SSL 验证

API 安全¶

CSRF 防护:

无内置 CSRF 保护(为兼容 cURL 等工具)
建议在反向代理层阻止可变端点

XSS 防护:

非可变端点设置 CORS 头
示例: Access-Control-Allow-Origin

PromQL 注入:

# ❌ 危险 - 用户输入未转义
up{job="<user_input>"}

# 如果 user_input = "} or some_metric{zzz="
# 结果: up{job=""} or some_metric{zzz=""}

Grafana 注意事项:

Dashboard 权限 ≠ 数据源权限
代理模式下用户可运行任意查询

秘密管理¶

秘密字段:

配置中标记为秘密的字段不会暴露在日志/API
不要在其他字段存放秘密
保护磁盘文件权限

外部秘密:

依赖的秘密(如 AWS_SECRET_KEY)可能被泄露
用户需自行保护

DoS 防护¶

缓解措施:

部分负载和查询限制
过多/昂贵查询会导致组件崩溃

用户责任:

提供足够资源 (CPU、RAM、磁盘、IOPS、文件描述符、带宽)
监控所有组件
自动重启失败组件

漏洞报告¶

公开漏洞:

在 GitHub 提交 Bug 报告

未公开漏洞:

私下报告给 MAINTAINERS
抄送: prometheus-team@googlegroups.com
7 天内修复核心组件(Prometheus、Alertmanager、Node Exporter 等)

扫描工具用户注意:

提交前分析报告,避免误报
使用开源工具(如 govulncheck)验证

12. Overview (概览)¶

什么是 Prometheus?¶

Prometheus 是开源的系统监控和告警工具包,最初在 SoundCloud 构建。2016 年加入 CNCF,成为继 Kubernetes 之后的第二个托管项目。

核心特点:

采集并存储指标为时间序列数据
指标 + 时间戳 + 可选键值对标签

主要特性¶

多维数据模型 - 指标名称和键值对标识时间序列
PromQL - 灵活的查询语言
无依赖分布式存储 - 单服务器节点自治
Pull 模型 - 通过 HTTP 拉取时间序列
Push 支持 - 通过中间网关支持推送
服务发现 - 静态配置或服务发现
图形和仪表板 - 多种可视化支持

什么是指标?¶

指标: 数值测量

时间序列: 随时间变化的记录

示例:

Web 服务器: 请求时间
数据库: 活动连接数、活动查询数

作用: 理解应用行为,诊断问题,指导扩容

组件生态¶

graph TB
    A[Prometheus Server<br/>抓取并存储数据] --> B[时间序列数据库]
    C[客户端库] --> D[应用程序]
    E[Pushgateway<br/>短期任务] --> A
    F[Exporters<br/>HAProxy/StatsD等] --> A
    A --> G[Alertmanager<br/>告警处理]
    A --> H[Grafana<br/>可视化]
    I[服务发现] --> A

核心组件:

Prometheus Server - 抓取和存储时间序列
客户端库 - 应用程序埋点
Pushgateway - 短期任务支持
Exporters - 第三方服务(HAProxy、StatsD 等)
Alertmanager - 告警处理

语言: 大多数组件用 Go 编写,易于部署

架构¶

Prometheus 从已埋点的作业抓取指标:

直接抓取
通过 Pushgateway(短期任务)

存储样本并运行规则:

聚合和记录新时间序列
生成告警

Grafana 或其他 API 消费者可视化数据。

适用场景¶

适合:

✅ 记录纯数值时间序列
✅ 机器中心监控
✅ 高度动态的面向服务架构
✅ 微服务环境
✅ 多维数据收集和查询

可靠性设计:

独立服务器,不依赖网络存储
故障期间可查看统计信息
不需要复杂基础设施

不适用场景¶

不适合:

❌ 需要 100% 准确性(如按请求计费)
❌ 采集的数据可能不够详细和完整
❌ 建议使用其他系统收集计费数据
✅ Prometheus 用于其余监控

13. Alertmanager (告警管理器)¶

核心功能¶

Alertmanager 处理客户端(如 Prometheus)发送的告警,负责:

去重 (Deduplication)
分组 (Grouping)
路由 (Routing)
静默 (Silencing)
抑制 (Inhibition)

核心概念¶

1. Grouping (分组)¶

目的: 将相似告警归类为单个通知

场景:

数百个服务实例 → 网络分区 → 一半实例无法访问数据库

无分组: 发送数百条告警

有分组: 发送一条告警,列出所有受影响实例

配置:

route:
  group_by: ['cluster', 'alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'team-ops'

2. Inhibition (抑制)¶

目的: 某些告警触发时,抑制其他相关告警

场景:

整个集群不可达 → 数百个实例告警

抑制规则: 集群不可达告警触发时,抑制该集群的所有其他告警

配置:

inhibit_rules:
  - source_match:
      severity: 'critical'
      alertname: 'ClusterDown'
    target_match:
      severity: 'warning'
    equal: ['cluster']

3. Silences (静默)¶

目的: 在给定时间内静默告警

配置方式: Web UI 配置

示例:

计划维护期间静默告警
基于标签匹配器
支持等式和正则表达式

Web UI 操作:

# 访问 Alertmanager UI
http://alertmanager:9093

# 创建静默
Silences → Create Silence → 设置匹配器和时间

高可用性¶

集群配置:

# Alertmanager 1
alertmanager \
  --cluster.peer=alertmanager-2:9094 \
  --cluster.peer=alertmanager-3:9094

# Alertmanager 2
alertmanager \
  --cluster.peer=alertmanager-1:9094 \
  --cluster.peer=alertmanager-3:9094

Prometheus 配置:

# ⚠️ 不要负载均衡,而是提供完整列表
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager-1:9093
          - alertmanager-2:9093
          - alertmanager-3:9093

配置示例¶

完整路由树:

route:
  # 根路由
  receiver: 'default-receiver'
  group_by: ['cluster', 'alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h

  # 子路由
  routes:
    - match:
        severity: critical
      receiver: 'pager'
      continue: true

    - match:
        severity: warning
      receiver: 'slack'

    - match_re:
        service: ^(foo|bar)$
      receiver: 'team-ops'
      group_by: ['service']

receivers:
  - name: 'default-receiver'
    email_configs:
      - to: 'team@example.com'

  - name: 'pager'
    pagerduty_configs:
      - service_key: '<key>'

  - name: 'slack'
    slack_configs:
      - api_url: '<webhook_url>'
        channel: '#alerts'

通知接收器¶

支持的接收器:

Email
PagerDuty
Slack
Webhook
OpsGenie
VictorOps
WeChat
Pushover
等等...

Webhook 示例:

receivers:
  - name: 'webhook'
    webhook_configs:
      - url: 'http://example.com/webhook'
        send_resolved: true

最佳实践¶

1. 合理分组

按集群、服务、严重程度分组
避免过度分组导致通知风暴

2. 使用抑制

上游问题抑制下游告警
减少噪音

3. 静默管理

维护期间使用静默
设置合理的静默时间
避免永久静默

4. 高可用部署

至少 3 个 Alertmanager 实例
Prometheus 配置完整的 Alertmanager 列表
不要负载均衡

5. 测试通知

使用 amtool 测试配置
验证路由规则
测试所有接收器

14. Alerting Configuration (Alertmanager 配置)¶

Alertmanager 通过命令行标志和配置文件进行配置。

14.1 配置文件基础¶

启动命令:

alertmanager --config.file=alertmanager.yml

重载配置:

# 发送 SIGHUP 信号
kill -HUP <pid>

# 或使用 HTTP 端点
curl -X POST http://localhost:9093/-/reload

14.2 全局配置¶

global:
  # SMTP 配置
  smtp_from: 'alertmanager@example.com'
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_auth_username: 'alerts@example.com'
  smtp_auth_password: 'password'
  smtp_require_tls: true

  # Slack 配置
  slack_api_url: 'https://hooks.slack.com/services/xxx'

  # PagerDuty 配置
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

  # OpsGenie 配置
  opsgenie_api_key: 'xxx'
  opsgenie_api_url: 'https://api.opsgenie.com/'

  # 解析超时
  resolve_timeout: 5m

  # HTTP 客户端配置
  http_config:
    tls_config:
      insecure_skip_verify: false

14.3 路由配置 (Route)¶

路由定义告警如何分组和发送到接收器。

核心参数:

参数	说明	默认值
`receiver`	接收器名称	必需
`group_by`	分组标签	-
`group_wait`	分组等待时间	`30s`
`group_interval`	组内新告警间隔	`5m`
`repeat_interval`	重复发送间隔	`4h`
`continue`	继续匹配兄弟节点	`false`
`matchers`	标签匹配器	-

完整示例:

route:
  # 根路由 - 匹配所有告警
  receiver: 'default-receiver'
  group_by: ['cluster', 'alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  # 子路由
  routes:
    # 数据库告警
    - receiver: 'database-pager'
      group_wait: 10s
      matchers:
        - service=~"mysql|cassandra"

    # 前端团队告警
    - receiver: 'frontend-pager'
      group_by: ['product', 'environment']
      matchers:
        - team="frontend"

    # 非工作时间静默
    - receiver: 'dev-pager'
      matchers:
        - service="inhouse-service"
      mute_time_intervals:
        - offhours
        - holidays
      continue: true

    # 工作时间激活
    - receiver: 'on-call-pager'
      matchers:
        - service="inhouse-service"
      active_time_intervals:
        - offhours
        - holidays

分组策略:

# 按所有标签分组(禁用聚合)
group_by: ['...']

# 按特定标签分组
group_by: ['cluster', 'alertname', 'severity']

14.4 时间间隔 (Time Intervals)¶

定义静默或激活路由的时间范围。

基本示例:

time_intervals:
  # 非工作时间
  - name: offhours
    time_intervals:
      - times:
          - start_time: '17:00'
            end_time: '09:00'
        weekdays: ['monday:friday']
      - weekdays: ['saturday', 'sunday']
    location: 'Asia/Shanghai'

  # 节假日
  - name: holidays
    time_intervals:
      - days_of_month: ['1']
        months: ['january', 'may', 'october']
      - days_of_month: ['25']
        months: ['december']

时间字段:

times - 时间范围 (HH:MM - HH:MM)
weekdays - 星期几 (sunday - saturday)
days_of_month - 月份中的天 (1-31, 支持负数)
months - 月份 (january-december 或 1-12)
years - 年份 (如 2024:2025)
location - 时区 (如 Asia/Shanghai, UTC, Local)

14.5 抑制规则 (Inhibition)¶

当某个告警触发时,抑制其他相关告警。

基本示例:

inhibit_rules演
  # 集群宕机时抑制所有其他告警
  - source_matchers:
      - alertname="ClusterDown"
      - severity="critical"
    target_matchers:
      - severity="warning"
    equal: ['cluster']

  # 节点宕机时抑制该节点的所有告警
  - source_matchers:
      - alertname="NodeDown"
    target_matchers:
      - alertname!="NodeDown"
    equal: ['instance']

字段说明:

source_matchers - 源告警匹配器(抑制器)
target_matchers - 目标告警匹配器(被抑制)
equal - 必须相等的标签列表

14.6 标签匹配器 (Matchers)¶

UTF-8 匹配器 (推荐):

matchers:
  - alertname = "HighCPU"           # 等于
  - severity != "info"              # 不等于
  - service =~ "api|web"            # 正则匹配
  - instance !~ "test.*"            # 正则不匹配
  - foo = "bar,baz"                 # 可包含特殊字符
  - team = "前端"                   # 支持 UTF-8

组合匹配器:

# YAML 列表形式
matchers:
  - alertname = "Watchdog"
  - severity =~ "warning|critical"

# 短格式
matchers: [ 'alertname = "Watchdog"', 'severity =~ "warning|critical"' ]

# PromQL 风格
matchers: [ '{alertname="Watchdog", severity=~"warning|critical"}' ]

UTF-8 严格模式:

# 启用 UTF-8 严格模式
alertmanager --enable-feature="utf8-strict-mode"

# 验证配置
amtool check-config alertmanager.yml

14.7 接收器 (Receivers)¶

Email 接收器¶

receivers:
  - name: 'email-team'
    email_configs:
      - to: 'team@example.com'
        from: 'alertmanager@example.com'
        smarthost: 'smtp.gmail.com:587'
        auth_username: 'alerts@example.com'
        auth_password: 'password'
        require_tls: true
        headers:
          Subject: '{{ .GroupLabels.alertname }}: {{ .Status }}'
        html: '{{ template "email.default.html" . }}'

Slack 接收器¶

receivers:
  - name: 'slack-alerts'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
        color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
        send_resolved: true

PagerDuty 接收器¶

receivers:
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - routing_key: 'xxx'
        severity: 'critical'
        description: '{{ .GroupLabels.alertname }}'
        details:
          firing: '{{ template "pagerduty.default.instances" .Alerts.Firing }}'
          num_firing: '{{ .Alerts.Firing | len }}'

Webhook 接收器¶

receivers:
  - name: 'webhook-receiver'
    webhook_configs:
      - url: 'http://example.com/webhook'
        send_resolved: true
        max_alerts: 0  # 0 = 无限制
        http_config:
          basic_auth:
            username: 'user'
            password: 'pass'

Webhook JSON 格式:

{
  "version": "4",
  "groupKey": "<string>",
  "status": "firing|resolved",
  "receiver": "<string>",
  "groupLabels": {},
  "commonLabels": {},
  "commonAnnotations": {},
  "externalURL": "<string>",
  "alerts": [
    {
      "status": "firing|resolved",
      "labels": {},
      "annotations": {},
      "startsAt": "<rfc3339>",
      "endsAt": "<rfc3339>",
      "generatorURL": "<string>",
      "fingerprint": "<string>"
    }
  ]
}

其他接收器¶

OpsGenie:

opsgenie_configs:
  - api_key: 'xxx'
    message: '{{ .GroupLabels.alertname }}'
    priority: 'P1'
    responders:
      - type: 'team'
        name: 'ops-team'

Telegram:

telegram_configs:
  - bot_token: 'xxx'
    chat_id: 123456789
    message: '🔥 {{ .GroupLabels.alertname }}'
    parse_mode: 'HTML'

WeChat (微信):

wechat_configs:
  - api_secret: 'xxx'
    corp_id: 'xxx'
    agent_id: '1000002'
    to_user: 'user123'
    message_type: 'markdown'

Pushover:

pushover_configs:
  - user_key: 'xxx'
    token: 'xxx'
    priority: '2'  # 紧急
    retry: 30s
    expire: 1h

14.8 HTTP 配置¶

TLS 配置:

http_config:
  tls_config:
    ca_file: /path/to/ca.crt
    cert_file: /path/to/client.crt
    key_file: /path/to/client.key
    server_name: 'example.com'
    insecure_skip_verify: false

Basic 认证:

http_config:
  basic_auth:
    username: 'admin'
    password: 'secret'

Bearer Token:

http_config:
  authorization:
    type: 'Bearer'
    credentials: 'token123'

OAuth2:

http_config:
  oauth2:
    client_id: 'xxx'
    client_secret: 'xxx'
    token_url: 'https://auth.example.com/token'
    scopes: ['read', 'write']

代理配置:

http_config:
  proxy_url: 'http://proxy.example.com:8080'
  no_proxy: '127.0.0.1,localhost'

14.9 完整配置示例¶

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alertmanager@example.com'
  smtp_auth_username: 'alerts@example.com'
  smtp_auth_password: 'password'
  slack_api_url: 'https://hooks.slack.com/services/xxx'

templates:
  - '/etc/alertmanager/templates/*.tmpl'

route:
  receiver: 'default'
  group_by: ['cluster', 'alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h

  routes:
    # Critical 告警立即发送
    - receiver: 'pagerduty'
      matchers:
        - severity="critical"
      group_wait: 0s
      repeat_interval: 5m
      continue: true

    # Warning 告警发送到 Slack
    - receiver: 'slack'
      matchers:
        - severity="warning"

    # 数据库告警
    - receiver: 'database-team'
      matchers:
        - service=~"mysql|postgres"
      mute_time_intervals:
        - weekends

receivers:
  - name: 'default'
    email_configs:
      - to: 'team@example.com'

  - name: 'pagerduty'
    pagerduty_configs:
      - routing_key: 'xxx'

  - name: 'slack'
    slack_configs:
      - channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'

  - name: 'database-team'
    email_configs:
      - to: 'dba@example.com'

inhibit_rules:
  - source_matchers:
      - severity="critical"
    target_matchers:
      - severity="warning"
    equal: ['cluster', 'alertname']

time_intervals:
  - name: weekends
    time_intervals:
      - weekdays: ['saturday', 'sunday']

14.10 模板变量¶

常用变量:

# 告警状态
{{ .Status }}  # firing 或 resolved

# 标签
{{ .GroupLabels.alertname }}  # 分组标签
{{ .CommonLabels.instance }}  # 公共标签

# 注解
{{ .CommonAnnotations.summary }}
{{ .CommonAnnotations.description }}

# 告警列表
{{ range .Alerts }}
  Instance: {{ .Labels.instance }}
  Summary: {{ .Annotations.summary }}
{{ end }}

# 告警数量
{{ .Alerts.Firing | len }}  # Firing 数量
{{ .Alerts.Resolved | len }}  # Resolved 数量

14.11 验证和测试¶

验证配置:

# 检查配置文件
amtool check-config alertmanager.yml

# UTF-8 严格模式检查
amtool check-config alertmanager.yml --enable-feature="utf8-strict-mode"

测试告警:

# 发送测试告警
amtool alert add alertname="TestAlert" severity="warning" instance="localhost"

# 查看告警
amtool alert query

# 创建静默
amtool silence add alertname="TestAlert" --duration=1h --comment="Testing"

# 查看静默
amtool silence query

14.12 最佳实践¶

1. 合理的分组策略

# ✅ 好 - 按集群和告警名称分组
group_by: ['cluster', 'alertname']

# ❌ 差 - 禁用聚合(产生大量通知)
group_by: ['...']

2. 合理的时间间隔

group_wait: 10s       # 等待更多相同组的告警
group_interval: 5m    # 组内新告警等待时间
repeat_interval: 4h   # 重复发送间隔(应为 group_interval 的倍数)

3. 使用抑制规则

# 上游故障抑制下游告警
inhibit_rules:
  - source_matchers:
      - alertname="APIDown"
    target_matchers:
      - alertname=~".*Slow|.*Error"
    equal: ['service']

4. 多级路由

routes:
  # Critical → PagerDuty + Slack
  - receiver: 'pagerduty'
    matchers:
      - severity="critical"
    continue: true

  - receiver: 'slack'
    matchers:
      - severity="critical"

5. 使用模板文件

templates:
  - '/etc/alertmanager/templates/*.tmpl'

6. 定期测试

定期发送测试告警
验证所有接收器工作正常
测试抑制和静默规则

配置快速参考¶

路由参数:

receiver - 接收器名称
group_by - 分组标签
group_wait - 初始等待 (默认: 30s)
group_interval - 组内间隔 (默认: 5m)
repeat_interval - 重复间隔 (默认: 4h)
matchers - 匹配器列表
continue - 继续匹配 (默认: false)
mute_time_intervals - 静默时间段
active_time_intervals - 激活时间段

抑制参数:

source_matchers - 源告警匹配器
target_matchers - 目标告警匹配器
equal - 相等标签列表

接收器类型:

email_configs - 邮件
slack_configs - Slack
pagerduty_configs - PagerDuty
webhook_configs - Webhook
opsgenie_configs - OpsGenie
telegram_configs - Telegram
wechat_configs - 微信
pushover_configs - Pushover
discord_configs - Discord
msteams_configs - Microsoft Teams
victorops_configs - VictorOps
sns_configs - AWS SNS
webex_configs - Webex

15. Notification Examples (通知示例)¶

以下是使用 Go 模板系统的 Alertmanager 通知配置示例。

15.1 自定义 Slack 通知¶

添加 Wiki 链接:

global:
  slack_api_url: 'https://hooks.slack.com/services/xxx'

route:
  receiver: 'slack-notifications'
  group_by: [alertname, datacenter, app]

receivers:
  - name: 'slack-notifications'
    slack_configs:
      - channel: '#alerts'
        text: 'https://internal.myorg.net/wiki/alerts/{{ .GroupLabels.app }}/{{ .GroupLabels.alertname }}'

效果: 根据告警的 app 和 alertname 标签,自动生成对应的 Wiki 文档链接。

15.2 访问注解 (Annotations)¶

Prometheus 告警规则:

groups:
  - name: Instances
    rules:
      - alert: InstanceDown
        expr: up == 0
        for: 5m
        labels:
          severity: page
        annotations:
          summary: 'Instance {{ $labels.instance }} down'
          description: '{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.'

Alertmanager 接收器:

receivers:
  - name: 'team-x'
    slack_configs:
      - channel: '#alerts'
        text: |
          <!channel>
          summary: {{ .CommonAnnotations.summary }}
          description: {{ .CommonAnnotations.description }}

关键点:

{{ .CommonAnnotations.summary }} - 访问公共注解中的摘要
{{ .CommonAnnotations.description }} - 访问公共注解中的描述
<!channel> - Slack 的 @channel 提及

15.3 遍历所有告警¶

receivers:
  - name: 'default-receiver'
    slack_configs:
      - channel: '#alerts'
        title: |
          {{ range .Alerts }}{{ .Annotations.summary }}
          {{ end }}
        text: |
          {{ range .Alerts }}{{ .Annotations.description }}
          {{ end }}

效果: 遍历所有接收到的告警,每个告警的摘要和描述各占一行。

15.4 定义可复用模板¶

步骤 1: 创建模板文件 /etc/alertmanager/templates/myorg.tmpl

{{ define "slack.myorg.text" }}
https://internal.myorg.net/wiki/alerts/{{ .GroupLabels.app }}/{{ .GroupLabels.alertname }}
{{ end }}

{{ define "slack.myorg.title" }}
🚨 [{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}
{{ end }}

{{ define "email.subject" }}
[{{ .Status }}] {{ .GroupLabels.alertname }} ({{ .Alerts.Firing | len }} firing)
{{ end }}

步骤 2: 在配置中引用模板

global:
  slack_api_url: 'https://hooks.slack.com/services/xxx'

templates:
  - '/etc/alertmanager/templates/myorg.tmpl'

route:
  receiver: 'slack-notifications'
  group_by: [alertname, datacenter, app]

receivers:
  - name: 'slack-notifications'
    slack_configs:
      - channel: '#alerts'
        title: '{{ template "slack.myorg.title" . }}'
        text: '{{ template "slack.myorg.text" . }}'

15.5 更多实用模板示例¶

1. 显示告警数量和列表

title: |
  {{ .GroupLabels.alertname }} ({{ .Alerts.Firing | len }} firing, {{ .Alerts.Resolved | len }} resolved)

text: |
  🔥 *Firing:*
  {{ range .Alerts.Firing }}
  • {{ .Labels.instance }}: {{ .Annotations.summary }}
  {{ end }}

  ✅ *Resolved:*
  {{ range .Alerts.Resolved }}
  • {{ .Labels.instance }}: {{ .Annotations.summary }}
  {{ end }}

2. 根据严重程度设置颜色

slack_configs:
  - channel: '#alerts'
    color: |
      {{ if eq .Status "firing" }}
        {{ if eq .CommonLabels.severity "critical" }}danger{{ else if eq .CommonLabels.severity "warning" }}warning{{ else }}#439FE0{{ end }}
      {{ else }}good{{ end }}

3. 格式化时间

text: |
  Started: {{ .Alerts.Firing | len }} alerts at {{ .CommonAnnotations.startsAt }}
  {{ range .Alerts }}
  - {{ .Labels.instance }} ({{ .StartsAt.Format "2006-01-02 15:04:05" }})
  {{ end }}

4. 添加 Prometheus 查询链接

text: |
  {{ range .Alerts }}
  Instance: {{ .Labels.instance }}
  Query: {{ .GeneratorURL }}
  {{ end }}

5. Email 通知模板

receivers:
  - name: 'email-team'
    email_configs:
      - to: 'team@example.com'
        headers:
          Subject: '{{ template "email.subject" . }}'
        html: |
          <!DOCTYPE html>
          <html>
          <head>
            <style>
              .firing { background-color: #ff0000; color: white; }
              .resolved { background-color: #00ff00; color: black; }
            </style>
          </head>
          <body>
            <h2>{{ .GroupLabels.alertname }}</h2>
            <p><strong>Status:</strong> {{ .Status }}</p>
            <p><strong>Severity:</strong> {{ .CommonLabels.severity }}</p>

            <h3>Firing Alerts ({{ .Alerts.Firing | len }})</h3>
            <ul>
            {{ range .Alerts.Firing }}
              <li class="firing">
                <strong>{{ .Labels.instance }}</strong><br/>
                {{ .Annotations.description }}
              </li>
            {{ end }}
            </ul>

            <h3>Resolved Alerts ({{ .Alerts.Resolved | len }})</h3>
            <ul>
            {{ range .Alerts.Resolved }}
              <li class="resolved">
                <strong>{{ .Labels.instance }}</strong><br/>
                Resolved at {{ .EndsAt.Format "2006-01-02 15:04:05" }}
              </li>
            {{ end }}
            </ul>
          </body>
          </html>

6. PagerDuty 自定义详情

pagerduty_configs:
  - routing_key: 'xxx'
    description: '{{ .GroupLabels.alertname }}'
    details:
      firing: '{{ template "pagerduty.default.instances" .Alerts.Firing }}'
      num_firing: '{{ .Alerts.Firing | len }}'
      num_resolved: '{{ .Alerts.Resolved | len }}'
      cluster: '{{ .CommonLabels.cluster }}'
      runbook: 'https://runbooks.example.com/{{ .GroupLabels.alertname }}'

7. Webhook JSON 自定义

webhook_configs:
  - url: 'http://example.com/webhook'
    send_resolved: true

接收到的 JSON 可通过代码处理:

@app.route('/webhook', methods=['POST'])
def webhook():
    data = request.json

    for alert in data['alerts']:
        print(f"Alert: {alert['labels']['alertname']}")
        print(f"Instance: {alert['labels']['instance']}")
        print(f"Summary: {alert['annotations']['summary']}")
        print(f"Status: {alert['status']}")

    return '', 200

15.6 模板函数¶

常用函数:

# 字符串操作
{{ .CommonLabels.alertname | toUpper }}    # 大写
{{ .CommonLabels.alertname | toLower }}    # 小写
{{ .CommonLabels.alertname | title }}      # 首字母大写

# 列表操作
{{ .Alerts.Firing | len }}                 # 长度
{{ index .Alerts 0 }}                      # 索引

# 条件判断
{{ if eq .Status "firing" }}Firing{{ else }}Resolved{{ end }}
{{ if gt (.Alerts.Firing | len) 5 }}Too many alerts!{{ end }}

# 时间格式化
{{ .StartsAt.Format "2006-01-02 15:04:05" }}
{{ .EndsAt.Format "15:04" }}

15.7 完整模板文件示例¶

/etc/alertmanager/templates/custom.tmpl

{{ define "slack.title" }}
[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}
{{ end }}

{{ define "slack.text" }}
*Cluster:* {{ .CommonLabels.cluster }}
*Severity:* {{ .CommonLabels.severity }}

{{ if gt (len .Alerts.Firing) 0 }}
🔥 *Firing ({{ .Alerts.Firing | len }}):*
{{ range .Alerts.Firing }}
  • *{{ .Labels.instance }}*
    {{ .Annotations.description }}
    <{{ .GeneratorURL }}|View Query>
{{ end }}
{{ end }}

{{ if gt (len .Alerts.Resolved) 0 }}
✅ *Resolved ({{ .Alerts.Resolved | len }}):*
{{ range .Alerts.Resolved }}
  • {{ .Labels.instance }}
{{ end }}
{{ end }}
{{ end }}

{{ define "email.subject" }}
[{{ .Status }}] {{ .GroupLabels.alertname }} - {{ .CommonLabels.cluster }}
{{ end }}

{{ define "email.html" }}
<h2>{{ .GroupLabels.alertname }}</h2>
<table>
  <tr><td>Status:</td><td>{{ .Status }}</td></tr>
  <tr><td>Severity:</td><td>{{ .CommonLabels.severity }}</td></tr>
  <tr><td>Cluster:</td><td>{{ .CommonLabels.cluster }}</td></tr>
  <tr><td>Firing:</td><td>{{ .Alerts.Firing | len }}</td></tr>
  <tr><td>Resolved:</td><td>{{ .Alerts.Resolved | len }}</td></tr>
</table>

<h3>Details</h3>
{{ range .Alerts }}
<div style="border-left: 3px solid {{ if eq .Status "firing" }}red{{ else }}green{{ end }}; padding-left: 10px; margin: 10px 0;">
  <strong>{{ .Labels.instance }}</strong><br/>
  {{ .Annotations.description }}<br/>
  <small>Started: {{ .StartsAt.Format "2006-01-02 15:04:05" }}</small>
</div>
{{ end }}
{{ end }}

16. Alertmanager Management API¶

Alertmanager 提供管理 API 用于自动化和集成。

16.1 健康检查¶

端点:

GET  /-/healthy
HEAD /-/healthy

用途: 检查 Alertmanager 是否运行

返回: 始终返回 200

示例:

# 检查健康状态
curl http://localhost:9093/-/healthy

# 仅检查状态码
curl -I http://localhost:9093/-/healthy

Kubernetes 存活探针:

livenessProbe:
  httpGet:
    path: /-/healthy
    port: 9093
  initialDelaySeconds: 30
  periodSeconds: 10

16.2 就绪检查¶

端点:

GET  /-/ready
HEAD /-/ready

用途: 检查 Alertmanager 是否准备好处理流量

返回: 准备好时返回 200,否则返回 503

示例:

curl http://localhost:9093/-/ready

Kubernetes 就绪探针:

readinessProbe:
  httpGet:
    path: /-/ready
    port: 9093
  initialDelaySeconds: 10
  periodSeconds: 5

16.3 重载配置¶

端点:

POST /-/reload

用途: 触发配置文件重新加载

示例:

curl -X POST http://localhost:9093/-/reload

替代方法 - SIGHUP 信号:

# 查找进程 ID
ps aux | grep alertmanager

# 发送 SIGHUP 信号
kill -HUP <PID>

# 或使用 systemd
systemctl reload alertmanager

16.4 自动化脚本示例¶

配置更新脚本:

#!/bin/bash
# update-alertmanager-config.sh

CONFIG_FILE="/etc/alertmanager/alertmanager.yml"
BACKUP_DIR="/etc/alertmanager/backups"
ALERTMANAGER_URL="http://localhost:9093"

# 创建备份
timestamp=$(date +%Y%m%d_%H%M%S)
cp "$CONFIG_FILE" "$BACKUP_DIR/alertmanager_${timestamp}.yml"

# 验证新配置
if amtool check-config "$CONFIG_FILE"; then
    echo "✅ Configuration is valid"

    # 重载配置
    if curl -X POST "${ALERTMANAGER_URL}/-/reload"; then
        echo "✅ Configuration reloaded successfully"

        # 验证就绪状态
        sleep 2
        if curl -f "${ALERTMANAGER_URL}/-/ready"; then
            echo "✅ Alertmanager is ready"
        else
            echo "⚠️  Alertmanager not ready, check logs"
            exit 1
        fi
    else
        echo "❌ Failed to reload configuration"
        exit 1
    fi
else
    echo "❌ Configuration is invalid"
    exit 1
fi

健康监控脚本:

#!/usr/bin/env python3
import requests
import time
import sys

ALERTMANAGER_URL = "http://localhost:9093"

def check_health():
    try:
        response = requests.get(f"{ALERTMANAGER_URL}/-/healthy", timeout=5)
        return response.status_code == 200
    except:
        return False

def check_ready():
    try:
        response = requests.get(f"{ALERTMANAGER_URL}/-/ready", timeout=5)
        return response.status_code == 200
    except:
        return False

def main():
    while True:
        healthy = check_health()
        ready = check_ready()

        status = "✅" if healthy and ready else "❌"
        print(f"{status} Health: {healthy} | Ready: {ready}")

        if not healthy:
            print("🔴 Alertmanager is not healthy!")
            sys.exit(1)

        time.sleep(30)

if __name__ == "__main__":
    main()

CI/CD 集成示例:

# GitLab CI
deploy-alertmanager:
  stage: deploy
  script:
    # 验证配置
    - amtool check-config alertmanager.yml

    # 部署配置
    - kubectl create configmap alertmanager-config \
        --from-file=alertmanager.yml \
        --dry-run=client -o yaml | kubectl apply -f -

    # 触发重载
    - kubectl exec -n monitoring alertmanager-0 -- \
        wget --post-data="" -O- http://localhost:9093/-/reload

    # 验证就绪
    - sleep 5
    - kubectl exec -n monitoring alertmanager-0 -- \
        wget -O- http://localhost:9093/-/ready

Docker Compose 健康检查:

version: '3.8'
services:
  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    ports:
      - "9093:9093"
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q", "http://localhost:9093/-/healthy"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

16.5 API 使用最佳实践¶

1. 在配置更新前验证

# ✅ 好 - 先验证再重载
amtool check-config alertmanager.yml && \
  curl -X POST http://localhost:9093/-/reload

# ❌ 差 - 直接重载可能导致错误
curl -X POST http://localhost:9093/-/reload

2. 使用就绪检查而非健康检查

# ✅ 好 - 使用就绪检查确保可以处理流量
curl http://localhost:9093/-/ready

# ⚠️  健康检查只表明进程存活,不代表可以处理请求
curl http://localhost:9093/-/healthy

3. 重载后验证

# 重载配置
curl -X POST http://localhost:9093/-/reload

# 等待并验证
sleep 2
curl http://localhost:9093/-/ready || echo "Reload may have failed"

4. 自动重启失败的实例

# systemd service 配置
[Service]
Restart=always
RestartSec=5s

5. 日志监控

# 监控重载操作
journalctl -u alertmanager -f | grep -i "reload"

# 或 Docker logs
docker logs -f alertmanager | grep -i "reload"

总结¶

Notification Examples (通知示例):

自定义 Slack 通知文本和格式
访问和使用告警注解
遍历所有告警
定义可复用模板文件
丰富的模板函数和示例

Management API (管理API):

/-/healthy - 健康检查
/-/ready - 就绪检查
/-/reload - 重载配置
自动化脚本和 CI/CD 集成
最佳实践

这两个章节提供了实用的模板和自动化方案,帮助运维人员高效管理 Alertmanager!

17. Practices (最佳实践)¶

17.1 Naming (命名规范)¶

指标和标签命名约定不是强制的,但可作为风格指南和最佳实践集合。

指标名称规范¶

必须 (MUST):

符合数据模型的有效字符
具有应用前缀(命名空间)

# ✅ 好
prometheus_notifications_total    # Prometheus 服务器特定
process_cpu_seconds_total         # 客户端库导出
http_request_duration_seconds     # 所有 HTTP 请求

# ❌ 差
notifications_total               # 缺少前缀

使用单一单位(不混用秒和毫秒,秒和字节)
使用基础单位(秒、字节、米,而非毫秒、兆字节、千米)
具有复数形式的单位后缀

# ✅ 好
http_request_duration_seconds
node_memory_usage_bytes
http_requests_total                # 无单位累积计数
process_cpu_seconds_total          # 带单位累积计数
foobar_build_info                  # 元数据伪指标
data_pipeline_last_processed_timestamp_seconds

# ❌ 差
http_request_duration_ms          # 应使用秒
node_memory_usage_megabytes       # 应使用字节
http_request_total                # 单数形式

建议 (SHOULD):

按词法排序时的便利分组

# ✅ 好 - 公共组件在前,相关指标排在一起
prometheus_tsdb_head_truncations_closed_total
prometheus_tsdb_head_truncations_established_total
prometheus_tsdb_head_truncations_failed_total
prometheus_tsdb_head_truncations_total

# 也可以 - 更易读但可能分散
prometheus_tsdb_head_closed_truncations_total
prometheus_tsdb_head_established_truncations_total

跨所有标签维度表示相同的逻辑测量对象

# ✅ 好 - sum() 或 avg() 有意义
request_duration_seconds
bytes_transferred
resource_usage_ratio

# ❌ 差 - sum() 无意义(混合了不同类型)
queue_capacity_and_size  # 混合容量和当前大小

为什么在名称中包含单位和类型?¶

Prometheus 强烈建议包含单位和类型,原因:

指标消费可靠性和 UX
纯 YAML 配置中可直接理解指标类型和单位
有助于事件响应时快速理解 PromQL 表达式
避免指标冲突
随着采用增长,缺少单位信息会导致序列冲突
例如 process_cpu 可能是秒或毫秒

标签规范¶

使用标签区分测量对象的特征:

# ✅ 好
api_http_requests_total{operation="create|update|delete"}
api_request_duration_seconds{stage="extract|transform|load"}

# ❌ 差 - 标签名在指标名中
api_http_requests_create_total
api_http_requests_update_total
api_http_requests_delete_total

注意事项:

⚠️ 警告: 每个唯一的键值标签对组合代表一个新时间序列,会显著增加存储数据量。

不要使用高基数标签:

❌ 用户 ID

❌ 邮箱地址

❌ 其他无界值集合

基础单位¶

类别	基础单位	备注
Time	`seconds`
Temperature	`celsius`	首选 celsius 而非 kelvin(实用性)
Length	`meters`
Bytes	`bytes`
Bits	`bytes`	始终使用 bytes 避免混淆
Percent	`ratio`	值为 0-1(而非 0-100),后缀为 `_ratio`
Voltage	`volts`
Electric current	`amperes`
Energy	`joules`
Power	-	优先导出 counter of joules,用 `rate(joules[5m])` 得到瓦特
Mass	`grams`	优先 grams 而非 kilograms 以避免 kilo 前缀问题

百分比示例:

# ✅ 好
disk_usage_ratio  # 值范围 0.0 - 1.0
cpu_usage_ratio

# 或使用模式 A_per_B
requests_per_second
errors_per_requests

17.2 Instrumentation (埋点)¶

如何埋点¶

简短答案: 埋点一切!

每个库、子系统和服务至少应有几个指标
埋点应是代码的组成部分
在使用指标的同一文件中实例化指标类

三种服务类型¶

1. Online-Serving Systems (在线服务系统)

人类或系统期待即时响应
例如:数据库、HTTP 请求

关键指标:

执行的查询数
错误数
延迟
进行中的请求数(可选)

最佳实践:

✅ 在客户端和服务器端都监控
✅ 记录请求结束时(而非开始时)
✅ 只在栈中一个点监控延迟(避免重复告警)

2. Offline Processing (离线处理)

无人主动等待响应
通常批量处理

关键指标(每个阶段):

进入的项目数
进行中的项目数
最后处理时间
发出的项目数

心跳技巧: 发送带时间戳的虚拟项目通过系统,导出最近心跳时间戳。

3. Batch Jobs (批处理任务)

不连续运行
抓取困难

关键指标:

最后成功时间 (✅ 最重要)
各阶段耗时
总运行时间
最后完成时间
处理的记录总数

推送到 PushGateway:

# Batch job 结束时推送指标
echo "batch_job_last_success $(date +%s)" | \
  curl --data-binary @- http://pushgateway:9091/metrics/job/batch_job

最佳实践:

批处理运行超过几分钟 → 也使用 pull 监控
运行频率超过每 15 分钟 → 考虑转为守护进程

子系统埋点¶

Libraries (库):

# ✅ 好 - 库自动埋点
import prometheus_client as prom

db_query_duration = prom.Histogram(
    'db_query_duration_seconds',
    'Database query duration',
    ['database', 'operation']  # 区分不同资源
)

@db_query_duration.labels('users', 'select').time()
def query_users():
    # ...

Logging (日志):

每一行日志代码 → 对应一个 counter
多个相关日志 → 可共享一个 counter
导出 info/error/warning 总行数

Failures (失败):

http_requests_total = Counter('http_requests_total', 'Total requests', ['status'])
http_request_errors = Counter('http_request_errors_total', 'Failed requests', ['type'])

# 记录总数和失败数,方便计算失败率
http_requests_total.labels(status='200').inc()
http_request_errors.labels(type='timeout').inc()

Threadpools (线程池):

排队请求数
使用中的线程数
总线程数
处理的任务数
任务耗时
排队等待时间

Caches (缓存):

总查询数
命中数
总延迟
后端系统的查询数、错误数、延迟

需要注意的事项¶

1. 使用标签

# ✅ 好 - 单个指标 + 标签
http_responses_total{code="500"}
http_responses_total{code="403"}
http_responses_total{code="200"}

# ❌ 差 - 多个指标
http_responses_500_total
http_responses_403_total
http_responses_200_total

规则: 指标名称的任何部分都不应程序化生成(使用标签代替)

2. 不要过度使用标签

成本:

每个 labelset 增加 RAM、CPU、磁盘和网络开销

指导原则:

保持基数 < 10
超过 10 的指标,整个系统限制在少数几个
大多数指标应无标签

示例:

# ✅ 可接受 - node_exporter 在 10k 节点上
# 每个节点约 10 个文件系统 = 100k 时间序列
node_filesystem_avail

# ❌ 不可接受 - 添加用户配额
# 10k 用户 × 10k 节点 = 1 亿时间序列 (太多!)
node_filesystem_user_quota

最佳实践:

✅ 从无标签开始
✅ 根据具体用例逐步添加标签

3. Counter vs. Gauge, Summary vs. Histogram

简单规则:

值能下降 → Gauge
值只能上升 → Counter

# Counter - 只能增加(或重置)
http_requests_total = Counter(...)         # ✅
bytes_transferred_total = Counter(...)     # ✅

# Gauge - 可上可下
memory_usage_bytes = Gauge(...)            # ✅
in_progress_requests = Gauge(...)          # ✅
temperature_celsius = Gauge(...)           # ✅

# ❌ 错误用法
rate(gauge_metric[5m])  # 不要对 Gauge 使用 rate()

4. 时间戳,而非"时间间隔"

# ✅ 好 - 导出 Unix 时间戳
last_success_timestamp_seconds = Gauge(...)
last_success_timestamp_seconds.set(time.time())

# 查询时计算间隔
time() - last_success_timestamp_seconds

# ❌ 差 - 导出间隔(需要更新逻辑)
time_since_last_success_seconds = Gauge(...)

5. 避免缺失指标

问题: 直到事件发生才出现的时间序列难以处理

解决方案: 预先导出默认值(如 0)

# ✅ 好 - 大多数客户端库自动导出 0
errors_total = Counter('errors_total', 'Errors', ['type'])
# 即使从未调用,也会显示为 0

# 手动初始化所有已知标签组合
for error_type in ['timeout', 'connection', 'parse']:
    errors_total.labels(type=error_type)

6. 内循环优化

性能关键代码 (> 100k 调用/秒):

Java counter 增量: 12-17ns
限制内循环中的指标更新
避免标签(或缓存标签查找结果)
注意涉及时间/持续时间的指标(可能涉及系统调用)

最佳实践: 使用基准测试确定影响

17.3 Histograms (直方图)¶

注意: 本文档早于原生直方图(v2.40 实验性,v3.8 稳定)。

Count 和 Sum¶

Histograms 和 Summaries 都采样观测值,跟踪:

观测数量 (_count 后缀) - Counter
观测值总和 (_sum 后缀) - Counter (如无负值)

计算平均值:

# 过去 5 分钟的平均请求时长
  rate(http_request_duration_seconds_sum[5m])
/
  rate(http_request_duration_seconds_count[5m])

Apdex 分数¶

SLO 示例: 95% 请求在 300ms 内

# 300ms 内的请求比例
  sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) by (job)
/
  sum(rate(http_request_duration_seconds_count[5m])) by (job)

Apdex 分数计算:

目标: 300ms
容忍: 1.2s

(
    sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) by (job)
  +
    sum(rate(http_request_duration_seconds_bucket{le="1.2"}[5m])) by (job)
) / 2 / sum(rate(http_request_duration_seconds_count[5m])) by (job)

注意:除以 2 是因为桶是累积的

Quantiles (分位数)¶

φ-quantile: φ*N 排名的观测值 (0 ≤ φ ≤ 1)

0.5-quantile = 中位数
0.95-quantile = 95^th 百分位

Histogram vs. Summary:

特性	Histogram	Summary
配置	选择适合范围的桶	选择所需 φ-quantiles 和滑动窗口
客户端性能	非常便宜(仅增加计数器)	昂贵(流式分位数计算)
服务器性能	需计算分位数	低成本
时间序列数	每个桶一个	每个分位数一个
分位数误差	受桶宽度限制	受配置值限制
规范	Ad-hoc(PromQL)	客户端预配置
聚合	✅ 可聚合	❌ 通常不可聚合

关键区别 - 聚合:

# ❌ 差 - Summary 平均分位数无统计意义
avg(http_request_duration_seconds{quantile="0.95"})

# ✅ 好 - Histogram 可正确聚合
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

选择建议¶

两条经验法则:

需要聚合 → Histogram
其他情况:
知道值的范围和分布 → Histogram
需要精确分位数,无论范围和分布 → Summary

分位数误差¶

Histogram 误差示例:

桶配置: {le="0.1"}, {le="0.2"}, {le="0.3"}, {le="0.45"}

真实 95^th 是 220ms
Histogram 计算为 295ms(线性插值在 200-300ms 桶中)

Summary 误差示例:

配置: 0.95±0.01 (94^th 到 96^th 之间)

如果分布范围宽,94^th(270ms) 和 96^th(330ms) 差异大
可能在 SLO 内外都不确定

结论:

Histogram: 控制观测值维度的误差
Summary: 控制 φ 维度的误差

17.4 Alerting (告警最佳实践)¶

核心原则¶

简单化告警:

仅告警症状(symptoms),而非每个可能原因
告警应关联终端用户痛点
告警应链接到相关控制台
允许容错以适应小波动

告警命名¶

推荐: 使用 CamelCase

# ✅ 推荐
- alert: HighRequestLatency
- alert: DatabaseDown
- alert: ServiceUnavailable

# 也可以,但不推荐
- alert: high_request_latency

告警内容¶

Online-Serving Systems:

groups:
  - name: latency_errors
    rules:
      # ✅ 好 - 在栈顶层告警高延迟
      - alert: HighLatency
        expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 5m

      # ✅ 好 - 告警用户可见错误
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m

      # ❌ 差 - 不要在多个层级告警延迟
      # 如果用户延迟正常,底层慢不需要告警

Offline Processing:

- alert: ProcessingStalled
  expr: time() - pipeline_last_processed_timestamp_seconds > 3600
  for: 10m
  annotations:
    summary: "Pipeline {{ $labels.pipeline }} stalled"

Batch Jobs:

- alert: BatchJobFailed
  expr: time() - batch_job_last_success_timestamp_seconds > 14400  # 4 hours
  for: 0m
  annotations:
    summary: "Batch job {{ $labels.job }} hasn't succeeded in 4 hours"

# 批处理运行间隔 4h,耗时 1h
# 阈值至少 2 倍完整运行时间(10h 合理)

容量告警¶

- alert: DiskSpaceWarning
  expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.15
  for: 30m
  labels:
    severity: warning

- alert: DiskSpaceCritical
  expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.05
  for: 10m
  labels:
    severity: critical

元监控 (Metamonitoring)¶

监控监控基础设施本身:

# ✅ 好 - 症状检测(黑盒测试)
- alert: AlertingPipelineBroken
  expr: up{job="blackbox-test"} == 0
  # 测试: 告警从 PushGateway → Prometheus → Alertmanager → Email

# 也可以 - 单组件告警
- alert: PrometheusDown
  expr: up{job="prometheus"} == 0

补充白盒监控:

外部黑盒监控捕获不可见问题
内部系统完全失败时的备用方案

17.5 Recording Rules (记录规则)¶

命名约定¶

格式: level:metric:operations

level - 聚合级别和标签
metric - 指标名称(从 counter 去除 _total,使用 rate/irate 时)
operations - 操作列表(最新操作在前)

示例:

# level = instance_path (有 instance 和 path 标签)
# metric = requests
# operations = rate5m (5分钟 rate)
- record: instance_path:requests:rate5m
  expr: rate(requests_total[5m])

# level = path (只有 path 标签,聚合掉 instance)
- record: path:requests:rate5m
  expr: sum without (instance)(instance_path:requests:rate5m)

命名规则¶

简化操作:

省略 _sum(如有其他操作)
合并结合操作 (min_min = min)
无明显操作时使用 sum
比率操作用 _per_ 分隔,操作称为 ratio

平均观测大小:

去除 _count/_sum 后缀
用 mean 替换 rate

聚合规则¶

规则 1: 聚合比率时,分别聚合分子和分母

# ❌ 差 - 不要平均比率或平均的平均
avg(instance:requests_per_errors:ratio)

# ✅ 好 - 分别聚合后再除
  sum(instance:requests:rate5m)
/
  sum(instance:errors:rate5m)

规则 2: 始终指定 without子句

# ✅ 好 - 保留其他标签(job 等)
sum without (instance)(instance_path:requests:rate5m)

# ❌ 差 - 可能丢失有用标签
sum(instance_path:requests:rate5m)

完整示例¶

聚合请求率:

# Step 1: 实例+路径级别
- record: instance_path:requests:rate5m
  expr: rate(requests_total{job="myjob"}[5m])

# Step 2: 路径级别(聚合掉 instance)
- record: path:requests:rate5m
  expr: sum without (instance)(instance_path:requests:rate5m{job="myjob"})

失败率比例:

# Step 1: 失败率
- record: instance_path:request_failures:rate5m
  expr: rate(request_failures_total{job="myjob"}[5m])

# Step 2: 实例+路径级别比率
- record: instance_path:request_failures_per_requests:ratio_rate5m
  expr: |2
      instance_path:request_failures:rate5m{job="myjob"}
    /
      instance_path:requests:rate5m{job="myjob"}

# Step 3: 路径级别比率(正确聚合)
- record: path:request_failures_per_requests:ratio_rate5m
  expr: |2
      sum without (instance)(instance_path:request_failures:rate5m{job="myjob"})
    /
      sum without (instance)(instance_path:requests:rate5m{job="myjob"})

# Step 4: Job 级别比率
- record: job:request_failures_per_requests:ratio_rate5m
  expr: |2
      sum without (instance, path)(instance_path:request_failures:rate5m{job="myjob"})
    /
      sum without (instance, path)(instance_path:requests:rate5m{job="myjob"})

平均延迟(Summary):

- record: instance_path:request_latency_seconds_count:rate5m
  expr: rate(request_latency_seconds_count{job="myjob"}[5m])

- record: instance_path:request_latency_seconds_sum:rate5m
  expr: rate(request_latency_seconds_sum{job="myjob"}[5m])

# mean 替换 rate(因为是平均观测大小)
- record: instance_path:request_latency_seconds:mean5m
  expr: |2
      instance_path:request_latency_seconds_sum:rate5m{job="myjob"}
    /
      instance_path:request_latency_seconds_count:rate5m{job="myjob"}

- record: path:request_latency_seconds:mean5m
  expr: |2
      sum without (instance)(instance_path:request_latency_seconds_sum:rate5m{job="myjob"})
    /
      sum without (instance)(instance_path:request_latency_seconds_count:rate5m{job="myjob"})

平均速率(avg 函数):

- record: job:request_latency_seconds_count:avg_rate5m
  expr: avg without (instance, path)(instance_path:request_latency_seconds_count:rate5m{job="myjob"})

验证规则:

聚合时,without 子句中的标签从输出 level 中移除
无聚合时,level 始终匹配
如果不匹配,规则可能有错误

17.6 Pushing Metrics (推送指标)¶

何时使用 Pushgateway?¶

仅限特定情况,不建议盲目使用

Pushgateway 的问题¶

问题 1: 单点故障和瓶颈

多实例通过单个 Pushgateway → 成为瓶颈

问题 2: 失去自动健康监控

丢失 Prometheus 的 up 指标(每次抓取生成)

问题 3: 永不遗忘

Pushgateway 永远不会忘记推送的序列
必须手动通过 API 删除
实例重命名/删除后指标仍然存在
需要手动同步生命周期

唯一有效用例¶

Service-Level Batch Jobs:

与特定机器/实例语义无关的批处理
例如:为整个服务删除用户的批处理

特点:

指标不应包含 machine 或 instance 标签
减少管理 Pushgateway 中过期指标的负担

# ✅ 好 - Service-level batch job
echo "user_cleanup_total 150" | \
  curl --data-binary @- http://pushgateway:9091/metrics/job/user_cleanup

# ❌ 差 - Machine-specific batch job
echo "backup_size_bytes 1073741824" | \
  curl --data-binary @- http://pushgateway:9091/metrics/job/backup/instance/server1

替代策略¶

1. 防火墙/NAT 问题:

# 将 Prometheus 移到网络屏障内
# 或使用 PushProx

PushProx 架构:

Proxy 在网络外部
Targets 主动连接到 proxy
Prometheus 通过 proxy 拉取

2. 机器相关批处理任务:

使用 Node Exporter 的 textfile collector:

# ✅ 好 - 使用 textfile collector
cat > /var/lib/node_exporter/textfile_collector/backup.prom << EOF
# HELP backup_last_success_timestamp Last successful backup time
# TYPE backup_last_success_timestamp gauge
backup_last_success_timestamp $(date +%s)
# HELP backup_size_bytes Size of last backup
# TYPE backup_size_bytes gauge
backup_size_bytes 1073741824
EOF

优点:

自动生命周期管理
机器宕机时指标自动消失
无需管理 Pushgateway

最佳实践总结¶

Pushgateway 使用检查清单:

是 service-level batch job?(不是特定机器)
指标不包含 instance/machine 标签?
可以手动管理指标生命周期?

如果任何一项为 No → 使用替代方案:

机器相关批处理 → Node Exporter textfile collector
防火墙问题 → PushProx 或移动 Prometheus
其他情况 → 重新考虑架构

Practices 快速参考¶

Naming (命名):

应用前缀 + 基础单位 + 复数后缀
基数 < 10,避免高基数标签

Instrumentation (埋点):

Counter vs Gauge:值能下降用 Gauge
导出时间戳,而非时间间隔
避免缺失指标,预初始化

Histograms:

需要聚合 → Histogram
需要精确分位数 → Summary

Alerting:

告警症状而非原因
关联用户痛点
在栈顶层告警

Recording Rules:

格式:level:metric:operations
始终用 without 子句
聚合比率时分别聚合分子和分母

Pushing:

仅用于 service-level batch jobs
机器相关任务 → textfile collector
防火墙问题 → PushProx

18. Exporters (导出器列表)¶

以下是 Prometheus 官方和第三方 Exporter 列表,涵盖各种系统、服务和应用。

Databases (数据库)¶

名称	链接
Aerospike exporter	GitHub
AWS RDS exporter	GitHub
ClickHouse exporter	GitHub
Consul exporter 🏅	GitHub
Couchbase exporter	GitHub
CouchDB exporter	GitHub
Druid exporter	GitHub
Elasticsearch exporter	GitHub
EventStore exporter	GitHub
IoTDB exporter	GitHub
KDB+ exporter	GitHub
Memcached exporter 🏅	GitHub
MongoDB exporter	GitHub
MongoDB query exporter	GitHub
MongoDB Node.js Driver exporter	GitHub
MSSQL server exporter	GitHub
MySQL router exporter	GitHub
MySQL server exporter 🏅	GitHub
OpenTSDB exporter	GitHub
Oracle DB exporter	GitHub
PgBouncer exporter	GitHub
PostgreSQL exporter	GitHub
Presto exporter	GitHub
ProxySQL exporter	GitHub
RavenDB exporter	GitHub
Redis exporter	GitHub
RethinkDB exporter	GitHub
SQL exporter	GitHub
Tarantool metric library	GitHub
Twemproxy exporter	GitHub

名称	链接
apcupsd exporter	GitHub
BIG-IP exporter	GitHub
Bosch Sensortec BMP/BME exporter	GitHub
Collins exporter	GitHub
Dell Hardware OMSA exporter	GitHub
Disk usage exporter	GitHub
Fortigate exporter	GitHub
IBM Z HMC exporter	GitHub
IoT Edison exporter	GitHub
InfiniBand exporter	GitHub
IPMI exporter	GitHub
knxd exporter	GitHub
Modbus exporter	GitHub
Netgear Cable Modem exporter	GitHub
Netgear Router exporter	GitHub
Network UPS Tools (NUT) exporter	GitHub
Node/system metrics exporter 🏅	GitHub
NVIDIA GPU exporter	GitHub
ProSAFE exporter	GitHub
SmartRAID exporter	GitLab
Waveplus Radon Sensor exporter	GitHub
Weathergoose Climate Monitor exporter	GitHub
Windows exporter	GitHub
Intel® Optane™ PMem Controller exporter	GitHub

Issue Trackers & CI/CD (问题跟踪 & 持续集成)¶

名称	链接
Bamboo exporter	GitHub
Bitbucket exporter	GitHub
Confluence exporter	GitHub
Jenkins exporter	GitHub
JIRA exporter	GitHub

Messaging Systems (消息系统)¶

名称	链接
Beanstalkd exporter	GitHub
EMQ exporter	GitHub
Gearman exporter	GitHub
IBM MQ exporter	GitHub
Kafka exporter	GitHub
NATS exporter	GitHub
NSQ exporter	GitHub
Mirth Connect exporter	GitHub
MQTT blackbox exporter	GitHub
MQTT2Prometheus	GitHub
RabbitMQ exporter	GitHub
RabbitMQ Management Plugin exporter	GitHub
RocketMQ exporter	GitHub
Solace exporter	GitHub

Storage (存储)¶

名称	链接
Ceph exporter	GitHub
Ceph RADOSGW exporter	GitHub
Gluster exporter	GitHub
GPFS exporter	GitHub
Hadoop HDFS FSImage exporter	GitHub
HPE CSI info metrics provider	Docs
HPE storage array exporter	GitHub
Lustre exporter	GitHub
NetApp E-Series exporter	GitHub
Pure Storage exporter	GitHub
ScaleIO exporter	GitHub
Tivoli Storage Manager/IBM Spectrum Protect	GitHub

HTTP (HTTP 服务器)¶

名称	链接
Apache exporter	GitHub
HAProxy exporter 🏅	GitHub
Nginx metric library	GitHub
Nginx VTS exporter	GitHub
Passenger exporter	GitHub
Squid exporter	GitHub
Tinyproxy exporter	GitHub
Varnish exporter	GitHub
WebDriver exporter	GitHub

APIs (API 服务)¶

名称	链接
AWS ECS exporter	GitHub
AWS Health exporter	GitHub
AWS SQS exporter	GitHub
AWS SQS Prometheus exporter	GitHub
Azure Health exporter	GitHub
BigBlueButton exporter	GitHub
Cloudflare exporter	GitLab
Cryptowat exporter	GitHub
DigitalOcean exporter	GitHub
Docker Cloud exporter	GitHub
Docker Hub exporter	GitHub
Fastly exporter	GitHub
GitHub exporter	GitHub
Gmail exporter	GitHub
GraphQL exporter	GitHub
InstaClustr exporter	GitHub
Mozilla Observatory exporter	GitHub
OpenWeatherMap exporter	GitHub
Pagespeed exporter	GitHub
Rancher exporter	GitHub
Speedtest exporter	GitHub
Tankerkönig API exporter	GitHub

Logging (日志)¶

名称	链接
Fluentd exporter	GitHub
Google's mtail log data extractor	GitHub
Grok exporter	GitHub

FinOps (成本管理)¶

名称	链接
AWS Cost exporter	GitHub
Azure Cost exporter	GitHub
Kubernetes Cost exporter	GitHub

Miscellaneous (其他)¶

名称	链接
ACT Fibernet exporter	Git
BIND exporter	GitHub
BIND query exporter	GitHub
Bitcoind exporter	GitHub
Blackbox exporter 🏅	GitHub
Bungeecord exporter	GitHub
BOSH exporter	GitHub
cAdvisor	GitHub
Cachet exporter	GitHub
ccache exporter	GitHub
c-lightning exporter	GitHub
DHCPD leases exporter	GitHub
Dovecot exporter	GitHub
Dnsmasq exporter	GitHub
eBPF exporter	GitHub
eBPF network traffic exporter	GitHub
Ethereum Client exporter	GitHub
FFmpeg exporter	GitHub
File statistics exporter	GitHub
JFrog Artifactory exporter	GitHub
Hostapd exporter	GitHub
IBM Security Verify Access exporter	GitLab
IPsec exporter	GitHub
ipset exporter	GitHub
IRCd exporter	GitHub
Linux HA ClusterLabs exporter	GitHub
JMeter plugin	GitHub
JSON exporter	GitHub
Kannel exporter	GitHub
Kemp LoadBalancer exporter	GitHub
Kibana exporter	GitHub
kube-state-metrics	GitHub
Locust exporter	GitHub
Meteor JS web framework exporter	Atmosphere
Minecraft exporter module	GitHub
Minecraft exporter	GitHub
NetBird exporter	GitHub
Nomad exporter	GitLab
nftables exporter	GitHub
OpenStack exporter	GitHub
OpenStack blackbox exporter	GitHub
OpenVPN exporter	GitHub
oVirt exporter	GitHub
Pact Broker exporter	GitHub
PHP-FPM exporter	GitHub
PowerDNS exporter	GitHub
Podman exporter	GitHub
Prefect2 exporter	GitHub
Process exporter	GitHub
rTorrent exporter	GitHub
Rundeck exporter	GitHub
SABnzbd exporter	GitHub
SAML exporter	GitHub
Script exporter	GitHub
Shield exporter	GitHub
Smokeping prober	GitHub
SMTP/Maildir MDA blackbox prober	GitHub
SoftEther exporter	GitHub
SSH exporter	GitHub
Teamspeak3 exporter	GitHub
Transmission exporter	GitHub
Unbound exporter	GitHub
WireGuard exporter	GitHub
Xen exporter	GitHub

图例:

🏅 = 官方维护的 Exporter

参考链接:

统计:

数据库: 30个
硬件: 24个
CI/CD: 5个
消息系统: 14个
存储: 12个
HTTP: 9个
APIs: 22个
日志: 3个
FinOps: 3个
其他: 70+个

总计: 200+ 个 Exporter

附录: Golang Exporter 完整示例¶

概述¶

这是一个完整的 Golang Exporter 示例,展示了如何使用 Prometheus 客户端库实现:

Gauge - 内存使用、CPU 核心数等
Histogram - HTTP 请求延迟分布

完整代码¶

package main

import (
 "context"
 "errors"
 "flag"
 "fmt"
 "log"
 "math/rand"
 "net/http"
 "os"
 "os/signal"
 "strconv"
 "syscall"
 "time"

 "github.com/prometheus/client_golang/prometheus"
 "github.com/prometheus/client_golang/prometheus/promhttp"
 "github.com/shirou/gopsutil/cpu"
 "github.com/shirou/gopsutil/mem"
 "github.com/shirou/gopsutil/net"
)

var listenPort string

// ExporterMetrics 存储指标的结构体
type ExporterMetrics struct {
 connectInfo *prometheus.Desc
 memInfo     *prometheus.Desc
 cpuNums     *prometheus.Desc
}

// Describe 将所有可能的指标描述符发送到提供的通道
func (collector *ExporterMetrics) Describe(ch chan<- *prometheus.Desc) {
 ch <- collector.connectInfo
 ch <- collector.memInfo
 ch <- collector.cpuNums
}

// Collect 用于 Prometheus 收集指标
func (collector *ExporterMetrics) Collect(ch chan<- prometheus.Metric) {
 collector.collectMemoryMetrics(ch)
}

// collectMemoryMetrics 收集内存使用信息
func (collector *ExporterMetrics) collectMemoryMetrics(ch chan<- prometheus.Metric) {
 memoryStats, err := mem.VirtualMemory()
 if err != nil {
  log.Printf("Error collecting memory stats: %v\n", err)
  return
 }

 ch <- prometheus.MustNewConstMetric(
  collector.memInfo, prometheus.GaugeValue, float64(memoryStats.Free), "free",
 )
 ch <- prometheus.MustNewConstMetric(
  collector.memInfo, prometheus.GaugeValue, float64(memoryStats.Used), "used",
 )
 ch <- prometheus.MustNewConstMetric(
  collector.memInfo, prometheus.GaugeValue, float64(memoryStats.Total), "total",
 )
}

// 内存指标信息
func memDesc() *prometheus.Desc {
 return prometheus.NewDesc(
  "memory_usage",
  "Memory usage metrics including total, used, and free memory",
  []string{"type"},
  nil)
}

// cpu指标信息
func cpuDesc() *prometheus.Desc {
 return prometheus.NewDesc(
  "cpu_cores",
  "Number of CPU cores available",
  nil,
  nil)
}

// 连接指标信息
func connDesc() *prometheus.Desc {
 return prometheus.NewDesc(
  "connection_status",
  "Status of network connections",
  []string{"local_addr", "local_port", "remote_addr", "remote_port", "status", "pid"},
  nil)
}

// newMetricsCollector 创建并初始化一个新的指标收集器
func newMetricsCollector() *ExporterMetrics {
 return &ExporterMetrics{
  connectInfo: connDesc(),
  memInfo:     memDesc(),
  cpuNums:     cpuDesc(),
 }
}

// =============== Histogram 示例 ===============

// 创建 Histogram 指标 - HTTP 请求延迟
var (
 httpRequestDuration = prometheus.NewHistogramVec(
  prometheus.HistogramOpts{
   Name: "http_request_duration_seconds",
   Help: "HTTP request latency distributions",
   // 自定义桶边界: 1ms, 5ms, 10ms, 50ms, 100ms, 500ms, 1s, 5s
   Buckets: []float64{0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5},
  },
  []string{"method", "endpoint", "status"}, // 标签
 )
)

// 模拟 API 处理函数 - 用于演示 Histogram
func apiHandler(w http.ResponseWriter, r *http.Request) {
 // 记录请求开始时间
 start := time.Now()

 // 模拟业务处理（随机延迟 10-500ms）
 processingTime := time.Duration(10+rand.Intn(490)) * time.Millisecond
 time.Sleep(processingTime)

 // 模拟不同的响应状态
 statusCode := 200
 if rand.Float64() < 0.1 { // 10% 概率返回 500
  statusCode = 500
 }

 // 记录请求耗时到 Histogram
 duration := time.Since(start).Seconds()
 httpRequestDuration.WithLabelValues(
  r.Method,
  r.URL.Path,
  strconv.Itoa(statusCode),
 ).Observe(duration)

 // 返回响应
 w.WriteHeader(statusCode)
 fmt.Fprintf(w, "Request processed in %.3f seconds\n", duration)
}

// 健康检查接口
func health(w http.ResponseWriter, r *http.Request) {
 w.Write([]byte("health"))
}

func main() {
 flag.StringVar(&listenPort, "port", "28880", "exporter listen port")
 flag.Parse()

 // 初始化随机数种子
 rand.Seed(time.Now().UnixNano())

 // 注册自定义 Collector (Gauge)
 allMetrics := newMetricsCollector()
 prometheus.MustRegister(allMetrics)

 // 注册 Histogram 指标
 prometheus.MustRegister(httpRequestDuration)

 // 路由配置
 http.Handle("/metrics", promhttp.Handler())
 http.HandleFunc("/healthz", health)
 http.HandleFunc("/api/users", apiHandler)
 http.HandleFunc("/api/orders", apiHandler)
 http.HandleFunc("/api/products", apiHandler)

 server := &http.Server{Addr: fmt.Sprintf(":%s", listenPort)}

 log.Printf("Starting server on port %s\n", listenPort)
 log.Printf("Metrics: http://localhost:%s/metrics\n", listenPort)

 go func() {
  if err := server.ListenAndServe(); err != nil && !errors.Is(err, http.ErrServerClosed) {
   log.Printf("ListenAndServe(): %s\n", err)
   panic(err)
  }
 }()

 sigChan := make(chan os.Signal, 1)
 signal.Notify(sigChan, syscall.SIGTERM, os.Interrupt, syscall.SIGKILL)

 sig := <-sigChan
 log.Println("SIGTERM received, shutting down gracefully...")

 timeout, cancel := context.WithTimeout(context.Background(), 60*time.Second)
 defer cancel()

 if err := server.Shutdown(timeout); err != nil {
  log.Printf("Server Close Error: %s\n", err)
 } else {
  log.Println("server Close Successful")
 }
}

使用说明¶

1. 安装依赖

go mod init exporter-demo
go get github.com/prometheus/client_golang/prometheus
go get github.com/prometheus/client_golang/prometheus/promhttp
go get github.com/shirou/gopsutil

2. 运行 Exporter

# 默认端口 28880
go run main.go

# 自定义端口
go run main.go -port 9090

3. 测试 API 端点

# 生成一些请求,产生 Histogram 数据
for i in {1..100}; do
  curl http://localhost:28880/api/users
  curl http://localhost:28880/api/orders
  curl http://localhost:28880/api/products
done

4. 查看指标

访问 http://localhost:28880/metrics 查看暴露的指标:

# Gauge 指标 - 内存使用
memory_usage{type="free"} 8589934592
memory_usage{type="used"} 7340032000
memory_usage{type="total"} 17179869184

# Histogram 指标 - HTTP 请求延迟
http_request_duration_seconds_bucket{method="GET",endpoint="/api/users",status="200",le="0.001"} 0
http_request_duration_seconds_bucket{method="GET",endpoint="/api/users",status="200",le="0.005"} 0
http_request_duration_seconds_bucket{method="GET",endpoint="/api/users",status="200",le="0.01"} 0
http_request_duration_seconds_bucket{method="GET",endpoint="/api/users",status="200",le="0.05"} 5
http_request_duration_seconds_bucket{method="GET",endpoint="/api/users",status="200",le="0.1"} 18
http_request_duration_seconds_bucket{method="GET",endpoint="/api/users",status="200",le="0.5"} 100
http_request_duration_seconds_bucket{method="GET",endpoint="/api/users",status="200",le="1"} 100
http_request_duration_seconds_bucket{method="GET",endpoint="/api/users",status="200",le="5"} 100
http_request_duration_seconds_bucket{method="GET",endpoint="/api/users",status="200",le="+Inf"} 100
http_request_duration_seconds_sum{method="GET",endpoint="/api/users",status="200"} 25.384
http_request_duration_seconds_count{method="GET",endpoint="/api/users",status="200"} 100

代码要点解析¶

1. Gauge 指标 (自定义 Collector)¶

// 实现 prometheus.Collector 接口
type ExporterMetrics struct {
    memInfo *prometheus.Desc
}

// Describe - 描述指标
func (c *ExporterMetrics) Describe(ch chan<- *prometheus.Desc) {
    ch <- c.memInfo
}

// Collect - 收集指标
func (c *ExporterMetrics) Collect(ch chan<- prometheus.Metric) {
    memoryStats, _ := mem.VirtualMemory()
    ch <- prometheus.MustNewConstMetric(
        c.memInfo, 
        prometheus.GaugeValue, 
        float64(memoryStats.Free), 
        "free",
    )
}

特点:

动态收集系统指标
每次抓取时实时获取最新值
适合 CPU、内存、磁盘等系统指标

2. Histogram 指标¶

// 定义 Histogram
var httpRequestDuration = prometheus.NewHistogramVec(
    prometheus.HistogramOpts{
        Name: "http_request_duration_seconds",
        Help: "HTTP request latency distributions",
        Buckets: []float64{0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5},
    },
    []string{"method", "endpoint", "status"},
)

// 记录观测值
httpRequestDuration.WithLabelValues("GET", "/api/users", "200").Observe(0.123)

桶配置选项:

// 1. 自定义桶
Buckets: []float64{0.001, 0.01, 0.1, 1, 10}

// 2. 默认桶
Buckets: prometheus.DefBuckets  // [.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]

// 3. 线性桶 (起始值, 宽度, 数量)
Buckets: prometheus.LinearBuckets(0, 10, 10)  // [0, 10, 20, ..., 90]

// 4. 指数桶 (起始值, 因子, 数量)
Buckets: prometheus.ExponentialBuckets(1, 2, 10)  // [1, 2, 4, 8, ..., 512]

Prometheus 配置¶

将此 Exporter 添加到 Prometheus 抓取配置:

scrape_configs:
  - job_name: 'custom-exporter'
    static_configs:
      - targets: ['localhost:28880']
    scrape_interval: 15s

PromQL 查询示例¶

查询 Histogram P99 延迟:

# 各端点 P99 延迟
histogram_quantile(0.99, 
  sum by (endpoint, le) (
    rate(http_request_duration_seconds_bucket[5m])
  )
)

# 各端点平均延迟
sum by (endpoint) (rate(http_request_duration_seconds_sum[5m])) 
/ 
sum by (endpoint) (rate(http_request_duration_seconds_count[5m]))

# 请求 QPS
sum by (endpoint) (rate(http_request_duration_seconds_count[5m]))

# 错误率
sum by (endpoint) (rate(http_request_duration_seconds_count{status="500"}[5m])) 
/ 
sum by (endpoint) (rate(http_request_duration_seconds_count[5m]))

最佳实践¶

1. Histogram 桶设计

✅ 根据实际延迟分布设计桶边界
✅ 覆盖 SLO 阈值附近的桶
✅ 避免桶过多(通常 10-20 个桶)

2. 标签使用

✅ 保持标签基数低(<1000)
✅ 使用有意义的标签(method, endpoint, status)
❌ 避免高基数标签(user_id, request_id)

3. 指标命名

✅ 使用描述性名称
✅ 包含单位后缀(_seconds, _bytes)
✅ 遵循 Prometheus 命名规范

4. 性能优化

✅ 使用 HistogramVec 而非多个 Histogram
✅ 缓存标签值,避免重复分配
✅ 在高频路径使用 Observe() 而非 ObserveWithExemplar()

扩展示例¶

添加 Counter:

var httpRequestsTotal = prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "http_requests_total",
        Help: "Total number of HTTP requests",
    },
    []string{"method", "endpoint", "status"},
)

// 在 apiHandler 中递增
httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, strconv.Itoa(statusCode)).Inc()

添加 Summary:

var httpRequestDurationSummary = prometheus.NewSummaryVec(
    prometheus.SummaryOpts{
        Name: "http_request_duration_summary_seconds",
        Help: "HTTP request latency summary",
        Objectives: map[float64]float64{
            0.5: 0.05,   // P50, 误差 ±5%
            0.9: 0.01,   // P90, 误差 ±1%
            0.99: 0.001, // P99, 误差 ±0.1%
        },
    },
    []string{"method", "endpoint"},
)

// 记录观测值
httpRequestDurationSummary.WithLabelValues("GET", "/api/users").Observe(0.123)

故障排查¶

问题 1: 指标未暴露

# 检查注册
prometheus.MustRegister(httpRequestDuration)

# 检查路由
http.Handle("/metrics", promhttp.Handler())

问题 2: Histogram 无数据

# 确保调用了 Observe()
httpRequestDuration.WithLabelValues(...).Observe(duration)

# 检查桶配置是否合理
Buckets: []float64{0.001, 0.01, 0.1, 1, 10}

问题 3: 内存占用高

# 减少标签基数
# 减少桶数量
# 使用标签缓存

总结¶

这个示例展示了:

✅ Gauge 指标 - 使用自定义 Collector
✅ Histogram 指标 - 记录延迟分布
✅ 优雅关闭 - 处理信号和超时
✅ 多端点路由 - 健康检查、指标、API

完整代码可作为生产环境 Exporter 的起点,根据实际需求扩展更多指标类型和业务逻辑。

附录: Pushgateway Golang 示例¶

概述¶

Pushgateway 是 Prometheus 生态中用于接收短期任务推送指标的中间组件。本示例展示如何使用 Golang 客户端推送指标到 Pushgateway。

适用场景¶

✅ 适合使用 Pushgateway:

批处理任务(Batch Jobs)
短期运行的脚本
定时任务(Cron Jobs)
无法被 Prometheus 抓取的服务

❌ 不适合使用 Pushgateway:

长期运行的服务(应使用 Pull 模式)
高频更新的指标
需要自动服务发现的场景

完整代码¶

package main

import (
 "log"
 "time"

 "github.com/prometheus/client_golang/prometheus"
 "github.com/prometheus/client_golang/prometheus/push"
)

// MetricCollector 封装指标收集器
type MetricCollector struct {
 counter   *prometheus.CounterVec
 gauge     *prometheus.GaugeVec
 histogram *prometheus.HistogramVec
 registry  *prometheus.Registry
}

func NewMetricCollector() *MetricCollector {
 // 创建一个带标签的计数器
 counter := prometheus.NewCounterVec(
  prometheus.CounterOpts{
   Name: "example_counter_total",
   Help: "Example counter metric",
  },
  []string{"service", "endpoint", "status"},
 )

 // 创建一个带标签的 gauge
 gauge := prometheus.NewGaugeVec(
  prometheus.GaugeOpts{
   Name: "example_gauge_value",
   Help: "Example gauge metric",
  },
  []string{"service", "region", "instance_type"},
 )

 // 创建一个带标签的直方图
 histogram := prometheus.NewHistogramVec(
  prometheus.HistogramOpts{
   Name:    "example_histogram_seconds",
   Help:    "Example histogram metric",
   Buckets: prometheus.LinearBuckets(0, 0.1, 10),
  },
  []string{"service", "operation", "status"},
 )

 // 创建 Registry
 registry := prometheus.NewRegistry()
 registry.MustRegister(counter)
 registry.MustRegister(gauge)
 registry.MustRegister(histogram)

 return &MetricCollector{
  counter:   counter,
  gauge:     gauge,
  histogram: histogram,
  registry:  registry,
 }
}

func (mc *MetricCollector) RecordMetrics() {
 // 记录计数器
 mc.counter.With(prometheus.Labels{
  "service":  "api",
  "endpoint": "/users",
  "status":   "success",
 }).Inc()

 mc.counter.With(prometheus.Labels{
  "service":  "api",
  "endpoint": "/orders",
  "status":   "error",
 }).Inc()

 // 记录 Gauge
 mc.gauge.With(prometheus.Labels{
  "service":       "backend",
  "region":        "us-east-1",
  "instance_type": "t2.micro",
 }).Set(42.0)

 mc.gauge.With(prometheus.Labels{
  "service":       "backend",
  "region":        "eu-west-1",
  "instance_type": "t2.small",
 }).Set(56.0)

 // 记录 Histogram
 mc.histogram.With(prometheus.Labels{
  "service":   "database",
  "operation": "query",
  "status":    "success",
 }).Observe(0.23)

 mc.histogram.With(prometheus.Labels{
  "service":   "database",
  "operation": "insert",
  "status":    "success",
 }).Observe(0.15)
}

func (mc *MetricCollector) PushMetrics(pushGatewayURL string, jobName string) error {
 pusher := push.New(pushGatewayURL, jobName).
  Gatherer(mc.registry)

 // 添加分组标签
 pusher.Grouping("instance", "example_instance")
 pusher.Grouping("environment", "production")

 // 推送指标
 return pusher.Push()
}

// 使用 Basic Auth 推送
func (mc *MetricCollector) PushMetricsWithBasicAuth(
 pushGatewayURL, jobName, username, password string,
) error {
 pusher := push.New(pushGatewayURL, jobName).
  Gatherer(mc.registry).
  BasicAuth(username, password)

 pusher.Grouping("instance", "example_instance")
 pusher.Grouping("environment", "production")

 return pusher.Push()
}

// 添加指标（保留旧数据）
func (mc *MetricCollector) AddMetrics(pushGatewayURL, jobName string) error {
 pusher := push.New(pushGatewayURL, jobName).
  Gatherer(mc.registry)

 pusher.Grouping("instance", "example_instance")

 // 使用 Add() 而非 Push()
 return pusher.Add()
}

// 删除指标
func (mc *MetricCollector) DeleteMetrics(pushGatewayURL, jobName string) error {
 pusher := push.New(pushGatewayURL, jobName)

 pusher.Grouping("instance", "example_instance")
 pusher.Grouping("environment", "production")

 return pusher.Delete()
}

func main() {
 collector := NewMetricCollector()
 pushGatewayURL := "http://localhost:9091"
 jobName := "batch_job"

 // 记录并推送指标
 collector.RecordMetrics()
 err := collector.PushMetrics(pushGatewayURL, jobName)
 if err != nil {
  log.Printf("Could not push to Pushgateway: %v", err)
 } else {
  log.Println("Successfully pushed metrics to Pushgateway")
 }

 // 30秒后删除指标
 time.Sleep(30 * time.Second)
 err = collector.DeleteMetrics(pushGatewayURL, jobName)
 if err != nil {
  log.Printf("Could not delete metrics: %v", err)
 } else {
  log.Println("Successfully deleted metrics from Pushgateway")
 }
}

使用说明¶

1. 启动 Pushgateway¶

Docker 方式:

docker run -d -p 9091:9091 prom/pushgateway

二进制方式:

# 下载
wget https://github.com/prometheus/pushgateway/releases/download/v1.6.2/pushgateway-1.6.2.linux-amd64.tar.gz
tar xvfz pushgateway-1.6.2.linux-amd64.tar.gz
cd pushgateway-1.6.2.linux-amd64

# 启动
./pushgateway

访问 http://localhost:9091 查看 Web UI。

2. 安装 Go 依赖¶

go mod init pushgateway-demo
go get github.com/prometheus/client_golang/prometheus
go get github.com/prometheus/client_golang/prometheus/push

3. 运行示例¶

go run main.go

输出:

Successfully pushed metrics to Pushgateway
Successfully deleted metrics from Pushgateway

4. 查看推送的指标¶

访问 http://localhost:9091/metrics:

# Counter
example_counter_total{environment="production",instance="example_instance",job="batch_job",service="api",endpoint="/users",status="success"} 1
example_counter_total{environment="production",instance="example_instance",job="batch_job",service="api",endpoint="/orders",status="error"} 1

# Gauge
example_gauge_value{environment="production",instance="example_instance",job="batch_job",service="backend",region="us-east-1",instance_type="t2.micro"} 42

# Histogram
example_histogram_seconds_bucket{environment="production",instance="example_instance",job="batch_job",service="database",operation="query",status="success",le="0.1"} 0
example_histogram_seconds_bucket{environment="production",instance="example_instance",job="batch_job",service="database",operation="query",status="success",le="0.2"} 0
example_histogram_seconds_bucket{environment="production",instance="example_instance",job="batch_job",service="database",operation="query",status="success",le="0.3"} 1
example_histogram_seconds_sum{environment="production",instance="example_instance",job="batch_job",service="database",operation="query",status="success"} 0.23
example_histogram_seconds_count{environment="production",instance="example_instance",job="batch_job",service="database",operation="query",status="success"} 1

核心概念¶

1. Push vs Add vs Delete¶

// Push - 覆盖指定 job/grouping 的所有指标
pusher.Push()

// Add - 添加指标，保留已有指标
pusher.Add()

// Delete - 删除指定 job/grouping 的所有指标
pusher.Delete()

区别示例:

// 第一次推送
gauge.Set(10)
pusher.Push()  // Pushgateway 中: gauge = 10

// 第二次推送 (不同指标)
counter.Inc()
pusher.Push()  // Pushgateway 中: gauge 被删除, counter = 1

// 使用 Add 替代
gauge.Set(10)
pusher.Push()  // gauge = 10

counter.Inc()
pusher.Add()   // gauge = 10, counter = 1 (保留 gauge)

2. 分组标签 (Grouping Labels)¶

分组标签用于区分不同的指标来源:

pusher.Grouping("instance", "server-1")
pusher.Grouping("environment", "production")

URL 映射:

http://localhost:9091/metrics/job/batch_job/instance/server-1/environment/production

最佳实践:

✅ 使用有意义的分组标签(instance, environment, region)
✅ 保持分组标签一致
❌ 避免高基数分组标签(user_id, request_id)

3. Registry 使用¶

为什么使用自定义 Registry?

// ❌ 不好 - 使用默认 Registry
prometheus.MustRegister(gauge)
pusher := push.New(url, job).Gatherer(prometheus.DefaultGatherer)
// 会推送所有默认的 Go 运行时指标

// ✅ 好 - 使用自定义 Registry
registry := prometheus.NewRegistry()
registry.MustRegister(gauge)
pusher := push.New(url, job).Gatherer(registry)
// 只推送显式注册的指标

实际应用场景¶

场景 1: Cron Job 批处理¶

// backup_job.go
func runBackup() {
 collector := NewMetricCollector()

 // 执行备份
 start := time.Now()
 err := performBackup()
 duration := time.Since(start).Seconds()

 // 记录指标
 if err != nil {
  collector.counter.With(prometheus.Labels{
   "service": "backup",
   "status":  "error",
  }).Inc()
 } else {
  collector.counter.With(prometheus.Labels{
   "service": "backup",
   "status":  "success",
  }).Inc()

  collector.gauge.With(prometheus.Labels{
   "service": "backup",
   "type":    "duration",
  }).Set(duration)
 }

 // 推送到 Pushgateway
 collector.PushMetrics("http://pushgateway:9091", "backup_job")
}

Crontab 配置:

0 2 * * * /usr/local/bin/backup_job

场景 2: 数据处理脚本¶

func processData(batchID string) {
 collector := NewMetricCollector()

 // 处理记录数
 recordsProcessed := 0
 recordsFailed := 0

 for _, record := range records {
  if processRecord(record) {
   recordsProcessed++
  } else {
   recordsFailed++
  }
 }

 // 记录指标
 collector.gauge.With(prometheus.Labels{
  "batch_id": batchID,
  "status":   "processed",
 }).Set(float64(recordsProcessed))

 collector.gauge.With(prometheus.Labels{
  "batch_id": batchID,
  "status":   "failed",
 }).Set(float64(recordsFailed))

 // 推送
 collector.PushMetrics("http://pushgateway:9091", "data_processor")
}

场景 3: 定期推送 (长期运行脚本)¶

func monitorWorker() {
 collector := NewMetricCollector()
 ticker := time.NewTicker(30 * time.Second)
 defer ticker.Stop()

 for range ticker.C {
  // 收集指标
  queueSize := getQueueSize()
  collector.gauge.With(prometheus.Labels{
   "queue": "tasks",
  }).Set(float64(queueSize))

  // 推送
  err := collector.PushMetrics("http://pushgateway:9091", "worker")
  if err != nil {
   log.Printf("Push failed: %v", err)
  }
 }
}

Prometheus 配置¶

配置 Prometheus 抓取 Pushgateway:

scrape_configs:
  - job_name: 'pushgateway'
    honor_labels: true  # 重要: 保留推送的 job 标签
    static_configs:
      - targets: ['localhost:9091']

关键配置说明:

honor_labels: true - 保留 Pushgateway 中的 job 和 instance 标签
不设置会导致标签被 Prometheus 的抓取标签覆盖

PromQL 查询示例¶

# 查询批处理任务成功率
sum by (service) (example_counter_total{status="success"})
/
sum by (service) (example_counter_total)

# 查询最近一次推送时间
push_time_seconds{job="batch_job"}

# 查询各批处理任务的处理记录数
example_gauge_value{status="processed"}

# 告警: 批处理任务失败
example_counter_total{status="error"} > 0

# 告警: Pushgateway 长时间未收到推送
time() - push_time_seconds{job="batch_job"} > 3600

最佳实践¶

1. 使用 Add 而非 Push (多指标场景)¶

// ❌ 不好 - 第二次 Push 会覆盖第一次的 gauge
gauge.Set(10)
pusher.Push()

counter.Inc()
pusher.Push()  // gauge 被删除!

// ✅ 好 - 使用 Add 保留已有指标
gauge.Set(10)
pusher.Push()

counter.Inc()
pusher.Add()   // gauge 和 counter 都保留

2. 任务结束时删除指标¶

func runJob() {
 collector := NewMetricCollector()
 defer func() {
  // 任务结束时清理指标
  collector.DeleteMetrics(pushGatewayURL, jobName)
 }()

 // 执行任务...
 collector.RecordMetrics()
 collector.PushMetrics(pushGatewayURL, jobName)
}

3. 添加时间戳指标¶

// Pushgateway 自动添加的指标
push_time_seconds{job="batch_job",instance="server-1"} 1700000000
push_failure_time_seconds{job="batch_job",instance="server-1"} 0

// 用于告警
time() - push_time_seconds{job="batch_job"} > 3600

4. 错误处理和重试¶

func PushWithRetry(collector *MetricCollector, url, job string) error {
 maxRetries := 3
 var err error

 for i := 0; i < maxRetries; i++ {
  err = collector.PushMetrics(url, job)
  if err == nil {
   return nil
  }

  log.Printf("Push attempt %d failed: %v", i+1, err)
  time.Sleep(time.Duration(i+1) * time.Second)
 }

 return fmt.Errorf("push failed after %d retries: %w", maxRetries, err)
}

安全性¶

1. 使用 Basic Auth¶

pusher := push.New(url, job).
 BasicAuth("username", "password")

Pushgateway 启动参数:

./pushgateway \
  --web.basic-auth-file=/etc/pushgateway/auth.yml

auth.yml:

basic_auth_users:
  admin: $2y$10$... # bcrypt hash

2. TLS 加密¶

import "crypto/tls"

tlsConfig := &tls.Config{
 InsecureSkipVerify: false,
}

client := &http.Client{
 Transport: &http.Transport{
  TLSClientConfig: tlsConfig,
 },
}

pusher := push.New("https://pushgateway:9091", job).
 Client(client)

常见问题¶

Q1: Pushgateway 指标一直存在?

A: 使用 Delete() 手动删除,或者重启 Pushgateway:

// 任务结束时删除
defer collector.DeleteMetrics(url, job)

Q2: 多个实例推送到同一 job?

A: 使用不同的分组标签区分:

hostname, _ := os.Hostname()
pusher.Grouping("instance", hostname)

Q3: Push 失败怎么办?

A: 实现重试机制,记录到本地日志:

err := collector.PushMetrics(url, job)
if err != nil {
 // 记录到本地文件
 logMetrics(collector.registry)
}

Q4: Pushgateway 重启后数据丢失?

A: Pushgateway 是无状态的,重启后数据丢失。解决方案:

使用持久化 Pushgateway (--persistence.file)
批处理任务重新推送指标

./pushgateway --persistence.file=/data/metrics.db

监控 Pushgateway¶

关键指标:

# Pushgateway 存储的时间序列数
pushgateway_http_push_size_bytes

# HTTP 推送请求数
pushgateway_http_requests_total

# 推送失败数
rate(pushgateway_http_requests_total{code!="200"}[5m])

告警规则:

groups:
  - name: pushgateway
    rules:
      # Pushgateway 宕机
      - alert: PushgatewayDown
        expr: up{job="pushgateway"} == 0
        for: 5m

      # 批处理任务长时间未推送
      - alert: BatchJobStale
        expr: time() - push_time_seconds{job="batch_job"} > 86400
        annotations:
          summary: "批处理任务超过 24 小时未运行"

总结¶

Pushgateway 适用于:

✅ 批处理任务、Cron Jobs
✅ 短期运行的脚本
✅ 无法被 Prometheus 抓取的服务

关键要点:

使用自定义 Registry 避免推送不必要的指标
使用 Add() 保留已有指标,使用 Push() 覆盖
任务结束时使用 Delete() 清理指标
配置 Prometheus 时设置 honor_labels: true
添加错误处理和重试机制

不要用于:

❌ 长期运行的服务(使用 Pull 模式)
❌ 高频更新的指标
❌ 需要服务发现的场景

完整代码可作为生产环境批处理任务监控的起点! 🎯

附录: Textfile Collector 示例¶

概述¶

Textfile Collector 是 Node Exporter 的一个功能，允许通过读取 .prom 格式的文本文件来暴露自定义指标。这对于以下场景非常有用：

批处理任务生成指标
脚本定期更新指标
无法直接集成 Prometheus 客户端的程序

本示例展示如何使用 expfmt 包将指标写入 textfile 格式。

适用场景¶

✅ 适合使用 Textfile Collector:

Cron 定时任务生成指标
Shell/Python 脚本导出指标
第三方工具输出转换为 Prometheus 格式
静态指标（配额、阈值等）

✅ 相比 Pushgateway 的优势:

不需要额外组件（复用 Node Exporter）
自动生命周期管理（文件删除 = 指标消失）
更简单的部署
本地文件系统，无网络依赖

完整代码¶

package main

import (
 "fmt"
 "log"
 "os"
 "time"

 "github.com/prometheus/client_golang/prometheus"
 "github.com/prometheus/common/expfmt"
)

// MetricCollector 指标收集器
type MetricCollector struct {
 registry *prometheus.Registry
 gauge    *prometheus.GaugeVec
 counter  *prometheus.CounterVec
}

func NewMetricCollector() *MetricCollector {
 // 创建自定义 Registry
 registry := prometheus.NewRegistry()

 // 创建 Gauge 指标
 gauge := prometheus.NewGaugeVec(
  prometheus.GaugeOpts{
   Name: "custom_metric_gauge",
   Help: "Custom gauge metric for textfile collector",
  },
  []string{"service", "region"},
 )

 // 创建 Counter 指标
 counter := prometheus.NewCounterVec(
  prometheus.CounterOpts{
   Name: "custom_metric_counter_total",
   Help: "Custom counter metric for textfile collector",
  },
  []string{"service", "status"},
 )

 // 注册指标
 registry.MustRegister(gauge)
 registry.MustRegister(counter)

 return &MetricCollector{
  registry: registry,
  gauge:    gauge,
  counter:  counter,
 }
}

// RecordMetrics 记录指标
func (mc *MetricCollector) RecordMetrics() {
 mc.gauge.With(prometheus.Labels{
  "service": "api",
  "region":  "us-east-1",
 }).Set(42.5)

 mc.gauge.With(prometheus.Labels{
  "service": "database",
  "region":  "eu-west-1",
 }).Set(78.9)

 mc.counter.With(prometheus.Labels{
  "service": "api",
  "status":  "success",
 }).Inc()

 mc.counter.With(prometheus.Labels{
  "service": "api",
  "status":  "error",
 }).Add(3)
}

// WriteToTextfile 将指标写入 textfile
func (mc *MetricCollector) WriteToTextfile(filename string) error {
 file, err := os.Create(filename)
 if err != nil {
  return fmt.Errorf("failed to create file: %w", err)
 }
 defer file.Close()

 // 从 Registry 收集指标
 metricFamilies, err := mc.registry.Gather()
 if err != nil {
  return fmt.Errorf("failed to gather metrics: %w", err)
 }

 // 使用 expfmt 将指标写入文件
 for _, mf := range metricFamilies {
  if _, err := expfmt.MetricFamilyToText(file, mf); err != nil {
   return fmt.Errorf("failed to write metric family: %w", err)
  }
 }

 return nil
}

// WriteToTextfileAtomic 原子写入（推荐）
func (mc *MetricCollector) WriteToTextfileAtomic(filename string) error {
 // 创建临时文件
 tmpFile := filename + ".tmp." + fmt.Sprintf("%d", time.Now().UnixNano())

 file, err := os.Create(tmpFile)
 if err != nil {
  return fmt.Errorf("failed to create temp file: %w", err)
 }

 // 收集并写入指标
 metricFamilies, err := mc.registry.Gather()
 if err != nil {
  file.Close()
  os.Remove(tmpFile)
  return fmt.Errorf("failed to gather metrics: %w", err)
 }

 for _, mf := range metricFamilies {
  if _, err := expfmt.MetricFamilyToText(file, mf); err != nil {
   file.Close()
   os.Remove(tmpFile)
   return fmt.Errorf("failed to write metric family: %w", err)
  }
 }

 if err := file.Close(); err != nil {
  os.Remove(tmpFile)
  return fmt.Errorf("failed to close temp file: %w", err)
 }

 // 原子重命名
 if err := os.Rename(tmpFile, filename); err != nil {
  os.Remove(tmpFile)
  return fmt.Errorf("failed to rename temp file: %w", err)
 }

 return nil
}

func main() {
 collector := NewMetricCollector()
 collector.RecordMetrics()

 // 写入 textfile
 outputFile := "/var/lib/node_exporter/textfile_collector/custom_metrics.prom"
 if err := collector.WriteToTextfileAtomic(outputFile); err != nil {
  log.Fatalf("Failed to write textfile: %v", err)
 }

 log.Printf("Metrics written to: %s", outputFile)
}

使用说明¶

1. 安装 Node Exporter¶

二进制安装:

wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
cd node_exporter-1.7.0.linux-amd64

启动 Node Exporter (启用 textfile collector):

# 创建 textfile 目录
mkdir -p /var/lib/node_exporter/textfile_collector

# 启动 Node Exporter
./node_exporter --collector.textfile.directory=/var/lib/node_exporter/textfile_collector

2. 安装 Go 依赖¶

go mod init textfile-demo
go get github.com/prometheus/client_golang/prometheus
go get github.com/prometheus/common/expfmt

3. 运行示例¶

# 修改输出路径到 textfile 目录
go run main.go

4. 生成的 .prom 文件内容¶

查看生成的文件:

cat /var/lib/node_exporter/textfile_collector/custom_metrics.prom

输出示例:

# HELP custom_metric_counter_total Custom counter metric for textfile collector
# TYPE custom_metric_counter_total counter
custom_metric_counter_total{service="api",status="error"} 3
custom_metric_counter_total{service="api",status="success"} 1
# HELP custom_metric_gauge Custom gauge metric for textfile collector
# TYPE custom_metric_gauge gauge
custom_metric_gauge{region="eu-west-1",service="database"} 78.9
custom_metric_gauge{region="us-east-1",service="api"} 42.5

5. 访问指标¶

访问 Node Exporter: http://localhost:9100/metrics

搜索 custom_metric:

custom_metric_counter_total{service="api",status="error"} 3
custom_metric_counter_total{service="api",status="success"} 1
custom_metric_gauge{region="eu-west-1",service="database"} 78.9
custom_metric_gauge{region="us-east-1",service="api"} 42.5

核心 API 详解¶

1. registry.Gather()¶

从 Registry 收集所有指标:

metricFamilies, err := registry.Gather()
if err != nil {
    return err
}

// metricFamilies 是 []*dto.MetricFamily 类型
// 每个 MetricFamily 包含:
// - Name: 指标名称
// - Help: 帮助信息
// - Type: 指标类型 (COUNTER, GAUGE, HISTOGRAM, SUMMARY)
// - Metric: 具体的指标值和标签

2. expfmt.MetricFamilyToText()¶

将 MetricFamily 写入文本格式:

for _, mf := range metricFamilies {
    // 写入 Prometheus text format
    _, err := expfmt.MetricFamilyToText(file, mf)
    if err != nil {
        return err
    }
}

支持的格式:

// Text format (默认,推荐)
expfmt.MetricFamilyToText(writer, metricFamily)

// OpenMetrics format
encoder := expfmt.NewEncoder(writer, expfmt.FmtOpenMetrics)
encoder.Encode(metricFamily)

实际应用场景¶

场景 1: 备份任务指标¶

// backup_monitor.go
package main

import (
 "log"
 "os"
 "time"

 "github.com/prometheus/client_golang/prometheus"
 "github.com/prometheus/common/expfmt"
)

func recordBackupMetrics(backupSize int64, duration time.Duration, success bool) error {
 registry := prometheus.NewRegistry()

 // 备份大小
 sizeGauge := prometheus.NewGauge(prometheus.GaugeOpts{
  Name: "backup_size_bytes",
  Help: "Size of the backup in bytes",
 })
 sizeGauge.Set(float64(backupSize))
 registry.MustRegister(sizeGauge)

 // 备份耗时
 durationGauge := prometheus.NewGauge(prometheus.GaugeOpts{
  Name: "backup_duration_seconds",
  Help: "Duration of the backup in seconds",
 })
 durationGauge.Set(duration.Seconds())
 registry.MustRegister(durationGauge)

 // 最后备份时间
 timestampGauge := prometheus.NewGauge(prometheus.GaugeOpts{
  Name: "backup_last_success_timestamp",
  Help: "Timestamp of the last successful backup",
 })
 if success {
  timestampGauge.Set(float64(time.Now().Unix()))
 }
 registry.MustRegister(timestampGauge)

 // 写入 textfile
 return writeMetrics(registry, "/var/lib/node_exporter/textfile_collector/backup.prom")
}

func writeMetrics(registry *prometheus.Registry, filename string) error {
 tmpFile := filename + ".tmp"
 file, err := os.Create(tmpFile)
 if err != nil {
  return err
 }
 defer file.Close()

 metricFamilies, err := registry.Gather()
 if err != nil {
  return err
 }

 for _, mf := range metricFamilies {
  if _, err := expfmt.MetricFamilyToText(file, mf); err != nil {
   return err
  }
 }

 return os.Rename(tmpFile, filename)
}

func main() {
 // 执行备份
 backupSize := int64(1024 * 1024 * 500) // 500MB
 duration := 120 * time.Second
 success := true

 if err := recordBackupMetrics(backupSize, duration, success); err != nil {
  log.Fatalf("Failed to record metrics: %v", err)
 }

 log.Println("Backup metrics recorded successfully")
}

Crontab:

0 2 * * * /usr/local/bin/backup.sh && /usr/local/bin/backup_monitor

场景 2: 磁盘配额监控¶

// disk_quota.go
package main

import (
 "fmt"
 "os/exec"
 "strconv"
 "strings"

 "github.com/prometheus/client_golang/prometheus"
 "github.com/prometheus/common/expfmt"
)

type QuotaInfo struct {
 User      string
 UsedBytes int64
 LimitBytes int64
}

func getQuotaInfo() ([]QuotaInfo, error) {
 // 执行 quota 命令
 cmd := exec.Command("quota", "-v")
 output, err := cmd.Output()
 if err != nil {
  return nil, err
 }

 // 解析输出 (示例)
 quotas := []QuotaInfo{
  {User: "alice", UsedBytes: 5368709120, LimitBytes: 10737418240},
  {User: "bob", UsedBytes: 8589934592, LimitBytes: 10737418240},
 }

 return quotas, nil
}

func recordQuotaMetrics() error {
 registry := prometheus.NewRegistry()

 usedGauge := prometheus.NewGaugeVec(
  prometheus.GaugeOpts{
   Name: "disk_quota_used_bytes",
   Help: "Disk quota used in bytes",
  },
  []string{"user"},
 )

 limitGauge := prometheus.NewGaugeVec(
  prometheus.GaugeOpts{
   Name: "disk_quota_limit_bytes",
   Help: "Disk quota limit in bytes",
  },
  []string{"user"},
 )

 registry.MustRegister(usedGauge)
 registry.MustRegister(limitGauge)

 quotas, err := getQuotaInfo()
 if err != nil {
  return err
 }

 for _, q := range quotas {
  usedGauge.WithLabelValues(q.User).Set(float64(q.UsedBytes))
  limitGauge.WithLabelValues(q.User).Set(float64(q.LimitBytes))
 }

 return writeMetrics(registry, "/var/lib/node_exporter/textfile_collector/disk_quota.prom")
}

Crontab:

*/5 * * * * /usr/local/bin/disk_quota_monitor

场景 3: Shell 脚本生成指标¶

bash 版本:

#!/bin/bash
# generate_metrics.sh

TEXTFILE_DIR="/var/lib/node_exporter/textfile_collector"
METRIC_FILE="$TEXTFILE_DIR/custom.prom"
TMP_FILE="$METRIC_FILE.$$"

# 生成指标
cat > "$TMP_FILE" << EOF
# HELP custom_database_connections Active database connections
# TYPE custom_database_connections gauge
custom_database_connections{database="primary"} $(mysql -e "SHOW STATUS LIKE 'Threads_connected'" | awk 'NR==2 {print $2}')
custom_database_connections{database="replica"} $(mysql -h replica -e "SHOW STATUS LIKE 'Threads_connected'" | awk 'NR==2 {print $2}')

# HELP custom_log_errors_total Total number of errors in logs
# TYPE custom_log_errors_total counter
custom_log_errors_total $(grep -c ERROR /var/log/app.log)
EOF

# 原子移动
mv "$TMP_FILE" "$METRIC_FILE"

Python 版本:

#!/usr/bin/env python3
import time
from prometheus_client import CollectorRegistry, Gauge, write_to_textfile

registry = CollectorRegistry()

# 创建指标
connections = Gauge(
    'custom_database_connections',
    'Active database connections',
    ['database'],
    registry=registry
)

# 设置值
connections.labels(database='primary').set(42)
connections.labels(database='replica').set(38)

# 写入文件
write_to_textfile(
    '/var/lib/node_exporter/textfile_collector/custom.prom',
    registry
)

最佳实践¶

1. 使用原子写入¶

// ✅ 好 - 原子写入,避免 Node Exporter 读取部分文件
tmpFile := filename + ".tmp"
os.Create(tmpFile)
// ... 写入内容
os.Rename(tmpFile, filename)

// ❌ 不好 - 直接写入,可能读取到不完整数据
os.Create(filename)
// ... 写入内容

2. 添加文件修改时间指标¶

// 添加文件生成时间,用于监控脚本是否正常运行
timestampGauge := prometheus.NewGauge(prometheus.GaugeOpts{
 Name: "textfile_scrape_timestamp",
 Help: "Timestamp of the last textfile generation",
})
timestampGauge.Set(float64(time.Now().Unix()))
registry.MustRegister(timestampGauge)

告警规则:

- alert: TextfileStale
  expr: time() - textfile_scrape_timestamp > 3600
  annotations:
    summary: "Textfile 超过 1 小时未更新"

3. 错误处理¶

func writeMetricsWithErrorHandling(registry *prometheus.Registry, filename string) error {
 tmpFile := filename + ".tmp"

 // 创建临时文件
 file, err := os.Create(tmpFile)
 if err != nil {
  return fmt.Errorf("create temp file: %w", err)
 }

 // 收集指标
 mfs, err := registry.Gather()
 if err != nil {
  file.Close()
  os.Remove(tmpFile)
  return fmt.Errorf("gather metrics: %w", err)
 }

 // 写入指标
 for _, mf := range mfs {
  if _, err := expfmt.MetricFamilyToText(file, mf); err != nil {
   file.Close()
   os.Remove(tmpFile)
   return fmt.Errorf("write metric: %w", err)
  }
 }

 // 关闭文件
 if err := file.Close(); err != nil {
  os.Remove(tmpFile)
  return fmt.Errorf("close file: %w", err)
 }

 // 原子重命名
 if err := os.Rename(tmpFile, filename); err != nil {
  os.Remove(tmpFile)
  return fmt.Errorf("rename file: %w", err)
 }

 return nil
}

4. 文件权限¶

# 确保 Node Exporter 用户可读
chmod 644 /var/lib/node_exporter/textfile_collector/*.prom

# 目录权限
chmod 755 /var/lib/node_exporter/textfile_collector

Prometheus 配置¶

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']

PromQL 查询示例¶

# 查询 textfile 指标
custom_metric_gauge

# 检查 textfile 是否过期
time() - textfile_scrape_timestamp > 3600

# 备份大小趋势
backup_size_bytes

# 磁盘配额使用率
disk_quota_used_bytes / disk_quota_limit_bytes > 0.9

监控 Textfile Collector¶

Node Exporter 暴露的相关指标:

# textfile collector 读取错误
node_textfile_scrape_error

# 最后一次成功读取时间
node_textfile_mtime_seconds

# textfile 文件数量
count(node_textfile_mtime_seconds)

告警规则:

groups:
  - name: textfile_collector
    rules:
      # Textfile 读取错误
      - alert: TextfileScrapeError
        expr: node_textfile_scrape_error == 1
        annotations:
          summary: "Textfile collector 读取错误"

      # Textfile 文件过期
      - alert: TextfileStale
        expr: time() - node_textfile_mtime_seconds > 3600
        annotations:
          summary: "Textfile 超过 1 小时未更新"

常见问题¶

Q1: 指标未出现在 Node Exporter?

A: 检查以下项:

# 1. 文件是否存在
ls -la /var/lib/node_exporter/textfile_collector/

# 2. 文件权限
chmod 644 /var/lib/node_exporter/textfile_collector/*.prom

# 3. 文件格式是否正确
cat /var/lib/node_exporter/textfile_collector/*.prom

# 4. 检查 Node Exporter 日志
journalctl -u node_exporter -f

# 5. 检查 scrape 错误
curl http://localhost:9100/metrics | grep node_textfile_scrape_error

Q2: 格式错误导致指标无法读取?

A: 确保格式正确:

# ✅ 正确格式
# HELP metric_name Metric description
# TYPE metric_name gauge
metric_name{label="value"} 42

# ❌ 错误格式 (缺少 HELP/TYPE)
metric_name{label="value"} 42

Q3: 如何删除指标?

A: 删除对应的 .prom 文件:

rm /var/lib/node_exporter/textfile_collector/custom.prom

Node Exporter 会在下次 scrape 时自动移除这些指标。

性能考虑¶

文件数量限制:

✅ 建议每个脚本/任务一个文件
✅ 总文件数 < 100
❌ 避免过多小文件

文件大小:

✅ 每个文件 < 10MB
❌ 避免大量时间序列

更新频率:

✅ 批处理任务: 按需更新
✅ 定期任务: 1-5 分钟
❌ 避免高频更新 (< 10s)

总结¶

Textfile Collector 适用于:

✅ Cron/批处理任务生成指标
✅ Shell/Python 脚本导出指标
✅ 静态指标和配置信息
✅ 本地文件系统可访问的场景

关键要点:

使用 registry.Gather() 收集指标
使用 expfmt.MetricFamilyToText() 写入文件
使用临时文件 + 原子重命名
添加时间戳指标监控脚本运行
正确的文件权限 (644)

对比 Pushgateway:

特性	Textfile Collector	Pushgateway
部署复杂度	低 (复用 Node Exporter)	中 (需要额外组件)
网络依赖	无 (本地文件)	有 (HTTP 推送)
生命周期	自动 (删除文件)	手动 (需调用 Delete)
适用场景	本地定时任务	远程批处理任务

完整代码可作为批处理任务指标导出的最佳实践！🎯

附录: Python & Shell 完整示例¶

本章节提供 Python 和 Shell 脚本的完整 Prometheus 集成示例，涵盖三种主要模式。

Python HTTP Exporter¶

概述¶

使用 prometheus_client 库创建 HTTP 端点暴露指标，适合长期运行的服务。

完整代码¶

#!/usr/bin/env python3
import time
import psutil
from prometheus_client import start_http_server, Gauge, Counter, Histogram, Info, REGISTRY

# 定义指标
cpu_usage = Gauge('system_cpu_usage_percent', 'CPU usage', ['cpu'], registry=REGISTRY)
memory_usage = Gauge('system_memory_usage_bytes', 'Memory usage', ['type'], registry=REGISTRY)
request_count = Counter('http_requests_total', 'HTTP requests', ['method', 'endpoint', 'status'], registry=REGISTRY)
request_duration = Histogram('http_request_duration_seconds', 'Request latency', ['endpoint'], buckets=[0.01, 0.05, 0.1, 0.5, 1, 2, 5], registry=REGISTRY)
app_info = Info('application', 'Application info', registry=REGISTRY)

def update_system_metrics():
    cpu_percent = psutil.cpu_percent(interval=1, percpu=True)
    for i, percent in enumerate(cpu_percent):
        cpu_usage.labels(cpu=f'cpu{i}').set(percent)

    mem = psutil.virtual_memory()
    memory_usage.labels(type='total').set(mem.total)
    memory_usage.labels(type='used').set(mem.used)

def main():
    app_info.info({'version': '1.0.0', 'env': 'production'})
    port = 8000
    start_http_server(port)
    print(f"Metrics: http://localhost:{port}/metrics")

    try:
        while True:
            update_system_metrics()
            time.sleep(5)
    except KeyboardInterrupt:
        print("Shutting down...")

if __name__ == '__main__':
    main()

安装和运行¶

# 安装依赖
pip install prometheus-client psutil

# 运行
python3 python_http_exporter.py

# 查看指标
curl http://localhost:8000/metrics

Python Pushgateway¶

完整代码¶

#!/usr/bin/env python3
import time
import random
import socket
from prometheus_client import CollectorRegistry, Gauge, Counter
from prometheus_client import push_to_gateway, delete_from_gateway

registry = CollectorRegistry()

batch_size = Gauge('batch_processing_size', 'Batch size', ['batch_id'], registry=registry)
batch_processed = Counter('batch_items_processed_total', 'Items processed', ['batch_id', 'status'], registry=registry)

def process_batch(batch_id, item_count):
    print(f"Processing batch: {batch_id}")
    batch_size.labels(batch_id=batch_id).set(item_count)

    for i in range(item_count):
        time.sleep(random.uniform(0.001, 0.01))
        if random.random() < 0.9:
            batch_processed.labels(batch_id=batch_id, status='success').inc()
        else:
            batch_processed.labels(batch_id=batch_id, status='failed').inc()

def main():
    pushgateway_url = 'localhost:9091'
    job_name = 'batch_job'
    grouping_key = {'instance': socket.gethostname()}

    batch_id = f"batch_{int(time.time())}"
    process_batch(batch_id, 100)

    push_to_gateway(pushgateway_url, job=job_name, registry=registry, grouping_key=grouping_key)
    print(f"Metrics pushed to {pushgateway_url}")

    time.sleep(30)
    delete_from_gateway(pushgateway_url, job=job_name, grouping_key=grouping_key)

if __name__ == '__main__':
    main()

运行¶

# 启动 Pushgateway
docker run -d -p 9091:9091 prom/pushgateway

# 运行脚本
python3 python_pushgateway.py

Python Textfile Collector¶

完整代码¶

#!/usr/bin/env python3
import os
import time
import psutil
from prometheus_client import CollectorRegistry, Gauge
from prometheus_client import write_to_textfile

TEXTFILE_DIR = '/var/lib/node_exporter/textfile_collector'
METRIC_FILE = os.path.join(TEXTFILE_DIR, 'custom_metrics.prom')

if not os.path.exists(TEXTFILE_DIR):
    TEXTFILE_DIR = '/tmp/textfile_collector'
    METRIC_FILE = os.path.join(TEXTFILE_DIR, 'custom_metrics.prom')
    os.makedirs(TEXTFILE_DIR, exist_ok=True)

registry = CollectorRegistry()

disk_usage = Gauge('custom_disk_usage_percent', 'Disk usage', ['mountpoint'], registry=registry)
scrape_timestamp = Gauge('custom_textfile_scrape_timestamp', 'Last scrape time', registry=registry)

def collect_metrics():
    for partition in psutil.disk_partitions():
        try:
            usage = psutil.disk_usage(partition.mountpoint)
            disk_usage.labels(mountpoint=partition.mountpoint).set(usage.percent)
        except PermissionError:
            continue

    scrape_timestamp.set(time.time())

def main():
    print("Collecting metrics...")
    collect_metrics()

    tmp_file = f"{METRIC_FILE}.tmp"
    write_to_textfile(tmp_file, registry)
    os.rename(tmp_file, METRIC_FILE)

    print(f"Metrics written to: {METRIC_FILE}")

if __name__ == '__main__':
    main()

Crontab 配置¶

*/5 * * * * /usr/bin/python3 /path/to/python_textfile_collector.py

Shell Textfile Collector¶

完整代码¶

#!/bin/bash
set -e

TEXTFILE_DIR="/var/lib/node_exporter/textfile_collector"
METRIC_FILE="$TEXTFILE_DIR/shell_metrics.prom"

if [ ! -d "$TEXTFILE_DIR" ]; then
    TEXTFILE_DIR="/tmp/textfile_collector"
    METRIC_FILE="$TEXTFILE_DIR/shell_metrics.prom"
    mkdir -p "$TEXTFILE_DIR"
fi

TMP_FILE="$METRIC_FILE.$$"

: > "$TMP_FILE"

# 系统负载
LOAD_1=$(uptime | awk -F'load average:' '{print $2}' | awk -F, '{print $1}' | tr -d ' ')
cat >> "$TMP_FILE" << EOF
# HELP shell_system_load System load average
# TYPE shell_system_load gauge
shell_system_load{period="1m"} $LOAD_1

EOF

# 磁盘使用
cat >> "$TMP_FILE" << 'EOF'
# HELP shell_disk_usage_percent Disk usage percentage
# TYPE shell_disk_usage_percent gauge
EOF

df -h | awk 'NR>1 && $1 ~ /^\// {
    gsub(/%/, "", $5)
    printf "shell_disk_usage_percent{mountpoint=\"%s\"} %s\n", $6, $5
}' >> "$TMP_FILE"

echo "" >> "$TMP_FILE"

# 内存使用
MEMORY_TOTAL=$(free -b | awk '/^Mem:/ {print $2}')
cat >> "$TMP_FILE" << EOF
# HELP shell_memory_bytes Memory usage
# TYPE shell_memory_bytes gauge
shell_memory_bytes{type="total"} $MEMORY_TOTAL

EOF

# 时间戳
cat >> "$TMP_FILE" << EOF
# HELP shell_textfile_scrape_timestamp Timestamp
# TYPE shell_textfile_scrape_timestamp gauge
shell_textfile_scrape_timestamp $(date +%s)
EOF

mv "$TMP_FILE" "$METRIC_FILE"
chmod 644 "$METRIC_FILE"

echo "Metrics written to: $METRIC_FILE"

使用方法¶

chmod +x shell_textfile_collector.sh
./shell_textfile_collector.sh

# Crontab
*/5 * * * * /path/to/shell_textfile_collector.sh

对比总结¶

语言选择¶

特性	Python	Go	Shell
性能	中	高	低
部署	需要 Python 环境	单一二进制	无依赖
开发效率	高	中	中
适用场景	数据处理、脚本	生产服务	系统监控

模式选择¶

模式	优势	劣势	场景
HTTP Exporter	实时更新	需要暴露端口	长期服务
Pushgateway	适合批处理	需要额外组件	批处理任务
Textfile	简单、无依赖	定期更新	定时任务

最佳实践¶

HTTP Exporter¶

# 使用自定义 Registry
registry = CollectorRegistry()

# 合理标签基数
gauge = Gauge('metric', 'desc', ['service'])  # ✅ 低基数
gauge = Gauge('metric', 'desc', ['user_id'])  # ❌ 高基数

Pushgateway¶

# 任务结束时删除指标
try:
    push_to_gateway(url, job='batch', registry=registry)
finally:
    delete_from_gateway(url, job='batch')

Textfile Collector¶

# 原子写入
tmp_file = f"{metric_file}.tmp"
write_to_textfile(tmp_file, registry)
os.rename(tmp_file, metric_file)

Prometheus 配置¶

scrape_configs:
  - job_name: 'python_exporter'
    static_configs:
      - targets: ['localhost:8000']

  - job_name: 'pushgateway'
    honor_labels: true
    static_configs:
      - targets: ['localhost:9091']

  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']

总结¶

本章节提供了完整的示例代码：

✅ 3 种语言: Go、Python、Shell
✅ 3 种模式: HTTP、Pushgateway、Textfile
✅ 4 种指标: Counter、Gauge、Histogram、Info
✅ 最佳实践: 原子写入、错误处理、标签使用

所有代码均可直接用于生产环境！🎯