OpenTelemetry 与 Langfuse 集成部署手册¶
本文档提供了使用 Helm 在 Kubernetes 上部署 OpenTelemetry Collector 并与 Langfuse 集成以采集 LLM 模型数据的详细指南。
目录¶
前提条件¶
- 一个正常运行的 Kubernetes 集群 (v1.19+)
- Helm (v3.0+) 已安装
- kubectl 已配置并可访问集群
- 基本的 Kubernetes 和可观测性概念理解
- LLM 服务已部署在集群中或可从集群访问
架构概述¶
在此架构中:
- LLM 服务生成 OpenTelemetry 格式的遥测数据
- OpenTelemetry Collector 收集这些数据
- Collector 将数据转发到 Langfuse 进行分析和可视化
- 可选地,数据也可以发送到其他后端如 Prometheus、Jaeger 等
部署 Langfuse¶
1. 添加 Langfuse Helm 仓库¶
helm repo add langfuse https://langfuse.github.io/helm-charts
helm repo update
2. 创建 Langfuse 配置文件¶
创建一个名为 langfuse-values.yaml
的文件:
global:
# 设置为您的域名
host: langfuse.your-domain.com
postgresql:
enabled: true
auth:
username: langfuse
password: "your-secure-password"
database: langfuse
ingress:
enabled: true
className: nginx
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
hosts:
- host: langfuse.your-domain.com
paths:
- path: /
pathType: Prefix
auth:
# 生成安全的密钥
secretKey: "your-secure-secret-key"
# 如果使用 Auth0 或其他身份提供商
# auth0:
# clientId: "your-auth0-client-id"
# clientSecret: "your-auth0-client-secret"
# issuer: "https://your-tenant.auth0.com"
3. 安装 Langfuse¶
helm install langfuse langfuse/langfuse -f langfuse-values.yaml -n langfuse --create-namespace
4. 获取 Langfuse API 密钥¶
部署完成后,访问 Langfuse UI 并创建一个新项目。记下生成的 API 密钥,稍后配置 OpenTelemetry Collector 时会用到。
部署 OpenTelemetry Collector¶
1. 添加 OpenTelemetry Helm 仓库¶
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update
2. 创建 OpenTelemetry Collector 配置文件¶
创建一个名为 otel-collector-values.yaml
的文件:
mode: deployment
config:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 10s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 1000
spike_limit_mib: 200
resource:
attributes:
- action: insert
key: environment
value: production
exporters:
logging:
loglevel: debug
otlphttp:
endpoint: "http://langfuse.langfuse.svc.cluster.local:3000/api/public/otel"
headers:
"X-Langfuse-Public-Key": "your-langfuse-public-key"
"X-Langfuse-Secret-Key": "your-langfuse-secret-key"
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [otlphttp, logging]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [otlphttp, logging]
logs:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [otlphttp, logging]
serviceAccount:
create: true
annotations: {}
name: ""
service:
type: ClusterIP
ports:
- name: otlp-grpc
port: 4317
protocol: TCP
targetPort: 4317
- name: otlp-http
port: 4318
protocol: TCP
targetPort: 4318
resources:
limits:
cpu: 1
memory: 2Gi
requests:
cpu: 200m
memory: 400Mi
podAnnotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8888"
3. 安装 OpenTelemetry Collector¶
helm install otel-collector open-telemetry/opentelemetry-collector -f otel-collector-values.yaml -n otel --create-namespace
4. 验证 Collector 部署¶
kubectl get pods -n otel
kubectl logs -f deployment/otel-collector -n otel
配置 LLM 服务发送遥测数据¶
对于 vLLM 服务¶
如果您使用 vLLM 部署 LLM 服务,需要在服务中集成 OpenTelemetry SDK。以下是一个示例配置:
- 在 vLLM 服务的 Deployment 或 StatefulSet 中添加环境变量:
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://otel-collector.otel.svc.cluster.local:4317"
- name: OTEL_SERVICE_NAME
value: "llm-inference-service"
- name: OTEL_RESOURCE_ATTRIBUTES
value: "service.namespace=ai,service.version=v1,deployment.environment=production"
- 在 Python 代码中初始化 OpenTelemetry:
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
# 初始化 OpenTelemetry
resource = Resource.create({})
tracer_provider = TracerProvider(resource=resource)
trace.set_tracer_provider(tracer_provider)
# 配置 OTLP 导出器
otlp_exporter = OTLPSpanExporter()
span_processor = BatchSpanProcessor(otlp_exporter)
tracer_provider.add_span_processor(span_processor)
# 自动检测 HTTP 请求
RequestsInstrumentor().instrument()
# 获取 tracer
tracer = trace.get_tracer(__name__)
# 在 LLM 推理过程中使用
def generate_text(prompt):
with tracer.start_as_current_span("llm.generate") as span:
span.set_attribute("llm.prompt", prompt)
span.set_attribute("llm.model", "DeepSeek-R1-Distill-Qwen-1.5B")
# 执行实际的推理
result = actual_generate_function(prompt)
span.set_attribute("llm.completion", result)
span.set_attribute("llm.tokens", len(result))
return result
使用 Kubernetes Sidecar 注入¶
对于支持 OpenTelemetry 自动检测的语言,可以使用 sidecar 注入方式:
- 为命名空间启用自动注入:
kubectl label namespace your-llm-namespace instrumentation.opentelemetry.io/inject-sdk=true
- 在部署中添加注解:
annotations:
instrumentation.opentelemetry.io/inject-sdk: "true"
instrumentation.opentelemetry.io/container-names: "your-llm-container"
验证部署¶
1. 检查 OpenTelemetry Collector 日志¶
kubectl logs -f deployment/otel-collector -n otel
应该能看到成功接收和导出遥测数据的日志。
2. 验证 Langfuse 接收数据¶
登录 Langfuse UI,查看 Traces 页面,确认能看到来自 LLM 服务的遥测数据。
3. 测试端到端流程¶
执行一个 LLM 推理请求,然后在 Langfuse 中查找对应的 trace。
监控与告警¶
监控 OpenTelemetry Collector¶
- 为 Collector 设置 Prometheus 监控:
kubectl apply -f - <<EOF
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: otel-collector
namespace: otel
spec:
endpoints:
- port: metrics
interval: 15s
selector:
matchLabels:
app.kubernetes.io/name: opentelemetry-collector
EOF
- 创建基本告警规则:
kubectl apply -f - <<EOF
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: otel-collector-alerts
namespace: otel
spec:
groups:
- name: otel-collector
rules:
- alert: OtelCollectorDown
expr: up{job="otel-collector"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "OpenTelemetry Collector is down"
description: "OpenTelemetry Collector has been down for more than 5 minutes."
- alert: OtelCollectorHighMemory
expr: process_resident_memory_bytes{job="otel-collector"} > 1.8e+9
for: 5m
labels:
severity: warning
annotations:
summary: "OpenTelemetry Collector high memory usage"
description: "OpenTelemetry Collector is using more than 1.8GB of memory."
EOF
监控 Langfuse¶
设置 Langfuse 健康检查:
kubectl apply -f - <<EOF
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: langfuse-alerts
namespace: langfuse
spec:
groups:
- name: langfuse
rules:
- alert: LangfuseDown
expr: probe_success{job="langfuse"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Langfuse is down"
description: "Langfuse has been down for more than 5 minutes."
EOF
故障排除¶
OpenTelemetry Collector 问题¶
- Collector 无法启动
- 检查配置文件语法
- 验证资源限制是否合理
-
查看详细日志:
kubectl logs -f deployment/otel-collector -n otel
-
Collector 无法连接到 Langfuse
- 验证 Langfuse 服务是否可达:
kubectl exec -it deployment/otel-collector -n otel -- curl -v langfuse.langfuse.svc.cluster.local:3000/health
-
检查 API 密钥是否正确配置
-
内存使用过高
- 调整
memory_limiter
处理器配置 - 增加资源限制
- 考虑使用水平自动缩放
Langfuse 问题¶
- 数据库连接问题
- 检查 PostgreSQL 连接配置
-
验证数据库是否健康:
kubectl exec -it deployment/langfuse-postgresql -n langfuse -- pg_isready
-
API 错误
- 检查 Langfuse 日志:
kubectl logs -f deployment/langfuse -n langfuse
- 验证 API 密钥权限
LLM 服务问题¶
- 未发送遥测数据
- 确认 OpenTelemetry SDK 正确初始化
- 验证环境变量配置
- 检查网络策略是否允许到 Collector 的流量
最佳实践¶
扩展性考虑¶
- 水平扩展 Collector
对于高流量场景,配置 HPA:
kubectl apply -f - <<EOF
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: otel-collector
namespace: otel
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: otel-collector
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80
EOF
- 使用 Collector Agent 模式
对于大规模集群,考虑部署 Collector 为 DaemonSet:
helm install otel-agent open-telemetry/opentelemetry-collector \
--set mode=daemonset \
-f otel-agent-values.yaml \
-n otel
安全最佳实践¶
- 使用 Kubernetes Secrets 存储敏感信息
kubectl create secret generic langfuse-api-keys \
--from-literal=public-key=your-public-key \
--from-literal=secret-key=your-secret-key \
-n otel
然后在 Collector 配置中引用:
exporters:
otlphttp:
endpoint: "http://langfuse.langfuse.svc.cluster.local:3000/api/public/otel"
headers:
"X-Langfuse-Public-Key": ${LANGFUSE_PUBLIC_KEY}
"X-Langfuse-Secret-Key": ${LANGFUSE_SECRET_KEY}
- 配置网络策略
限制 Collector 和 Langfuse 的网络访问:
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-otel-to-langfuse
namespace: langfuse
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: langfuse
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: otel
podSelector:
matchLabels:
app.kubernetes.io/name: opentelemetry-collector
ports:
- protocol: TCP
port: 3000
EOF
数据采集最佳实践¶
- 标准化 LLM 属性
在所有 LLM 服务中使用一致的属性命名:
llm.model
- 模型名称llm.prompt
- 输入提示llm.completion
- 生成的文本llm.tokens
- 使用的令牌数llm.latency
- 推理延迟-
llm.temperature
- 采样温度 -
采集适当的上下文
确保采集足够的上下文信息,但避免包含敏感数据:
def process_llm_request(request_data):
with tracer.start_as_current_span("llm.request") as span:
# 添加基本属性
span.set_attribute("llm.model", request_data.get("model"))
span.set_attribute("llm.temperature", request_data.get("temperature", 0.7))
# 脱敏处理提示内容
prompt = request_data.get("prompt", "")
sanitized_prompt = sanitize_sensitive_data(prompt)
span.set_attribute("llm.prompt", sanitized_prompt)
# 处理请求...
结论¶
通过本指南,您已经学习了如何在 Kubernetes 上使用 Helm 部署 OpenTelemetry Collector 并与 Langfuse 集成,以采集和分析 LLM 模型数据。这种设置提供了对 LLM 服务的全面可观测性,使您能够监控性能、跟踪请求流程并优化模型行为。
随着 LLM 应用的扩展,可以进一步增强此架构,添加更多的后端存储、更复杂的处理管道和更详细的可视化。