跳转至

OpenTelemetry 与 Langfuse 集成部署手册

本文档提供了使用 Helm 在 Kubernetes 上部署 OpenTelemetry Collector 并与 Langfuse 集成以采集 LLM 模型数据的详细指南。

目录

前提条件

  • 一个正常运行的 Kubernetes 集群 (v1.19+)
  • Helm (v3.0+) 已安装
  • kubectl 已配置并可访问集群
  • 基本的 Kubernetes 和可观测性概念理解
  • LLM 服务已部署在集群中或可从集群访问

架构概述

OpenTelemetry 与 Langfuse 架构

在此架构中:

  1. LLM 服务生成 OpenTelemetry 格式的遥测数据
  2. OpenTelemetry Collector 收集这些数据
  3. Collector 将数据转发到 Langfuse 进行分析和可视化
  4. 可选地,数据也可以发送到其他后端如 Prometheus、Jaeger 等

部署 Langfuse

1. 添加 Langfuse Helm 仓库

helm repo add langfuse https://langfuse.github.io/helm-charts
helm repo update

2. 创建 Langfuse 配置文件

创建一个名为 langfuse-values.yaml 的文件:

global:
  # 设置为您的域名
  host: langfuse.your-domain.com

postgresql:
  enabled: true
  auth:
    username: langfuse
    password: "your-secure-password"
    database: langfuse

ingress:
  enabled: true
  className: nginx
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
  hosts:
    - host: langfuse.your-domain.com
      paths:
        - path: /
          pathType: Prefix

auth:
  # 生成安全的密钥
  secretKey: "your-secure-secret-key"
  # 如果使用 Auth0 或其他身份提供商
  # auth0:
  #   clientId: "your-auth0-client-id"
  #   clientSecret: "your-auth0-client-secret"
  #   issuer: "https://your-tenant.auth0.com"

3. 安装 Langfuse

helm install langfuse langfuse/langfuse -f langfuse-values.yaml -n langfuse --create-namespace

4. 获取 Langfuse API 密钥

部署完成后,访问 Langfuse UI 并创建一个新项目。记下生成的 API 密钥,稍后配置 OpenTelemetry Collector 时会用到。

部署 OpenTelemetry Collector

1. 添加 OpenTelemetry Helm 仓库

helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update

2. 创建 OpenTelemetry Collector 配置文件

创建一个名为 otel-collector-values.yaml 的文件:

mode: deployment

config:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318

  processors:
    batch:
      timeout: 10s
      send_batch_size: 1024
    memory_limiter:
      check_interval: 1s
      limit_mib: 1000
      spike_limit_mib: 200
    resource:
      attributes:
        - action: insert
          key: environment
          value: production

  exporters:
    logging:
      loglevel: debug
    otlphttp:
      endpoint: "http://langfuse.langfuse.svc.cluster.local:3000/api/public/otel"
      headers:
        "X-Langfuse-Public-Key": "your-langfuse-public-key"
        "X-Langfuse-Secret-Key": "your-langfuse-secret-key"

  service:
    pipelines:
      traces:
        receivers: [otlp]
        processors: [memory_limiter, batch, resource]
        exporters: [otlphttp, logging]
      metrics:
        receivers: [otlp]
        processors: [memory_limiter, batch, resource]
        exporters: [otlphttp, logging]
      logs:
        receivers: [otlp]
        processors: [memory_limiter, batch, resource]
        exporters: [otlphttp, logging]

serviceAccount:
  create: true
  annotations: {}
  name: ""

service:
  type: ClusterIP
  ports:
    - name: otlp-grpc
      port: 4317
      protocol: TCP
      targetPort: 4317
    - name: otlp-http
      port: 4318
      protocol: TCP
      targetPort: 4318

resources:
  limits:
    cpu: 1
    memory: 2Gi
  requests:
    cpu: 200m
    memory: 400Mi

podAnnotations:
  prometheus.io/scrape: "true"
  prometheus.io/port: "8888"

3. 安装 OpenTelemetry Collector

helm install otel-collector open-telemetry/opentelemetry-collector -f otel-collector-values.yaml -n otel --create-namespace

4. 验证 Collector 部署

kubectl get pods -n otel
kubectl logs -f deployment/otel-collector -n otel

配置 LLM 服务发送遥测数据

对于 vLLM 服务

如果您使用 vLLM 部署 LLM 服务,需要在服务中集成 OpenTelemetry SDK。以下是一个示例配置:

  1. 在 vLLM 服务的 Deployment 或 StatefulSet 中添加环境变量:
env:
  - name: OTEL_EXPORTER_OTLP_ENDPOINT
    value: "http://otel-collector.otel.svc.cluster.local:4317"
  - name: OTEL_SERVICE_NAME
    value: "llm-inference-service"
  - name: OTEL_RESOURCE_ATTRIBUTES
    value: "service.namespace=ai,service.version=v1,deployment.environment=production"
  1. 在 Python 代码中初始化 OpenTelemetry:
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

# 初始化 OpenTelemetry
resource = Resource.create({})
tracer_provider = TracerProvider(resource=resource)
trace.set_tracer_provider(tracer_provider)

# 配置 OTLP 导出器
otlp_exporter = OTLPSpanExporter()
span_processor = BatchSpanProcessor(otlp_exporter)
tracer_provider.add_span_processor(span_processor)

# 自动检测 HTTP 请求
RequestsInstrumentor().instrument()

# 获取 tracer
tracer = trace.get_tracer(__name__)

# 在 LLM 推理过程中使用
def generate_text(prompt):
    with tracer.start_as_current_span("llm.generate") as span:
        span.set_attribute("llm.prompt", prompt)
        span.set_attribute("llm.model", "DeepSeek-R1-Distill-Qwen-1.5B")

        # 执行实际的推理
        result = actual_generate_function(prompt)

        span.set_attribute("llm.completion", result)
        span.set_attribute("llm.tokens", len(result))

        return result

使用 Kubernetes Sidecar 注入

对于支持 OpenTelemetry 自动检测的语言,可以使用 sidecar 注入方式:

  1. 为命名空间启用自动注入:
kubectl label namespace your-llm-namespace instrumentation.opentelemetry.io/inject-sdk=true
  1. 在部署中添加注解:
annotations:
  instrumentation.opentelemetry.io/inject-sdk: "true"
  instrumentation.opentelemetry.io/container-names: "your-llm-container"

验证部署

1. 检查 OpenTelemetry Collector 日志

kubectl logs -f deployment/otel-collector -n otel

应该能看到成功接收和导出遥测数据的日志。

2. 验证 Langfuse 接收数据

登录 Langfuse UI,查看 Traces 页面,确认能看到来自 LLM 服务的遥测数据。

3. 测试端到端流程

执行一个 LLM 推理请求,然后在 Langfuse 中查找对应的 trace。

监控与告警

监控 OpenTelemetry Collector

  1. 为 Collector 设置 Prometheus 监控:
kubectl apply -f - <<EOF
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: otel-collector
  namespace: otel
spec:
  endpoints:
  - port: metrics
    interval: 15s
  selector:
    matchLabels:
      app.kubernetes.io/name: opentelemetry-collector
EOF
  1. 创建基本告警规则:
kubectl apply -f - <<EOF
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: otel-collector-alerts
  namespace: otel
spec:
  groups:
  - name: otel-collector
    rules:
    - alert: OtelCollectorDown
      expr: up{job="otel-collector"} == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "OpenTelemetry Collector is down"
        description: "OpenTelemetry Collector has been down for more than 5 minutes."
    - alert: OtelCollectorHighMemory
      expr: process_resident_memory_bytes{job="otel-collector"} > 1.8e+9
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "OpenTelemetry Collector high memory usage"
        description: "OpenTelemetry Collector is using more than 1.8GB of memory."
EOF

监控 Langfuse

设置 Langfuse 健康检查:

kubectl apply -f - <<EOF
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: langfuse-alerts
  namespace: langfuse
spec:
  groups:
  - name: langfuse
    rules:
    - alert: LangfuseDown
      expr: probe_success{job="langfuse"} == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Langfuse is down"
        description: "Langfuse has been down for more than 5 minutes."
EOF

故障排除

OpenTelemetry Collector 问题

  1. Collector 无法启动
  2. 检查配置文件语法
  3. 验证资源限制是否合理
  4. 查看详细日志:kubectl logs -f deployment/otel-collector -n otel

  5. Collector 无法连接到 Langfuse

  6. 验证 Langfuse 服务是否可达:kubectl exec -it deployment/otel-collector -n otel -- curl -v langfuse.langfuse.svc.cluster.local:3000/health
  7. 检查 API 密钥是否正确配置

  8. 内存使用过高

  9. 调整 memory_limiter 处理器配置
  10. 增加资源限制
  11. 考虑使用水平自动缩放

Langfuse 问题

  1. 数据库连接问题
  2. 检查 PostgreSQL 连接配置
  3. 验证数据库是否健康:kubectl exec -it deployment/langfuse-postgresql -n langfuse -- pg_isready

  4. API 错误

  5. 检查 Langfuse 日志:kubectl logs -f deployment/langfuse -n langfuse
  6. 验证 API 密钥权限

LLM 服务问题

  1. 未发送遥测数据
  2. 确认 OpenTelemetry SDK 正确初始化
  3. 验证环境变量配置
  4. 检查网络策略是否允许到 Collector 的流量

最佳实践

扩展性考虑

  1. 水平扩展 Collector

对于高流量场景,配置 HPA:

kubectl apply -f - <<EOF
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: otel-collector
  namespace: otel
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: otel-collector
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 80
EOF
  1. 使用 Collector Agent 模式

对于大规模集群,考虑部署 Collector 为 DaemonSet:

helm install otel-agent open-telemetry/opentelemetry-collector \
  --set mode=daemonset \
  -f otel-agent-values.yaml \
  -n otel

安全最佳实践

  1. 使用 Kubernetes Secrets 存储敏感信息
kubectl create secret generic langfuse-api-keys \
  --from-literal=public-key=your-public-key \
  --from-literal=secret-key=your-secret-key \
  -n otel

然后在 Collector 配置中引用:

exporters:
  otlphttp:
    endpoint: "http://langfuse.langfuse.svc.cluster.local:3000/api/public/otel"
    headers:
      "X-Langfuse-Public-Key": ${LANGFUSE_PUBLIC_KEY}
      "X-Langfuse-Secret-Key": ${LANGFUSE_SECRET_KEY}
  1. 配置网络策略

限制 Collector 和 Langfuse 的网络访问:

kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-otel-to-langfuse
  namespace: langfuse
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: langfuse
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: otel
      podSelector:
        matchLabels:
          app.kubernetes.io/name: opentelemetry-collector
    ports:
    - protocol: TCP
      port: 3000
EOF

数据采集最佳实践

  1. 标准化 LLM 属性

在所有 LLM 服务中使用一致的属性命名:

  • llm.model - 模型名称
  • llm.prompt - 输入提示
  • llm.completion - 生成的文本
  • llm.tokens - 使用的令牌数
  • llm.latency - 推理延迟
  • llm.temperature - 采样温度

  • 采集适当的上下文

确保采集足够的上下文信息,但避免包含敏感数据:

def process_llm_request(request_data):
    with tracer.start_as_current_span("llm.request") as span:
        # 添加基本属性
        span.set_attribute("llm.model", request_data.get("model"))
        span.set_attribute("llm.temperature", request_data.get("temperature", 0.7))

        # 脱敏处理提示内容
        prompt = request_data.get("prompt", "")
        sanitized_prompt = sanitize_sensitive_data(prompt)
        span.set_attribute("llm.prompt", sanitized_prompt)

        # 处理请求...

结论

通过本指南,您已经学习了如何在 Kubernetes 上使用 Helm 部署 OpenTelemetry Collector 并与 Langfuse 集成,以采集和分析 LLM 模型数据。这种设置提供了对 LLM 服务的全面可观测性,使您能够监控性能、跟踪请求流程并优化模型行为。

随着 LLM 应用的扩展,可以进一步增强此架构,添加更多的后端存储、更复杂的处理管道和更详细的可视化。

参考资料

回到页面顶部