Skip to content

OpenTelemetry — 可观测性标准

什么是 OpenTelemetry

OpenTelemetry(OTEL)是 CNCF 的可观测性标准项目,统一了 Metrics、Logs、Traces 三大支柱的数据规范和采集 SDK。

应用(OTEL SDK)
  └──► OTEL Collector(采集、处理、导出)
              ├──► Prometheus(Metrics)
              ├──► Jaeger(Traces)
              ├──► Elasticsearch(Logs)
              └──► 任意 OTLP 兼容后端

优势:
  - 一次接入,多后端输出
  - 避免厂商锁定
  - 统一的数据模型
  - 自动插桩(无需修改业务代码)

OTEL Collector 架构

Receivers(接收)→ Processors(处理)→ Exporters(导出)

Receivers:
  - otlp(gRPC/HTTP)
  - prometheus(拉取)
  - jaeger
  - zipkin
  - kafka
  - filelog

Processors:
  - batch(批量发送)
  - memory_limiter(内存限制)
  - filter(过滤)
  - transform(转换)
  - tail_sampling(尾部采样)
  - resource(添加资源属性)

Exporters:
  - otlp(发送到其他 Collector 或后端)
  - prometheus(暴露 /metrics)
  - jaeger
  - elasticsearch
  - kafka
  - logging(调试用)

Collector 配置

yaml
# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  
  prometheus:
    config:
      scrape_configs:
        - job_name: 'kubernetes-pods'
          kubernetes_sd_configs:
            - role: pod
          relabel_configs:
            - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
              action: keep
              regex: true

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128
  
  resource:
    attributes:
      - key: deployment.environment
        value: production
        action: upsert
  
  filter/drop_debug:
    traces:
      span:
        - 'attributes["http.url"] == "/health"'  # 过滤健康检查
  
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: error-policy
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow-policy
        type: latency
        latency: {threshold_ms: 500}
      - name: sample-policy
        type: probabilistic
        probabilistic: {sampling_percentage: 10}

exporters:
  jaeger:
    endpoint: jaeger:14250
    tls:
      insecure: true
  
  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: otel
  
  elasticsearch:
    endpoints: ["http://es:9200"]
    logs_index: otel-logs
    traces_index: otel-traces
  
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, filter/drop_debug, tail_sampling, batch]
      exporters: [jaeger, otlp/tempo]
    
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, resource, batch]
      exporters: [prometheus]
    
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [elasticsearch]

Java SDK 使用

自动插桩(推荐)

bash
# 下载 OTEL Java Agent
wget https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar

# 启动应用时附加 Agent(无需修改代码)
java -javaagent:opentelemetry-javaagent.jar \
  -Dotel.service.name=order-service \
  -Dotel.exporter.otlp.endpoint=http://otel-collector:4317 \
  -Dotel.traces.sampler=parentbased_traceidratio \
  -Dotel.traces.sampler.arg=0.1 \
  -Dotel.logs.exporter=otlp \
  -Dotel.metrics.exporter=otlp \
  -jar app.jar

自动插桩支持的框架(部分):

  • Spring Boot / Spring MVC / Spring WebFlux
  • JDBC / Hibernate / MyBatis
  • Kafka / RabbitMQ / RocketMQ
  • Redis(Jedis/Lettuce)
  • gRPC / Dubbo
  • HTTP 客户端(OkHttp/HttpClient/RestTemplate/Feign)

手动插桩

java
// 初始化 SDK
SdkTracerProvider tracerProvider = SdkTracerProvider.builder()
    .addSpanProcessor(BatchSpanProcessor.builder(
        OtlpGrpcSpanExporter.builder()
            .setEndpoint("http://otel-collector:4317")
            .build()
    ).build())
    .setResource(Resource.getDefault().merge(
        Resource.create(Attributes.of(
            ResourceAttributes.SERVICE_NAME, "order-service",
            ResourceAttributes.SERVICE_VERSION, "1.2.0"
        ))
    ))
    .build();

OpenTelemetrySdk openTelemetry = OpenTelemetrySdk.builder()
    .setTracerProvider(tracerProvider)
    .buildAndRegisterGlobal();

// 使用 Tracer
Tracer tracer = openTelemetry.getTracer("order-service", "1.2.0");

Span span = tracer.spanBuilder("processOrder")
    .setAttribute("order.id", orderId)
    .startSpan();

try (Scope scope = span.makeCurrent()) {
    processOrder(orderId);
} catch (Exception e) {
    span.recordException(e);
    span.setStatus(StatusCode.ERROR);
    throw e;
} finally {
    span.end();
}

自定义 Metrics

java
// 使用 Micrometer + OTEL Bridge(Spring Boot 推荐)
@Bean
MeterRegistry meterRegistry(OpenTelemetry openTelemetry) {
    return OpenTelemetryMeterRegistry.builder(openTelemetry).build();
}

// 或直接使用 OTEL Metrics API
Meter meter = openTelemetry.getMeter("order-service");

LongCounter orderCounter = meter.counterBuilder("orders.created")
    .setDescription("Total orders created")
    .setUnit("1")
    .build();

DoubleHistogram latencyHistogram = meter.histogramBuilder("order.processing.duration")
    .setDescription("Order processing duration")
    .setUnit("ms")
    .build();

// 记录指标
orderCounter.add(1, Attributes.of(AttributeKey.stringKey("status"), "success"));
latencyHistogram.record(150.0, Attributes.of(AttributeKey.stringKey("service"), "order"));

Kubernetes 部署

yaml
# OTEL Operator(推荐)
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel-collector
spec:
  mode: DaemonSet  # 每个节点一个 Collector
  config: |
    receivers:
      otlp:
        protocols:
          grpc:
          http:
    processors:
      batch: {}
    exporters:
      jaeger:
        endpoint: jaeger:14250
        tls:
          insecure: true
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [batch]
          exporters: [jaeger]

---
# 自动注入 OTEL Agent
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: java-instrumentation
spec:
  exporter:
    endpoint: http://otel-collector:4317
  propagators:
    - tracecontext
    - baggage
    - b3
  sampler:
    type: parentbased_traceidratio
    argument: "0.1"
  java:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:latest
yaml
# Pod 注解启用自动注入
metadata:
  annotations:
    instrumentation.opentelemetry.io/inject-java: "true"

关联 Traces、Metrics 和 Logs

OTEL 的核心价值是将三大支柱关联起来:

java
// 在日志中自动注入 Trace ID(通过 OTEL Log Bridge)
// MDC 自动包含 trace_id 和 span_id
log.info("Processing order {}", orderId);
// 输出:{"message":"Processing order 123","trace_id":"abc123","span_id":"def456"}

// Grafana 中:
// 1. 在 Grafana Explore 查看日志
// 2. 点击 trace_id → 跳转到 Jaeger 查看完整链路
// 3. 在 Jaeger 中点击 Span → 跳转到 Grafana 查看该时间段的 Metrics

故障处理案例

案例一:Collector 内存溢出

现象:OTEL Collector 进程 OOM。

解决

yaml
processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 512        # 硬限制
    spike_limit_mib: 128  # 峰值限制
    # 超过限制时,Collector 会拒绝新数据并返回背压信号

案例二:数据丢失

现象:部分 Trace 或 Metrics 未到达后端。

排查

bash
# 查看 Collector 内部指标
curl http://otel-collector:8888/metrics | grep otelcol_exporter

# 关键指标:
# otelcol_exporter_sent_spans - 成功发送的 Span 数
# otelcol_exporter_send_failed_spans - 发送失败的 Span 数
# otelcol_processor_dropped_spans - 被丢弃的 Span 数

解决

  • 增大 Collector 资源限制
  • 配置重试和持久化队列
yaml
exporters:
  jaeger:
    retry_on_failure:
      enabled: true
      max_elapsed_time: 300s
    sending_queue:
      enabled: true
      num_consumers: 10
      queue_size: 5000
      storage: file_storage  # 持久化队列,防止 Collector 重启丢数据

PaaS 中间件生态系统深度学习文档