OpenTelemetry — 可观测性标准
什么是 OpenTelemetry
OpenTelemetry(OTEL)是 CNCF 的可观测性标准项目,统一了 Metrics、Logs、Traces 三大支柱的数据规范和采集 SDK。
应用(OTEL SDK)
└──► OTEL Collector(采集、处理、导出)
├──► Prometheus(Metrics)
├──► Jaeger(Traces)
├──► Elasticsearch(Logs)
└──► 任意 OTLP 兼容后端
优势:
- 一次接入,多后端输出
- 避免厂商锁定
- 统一的数据模型
- 自动插桩(无需修改业务代码)OTEL Collector 架构
Receivers(接收)→ Processors(处理)→ Exporters(导出)
Receivers:
- otlp(gRPC/HTTP)
- prometheus(拉取)
- jaeger
- zipkin
- kafka
- filelog
Processors:
- batch(批量发送)
- memory_limiter(内存限制)
- filter(过滤)
- transform(转换)
- tail_sampling(尾部采样)
- resource(添加资源属性)
Exporters:
- otlp(发送到其他 Collector 或后端)
- prometheus(暴露 /metrics)
- jaeger
- elasticsearch
- kafka
- logging(调试用)Collector 配置
yaml
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
prometheus:
config:
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
processors:
batch:
timeout: 1s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 512
spike_limit_mib: 128
resource:
attributes:
- key: deployment.environment
value: production
action: upsert
filter/drop_debug:
traces:
span:
- 'attributes["http.url"] == "/health"' # 过滤健康检查
tail_sampling:
decision_wait: 10s
policies:
- name: error-policy
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow-policy
type: latency
latency: {threshold_ms: 500}
- name: sample-policy
type: probabilistic
probabilistic: {sampling_percentage: 10}
exporters:
jaeger:
endpoint: jaeger:14250
tls:
insecure: true
prometheus:
endpoint: "0.0.0.0:8889"
namespace: otel
elasticsearch:
endpoints: ["http://es:9200"]
logs_index: otel-logs
traces_index: otel-traces
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, filter/drop_debug, tail_sampling, batch]
exporters: [jaeger, otlp/tempo]
metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, resource, batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [elasticsearch]Java SDK 使用
自动插桩(推荐)
bash
# 下载 OTEL Java Agent
wget https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar
# 启动应用时附加 Agent(无需修改代码)
java -javaagent:opentelemetry-javaagent.jar \
-Dotel.service.name=order-service \
-Dotel.exporter.otlp.endpoint=http://otel-collector:4317 \
-Dotel.traces.sampler=parentbased_traceidratio \
-Dotel.traces.sampler.arg=0.1 \
-Dotel.logs.exporter=otlp \
-Dotel.metrics.exporter=otlp \
-jar app.jar自动插桩支持的框架(部分):
- Spring Boot / Spring MVC / Spring WebFlux
- JDBC / Hibernate / MyBatis
- Kafka / RabbitMQ / RocketMQ
- Redis(Jedis/Lettuce)
- gRPC / Dubbo
- HTTP 客户端(OkHttp/HttpClient/RestTemplate/Feign)
手动插桩
java
// 初始化 SDK
SdkTracerProvider tracerProvider = SdkTracerProvider.builder()
.addSpanProcessor(BatchSpanProcessor.builder(
OtlpGrpcSpanExporter.builder()
.setEndpoint("http://otel-collector:4317")
.build()
).build())
.setResource(Resource.getDefault().merge(
Resource.create(Attributes.of(
ResourceAttributes.SERVICE_NAME, "order-service",
ResourceAttributes.SERVICE_VERSION, "1.2.0"
))
))
.build();
OpenTelemetrySdk openTelemetry = OpenTelemetrySdk.builder()
.setTracerProvider(tracerProvider)
.buildAndRegisterGlobal();
// 使用 Tracer
Tracer tracer = openTelemetry.getTracer("order-service", "1.2.0");
Span span = tracer.spanBuilder("processOrder")
.setAttribute("order.id", orderId)
.startSpan();
try (Scope scope = span.makeCurrent()) {
processOrder(orderId);
} catch (Exception e) {
span.recordException(e);
span.setStatus(StatusCode.ERROR);
throw e;
} finally {
span.end();
}自定义 Metrics
java
// 使用 Micrometer + OTEL Bridge(Spring Boot 推荐)
@Bean
MeterRegistry meterRegistry(OpenTelemetry openTelemetry) {
return OpenTelemetryMeterRegistry.builder(openTelemetry).build();
}
// 或直接使用 OTEL Metrics API
Meter meter = openTelemetry.getMeter("order-service");
LongCounter orderCounter = meter.counterBuilder("orders.created")
.setDescription("Total orders created")
.setUnit("1")
.build();
DoubleHistogram latencyHistogram = meter.histogramBuilder("order.processing.duration")
.setDescription("Order processing duration")
.setUnit("ms")
.build();
// 记录指标
orderCounter.add(1, Attributes.of(AttributeKey.stringKey("status"), "success"));
latencyHistogram.record(150.0, Attributes.of(AttributeKey.stringKey("service"), "order"));Kubernetes 部署
yaml
# OTEL Operator(推荐)
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: otel-collector
spec:
mode: DaemonSet # 每个节点一个 Collector
config: |
receivers:
otlp:
protocols:
grpc:
http:
processors:
batch: {}
exporters:
jaeger:
endpoint: jaeger:14250
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [jaeger]
---
# 自动注入 OTEL Agent
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: java-instrumentation
spec:
exporter:
endpoint: http://otel-collector:4317
propagators:
- tracecontext
- baggage
- b3
sampler:
type: parentbased_traceidratio
argument: "0.1"
java:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:latestyaml
# Pod 注解启用自动注入
metadata:
annotations:
instrumentation.opentelemetry.io/inject-java: "true"关联 Traces、Metrics 和 Logs
OTEL 的核心价值是将三大支柱关联起来:
java
// 在日志中自动注入 Trace ID(通过 OTEL Log Bridge)
// MDC 自动包含 trace_id 和 span_id
log.info("Processing order {}", orderId);
// 输出:{"message":"Processing order 123","trace_id":"abc123","span_id":"def456"}
// Grafana 中:
// 1. 在 Grafana Explore 查看日志
// 2. 点击 trace_id → 跳转到 Jaeger 查看完整链路
// 3. 在 Jaeger 中点击 Span → 跳转到 Grafana 查看该时间段的 Metrics故障处理案例
案例一:Collector 内存溢出
现象:OTEL Collector 进程 OOM。
解决:
yaml
processors:
memory_limiter:
check_interval: 1s
limit_mib: 512 # 硬限制
spike_limit_mib: 128 # 峰值限制
# 超过限制时,Collector 会拒绝新数据并返回背压信号案例二:数据丢失
现象:部分 Trace 或 Metrics 未到达后端。
排查:
bash
# 查看 Collector 内部指标
curl http://otel-collector:8888/metrics | grep otelcol_exporter
# 关键指标:
# otelcol_exporter_sent_spans - 成功发送的 Span 数
# otelcol_exporter_send_failed_spans - 发送失败的 Span 数
# otelcol_processor_dropped_spans - 被丢弃的 Span 数解决:
- 增大 Collector 资源限制
- 配置重试和持久化队列
yaml
exporters:
jaeger:
retry_on_failure:
enabled: true
max_elapsed_time: 300s
sending_queue:
enabled: true
num_consumers: 10
queue_size: 5000
storage: file_storage # 持久化队列,防止 Collector 重启丢数据