Skip to content

etcd — 强一致性 KV 存储

架构与 Raft 协议

etcd 是 CoreOS 开发的分布式 KV 存储,基于 Raft 协议保证强一致性,是 Kubernetes 的核心依赖。

etcd 集群(3节点):
  Node1 (Leader) ──[AppendEntries]──► Node2 (Follower)
                 ──[AppendEntries]──► Node3 (Follower)
  
  写入流程:
  1. Client 写入 Leader
  2. Leader 追加日志,广播给 Follower
  3. 多数节点(Quorum)确认后,Leader 提交
  4. 返回客户端成功
  
  读取(线性一致性):
  1. Client 读取 Leader
  2. Leader 确认自己仍是 Leader(ReadIndex 机制)
  3. 返回最新数据

Raft 选举

Term(任期):
  每次选举开始新 Term
  每个节点每个 Term 只投一票
  
选举触发:
  Follower 超过 election timeout(150~300ms)未收到心跳
  → 转为 Candidate,发起选举
  → 获得多数票 → 成为 Leader
  → 未获多数票 → 等待下次选举

基本操作

bash
# 设置键值
etcdctl put /config/db/url "jdbc:mysql://mysql:3306/orders"
etcdctl put /config/db/password "secret123"

# 读取
etcdctl get /config/db/url
etcdctl get /config/ --prefix  # 前缀查询

# 删除
etcdctl del /config/db/password

# Watch(监听变更)
etcdctl watch /config/ --prefix

# 事务(原子操作)
etcdctl txn <<EOF
compares:
  value("/lock/order") = ""
success:
  put /lock/order "node1"
failure:
  get /lock/order
EOF

Lease(租约)与分布式锁

bash
# 创建租约(TTL=30s)
LEASE_ID=$(etcdctl lease grant 30 | awk '{print $2}')

# 绑定 Key 到租约(Key 随租约过期自动删除)
etcdctl put /lock/order "node1" --lease=$LEASE_ID

# 续约(保持租约活跃)
etcdctl lease keep-alive $LEASE_ID

# 撤销租约(立即删除绑定的 Key)
etcdctl lease revoke $LEASE_ID

Go 客户端实现分布式锁

go
import (
    clientv3 "go.etcd.io/etcd/client/v3"
    "go.etcd.io/etcd/client/v3/concurrency"
)

func acquireLock(client *clientv3.Client, key string) error {
    // 创建 Session(内部管理 Lease)
    session, err := concurrency.NewSession(client, concurrency.WithTTL(30))
    if err != nil {
        return err
    }
    defer session.Close()
    
    // 创建互斥锁
    mutex := concurrency.NewMutex(session, "/locks/"+key)
    
    // 加锁(阻塞直到获取)
    ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
    defer cancel()
    
    if err := mutex.Lock(ctx); err != nil {
        return fmt.Errorf("failed to acquire lock: %w", err)
    }
    defer mutex.Unlock(context.Background())
    
    // 执行临界区操作
    doWork()
    return nil
}

Watch 机制

etcd Watch 是其核心特性,Kubernetes 的 Informer 机制基于此实现:

go
// 监听前缀下所有变更
watchChan := client.Watch(context.Background(), "/services/", clientv3.WithPrefix())

for resp := range watchChan {
    for _, event := range resp.Events {
        switch event.Type {
        case clientv3.EventTypePut:
            fmt.Printf("PUT: %s = %s\n", event.Kv.Key, event.Kv.Value)
        case clientv3.EventTypeDelete:
            fmt.Printf("DELETE: %s\n", event.Kv.Key)
        }
    }
}

Watch 与 Revision

go
// 从指定 Revision 开始 Watch(不丢失历史事件)
resp, _ := client.Get(ctx, "/services/", clientv3.WithPrefix())
startRevision := resp.Header.Revision

// 从 startRevision+1 开始监听
watchChan := client.Watch(ctx, "/services/",
    clientv3.WithPrefix(),
    clientv3.WithRev(startRevision+1),
)

在 Kubernetes 中的角色

Kubernetes 使用 etcd 存储所有集群状态:
  /registry/pods/default/my-pod
  /registry/services/default/my-service
  /registry/deployments/default/my-deployment
  /registry/secrets/default/my-secret
  ...

API Server 是唯一直接访问 etcd 的组件:
  kubectl → API Server → etcd
  Controller Manager → API Server → etcd(通过 Watch)
  Scheduler → API Server → etcd(通过 Watch)

集群运维

成员管理

bash
# 查看集群成员
etcdctl member list -w table

# 添加新成员
etcdctl member add etcd4 --peer-urls=http://192.168.1.14:2380

# 移除成员
etcdctl member remove <member-id>

# 更新成员 Peer URL
etcdctl member update <member-id> --peer-urls=http://new-ip:2380

备份与恢复

bash
# 备份(快照)
etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d).db

# 验证快照
etcdctl snapshot status /backup/etcd-snapshot.db -w table

# 恢复(所有节点执行,使用不同的 --name 和 --initial-advertise-peer-urls)
etcdctl snapshot restore /backup/etcd-snapshot.db \
    --name etcd1 \
    --initial-cluster "etcd1=http://192.168.1.11:2380,etcd2=http://192.168.1.12:2380,etcd3=http://192.168.1.13:2380" \
    --initial-cluster-token etcd-cluster-1 \
    --initial-advertise-peer-urls http://192.168.1.11:2380 \
    --data-dir /var/lib/etcd-restored

压缩与碎片整理

bash
# 获取当前 Revision
REV=$(etcdctl endpoint status --write-out="json" | python3 -c "import sys,json; print(json.load(sys.stdin)[0]['Status']['header']['revision'])")

# 压缩历史版本(保留最近 1000 个版本)
etcdctl compact $((REV - 1000))

# 碎片整理(释放磁盘空间)
etcdctl defrag --cluster

性能调优

yaml
# etcd 配置(etcd.yaml)
# 心跳间隔(默认 100ms)
heartbeat-interval: 100

# 选举超时(默认 1000ms,应为心跳的 10 倍)
election-timeout: 1000

# 快照触发阈值(默认 10000 条日志)
snapshot-count: 10000

# 自动压缩(保留 1 小时的历史)
auto-compaction-mode: periodic
auto-compaction-retention: "1h"

# 磁盘配额(默认 2GB,超过后拒绝写入)
quota-backend-bytes: 8589934592  # 8GB

故障处理案例

案例一:etcd 集群无 Leader

现象:所有写操作返回 etcdserver: no leader

排查

bash
etcdctl endpoint status --cluster -w table
etcdctl endpoint health --cluster

原因

  • 多数节点宕机(3节点集群中2个宕机)
  • 网络分区
  • 磁盘 IO 过慢导致心跳超时

处理

  1. 恢复宕机节点
  2. 如果数据损坏,从快照恢复

案例二:磁盘空间告警(mvcc: database space exceeded)

现象:etcd 拒绝写入,报 mvcc: database space exceeded

处理

bash
# 1. 压缩历史版本
REV=$(etcdctl endpoint status -w json | python3 -c "...")
etcdctl compact $REV

# 2. 碎片整理
etcdctl defrag

# 3. 解除告警
etcdctl alarm disarm

# 4. 长期:调大 quota 或增加自动压缩

案例三:Kubernetes etcd 数据损坏

现象:API Server 无法启动,etcd 日志出现数据损坏错误。

恢复步骤

bash
# 1. 停止所有 etcd 节点
systemctl stop etcd

# 2. 从最近的快照恢复
etcdctl snapshot restore /backup/latest.db \
    --data-dir /var/lib/etcd-new \
    ...

# 3. 替换数据目录
mv /var/lib/etcd /var/lib/etcd-broken
mv /var/lib/etcd-new /var/lib/etcd

# 4. 重启 etcd
systemctl start etcd

监控指标

指标说明告警阈值
etcd_server_has_leader是否有 Leader= 0 立即告警
etcd_server_leader_changes_seen_totalLeader 变更次数频繁变更
etcd_disk_wal_fsync_duration_secondsWAL 写盘延迟p99 > 10ms
etcd_disk_backend_commit_duration_secondsDB 提交延迟p99 > 25ms
etcd_mvcc_db_total_size_in_bytesDB 大小> quota 的 80%
etcd_network_peer_round_trip_time_seconds节点间 RTTp99 > 150ms

PaaS 中间件生态系统深度学习文档