etcd — 强一致性 KV 存储
架构与 Raft 协议
etcd 是 CoreOS 开发的分布式 KV 存储,基于 Raft 协议保证强一致性,是 Kubernetes 的核心依赖。
etcd 集群(3节点):
Node1 (Leader) ──[AppendEntries]──► Node2 (Follower)
──[AppendEntries]──► Node3 (Follower)
写入流程:
1. Client 写入 Leader
2. Leader 追加日志,广播给 Follower
3. 多数节点(Quorum)确认后,Leader 提交
4. 返回客户端成功
读取(线性一致性):
1. Client 读取 Leader
2. Leader 确认自己仍是 Leader(ReadIndex 机制)
3. 返回最新数据Raft 选举
Term(任期):
每次选举开始新 Term
每个节点每个 Term 只投一票
选举触发:
Follower 超过 election timeout(150~300ms)未收到心跳
→ 转为 Candidate,发起选举
→ 获得多数票 → 成为 Leader
→ 未获多数票 → 等待下次选举基本操作
bash
# 设置键值
etcdctl put /config/db/url "jdbc:mysql://mysql:3306/orders"
etcdctl put /config/db/password "secret123"
# 读取
etcdctl get /config/db/url
etcdctl get /config/ --prefix # 前缀查询
# 删除
etcdctl del /config/db/password
# Watch(监听变更)
etcdctl watch /config/ --prefix
# 事务(原子操作)
etcdctl txn <<EOF
compares:
value("/lock/order") = ""
success:
put /lock/order "node1"
failure:
get /lock/order
EOFLease(租约)与分布式锁
bash
# 创建租约(TTL=30s)
LEASE_ID=$(etcdctl lease grant 30 | awk '{print $2}')
# 绑定 Key 到租约(Key 随租约过期自动删除)
etcdctl put /lock/order "node1" --lease=$LEASE_ID
# 续约(保持租约活跃)
etcdctl lease keep-alive $LEASE_ID
# 撤销租约(立即删除绑定的 Key)
etcdctl lease revoke $LEASE_IDGo 客户端实现分布式锁
go
import (
clientv3 "go.etcd.io/etcd/client/v3"
"go.etcd.io/etcd/client/v3/concurrency"
)
func acquireLock(client *clientv3.Client, key string) error {
// 创建 Session(内部管理 Lease)
session, err := concurrency.NewSession(client, concurrency.WithTTL(30))
if err != nil {
return err
}
defer session.Close()
// 创建互斥锁
mutex := concurrency.NewMutex(session, "/locks/"+key)
// 加锁(阻塞直到获取)
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
if err := mutex.Lock(ctx); err != nil {
return fmt.Errorf("failed to acquire lock: %w", err)
}
defer mutex.Unlock(context.Background())
// 执行临界区操作
doWork()
return nil
}Watch 机制
etcd Watch 是其核心特性,Kubernetes 的 Informer 机制基于此实现:
go
// 监听前缀下所有变更
watchChan := client.Watch(context.Background(), "/services/", clientv3.WithPrefix())
for resp := range watchChan {
for _, event := range resp.Events {
switch event.Type {
case clientv3.EventTypePut:
fmt.Printf("PUT: %s = %s\n", event.Kv.Key, event.Kv.Value)
case clientv3.EventTypeDelete:
fmt.Printf("DELETE: %s\n", event.Kv.Key)
}
}
}Watch 与 Revision
go
// 从指定 Revision 开始 Watch(不丢失历史事件)
resp, _ := client.Get(ctx, "/services/", clientv3.WithPrefix())
startRevision := resp.Header.Revision
// 从 startRevision+1 开始监听
watchChan := client.Watch(ctx, "/services/",
clientv3.WithPrefix(),
clientv3.WithRev(startRevision+1),
)在 Kubernetes 中的角色
Kubernetes 使用 etcd 存储所有集群状态:
/registry/pods/default/my-pod
/registry/services/default/my-service
/registry/deployments/default/my-deployment
/registry/secrets/default/my-secret
...
API Server 是唯一直接访问 etcd 的组件:
kubectl → API Server → etcd
Controller Manager → API Server → etcd(通过 Watch)
Scheduler → API Server → etcd(通过 Watch)集群运维
成员管理
bash
# 查看集群成员
etcdctl member list -w table
# 添加新成员
etcdctl member add etcd4 --peer-urls=http://192.168.1.14:2380
# 移除成员
etcdctl member remove <member-id>
# 更新成员 Peer URL
etcdctl member update <member-id> --peer-urls=http://new-ip:2380备份与恢复
bash
# 备份(快照)
etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d).db
# 验证快照
etcdctl snapshot status /backup/etcd-snapshot.db -w table
# 恢复(所有节点执行,使用不同的 --name 和 --initial-advertise-peer-urls)
etcdctl snapshot restore /backup/etcd-snapshot.db \
--name etcd1 \
--initial-cluster "etcd1=http://192.168.1.11:2380,etcd2=http://192.168.1.12:2380,etcd3=http://192.168.1.13:2380" \
--initial-cluster-token etcd-cluster-1 \
--initial-advertise-peer-urls http://192.168.1.11:2380 \
--data-dir /var/lib/etcd-restored压缩与碎片整理
bash
# 获取当前 Revision
REV=$(etcdctl endpoint status --write-out="json" | python3 -c "import sys,json; print(json.load(sys.stdin)[0]['Status']['header']['revision'])")
# 压缩历史版本(保留最近 1000 个版本)
etcdctl compact $((REV - 1000))
# 碎片整理(释放磁盘空间)
etcdctl defrag --cluster性能调优
yaml
# etcd 配置(etcd.yaml)
# 心跳间隔(默认 100ms)
heartbeat-interval: 100
# 选举超时(默认 1000ms,应为心跳的 10 倍)
election-timeout: 1000
# 快照触发阈值(默认 10000 条日志)
snapshot-count: 10000
# 自动压缩(保留 1 小时的历史)
auto-compaction-mode: periodic
auto-compaction-retention: "1h"
# 磁盘配额(默认 2GB,超过后拒绝写入)
quota-backend-bytes: 8589934592 # 8GB故障处理案例
案例一:etcd 集群无 Leader
现象:所有写操作返回 etcdserver: no leader。
排查:
bash
etcdctl endpoint status --cluster -w table
etcdctl endpoint health --cluster原因:
- 多数节点宕机(3节点集群中2个宕机)
- 网络分区
- 磁盘 IO 过慢导致心跳超时
处理:
- 恢复宕机节点
- 如果数据损坏,从快照恢复
案例二:磁盘空间告警(mvcc: database space exceeded)
现象:etcd 拒绝写入,报 mvcc: database space exceeded。
处理:
bash
# 1. 压缩历史版本
REV=$(etcdctl endpoint status -w json | python3 -c "...")
etcdctl compact $REV
# 2. 碎片整理
etcdctl defrag
# 3. 解除告警
etcdctl alarm disarm
# 4. 长期:调大 quota 或增加自动压缩案例三:Kubernetes etcd 数据损坏
现象:API Server 无法启动,etcd 日志出现数据损坏错误。
恢复步骤:
bash
# 1. 停止所有 etcd 节点
systemctl stop etcd
# 2. 从最近的快照恢复
etcdctl snapshot restore /backup/latest.db \
--data-dir /var/lib/etcd-new \
...
# 3. 替换数据目录
mv /var/lib/etcd /var/lib/etcd-broken
mv /var/lib/etcd-new /var/lib/etcd
# 4. 重启 etcd
systemctl start etcd监控指标
| 指标 | 说明 | 告警阈值 |
|---|---|---|
etcd_server_has_leader | 是否有 Leader | = 0 立即告警 |
etcd_server_leader_changes_seen_total | Leader 变更次数 | 频繁变更 |
etcd_disk_wal_fsync_duration_seconds | WAL 写盘延迟 | p99 > 10ms |
etcd_disk_backend_commit_duration_seconds | DB 提交延迟 | p99 > 25ms |
etcd_mvcc_db_total_size_in_bytes | DB 大小 | > quota 的 80% |
etcd_network_peer_round_trip_time_seconds | 节点间 RTT | p99 > 150ms |