Ceph — 统一分布式存储

架构概览

Ceph 是统一的分布式存储系统，同时提供对象存储（RGW）、块存储（RBD）和文件系统（CephFS）。

┌─────────────────────────────────────────────────────────┐
│                    访问接口层                             │
│  RGW（S3/Swift）  RBD（块设备）  CephFS（POSIX 文件系统）│
└─────────────────────────────────────────────────────────┘
                         │
┌─────────────────────────────────────────────────────────┐
│                    RADOS（可靠自主分布式对象存储）          │
│  ┌──────────┐  ┌──────────┐  ┌──────────────────────┐  │
│  │  Monitor │  │  Manager │  │       OSD            │  │
│  │（集群状态）│  │（监控/插件）│  │（对象存储守护进程）    │  │
│  └──────────┘  └──────────┘  └──────────────────────┘  │
└─────────────────────────────────────────────────────────┘

核心组件

组件	职责
Monitor（MON）	维护集群状态图（Cluster Map），Paxos 协议保证一致性
OSD（Object Storage Daemon）	每块磁盘对应一个 OSD，负责数据存储、复制、恢复
Manager（MGR）	集群监控、插件（Dashboard、Prometheus）
MDS（Metadata Server）	CephFS 的元数据服务
RGW（RADOS Gateway）	对象存储网关，兼容 S3/Swift API

CRUSH 算法

CRUSH（Controlled Replication Under Scalable Hashing）是 Ceph 的核心数据分布算法：

数据写入流程：
1. 计算对象所属 PG（Placement Group）
   PG = hash(object_name) % pg_num

2. CRUSH 算法根据 PG 和 Cluster Map 计算 OSD 列表
   CRUSH(PG, cluster_map) → [OSD1, OSD2, OSD3]

3. 数据写入 Primary OSD，Primary 复制到 Secondary OSD

优势：
  - 无中心元数据服务器（区别于 HDFS NameNode）
  - 客户端直接计算数据位置，无需查询
  - 集群扩展时数据自动均衡

CRUSH 规则

bash

# 查看 CRUSH 规则
ceph osd crush rule ls
ceph osd crush rule dump

# 创建跨机架复制规则（3副本，分布在不同机架）
ceph osd crush rule create-replicated replicated_rack default rack osd

# 查看 CRUSH 拓扑
ceph osd crush tree

存储池管理

bash

# 创建存储池
ceph osd pool create my-pool 128  # 128 个 PG

# 设置副本数
ceph osd pool set my-pool size 3
ceph osd pool set my-pool min_size 2  # 最少写入2个副本才返回成功

# 启用纠删码（EC）
ceph osd erasure-code-profile set my-ec-profile k=4 m=2
ceph osd pool create ec-pool 64 erasure my-ec-profile

# 查看存储池
ceph osd pool ls detail
ceph df  # 查看容量使用

RBD（块存储）

bash

# 创建 RBD 镜像
rbd create --size 100G my-pool/my-volume

# 映射到本地块设备
rbd map my-pool/my-volume
# 输出：/dev/rbd0

# 格式化并挂载
mkfs.ext4 /dev/rbd0
mount /dev/rbd0 /mnt/ceph-volume

# 快照
rbd snap create my-pool/my-volume@snap1
rbd snap ls my-pool/my-volume

# 克隆（基于快照的 COW 克隆）
rbd snap protect my-pool/my-volume@snap1
rbd clone my-pool/my-volume@snap1 my-pool/my-clone

Kubernetes CSI

yaml

# StorageClass
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ceph-rbd
provisioner: rbd.csi.ceph.com
parameters:
  clusterID: <cluster-id>
  pool: my-pool
  imageFormat: "2"
  imageFeatures: layering
  csi.storage.k8s.io/provisioner-secret-name: csi-rbd-secret
  csi.storage.k8s.io/provisioner-secret-namespace: ceph-csi
  csi.storage.k8s.io/node-stage-secret-name: csi-rbd-secret
  csi.storage.k8s.io/node-stage-secret-namespace: ceph-csi
reclaimPolicy: Delete
allowVolumeExpansion: true

---
# PVC
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-pvc
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: ceph-rbd
  resources:
    requests:
      storage: 10Gi

RGW（对象存储）

bash

# 创建 RGW 用户
radosgw-admin user create --uid=myuser --display-name="My User"

# 获取 Access Key 和 Secret Key
radosgw-admin user info --uid=myuser

# 使用 S3 兼容 API
aws s3 --endpoint-url http://rgw:7480 mb s3://my-bucket
aws s3 --endpoint-url http://rgw:7480 cp file.txt s3://my-bucket/

CephFS（文件系统）

bash

# 创建 CephFS
ceph fs new my-fs cephfs-meta cephfs-data

# 挂载（内核客户端）
mount -t ceph mon1:6789,mon2:6789,mon3:6789:/ /mnt/cephfs \
  -o name=admin,secret=<admin-key>

# 挂载（FUSE 客户端）
ceph-fuse -m mon1:6789 /mnt/cephfs

集群运维

bash

# 查看集群健康状态
ceph status
ceph health detail

# 查看 OSD 状态
ceph osd stat
ceph osd tree

# 查看 PG 状态
ceph pg stat
ceph pg dump | grep -v "^pg_stat"

# 查看 Monitor 状态
ceph mon stat
ceph quorum_status

# 查看 IO 性能
ceph osd perf
rados bench -p my-pool 10 write --no-cleanup
rados bench -p my-pool 10 seq

故障处理案例

案例一：OSD 宕机

现象：ceph status 显示 X osds down，PG 状态为 degraded。

处理：

bash

# 查看宕机 OSD
ceph osd tree | grep down

# 尝试重启 OSD
systemctl restart ceph-osd@<osd-id>

# 如果磁盘故障，移除 OSD
ceph osd out <osd-id>          # 标记为 out，触发数据迁移
ceph osd crush remove osd.<id> # 从 CRUSH 移除
ceph auth del osd.<id>         # 删除认证
ceph osd rm <osd-id>           # 删除 OSD

# 添加新 OSD（替换磁盘后）
ceph-volume lvm create --data /dev/sdb

案例二：PG 长时间处于 degraded 状态

现象：数据副本数不足，但集群未自动恢复。

排查：

bash

# 查看 PG 详情
ceph pg <pg-id> query

# 查看恢复进度
ceph status  # 关注 recovery 进度

# 加速恢复（临时提高恢复优先级）
ceph tell osd.* injectargs '--osd-recovery-max-active 5'
ceph tell osd.* injectargs '--osd-max-backfills 5'

案例三：集群满（HEALTH_ERR: full）

现象：集群使用率超过 mon_osd_full_ratio（默认 95%），拒绝写入。

紧急处理：

bash

# 临时提高阈值（争取时间）
ceph osd set-full-ratio 0.97

# 删除不必要的数据
rados -p my-pool rm old-object

# 扩容（添加新 OSD）

预防：

bash

# 设置告警阈值（默认 85%）
ceph osd set-nearfull-ratio 0.80
ceph osd set-backfillfull-ratio 0.85
ceph osd set-full-ratio 0.90

监控指标

bash

# 启用 Prometheus 插件
ceph mgr module enable prometheus

# 指标端点
curl http://mgr-host:9283/metrics

指标	说明
`ceph_health_status`	集群健康状态（0=OK, 1=WARN, 2=ERR）
`ceph_osd_up`	OSD 在线状态
`ceph_osd_in`	OSD 是否在集群中
`ceph_pg_degraded`	降级 PG 数量
`ceph_cluster_total_bytes`	总容量
`ceph_cluster_total_used_bytes`	已用容量
`ceph_osd_apply_latency_ms`	OSD 写入延迟

Ceph — 统一分布式存储 ​

架构概览 ​

核心组件 ​

CRUSH 算法 ​

CRUSH 规则 ​

存储池管理 ​

RBD（块存储） ​

Kubernetes CSI ​

RGW（对象存储） ​

CephFS（文件系统） ​

集群运维 ​

故障处理案例 ​

案例一：OSD 宕机 ​

案例二：PG 长时间处于 degraded 状态 ​

案例三：集群满（HEALTH_ERR: full） ​

监控指标 ​

Ceph — 统一分布式存储

架构概览

核心组件

CRUSH 算法

CRUSH 规则

存储池管理

RBD（块存储）

Kubernetes CSI

RGW（对象存储）

CephFS（文件系统）

集群运维

故障处理案例

案例一：OSD 宕机

案例二：PG 长时间处于 degraded 状态

案例三：集群满（HEALTH_ERR: full）

监控指标