Multi-Raft Support in Ozone
Multi-Raft in Datanodes
Multi-Raft support allows a Datanode to participate multiple Ratis replication groups (pipelines) at the same time for improving write throughput and ensuring better utilization of disk and network resources. This is particularly useful when Datanodes have multiple disks or the network has a very high bandwidth.
Background
The early Ozone versions supported only one Raft pipeline per Datanode. This limited its concurrent write handling capacity for replicated data and led to under-utilization of resources. The use of Multi-Raft tremendously improved the resource utilization.
Prerequisites
- Ozone 0.5.0 or later
- Replication type: RATIS
- Multiple metadata directories on each Datanode
(via
hdds.container.ratis.datanode.storage.dir
)- Adequate CPU, memory, and network bandwidth
How It Works
SCM can now create overlapping pipelines: each Datanode can join multiple Raft groups up to a configurable limit. This boosts concurrency and avoids idle nodes and idle disks. Raft logs are stored separately on different metadata directories in order to reduce disk contention. Ratis handles concurrent logs per node.
Configuration
hdds.container.ratis.datanode.storage.dir
(no default)- A list of metadata directory paths.
ozone.scm.datanode.pipeline.limit
(default: 2)- The maximum number of pipelines per datanode can be engaged in. The value 0 means that the pipeline limit per datanode will be determined by the number of metadata disks reported per datanode; see the next property.
ozone.scm.pipeline.per.metadata.disk
(default: 2)- The maximum number of pipelines for a datanode is determined by the number of disks in that datanode. This property is effective only when the previous property is set to 0. The value of this property must be greater than 0.
How to Use
- Configure Datanode metadata directories:
<property> <name>hdds.container.ratis.datanode.storage.dir</name> <value>/disk1/ratis,/disk2/ratis,/disk3/ratis,/disk4/ratis</value> </property>
- Set pipeline limits in
ozone-site.xml
:<property> <name>ozone.scm.datanode.pipeline.limit</name> <value>0</value> </property> <property> <name>ozone.scm.pipeline.per.metadata.disk</name> <value>2</value> </property>
- Restart SCM and Datanodes.
- Validate with:
ozone admin pipeline list ozone admin datanode list
Operational Tips
- Monitor with
ozone admin
CLI and the Recon UI. - Ensure pipeline count matches expectations.
For optimal I/O isolation,
configure
hdds.container.ratis.datanode.storage.dir
with paths on multiple distinct physical disks. The Ratis pipelines will be distributed accordingly.- Be cautious with very high pipeline counts due to memory/CPU overhead.
Limitations
- Global configuration: cannot set per-node limits
- Requires restart after changing storage dirs
- No effect on Erasure-Coding (EC) pipelines
References
- Design doc: HDDS-1564 Ozone multi-raft support