[HDDS-5713] DiskBalancer for Datanode (implementing)
 
HDDS-5713 DiskBalancer for Datanode (implementing)
Background
Apache Ozone works well to distribute all containers evenly across all multiple disks on each Datanode. This initial spread ensures that I/O load is balanced from the start. However, over the operational lifetime of a cluster disk imbalance can occur due to the following reasons:
- Adding new disks to expand datanode storage space.
- Replacing old broken disks with new disks.
- Massive block or replica deletions.
This uneven utilisation of disks can create performance bottlenecks, as over-utilised disks become hotspots limiting the overall throughput of the Datanode. As a result, this new feature, DiskBalancer, is introduced to ensure even data distribution across disks within a Datanode.
Proposed Solution
The DiskBalancer is a feature which evenly distributes data across different disks of a Datanode.
It detects an imbalance within datanode, using the term from HDFS DiskBalancer metric called Volume Data Density. This metric is calculated for each disk using following formula:
AverageUtilization = TotalUsed / TotalCapacity
VolumeUtilization = diskUsedSpace / diskCapacity
VolumeDensity = | VolumeUtilization - AverageUtilization |
Here, VolumeUtilization is each disk’s individual utilization and AverageUtilization is the ideal utilization for all disks to maintain eveness.
A disk is considered a candidate for balancing if its VolumeDataDensity exceeds a configurable
threshold. The DiskBalancer then moves the containers from most
utilised disk to the least utilised disk. DiskBalancer can be triggered manually by CLI commands.
High-Level DiskBalancer Implementation
The general view of this design consists of 2 parts as follows:
Client & DN - Direct Communication:
Administrators use the ozone admin datanode diskbalancer CLI to manage and monitor the feature.
- Clients communicate directly with datanodes via RPC using the
DiskBalancerProtocolinterface, bypassing SCM. - Clients can control the DiskBalancer job by sending requests directly to datanodes, including:
- Start/stop DiskBalancer operations
- Update configuration parameters
- Query DiskBalancer status and volume density reports
- Each datanode performs its own authentication (via RPC) and authorization checks (using
OzoneAdminsbased onozone.administratorsconfiguration). - For batch operations, clients can use the
--in-service-datanodesflag to automatically query SCM for all IN_SERVICE datanodes and execute commands on all of them.
DN - DiskBalancer Service:
All balance operations are done in dataNodes.
A daemon thread, the Scheduler, runs periodically on each Datanode.
- It calculates the
VolumeDataDensityfor all volumes. - If an imbalance is detected (i.e., density > threshold), it moves a set of closed containers from the most over-utilized disk (source) to the least utilized disk (destination).
- The scheduler dispatches these move tasks to a pool of Worker threads for parallel execution.
Note: SCM is used only for datanode discovery when using the --in-service-datanodes flag. SCM provides a list of IN_SERVICE datanodes for batch operations but
does not participate in DiskBalancer control operations (start/stop/update/status/report). All DiskBalancer operations are performed directly between client and datanode.
Container Move Process
Suppose, we are moving container C1 (CLOSED state) from Source Disk d1 to Destination disk d2 :
-
A temporary copy,
Temp C1-CLOSED, is created in thetemp directoryof the destination disk D2. -
Temp C1-CLOSEDis transitioned toTemp C1-RECOVERINGstate. This Temp C1-RECOVERING container is now atomically moved to the final destination directory of D2 asC1-RECOVERING. -
Now new container import is initiated for
C1-RECOVERINGcontainer. -
Once the import is successful, all the metadata updates are done for this new container created on D2.
-
Finally, the original container
C1-CLOSEDon D1 is deleted.
D1 ----> C1-CLOSED --- (5) ---> C1-DELETED
|
|
(1)
|
D2 ----> Temp C1-CLOSED --- (2) ---> Temp C1-RECOVERING --- (3) ---> C1-RECOVERING --- (4) ---> C1-CLOSED
DiskBalancing Policies
By default, the DiskBalancer uses specific policies to decide which disks to balance and which containers to move. These are configurable, but the default implementations provide robust and safe behavior.
-
DefaultVolumeChoosingPolicy: This is the default policy for selecting the source and destination volumes. It identifies the most over-utilized volume as the source and the most under-utilized volume as the destination by comparing each volume’s utilization against the Datanode’s average. The calculation is smart enough to account for data that is already in the process of being moved, ensuring it makes accurate decisions based on the future state of the volumes. -
DefaultContainerChoosingPolicy: This is the default policy for selecting which container to move from a source volume. It iterates through the containers on the source disk and picks the first one that is in a CLOSED state and is not already being moved by another balancing operation. To optimize performance and avoid re-scanning the same containers repeatedly, it caches the list of containers for each volume which auto expires after one hour of its last used time or if the container iterator for that is invalidated on full utilisation.
Security Design
DiskBalancer follows the same security model as other services:
-
Authentication: Clients communicate directly with datanodes via RPC. In secure clusters, RPC authentication is required (Kerberos).
-
Authorization: After successful authentication, each datanode performs authorization checks using
OzoneAdminsbased on theozone.administratorsconfiguration:- Admin operations (start, stop, update): Require the authenticated user to be in
ozone.administratorsor belong to a group inozone.administrators.groups - Read-only operations (status, report): Do not require admin privileges - any authenticated user can query status and reports
- Admin operations (start, stop, update): Require the authenticated user to be in
By default, if ozone.administrators is not configured, only the user who launched the datanode service has admin privileges. This ensures that DiskBalancer operations are restricted to authorized administrators while allowing read-only access for monitoring purposes.
CLI Interface Design
The DiskBalancer CLI provides five main commands that communicate directly with datanodes:
- start - Initiates DiskBalancer on
specified datanodesor allin-service-datanodeswith optional configuration parameters - stop - Stops DiskBalancer operations on specified datanodes.
- update - Updates DiskBalancer configuration.
- status - Retrieves current DiskBalancer status including running state, metrics, and configuration.
- report - Retrieves volume density report showing imbalance analysis.
The CLI supports:
- Direct datanode addressing: Commands can target specific datanodes by hostname or IP address
- Batch operations: The
--in-service-datanodesflag queries SCM for all IN_SERVICE and HEALTHY datanodes and executes commands on all of them - Flexible input: Datanode addresses can be provided as positional arguments or read from stdin
- Output formats: Results can be displayed in human-readable format or JSON for programmatic access
Operational State Awareness
DiskBalancer automatically responds to datanode operational state changes:
- When a datanode enters DECOMMISSIONING or MAINTENANCE state, DiskBalancer automatically pauses (transitions to
PAUSEDstate). - When a datanode returns to IN_SERVICE state, DiskBalancer automatically resumes (if it was previously
RUNNING). - If DiskBalancer was explicitly stopped (via CLI), it remains
STOPPEDeven after the datanode returns toIN_SERVICEstate.
This ensures DiskBalancer respects datanode lifecycle management and does not interfere with maintenance or decommissioning operations.
Feature Flag
The DiskBalancer feature is gated behind a feature flag (hdds.datanode.disk.balancer.enabled) to allow controlled rollout. By default, the feature is disabled. When disabled, the DiskBalancer service is not initialized on datanodes, and the CLI commands are hidden from the main help output to prevent accidental usage.
DiskBalancer Metrics
The DiskBalancer service exposes JMX metrics on each Datanode for real-time monitoring. These metrics provide insights into the balancer’s activity, progress, and overall health.
| DiskBalancer Service Metrics | Description |
|---|---|
SuccessCount |
The number of successful balance jobs. |
SuccessBytes |
The total bytes for successfully balanced jobs. |
FailureCount |
The number of failed balance jobs. |
moveSuccessTime |
The time spent on successful container moves. |
moveFailureTime |
The time spent on failed container moves. |
runningLoopCount |
The total number of times the balancer’s main loop has run. |
idleLoopNoAvailableVolumePairCount |
The number of loops where balancing did not run because no suitable source/destination volume pair could be found. |
idleLoopExceedsBandwidthCount |
The number of loops where balancing did not run due to bandwidth limits. |