Thursday, 29 March 2012

Balancing data equally among disks on a datanode

What's the best method of achieving a balanced datanode with respect to disk usage? If we have 5 * 2TB disks in a datanode, and they're each 95% utilised and one fails, how do we then redistribute the data to the new disk?

There doesn't appear to be any method within hadoop (0.20.x) that achieves this. Google doesn't appear to give me any answers other than manually moving the blocks to the new disk. Thats a lot of effort. If I can't find one, I'll write a shell script that automatically spreads blocks out equally amongst all available disks.

Stop the datanode process, rebalance within the datanode, restart the datanode process and let it scan for block locations.

Any major flaws to this plan or is there something around already that does this?




I wrote a small shell script to do this. It works out how much data is on a node, and works out (if spread evenly across each disk) the average utilisation percentage. It then finds and moves the required number of blocks (picks out the block size from hdfs-site.xml) onto the least used disk in a randomly named subdirXX with less than 60 blocks contained within.

Seems to work fine....

Second Edit
Decommission the node within the namenode so that all blocks are copied elsewhere, then just format all data disks. It's the cleanest way of ensuring a single node is balanced.

1 comment: