I've not been around much lately. I've been on holiday (vacation) for a couple of weeks and work has been relatively quiet.
I've this week been working on benchmarking some of the external nodes that process data and move it into hdfs. Some of the handover testing carried out showed benchmark speeds slower than the average. We have (as most will be aware), commodity hardware for external nodes. The nodes are configured as raid1 for OS, and raid0 for processing.
The raid0 array is made up of 4 standard 7200 rpm 1TB disks. They are all HP badged but are of varying models / speeds. The customer said the previously handed over nodes posted benchmarks around 250MB/write, and 200MB/read. The slower results were showing as 190MB/write, and 150MB/read. Considerably slower.
So, where do I start?
First thing that sprung to mind, how is my customer testing? The answer, dd
# dd if=/dev/zero of=/mnt/raid/test1 bs=1024k count=10240
Looks reasonable. 100GB file. I replicate the customer tests and see the same speeds. I run the same test on an already accepted server and it returns significantly faster (270MB/write, 490MB/read). Wow, that seems a lot different. Is it being cached? How? Where? How do i test without the cache?
Following some research, I stumbled across the "oflag=direct" argument which ignores the buffer cache and writes directly to disk. I try dd with the new argument and the speeds are much slower. Each disk only appears to be writing at 20MB/s. Can you read/write directly to a software raided device? It doesn't make sense to me. Do some more research. I've used bonnie++ previously for benchmarking, and iozone comes highly recommended too. I've never used iozone so install it and start bashing out some benchmarks.
All *slow* nodes when tested under iozone return seemingly sane results. Command used:
# iozone -U /mnt/raid -f /mnt/raid/test1 -s 100g -i 0 -i 1 -e
It returns with ~250Mb write and ~200MB read. I can live with that. It seems sane. I run the same test again on the randomly high previously accepted server and I see the same results (270MB/write, 490MB/read).
Next thing.....what's different between the OS configuration. Same RAID0 configuration. sysctl results were identical. BIOS was configured the same. None of the disks were showing as faulty.
Next thing......whats different in terms of hardware? They're the same hardware (even down to disks), Although this wasn't strictly true as I found out. Being commodity hardware, the hard disks fail...a lot. They're replaced by our hardware chaps. Only, they must have run out of the older disks and started replacing them with newer models. Could a couple of faster disks in the raid0 array account for the speed increase?
You always hear that "Your raid array is only as fast as the slowest disk!". Is that really the case? Do some more research.....
When writing to a raid0 volume, each block has to be written fully before moving on to the next block/disk so in a sense you are limited to the slowest disk. But, if you have one slow disk, and three fast ones then the speeds should still be a lot faster than if you had 4 slow disks right? Lets test.....
I found all the types of disk we had in our clusters, and started to benchmark each disk type using iozone (same command as above). Depending on the model of disk, I saw variations of up to 70MB in speeds. Write speeds varied from 70MB/s up to around 110MB/s, and read speeds from 200MB/s up to 490MB/s where all four disks are new.
What have I learnt after all this? We need to take note of the disks hitting our clusters. If we care about performance (using raid0 in the first place assumes we do), then putting faster disks in makes sense.
We now have benchmarks for each type of disk. I want to test each and every disk to help weed out the "bad eggs" within our cluster.
Do I have any massive misunderstandings above? Is there a better way of benchmarking disks / software raided volumes?