A day in the life of a hadoop sysadmin

Saturday, 30 June 2012

Benchmarking software raid?

I've not been around much lately. I've been on holiday (vacation) for a couple of weeks and work has been relatively quiet.

I've this week been working on benchmarking some of the external nodes that process data and move it into hdfs. Some of the handover testing carried out showed benchmark speeds slower than the average. We have (as most will be aware), commodity hardware for external nodes. The nodes are configured as raid1 for OS, and raid0 for processing.

The raid0 array is made up of 4 standard 7200 rpm 1TB disks. They are all HP badged but are of varying models / speeds. The customer said the previously handed over nodes posted benchmarks around 250MB/write, and 200MB/read. The slower results were showing as 190MB/write, and 150MB/read. Considerably slower.

So, where do I start?

First thing that sprung to mind, how is my customer testing? The answer, dd

# dd if=/dev/zero of=/mnt/raid/test1 bs=1024k count=10240

Looks reasonable. 100GB file. I replicate the customer tests and see the same speeds. I run the same test on an already accepted server and it returns significantly faster (270MB/write, 490MB/read). Wow, that seems a lot different. Is it being cached? How? Where? How do i test without the cache?

Following some research, I stumbled across the "oflag=direct" argument which ignores the buffer cache and writes directly to disk. I try dd with the new argument and the speeds are much slower. Each disk only appears to be writing at 20MB/s. Can you read/write directly to a software raided device? It doesn't make sense to me. Do some more research. I've used bonnie++ previously for benchmarking, and iozone comes highly recommended too. I've never used iozone so install it and start bashing out some benchmarks.

All *slow* nodes when tested under iozone return seemingly sane results. Command used:

# iozone -U /mnt/raid -f /mnt/raid/test1 -s 100g -i 0 -i 1 -e

It returns with ~250Mb write and ~200MB read. I can live with that. It seems sane. I run the same test again on the randomly high previously accepted server and I see the same results (270MB/write, 490MB/read).

Next thing.....what's different between the OS configuration. Same RAID0 configuration. sysctl results were identical. BIOS was configured the same. None of the disks were showing as faulty.

Next thing......whats different in terms of hardware? They're the same hardware (even down to disks), Although this wasn't strictly true as I found out. Being commodity hardware, the hard disks fail...a lot. They're replaced by our hardware chaps. Only, they must have run out of the older disks and started replacing them with newer models. Could a couple of faster disks in the raid0 array account for the speed increase?

You always hear that "Your raid array is only as fast as the slowest disk!". Is that really the case? Do some more research.....

When writing to a raid0 volume, each block has to be written fully before moving on to the next block/disk so in a sense you are limited to the slowest disk. But, if you have one slow disk, and three fast ones then the speeds should still be a lot faster than if you had 4 slow disks right? Lets test.....

I found all the types of disk we had in our clusters, and started to benchmark each disk type using iozone (same command as above). Depending on the model of disk, I saw variations of up to 70MB in speeds. Write speeds varied from 70MB/s up to around 110MB/s, and read speeds from 200MB/s up to 490MB/s where all four disks are new.

What have I learnt after all this? We need to take note of the disks hitting our clusters. If we care about performance (using raid0 in the first place assumes we do), then putting faster disks in makes sense.

We now have benchmarks for each type of disk. I want to test each and every disk to help weed out the "bad eggs" within our cluster.

Do I have any massive misunderstandings above? Is there a better way of benchmarking disks / software raided volumes?

Saturday, 21 April 2012

XBMC Live Cursor showing up after power cycling TV

A massive annoyance on my XBMC live media centre (XBMC live (Eden) running on Foxconn NT-A3500). If I turn my TV off and on again, the X cursor remains on the screen. This didn't happen when i first installed XBMC live. I didn't consciously amend any settings to make this the case. The only way I could find to get around it is to either reboot the media centre or jump on the laptop and restart lightdm.

Then I got creative and wrote a quick script to run under cron (every 5 minutes) to detect when the TV is powered on and to restart lightdm.

#!/bin/bash

TVON="HDMI status: Pin=3 Presence_Detect=1 ELD_Valid=1"

for i in $(find /var/log/ -maxdepth 1 -name syslog -mmin -5)

do

{

TEST=$(grep "$TVON" "$i" | tail -n 1)

if [ "$TEST" != "" ] ; then

{

# Is this a new one?

TIMESTAMP=$(echo "$TEST" | awk '{print $6}' | sed 's/\[//g' | sed 's/\]//g')

OLD_TIMESTAMP=$(cat /tmp/old_timestamp.txt)

if [ "${TIMESTAMP}" != "${OLD_TIMESTAMP}" ] ; then

{

# we need to restart lightdm

service lighdm restart

echo ${TIMESTAMP} | sed 's/\[//g' | sed 's/\]//g' > /tmp/old_timestamp.txt

}

fi

}

fi

}

done

Not elegant, but currently functional (No idea if restarting XBMC this way could be the cause of my previous post?).

Cursor no more..... :-)

XBMC Live remote not working?

Slightly off topic but my son was watching Thomas the Tank Engine on the TV yesterday (via XBMC live on a Foxconn NT-A3500). Everything was fine. I put him to bed, came down and the remote wouldn't work anymore.

I have an MCE remote I bought from maplins (model RC118) that just worked. Didn't need to perform any button translation which was good. Now its broke. Tried a few things:

Restart xbmc (service lightdm restart)
Restart server
Reseat Infrared receiver

Nothing works.

dmesg shows me that the Infrared receiver is registering:

mceusb 8-2:1.0: Registered Formosa21 eHome Infrared Transceiver on usb8:2

XBMC just isn't able to understand the key presses. Now I know from reading the xbmc forums that lirc is partly responsible for remote controls. I try to restart lirc (service lirc restart) and it complains about missing kernel modules. w000000t. Nothing has happened on the media centre to remove kernel modules in the time I was (ahem, I mean my son was) watching Thomas the Tank Engine.

To google.....

Found the 'irw' command:

irw
000000037ff05bf2 00 Home mceusb_hauppauge

It works which proves the hardware is functioning. Now whats up with XBMC. Nothing obvious in the logs (other than the reference to lirc not starting). Google tells me to check ~userdata/Lircmap.xml. Oh, right. That doesn't exist. I'm assuming thats my problem. But this worked like an hour ago. Whats caused the file to disappear? (i still dont have a clue).

Oh well, to resolve:

cp -p /usr/share/xbmc/system/Lircmap.xml /home/bleh/.xbmc/userdata/

Edit Lircmap.xml in the new location. I deleted all sections other than the one starting:

<lircmap>

<remote device="mceusb">

<---------Stuff--------->
</remote
</lircmap>

Then I renamed mceusb to the name irw was showing the device as (mceusb_hauppauge). I then had to edit the /etc/lirc/hardware.conf file to remove the offending kernel module:

REMOTE_MODULES="lirc_dev "

Restarted lirc, followed by restarting XBMC and voila. A working remote.

I need to read up to understand what happened, whether removing the kernel module (lirc_mceusb2) has any nasty side effects and on lirc in general.

Hope this helps anyone with the same issue!

Thursday, 29 March 2012

Balancing data equally among disks on a datanode

What's the best method of achieving a balanced datanode with respect to disk usage? If we have 5 * 2TB disks in a datanode, and they're each 95% utilised and one fails, how do we then redistribute the data to the new disk?

There doesn't appear to be any method within hadoop (0.20.x) that achieves this. Google doesn't appear to give me any answers other than manually moving the blocks to the new disk. Thats a lot of effort. If I can't find one, I'll write a shell script that automatically spreads blocks out equally amongst all available disks.

Stop the datanode process, rebalance within the datanode, restart the datanode process and let it scan for block locations.

Any major flaws to this plan or is there something around already that does this?

Gracias

A.M

**********
Edit
**********

I wrote a small shell script to do this. It works out how much data is on a node, and works out (if spread evenly across each disk) the average utilisation percentage. It then finds and moves the required number of blocks (picks out the block size from hdfs-site.xml) onto the least used disk in a randomly named subdirXX with less than 60 blocks contained within.

Seems to work fine....

**********
Second Edit
**********
Decommission the node within the namenode so that all blocks are copied elsewhere, then just format all data disks. It's the cleanest way of ensuring a single node is balanced.

Wednesday, 28 March 2012

Contributing to apache hadoop #2

I've always wanted to contribute something back into the open source community. I thought I'd start off with something ridiculously simple. They offer *newbie* JIRA's after all. I picked one, created a patch and tested it. I uploaded the patch.

And I waited for a few days like the "HowToContribute" page says....

After a week, I asked in the IRC channel what I needed to do to get the patch reviewed. The answer was to add a comment asking for someone to review it. So i did that.....

I saw they were asking for patches to be included in 1.0.2 so i mailed the release manager (Matt Foley) asking him to include it. He said he'd try and get it reviewed. The RC was released for 1.0.2 (minus my patch).

I uploaded the patch on the 18th Feb 2012. I last updated the JIRA asking for someone to look at it on the 29th Feb 2012. Its now the end of March and theres still no movement.

It does beg the question of "Whats the point?". May aswell invest my time in something else. But what?

The calm before the storm.....

Have had some quiet time in recent weeks at work whilst waiting for some of our clusters to enter their outage periods.

It was decided to move from our yahoo-0.20.10 install to apache-0.20.203.0 as it was the latest stable version when the decision was made. I did try to sway the decision to 1.0.0 but it was not mine to make. Started off with our dev clusters which went smoothly, then did the first of our main clusters last week which went smoothly in the end.

Had some oddities during the namenode upgrade due to a corrupt edits log but once that was corrected things went well.

Finally set up collectl to multicast its metrics to a local gmond (3.2.x) which then forwards to its gmetad server and into an rrd database. ganglia has some nice default graphs but lacks some features that I want. For example I want to be able to have hierarchical groups and summary tables:

Currently, if i specify the following groups in gmetad:

Masters (consisting of namenode, secondary and jobtracker)
DataNode_Rack01 (consists of 20 datanodes)
DataNode_Rack02 (consists of 20 datanodes)
DataNode_Rack03 (consists of 20 datanodes)

It will by default summarise at the Masters level, and at each of the DataNode racks, and for all of those groups. I want the ability to summarise on say just the DataNode groups. I dont see a way of doing that in ganglia. I've created some custom php scripts to make some graphs and have even created the relevant summary graphs manually, but they are horrible, hundreds of lines and ridiculously hard to keep up to date.

Has anyone done anything similar? Know of a way around this? Have some examples they can throw around?

Sunday, 4 March 2012

Hadoop sloooow start up

Restarted one of our hadoop clusters following an outage this week and it became apparent that the datanodes registering into the namenode were taking a long time. A similar cluster registers all datanodes within a few minutes, this cluster took an hour. I could tail the namenode log and the nodes would register, and then they would lose connectivity (lose heartbeat) before reconnecting. The network is absolutely not saturated.

There are no firewalls involved. There is more than enough memory for the namenode to run. The heap is large and there is plenty of memory for the OS. There isn't anything apparent in the OS logs to suggest why this is happening.

I have collectl running so i'll start to investigate those logs tomorrow morning.

What can I monitor in terms of the namenode and network connectivity / open connections / sockets in use thats hadoop specific ?

The OS limits aren't an issue either. Could it be worth turning on debug for the namenode logging ? In my experience of doing this on the jobtrack, all it does extra is log heartbeat calls. Thats not what im looking for.

I need to get an outage on the cluster so that I can strace the namenode process on startup to see if that sheds any light on things.

*****************************************
Update

*****************************************

When the tasktrackers restart, they clean up old files (including distributed cache). This can take some time (tens of minutes). This was the cause of the issue.