A day in the life of a hadoop sysadmin: 2012

Saturday 30 June 2012

Benchmarking software raid?

I've not been around much lately. I've been on holiday (vacation) for a couple of weeks and work has been relatively quiet.

I've this week been working on benchmarking some of the external nodes that process data and move it into hdfs. Some of the handover testing carried out showed benchmark speeds slower than the average. We have (as most will be aware), commodity hardware for external nodes. The nodes are configured as raid1 for OS, and raid0 for processing.

The raid0 array is made up of 4 standard 7200 rpm 1TB disks. They are all HP badged but are of varying models / speeds. The customer said the previously handed over nodes posted benchmarks around 250MB/write, and 200MB/read. The slower results were showing as 190MB/write, and 150MB/read. Considerably slower.

So, where do I start?

First thing that sprung to mind, how is my customer testing? The answer, dd

# dd if=/dev/zero of=/mnt/raid/test1 bs=1024k count=10240

Looks reasonable. 100GB file. I replicate the customer tests and see the same speeds. I run the same test on an already accepted server and it returns significantly faster (270MB/write, 490MB/read). Wow, that seems a lot different. Is it being cached? How? Where? How do i test without the cache?

Following some research, I stumbled across the "oflag=direct" argument which ignores the buffer cache and writes directly to disk. I try dd with the new argument and the speeds are much slower. Each disk only appears to be writing at 20MB/s. Can you read/write directly to a software raided device? It doesn't make sense to me. Do some more research. I've used bonnie++ previously for benchmarking, and iozone comes highly recommended too. I've never used iozone so install it and start bashing out some benchmarks.

All *slow* nodes when tested under iozone return seemingly sane results. Command used:

# iozone -U /mnt/raid -f /mnt/raid/test1 -s 100g -i 0 -i 1 -e

It returns with ~250Mb write and ~200MB read. I can live with that. It seems sane. I run the same test again on the randomly high previously accepted server and I see the same results (270MB/write, 490MB/read).

Next thing.....what's different between the OS configuration. Same RAID0 configuration. sysctl results were identical. BIOS was configured the same. None of the disks were showing as faulty.

Next thing......whats different in terms of hardware? They're the same hardware (even down to disks), Although this wasn't strictly true as I found out. Being commodity hardware, the hard disks fail...a lot. They're replaced by our hardware chaps. Only, they must have run out of the older disks and started replacing them with newer models. Could a couple of faster disks in the raid0 array account for the speed increase?

You always hear that "Your raid array is only as fast as the slowest disk!". Is that really the case? Do some more research.....

When writing to a raid0 volume, each block has to be written fully before moving on to the next block/disk so in a sense you are limited to the slowest disk. But, if you have one slow disk, and three fast ones then the speeds should still be a lot faster than if you had 4 slow disks right? Lets test.....

I found all the types of disk we had in our clusters, and started to benchmark each disk type using iozone (same command as above). Depending on the model of disk, I saw variations of up to 70MB in speeds. Write speeds varied from 70MB/s up to around 110MB/s, and read speeds from 200MB/s up to 490MB/s where all four disks are new.

What have I learnt after all this? We need to take note of the disks hitting our clusters. If we care about performance (using raid0 in the first place assumes we do), then putting faster disks in makes sense.

We now have benchmarks for each type of disk. I want to test each and every disk to help weed out the "bad eggs" within our cluster.

Do I have any massive misunderstandings above? Is there a better way of benchmarking disks / software raided volumes?

Saturday 21 April 2012

XBMC Live Cursor showing up after power cycling TV

A massive annoyance on my XBMC live media centre (XBMC live (Eden) running on Foxconn NT-A3500). If I turn my TV off and on again, the X cursor remains on the screen. This didn't happen when i first installed XBMC live. I didn't consciously amend any settings to make this the case. The only way I could find to get around it is to either reboot the media centre or jump on the laptop and restart lightdm.

Then I got creative and wrote a quick script to run under cron (every 5 minutes) to detect when the TV is powered on and to restart lightdm.

#!/bin/bash

TVON="HDMI status: Pin=3 Presence_Detect=1 ELD_Valid=1"

for i in $(find /var/log/ -maxdepth 1 -name syslog -mmin -5)

do

{

TEST=$(grep "$TVON" "$i" | tail -n 1)

if [ "$TEST" != "" ] ; then

{

# Is this a new one?

TIMESTAMP=$(echo "$TEST" | awk '{print $6}' | sed 's/\[//g' | sed 's/\]//g')

OLD_TIMESTAMP=$(cat /tmp/old_timestamp.txt)

if [ "${TIMESTAMP}" != "${OLD_TIMESTAMP}" ] ; then

{

# we need to restart lightdm

service lighdm restart

echo ${TIMESTAMP} | sed 's/\[//g' | sed 's/\]//g' > /tmp/old_timestamp.txt

}

fi

}

fi

}

done

Not elegant, but currently functional (No idea if restarting XBMC this way could be the cause of my previous post?).

Cursor no more..... :-)

XBMC Live remote not working?

Slightly off topic but my son was watching Thomas the Tank Engine on the TV yesterday (via XBMC live on a Foxconn NT-A3500). Everything was fine. I put him to bed, came down and the remote wouldn't work anymore.

I have an MCE remote I bought from maplins (model RC118) that just worked. Didn't need to perform any button translation which was good. Now its broke. Tried a few things:

Restart xbmc (service lightdm restart)
Restart server
Reseat Infrared receiver

Nothing works.

dmesg shows me that the Infrared receiver is registering:

mceusb 8-2:1.0: Registered Formosa21 eHome Infrared Transceiver on usb8:2

XBMC just isn't able to understand the key presses. Now I know from reading the xbmc forums that lirc is partly responsible for remote controls. I try to restart lirc (service lirc restart) and it complains about missing kernel modules. w000000t. Nothing has happened on the media centre to remove kernel modules in the time I was (ahem, I mean my son was) watching Thomas the Tank Engine.

To google.....

Found the 'irw' command:

irw
000000037ff05bf2 00 Home mceusb_hauppauge

It works which proves the hardware is functioning. Now whats up with XBMC. Nothing obvious in the logs (other than the reference to lirc not starting). Google tells me to check ~userdata/Lircmap.xml. Oh, right. That doesn't exist. I'm assuming thats my problem. But this worked like an hour ago. Whats caused the file to disappear? (i still dont have a clue).

Oh well, to resolve:

cp -p /usr/share/xbmc/system/Lircmap.xml /home/bleh/.xbmc/userdata/

Edit Lircmap.xml in the new location. I deleted all sections other than the one starting:

<lircmap>

<remote device="mceusb">

<---------Stuff--------->
</remote
</lircmap>

Then I renamed mceusb to the name irw was showing the device as (mceusb_hauppauge). I then had to edit the /etc/lirc/hardware.conf file to remove the offending kernel module:

REMOTE_MODULES="lirc_dev "

Restarted lirc, followed by restarting XBMC and voila. A working remote.

I need to read up to understand what happened, whether removing the kernel module (lirc_mceusb2) has any nasty side effects and on lirc in general.

Hope this helps anyone with the same issue!

Thursday 29 March 2012

Balancing data equally among disks on a datanode

What's the best method of achieving a balanced datanode with respect to disk usage? If we have 5 * 2TB disks in a datanode, and they're each 95% utilised and one fails, how do we then redistribute the data to the new disk?

There doesn't appear to be any method within hadoop (0.20.x) that achieves this. Google doesn't appear to give me any answers other than manually moving the blocks to the new disk. Thats a lot of effort. If I can't find one, I'll write a shell script that automatically spreads blocks out equally amongst all available disks.

Stop the datanode process, rebalance within the datanode, restart the datanode process and let it scan for block locations.

Any major flaws to this plan or is there something around already that does this?

Gracias

A.M

**********
Edit
**********

I wrote a small shell script to do this. It works out how much data is on a node, and works out (if spread evenly across each disk) the average utilisation percentage. It then finds and moves the required number of blocks (picks out the block size from hdfs-site.xml) onto the least used disk in a randomly named subdirXX with less than 60 blocks contained within.

Seems to work fine....

**********
Second Edit
**********
Decommission the node within the namenode so that all blocks are copied elsewhere, then just format all data disks. It's the cleanest way of ensuring a single node is balanced.

Wednesday 28 March 2012

Contributing to apache hadoop #2

I've always wanted to contribute something back into the open source community. I thought I'd start off with something ridiculously simple. They offer *newbie* JIRA's after all. I picked one, created a patch and tested it. I uploaded the patch.

And I waited for a few days like the "HowToContribute" page says....

After a week, I asked in the IRC channel what I needed to do to get the patch reviewed. The answer was to add a comment asking for someone to review it. So i did that.....

I saw they were asking for patches to be included in 1.0.2 so i mailed the release manager (Matt Foley) asking him to include it. He said he'd try and get it reviewed. The RC was released for 1.0.2 (minus my patch).

I uploaded the patch on the 18th Feb 2012. I last updated the JIRA asking for someone to look at it on the 29th Feb 2012. Its now the end of March and theres still no movement.

It does beg the question of "Whats the point?". May aswell invest my time in something else. But what?

The calm before the storm.....

Have had some quiet time in recent weeks at work whilst waiting for some of our clusters to enter their outage periods.

It was decided to move from our yahoo-0.20.10 install to apache-0.20.203.0 as it was the latest stable version when the decision was made. I did try to sway the decision to 1.0.0 but it was not mine to make. Started off with our dev clusters which went smoothly, then did the first of our main clusters last week which went smoothly in the end.

Had some oddities during the namenode upgrade due to a corrupt edits log but once that was corrected things went well.

Finally set up collectl to multicast its metrics to a local gmond (3.2.x) which then forwards to its gmetad server and into an rrd database. ganglia has some nice default graphs but lacks some features that I want. For example I want to be able to have hierarchical groups and summary tables:

Currently, if i specify the following groups in gmetad:

Masters (consisting of namenode, secondary and jobtracker)
DataNode_Rack01 (consists of 20 datanodes)
DataNode_Rack02 (consists of 20 datanodes)
DataNode_Rack03 (consists of 20 datanodes)

It will by default summarise at the Masters level, and at each of the DataNode racks, and for all of those groups. I want the ability to summarise on say just the DataNode groups. I dont see a way of doing that in ganglia. I've created some custom php scripts to make some graphs and have even created the relevant summary graphs manually, but they are horrible, hundreds of lines and ridiculously hard to keep up to date.

Has anyone done anything similar? Know of a way around this? Have some examples they can throw around?

Sunday 4 March 2012

Hadoop sloooow start up

Restarted one of our hadoop clusters following an outage this week and it became apparent that the datanodes registering into the namenode were taking a long time. A similar cluster registers all datanodes within a few minutes, this cluster took an hour. I could tail the namenode log and the nodes would register, and then they would lose connectivity (lose heartbeat) before reconnecting. The network is absolutely not saturated.

There are no firewalls involved. There is more than enough memory for the namenode to run. The heap is large and there is plenty of memory for the OS. There isn't anything apparent in the OS logs to suggest why this is happening.

I have collectl running so i'll start to investigate those logs tomorrow morning.

What can I monitor in terms of the namenode and network connectivity / open connections / sockets in use thats hadoop specific ?

The OS limits aren't an issue either. Could it be worth turning on debug for the namenode logging ? In my experience of doing this on the jobtrack, all it does extra is log heartbeat calls. Thats not what im looking for.

I need to get an outage on the cluster so that I can strace the namenode process on startup to see if that sheds any light on things.

*****************************************
Update

*****************************************

When the tasktrackers restart, they clean up old files (including distributed cache). This can take some time (tens of minutes). This was the cause of the issue.

Thursday 1 March 2012

Nagios and check_mk

Still waiting on a hadoop committer / reviewer to look at my first patch so in the mean time I've been looking into check_mk and what it offers us as a layer on top of nagios.

It's pretty damn cool. Downloaded the latest stable version and installed it along with nagios 3.3.1. Install was pretty easy, but would be much better via rpm. Much much cleaner.

Started to read through the docs provided in the source code and realised they didn't scratch the surface. Anything useful is only documented on the check_mk website. This is becoming more and more frequent and stupidly annoying when you dont have internet access readily available on the system to be installed. After downloading quite a few of the pages of documentation, managed to write a main.mk that could correctly create the nagios configuration for us along with associated service/host groups.

Bastardised the check_mk_agent (xinetd isnt great, would much rather configure ssh keys (which we did)), so it only picks up specific checks when inventory is run. Most of our checks are run via mrpe for now as it's an easier migration than trying to integrate them natively into check_mk.

Lots more to do with check_mk over the next two weeks like integrating it into our event handlers and ensuring the service group definitions allow us to get our hadoop availability reports (availability of namenode, jobtrack and secondary processes) correctly.

Whats the standard method of starting hadoop daemons on a cluster? I know cloudera have init scripts, but besides start the daemons they don't do a whole lot. I wrote custom init scripts for our clusters. A daemon (tasktracker / datanode) will only start if the following conditions are met:

It's relevant master node process is listening for connections (namenode / jobtrack)
It's hadoop configuration (hdfs-site.xml / mapred-site.xml / core-site.xml) matches the centrally held configuration master
The server passes all necessary hardware checks (no memory errors, no cpu errors, no hard disk errors).

* Failing hard disks are quite painful within hadoop clusters. Failed disks are easy to find. Failing disks are horrible. What do you class a failing disk? When it's throwing out ATA errors like no-ones business? A single ATA error? When it's performance has slowed? How do you test disk performance in a live environment? Do you take the whole node offline and run some form of benchmarking test? Do you see an issue, remove it instantly and work on it in the background?

Saturday 25 February 2012

Success at last...hadoop-1.0.0 passes all tests

Re ran the tests last night and only one (new) test failed :

[junit] Test org.apache.hadoop.metrics2.impl.TestMetricsSystemImpl FAILED

So I assume the GangliaMetrics test was fixed by the hosts file too.

I found MAPREDUCE-3894 which seems to explain the failure and it suggests its intermittent. I've tested this by rerunning just that test and low and behold it passed :

# ant test -Dtestcase=TestMetricsSystemImpl
# BUILD SUCCESSFUL
# Total time: 24 seconds

Finally, a clean test run (even with my patch attached). Now I just need someone to look at / approve my patch:

https://issues.apache.org/jira/browse/MAPREDUCE-3807

Any clue as to how I draw someones attention to this?

Friday 24 February 2012

The quest continues

The next run finished with two failure. Getting closer:

HADOOP-7949 [junit] Test org.apache.hadoop.ipc.TestSaslRPC FAILED

[junit] Test org.apache.hadoop.metrics2.impl.TestGangliaMetrics FAILED

On further inspection, the TestSaslRPC failure was tracked down to HADOOP-7949 and the need for localhost to appear before localhost.localdomain in the /etc/hosts file.

Wow, just wow.

It shouldn't be this difficult to set up a build environment. I've resorted to going to the hadoop dev mailing lists to see if anyone can help with the last one.

Fingers crossed!!!

Thursday 23 February 2012

And so the real testing begins.....

First run through of the ant tests with ant 1.7.1 and the following tests failed (along with JIRA's that help describe the failure):

HADOOP-7949 [junit] Test org.apache.hadoop.ipc.TestSaslRPC FAILED
MAPREDUCE-3357 [junit] Test org.apache.hadoop.filecache.TestMRWithDistributedCache FAILED
MAPREDUCE-2073 [junit] Test org.apache.hadoop.filecache.TestTrackerDistributedCacheManager FAILED
HBASE-3285 [junit] Test org.apache.hadoop.hdfs.TestFileAppend4 FAILED
MAPREDUCE-3594 [junit] Test org.apache.hadoop.streaming.TestUlimit FAILED
Too many open file [junit] Test org.apache.hadoop.mapred.TestCapacityScheduler FAILED

Increased the number of open files to fix the last test, chmod'd my home dir to resolve MAPREDUCE-2073 and run again.....

Wednesday 22 February 2012

Abject failure continued 2

After trawling mailing lists again, I finally stumbled across this :

http://grokbase.com/p/hadoop/common-dev/1188pq2p5h/vote-release-0-20-204-0-rc0

The useful part:

I hit this one too. If you look at that test case, you'll see it has an@Ignore on it. For some unknown reason, when you use ant 1.8.2 junit doesthe wrong thing. Use ant 1.7.1 and the test cases will be properly ignored.
-- Owen

To coin Bobby Boucher....."that information would have been useful yesterday!!!!!"

Finally stumbled around installing ant 1.7.1 and re-running ant test.

A few more hours to go....

Thursday 16 February 2012

Abject failure continued......

Built my centos vm, checked out source, ensured all dependencies are met, ran ant test and it failed with the same tests as before. Time to go to the mailing lists to see what I'm doing wrong....

Monday 13 February 2012

Abject failure.....

Well that didn't go to plan. Failure once again. A few tests failed. I can read the log output within:

build/test/TEST-<test-name>

But I dont see any kind of report that might alude to which of the numerous tests failed. Scrolling back through hundreds of lines of text doesn't seem productive.

Think i need to do some more research (and build a dedicated hadoop build machine in a VM on my laptop). A plain rhel5u4 image built from scratch. Lets document this shit outta this process and see where I'm going wrong....

Interrupted slightly tonight by purchasing a new NAS drive (A Synology DS411). Something to play with whilst pulling my hair out over my inability to submit a patch that took me about 5 minutes to actually create.

So far it's taken about 10 hours of effort to get down to a handful of unit tests.

To be continued.....

Contributing to apache-hadoop (hadoop-1.0.0)

I decided it was time to contribute to the apache hadoop project. I haven't touched java in a while so thought i'd start with something easy. I found an unassigned JIRA, fixed the issue (using vim as it just worked) and then started to look at how i actually test / submit my patch. There's some information on how to contribute from the following links:

http://wiki.apache.org/hadoop/HowToContribute
http://wiki.apache.org/hadoop/QwertyManiac/BuildingHadoopTrunk

I had a fedora 16 box and installed the necessary software to develop my hadoop patch. I checked out the source using svn :

# svn checkout http://svn.apache.org/repos/asf/hadoop/common/branches/branch-1 hadoop

Then checked my environment by running ant:

# cd hadoop

# ant

This took a while, but built cleanly. Cool. I thought my environment must be good. Think again....

Following the instructions on the HowToContribute page, I created a patch. I then needed to test the patch against the existing tests. Easy....

# ant test

I left it running overnight as it can take up to 3 hours to complete (as suggested on the mailing lists). I check the next morning and about 25% of the tests failed. I read through some of the logs and it looks like permission errors. I'm quite competent in this area so dug a little and the JUnit tests expect specific file permissions. After some trial and error, the umask of 0022 seemed to work just fine:

# umask 0022

Re-run the tests and wait a few hours.

It complete again with a lot of errors. Less than before but not zero so something else must be wrong. I look through these logs and notice the regular expressions expect the hostname to be localhost. That's clearly not the case. I *correct* this and am currently re-running the tests. We shall see how it goes in the morning......

If and when the tests complete OK, I then need to read up on any pre-requisites to submitting patches. As this is only a small fix (and a similar one didnt require a new test), I'm not planning to (learn how to) write a JUnit test. Other than that I'll just submit the patch and hope the community doesn't bite