A day in the life of a hadoop sysadmin: March 2012

Thursday 29 March 2012

Balancing data equally among disks on a datanode

What's the best method of achieving a balanced datanode with respect to disk usage? If we have 5 * 2TB disks in a datanode, and they're each 95% utilised and one fails, how do we then redistribute the data to the new disk?

There doesn't appear to be any method within hadoop (0.20.x) that achieves this. Google doesn't appear to give me any answers other than manually moving the blocks to the new disk. Thats a lot of effort. If I can't find one, I'll write a shell script that automatically spreads blocks out equally amongst all available disks.

Stop the datanode process, rebalance within the datanode, restart the datanode process and let it scan for block locations.

Any major flaws to this plan or is there something around already that does this?

Gracias

A.M

**********
Edit
**********

I wrote a small shell script to do this. It works out how much data is on a node, and works out (if spread evenly across each disk) the average utilisation percentage. It then finds and moves the required number of blocks (picks out the block size from hdfs-site.xml) onto the least used disk in a randomly named subdirXX with less than 60 blocks contained within.

Seems to work fine....

**********
Second Edit
**********
Decommission the node within the namenode so that all blocks are copied elsewhere, then just format all data disks. It's the cleanest way of ensuring a single node is balanced.

Wednesday 28 March 2012

Contributing to apache hadoop #2

I've always wanted to contribute something back into the open source community. I thought I'd start off with something ridiculously simple. They offer *newbie* JIRA's after all. I picked one, created a patch and tested it. I uploaded the patch.

And I waited for a few days like the "HowToContribute" page says....

After a week, I asked in the IRC channel what I needed to do to get the patch reviewed. The answer was to add a comment asking for someone to review it. So i did that.....

I saw they were asking for patches to be included in 1.0.2 so i mailed the release manager (Matt Foley) asking him to include it. He said he'd try and get it reviewed. The RC was released for 1.0.2 (minus my patch).

I uploaded the patch on the 18th Feb 2012. I last updated the JIRA asking for someone to look at it on the 29th Feb 2012. Its now the end of March and theres still no movement.

It does beg the question of "Whats the point?". May aswell invest my time in something else. But what?

The calm before the storm.....

Have had some quiet time in recent weeks at work whilst waiting for some of our clusters to enter their outage periods.

It was decided to move from our yahoo-0.20.10 install to apache-0.20.203.0 as it was the latest stable version when the decision was made. I did try to sway the decision to 1.0.0 but it was not mine to make. Started off with our dev clusters which went smoothly, then did the first of our main clusters last week which went smoothly in the end.

Had some oddities during the namenode upgrade due to a corrupt edits log but once that was corrected things went well.

Finally set up collectl to multicast its metrics to a local gmond (3.2.x) which then forwards to its gmetad server and into an rrd database. ganglia has some nice default graphs but lacks some features that I want. For example I want to be able to have hierarchical groups and summary tables:

Currently, if i specify the following groups in gmetad:

Masters (consisting of namenode, secondary and jobtracker)
DataNode_Rack01 (consists of 20 datanodes)
DataNode_Rack02 (consists of 20 datanodes)
DataNode_Rack03 (consists of 20 datanodes)

It will by default summarise at the Masters level, and at each of the DataNode racks, and for all of those groups. I want the ability to summarise on say just the DataNode groups. I dont see a way of doing that in ganglia. I've created some custom php scripts to make some graphs and have even created the relevant summary graphs manually, but they are horrible, hundreds of lines and ridiculously hard to keep up to date.

Has anyone done anything similar? Know of a way around this? Have some examples they can throw around?

Sunday 4 March 2012

Hadoop sloooow start up

Restarted one of our hadoop clusters following an outage this week and it became apparent that the datanodes registering into the namenode were taking a long time. A similar cluster registers all datanodes within a few minutes, this cluster took an hour. I could tail the namenode log and the nodes would register, and then they would lose connectivity (lose heartbeat) before reconnecting. The network is absolutely not saturated.

There are no firewalls involved. There is more than enough memory for the namenode to run. The heap is large and there is plenty of memory for the OS. There isn't anything apparent in the OS logs to suggest why this is happening.

I have collectl running so i'll start to investigate those logs tomorrow morning.

What can I monitor in terms of the namenode and network connectivity / open connections / sockets in use thats hadoop specific ?

The OS limits aren't an issue either. Could it be worth turning on debug for the namenode logging ? In my experience of doing this on the jobtrack, all it does extra is log heartbeat calls. Thats not what im looking for.

I need to get an outage on the cluster so that I can strace the namenode process on startup to see if that sheds any light on things.

*****************************************
Update

*****************************************

When the tasktrackers restart, they clean up old files (including distributed cache). This can take some time (tens of minutes). This was the cause of the issue.

Thursday 1 March 2012

Nagios and check_mk

Still waiting on a hadoop committer / reviewer to look at my first patch so in the mean time I've been looking into check_mk and what it offers us as a layer on top of nagios.

It's pretty damn cool. Downloaded the latest stable version and installed it along with nagios 3.3.1. Install was pretty easy, but would be much better via rpm. Much much cleaner.

Started to read through the docs provided in the source code and realised they didn't scratch the surface. Anything useful is only documented on the check_mk website. This is becoming more and more frequent and stupidly annoying when you dont have internet access readily available on the system to be installed. After downloading quite a few of the pages of documentation, managed to write a main.mk that could correctly create the nagios configuration for us along with associated service/host groups.

Bastardised the check_mk_agent (xinetd isnt great, would much rather configure ssh keys (which we did)), so it only picks up specific checks when inventory is run. Most of our checks are run via mrpe for now as it's an easier migration than trying to integrate them natively into check_mk.

Lots more to do with check_mk over the next two weeks like integrating it into our event handlers and ensuring the service group definitions allow us to get our hadoop availability reports (availability of namenode, jobtrack and secondary processes) correctly.

Whats the standard method of starting hadoop daemons on a cluster? I know cloudera have init scripts, but besides start the daemons they don't do a whole lot. I wrote custom init scripts for our clusters. A daemon (tasktracker / datanode) will only start if the following conditions are met:

It's relevant master node process is listening for connections (namenode / jobtrack)
It's hadoop configuration (hdfs-site.xml / mapred-site.xml / core-site.xml) matches the centrally held configuration master
The server passes all necessary hardware checks (no memory errors, no cpu errors, no hard disk errors).

* Failing hard disks are quite painful within hadoop clusters. Failed disks are easy to find. Failing disks are horrible. What do you class a failing disk? When it's throwing out ATA errors like no-ones business? A single ATA error? When it's performance has slowed? How do you test disk performance in a live environment? Do you take the whole node offline and run some form of benchmarking test? Do you see an issue, remove it instantly and work on it in the background?