A day in the life of a hadoop sysadmin: Nagios and check

Still waiting on a hadoop committer / reviewer to look at my first patch so in the mean time I've been looking into check_mk and what it offers us as a layer on top of nagios.

It's pretty damn cool. Downloaded the latest stable version and installed it along with nagios 3.3.1. Install was pretty easy, but would be much better via rpm. Much much cleaner.

Started to read through the docs provided in the source code and realised they didn't scratch the surface. Anything useful is only documented on the check_mk website. This is becoming more and more frequent and stupidly annoying when you dont have internet access readily available on the system to be installed. After downloading quite a few of the pages of documentation, managed to write a main.mk that could correctly create the nagios configuration for us along with associated service/host groups.

Bastardised the check_mk_agent (xinetd isnt great, would much rather configure ssh keys (which we did)), so it only picks up specific checks when inventory is run. Most of our checks are run via mrpe for now as it's an easier migration than trying to integrate them natively into check_mk.

Lots more to do with check_mk over the next two weeks like integrating it into our event handlers and ensuring the service group definitions allow us to get our hadoop availability reports (availability of namenode, jobtrack and secondary processes) correctly.

Whats the standard method of starting hadoop daemons on a cluster? I know cloudera have init scripts, but besides start the daemons they don't do a whole lot. I wrote custom init scripts for our clusters. A daemon (tasktracker / datanode) will only start if the following conditions are met:

It's relevant master node process is listening for connections (namenode / jobtrack)
It's hadoop configuration (hdfs-site.xml / mapred-site.xml / core-site.xml) matches the centrally held configuration master
The server passes all necessary hardware checks (no memory errors, no cpu errors, no hard disk errors).

* Failing hard disks are quite painful within hadoop clusters. Failed disks are easy to find. Failing disks are horrible. What do you class a failing disk? When it's throwing out ATA errors like no-ones business? A single ATA error? When it's performance has slowed? How do you test disk performance in a live environment? Do you take the whole node offline and run some form of benchmarking test? Do you see an issue, remove it instantly and work on it in the background?

A day in the life of a hadoop sysadmin

Thursday, 1 March 2012

Nagios and check_mk

No comments:

Post a Comment