I look after a few largish clusters. A typical cluster will consist of :
Midrange server for masternodes (namenode / secondary / jobtrack). When I say midrange, 24 core 128GB /RAM machines with appropriate disk.
Commodity data nodes (32GB RM / 16 cores / 5*1TB hard disks)
Currently hadoop 0.20.10 (yahoo's last stable version)
I'm responsible for monitoring these clusters. Our current solution consists of custom nagios scripts (mostly written by myself), ganglia and collectl. I guess this blog is aimed at other sysadmins in a similar scenario. to share / learn new ideas, new ways of monitoring.
Things I've learnt in the past 18 months:
- nagios is ridiculously powerful. You can write checks for *anything*
- swap is evil on datanodes
- collectl is a godsend. With swap turned off, and oom_on_panic turned on in the kernel, collectl with its process monitoring will tell you which process ate all the memory. This process will contain the job ID that you can use to bash the developer who's massively increasing the jvm sizes.
- Weekly reboots tend to help keep the cluster moving sweetly.
- I love stats. collectl and colplot make me kinda horny.
- Commodity hardware brings commodity hard disks. Push them hard enough and they will fail. A lot. And often. Did I mention a lot ?
How is the rest of the world monitoring their hadoop clusters ? I see cloudera released cloudera manager recently (http://www.cloudera.com/blog/2011/12/cloudera-manager-3-7-released/ ) which intrigues me. Although the free version is limited to 50 nodes and I hear that their licensing costs are horrific*, I do wonder what it has to offer me and my clusters.
Do you do any log analytics ? I've played about with pig and the piggybank with the HadoopJobHistoryLoader but haven't implemented anything yet.
* No basis for this statement. Just gossip!