I look after a few largish clusters. A typical cluster will consist of :
Midrange server for masternodes (namenode / secondary / jobtrack). When I say midrange, 24 core 128GB /RAM machines with appropriate disk.
Commodity data nodes (32GB RM / 16 cores / 5*1TB hard disks)
Currently hadoop 0.20.10 (yahoo's last stable version)
I'm responsible for monitoring these clusters. Our current solution consists of custom nagios scripts (mostly written by myself), ganglia and collectl. I guess this blog is aimed at other sysadmins in a similar scenario. to share / learn new ideas, new ways of monitoring.
Things I've learnt in the past 18 months:
- nagios is ridiculously powerful. You can write checks for *anything*
- swap is evil on datanodes
- collectl is a godsend. With swap turned off, and oom_on_panic turned on in the kernel, collectl with its process monitoring will tell you which process ate all the memory. This process will contain the job ID that you can use to bash the developer who's massively increasing the jvm sizes.
- Weekly reboots tend to help keep the cluster moving sweetly.
- I love stats. collectl and colplot make me kinda horny.
- Commodity hardware brings commodity hard disks. Push them hard enough and they will fail. A lot. And often. Did I mention a lot ?
How is the rest of the world monitoring their hadoop clusters ? I see cloudera released cloudera manager recently (http://www.cloudera.com/blog/2011/12/cloudera-manager-3-7-released/ ) which intrigues me. Although the free version is limited to 50 nodes and I hear that their licensing costs are horrific*, I do wonder what it has to offer me and my clusters.
Do you do any log analytics ? I've played about with pig and the piggybank with the HadoopJobHistoryLoader but haven't implemented anything yet.
regards
Hadoop Admin
* No basis for this statement. Just gossip!
I've heard a lot of comments about collectl, but I've never heard anyone say it makes then horny! You gotta get out more!
ReplyDeleteBut seriously, if you like collectl and colplot you need to spend some time with colmux, especially if you're monitoring a hadoop cluster. Imaging running the equivalent of top across all your hadoop clusters and instead of top processes your looking at top disk, top memory usage, top network traffic. In other words, ANY command you can run with collectl you can run via colmux and see an intergrated picture across the systems of your choice.
But if that hasn't made you break out in a sweat yet, not imaging running the same cluster-top for any collectl metric against logged data! That's right, you can playback collectl data from n-machines and sort the output by top process, disk, etc.
btw - hopefully I'll be releasing new version next week. Also, have you discovered the --whatsnew switch? I have such a hard time remembering what I've added to each version I added a switch to tell me [and others].
keep on collectling....
-mark
I've used colmux a little on our development cluster and it worked absolutely fine. I need to do some prep work to ensure it works on our live clusters (checking it works at scale and more importantly whether the underlying OS can deal with the number of required connections) as I'm sure i tried it once and encountered issues with that.
DeleteWhats the biggest environment (number of nodes) you've had colmux successfully running on?
I've run colmux on over 1K nodes, at least I think I did. But I'll tell you what, I did find a couple of problems with the one that's currently out there now and do plan on releasing a new version some time in the next couple of weeks. I'll try to remember to post a reply here when it's ready OR if you want to send me a private email I can send you a pre-release version if you'd like to take it for a shake-down run.
ReplyDelete-mark