Have had some quiet time in recent weeks at work whilst waiting for some of our clusters to enter their outage periods.
It was decided to move from our yahoo-0.20.10 install to apache-0.20.203.0 as it was the latest stable version when the decision was made. I did try to sway the decision to 1.0.0 but it was not mine to make. Started off with our dev clusters which went smoothly, then did the first of our main clusters last week which went smoothly in the end.
Had some oddities during the namenode upgrade due to a corrupt edits log but once that was corrected things went well.
Finally set up collectl to multicast its metrics to a local gmond (3.2.x) which then forwards to its gmetad server and into an rrd database. ganglia has some nice default graphs but lacks some features that I want. For example I want to be able to have hierarchical groups and summary tables:
Currently, if i specify the following groups in gmetad:
Masters (consisting of namenode, secondary and jobtracker)
DataNode_Rack01 (consists of 20 datanodes)
DataNode_Rack02 (consists of 20 datanodes)
DataNode_Rack03 (consists of 20 datanodes)
It will by default summarise at the Masters level, and at each of the DataNode racks, and for all of those groups. I want the ability to summarise on say just the DataNode groups. I dont see a way of doing that in ganglia. I've created some custom php scripts to make some graphs and have even created the relevant summary graphs manually, but they are horrible, hundreds of lines and ridiculously hard to keep up to date.
Has anyone done anything similar? Know of a way around this? Have some examples they can throw around?