Sunday 4 March 2012

Hadoop sloooow start up

Restarted one of our hadoop clusters following an outage this week and it became apparent that the datanodes registering into the namenode were taking a long time. A similar cluster registers all datanodes within a few minutes, this cluster took an hour. I could tail the namenode log and the nodes would register, and then they would lose connectivity (lose heartbeat) before reconnecting. The network is absolutely not saturated.

There are no firewalls involved. There is more than enough memory for the namenode to run. The heap is large and there is plenty of memory for the OS. There isn't anything apparent in the OS logs to suggest why this is happening.

I have collectl running so i'll start to investigate those logs tomorrow morning.

What can I monitor in terms of the namenode and network connectivity / open connections / sockets in use thats hadoop specific ?

The OS limits aren't an issue either. Could it be worth turning on debug for the namenode logging ? In my experience of doing this on the jobtrack, all it does extra is log heartbeat calls. Thats not what im looking for.

I need to get an outage on the cluster so that I can strace the namenode process on startup to see if that sheds any light on things.

*****************************************
Update

*****************************************

When the tasktrackers restart, they clean up old files (including distributed cache). This can take some time (tens of minutes). This was the cause of the issue.

1 comment: