EMR master instance is not reachable
I faced a rare issue today. My EMR cluster was not resizing.
Apparently, the master node was facing issue in communicating with EMR.
The cluster status showed "Master instance is not reachable. Please check your master instance status". But I was able to login into master node and use hadoop and HDFS without any issues. Just that the cluster was not resizing.
Searching on the internet just gave one relevant result.
https://forums.aws.amazon.com/thread.jspa?messageID=695687
Although it mentioned that the issue was with the emr 3.x amis and my cluster was a 4.x, I thought this was the same issue. For verifying I checked on my other clusters if any such service was running on them.
$ ps awux | grep instancecontroller
hadoop 2389 0.4 2.0 3584132 309980 ? Sl Apr13 97:41 /etc/alternatives/jre/bin/java -Xmx1024m -XX:OnOutOfMemoryError=kill -9 %p -XX:MinHeapFreeRatio=10 -server -cp /usr/share/aws/emr/instance-controller/lib/*:/home/hadoop/conf -Dlog4j.defaultInitOverride aws157.instancecontroller.Main
hadoop 18522 0.0 0.0 110460 2080 pts/2 S+ 19:08 0:00 grep --color=auto instancecontroller
So I gave it a try using the following command as -XX "kill -9 %p" was giving error
$ /etc/alternatives/jre/bin/java -Xmx1024m -XX:MinHeapFreeRatio=10 -server -cp /usr/share/aws/emr/instance-controller/lib/*:/home/hadoop/conf -Dlog4j.defltInitOverride aws157.instancecontroller.Main &
Alas! the status on EMR console went ok again. All good till now.
Although later I figured out the downsizing still doesn't work. I terminated all TASK nodes manually and CORE nodes too (Always keep data in S3 instead of HDFS). The dfs is showing missing blocks but who cares.
I tried adding a TASK node and it worked.
Comments
There is an error in the command above. Please change defltInitOverride to defaultInitOverride.
sudo /etc/init.d/instance-controller start
that file contained the correct command to run for me.