Hi BDE users,
Here are some tips for quick debugging 'Bootstrap Failed' error.
When creating/starting/configuring a cluster and some nodes says 'Bootstrap Failed', please follow this to find out the reason.
sudo cat /var/chef/cache/chef-stacktrace.out
Note: if both master node (e.g. zookeeper, namenode, jobtracker, hbase_master) and non-master nodes failed, check the file on the master node is enough.
This will show the error log of 'chef-client' process started by Serengeti.
Here are some typical error log we have met :
Generated at 2013-05-26 23:53:44 -0400
Errno::EHOSTUNREACH: remote_file[/etc/yum.repos.d/cloudera-cdh4.repo] (hadoop_common::add_repo line 45) had an error: Errno::EHOSTUNREACH: No route to host - connect(2)
/usr/lib/ruby/1.9.1/net/http.rb:644:in `block in connect'
This is because the yum server is not available. We need to ensure the yum server is running and the yum repo url http://.../cloudera-cdh4.repo can be reached.
And if the yum repo you added via config-distro.rb is on the BDE server, you should specify https://fqdn_of_bde_server/.../cloudera-cdh4.repo for the --repos param of config-distro.rb, and the baseurl in cloudera-cdh4.repo should also start with https://fqdn_of_bde_server/, otherwise, you will see SSL Certificate error when running chef-client during bootstrapping. If the yum repo is not on the BDE server, you should specify http://fqdn_or_ip/.../cloudera-cdh4.repo unless the apache web server on that server redirects http to https.
2) ERROR: package[hadoop] (/var/chef/cache/cookbooks/hadoop_cluster/libraries/hadoop_cluster.rb:329:in `block in hadoop_package') had an error:
package[hadoop] (hadoop_cluster::default line 329) had an error: Chef::Exceptions::Exec: returned 1, expected 0
This means 'yum install hadoop' failed. The root cause might be that the rpm or its dependant rpms are not on the yum server (this means the ova or the code has problem) or the yum server is not created correctly (need to recreate the yum server).
You can run 'sudo yum install hadoop' on the node to get the detail error msg.
3) ERROR: service[start-hadoop-hdfs-datanode] (/var/chef/cache/cookbooks/hadoop_cluster/recipes/datanode.rb:43:in `from_file') had an error:
10.136.29.45 service[start-hadoop-hdfs-datanode] (hadoop_cluster::datanode line 43) had an error: Chef::Exceptions::Exec: /sbin/service hadoop-hdfs-datanode start returned 1, expected 0
This means 'sudo service hadoop-hdfs-datanode start' failed. We need to check logs in /var/log/hadoop/ to find out why the datanode service can't start.
This solution also applies to other similar ERROR: service[start-<service-name>].
4) ERROR: Net::HTTPServerException: 401 "Unauthorized"
This mean the the clock on your nodes and the serengeti service is not synchronized. Please change configuration in vSphere client to use NTP synchronize the clock on all ESXi hosts. The difference of clocks on all hosts should less than 20 seconds. Once the setting is done, it will need several minutes for all the VMs to get the new clock. Then you can run 'cluster ... --resume'.
5) For other error, please run 'sudo chef-client' on the failed node, and send its output and all files in /opt/serengeti/logs/ on the BDE Management Server to us for debugging.