VMware Cloud Community
EdSp
Enthusiast
Enthusiast

Serengeti cluster creation error: failed to get IP address of vm ...

Hi all,

I have installed and deployed the Big Data Extension and am trying to create a simple and small (compute-only, PivotalHD) cluster. All seems to go well, a compute and a worker node are being cloned from the template VM.

However, in the serengeti log on the management server I can see that the nodes don't get the IP address that was correctly selected for them from the range defined in the BDE.

The log shows that an IP is selected for each node but in the end the cluster creation fails:

2014 Jun 12 11:46:39,638+0000 ERROR SimpleAsyncTaskExecutor-1| com.vmware.bdd.utils.VcVmUtil: Failed to get ip address of VM PHDcluster1-ComputeMaster-0


The template VM has DHCP set for eth0, exactly as described in step 3b of the CentOS installation chapter in the BDE Admin Guide.

I have already rebuild the environment, step by step exactly according the guide, but with no luck.

Any help appreciated!

Ed

0 Kudos
18 Replies
jessehuvmw
Enthusiast
Enthusiast

I think you are using customized template via following VMware vSphere Big Data Extensions .

Did you follow the step 9 ? :

9.

Remove the /etc/udev/rules.d/70-persistent-net.rules file to prevent increasing the eth

number during the clone operation.

If you do not remove this file, virtual machines cloned from the template cannot get IP addresses. If you power on the Hadoop Template virtual machine to make changes, remove this file before shutting down this virtual machine.

Cheers, Jesse Hu
0 Kudos
admin
Immortal
Immortal

Hi Ed,

     Could you provide some more details for us to figure out the root cause please?. For example,

     1) Inside deployed VMs, does "ifconfig -a" show the expected IP addresses?

     2) Is the IP show correctly on vSphere Client? On vSphere client, in the "Summary" view of the deployed VM, what is the status of "VMware Tools" and "IP Addresses";

     3) What is your BDE version, 1.1.0 or 1.0.0? If it is 1.0.0, you MUST delete /etc/udev/rules.d/70-persistent-net.rules as Jesse mentioned.

     4) What is the vCenter server's version as well as build number? You can get this info by clicking "Help" --> "About VMware vSphere" on vSphere client.

Thanks,

-bxd

0 Kudos
EdSp
Enthusiast
Enthusiast

Hi guys,

Thanks for your suggestions. I did another test yesterday, where I swapped the customized Hadoop template VM out and moved the default Hadoop-template VM back in again. I then created a compute-only Apache cluster, giving 1 master and 1 worker node. The range of IP’s is defined in the BDE as .64 to .80. To my surprise, the master got an IP of .61 and the worker did not get an IP.

(Only in previous deployments I used a free range of .61 to .80.) Looking in the Serengeti.log file on the management server, I can see that it *does* allocate the first 2 IP’s (i.e. .64 and .65 for the 2 nodes to be cloned) and that is confirmed by checking the network resource of the cluster, which (after provisioning) shows that the free range now starts with .66.)

Let me respond to your questions/suggestions inline below:

1) Inside deployed VMs, does "ifconfig -a" show the expected IP addresses?

On the master node, it shows the IPv4 address (inet addr). On the worker node, there is no IPv4, only the usual IPv6 address. The adapter is ‘up’ though and has the correct number (eth0). The ifcfg-eth0 file on the template vm shows the 3-line dhcp setup like in the BDE Admin Guide.

2) Is the IP show correctly on vSphere Client? On vSphere client, in the "Summary" view of the deployed VM, what is the status of "VMware Tools" and "IP Addresses";

For the master node: VMware Tools: Running (Current), IP Addresses: <IPv4 ending .61>

For the worker node: VMware Tools: Running (Current), IP Addresses: <IPv6 value>

3) What is your BDE version, 1.1.0 or 1.0.0? If it is 1.0.0, you MUST delete /etc/udev/rules.d/70-persistent-net.rules as Jesse mentioned.

The BDE version is 1.1.0 which is displayed in the BDE Home screen in the vSphere Web Client.

I am aware of the possible issue due to the udev 70-persistent-net rule and always check that that file is removed before powering off the template VM. I then also double check that there are no vm snapshots to remove of the Hadoop template vm.

4) What is the vCenter server's version as well as build number? You can get this info by clicking "Help" --> "About VMware vSphere" on vSphere client.

The vSphere Server is: v5.5.0, b1476327.

Seeing the issue with getting the .61 IP, which is outside the range of defined free IP’s… If I was to go ahead to reinstall the BDE. What would be the best procedure to make the environment ‘clean’? I did a re-install earlier this week by going to the URL https://<management server IP>:8443/register-plugin/ and using the radio button for Uninstall. How can I best verify it’s all removed?

Apologies for all the questions! Hope it all makes some sense and we can get this cluster working.

Thanks again,

Ed

0 Kudos
admin
Immortal
Immortal

Hi Ed,

    This problem is a little wired, I do not know what happened, just try to guess...

   1) "I swapped the customized Hadoop template VM out and moved the default Hadoop-template VM back in again"

      After move back, you need to restart your BDE server's Tomcat service(login BDE server, and run "sudo /etc/init.d/tomcat restart"), otherwise, it will still use the former customized template VM for deployment.

   2) . The range of IP’s is defined in the BDE as .64 to .80. To my surprise, the master got an IP of .61, Looking in the Serengeti.log file on the management server, I can see that it *does* allocate the first 2 IP’s.

     It is impossible in my opinion, only one very special case: configure a .61 IP for customized template VM when customizing it, and the customized template VM is not a CentOS 6.x OS and our agent scripts do not work on it. Then when temple VM is cloned, all deployed VMs try to fetch the .61 IP, one succeed and all others cannot get any IP.

  3) On the master node, it shows the IPv4 address (inet addr). On the worker node, there is no IPv4, only the usual IPv6 address.

    For BDE 1.1.0, we disabled IPv6 module on template VM, so if you are using default template VM, there should be no IPv6 addresses on deployed nodes...

  4) What would be the best procedure to make the environment ‘clean’?

   To make the current env clean:

   a) delete all clusters deployed by this BDE server;

    b) delete all networks;

    c) make sure the default template VM is inside the vApp with BDE server, and delete all its snapshots.

    d) restart tomcat service on BDE server.

    e) add network.

    f) create cluster.

   If you want to reinstall a fresh BDE, just shutdown the BDE vApp, delete it and redeploy a new one.

  5) Furthermore, could you please attach some logs? i.e

   a) /opt/serengeti/logs/serengeti.log on BDE server;

   b) /var/log/boot.log on deployed nodes;

   c) /opt/serengeti/logs/serenget-boot.log on deployed nodes if it exist.

   d) ifcfg-ethX on deployed nodes.

  6) If possible, please also CC your question to my email(xiaodingbian@vmware.com), then I can reply more timely.

Thanks,

-bxd

EdSp
Enthusiast
Enthusiast

Hi,

Thanks for your feedback, it is starting to make a bit of sense now...

  1) "I swapped the customized Hadoop template VM out and moved the default Hadoop-template VM back in again"

      After move back, you need to restart your BDE server's Tomcat service(login BDE server, and run "sudo /etc/init.d/tomcat restart"), otherwise, it will still use the former customized template VM for deployment.

Ed: This could be the case! I cannot recall explicitely restarting tomcat… I will test this shortly, after making env clean as shown in point 4) below.

   2) . The range of IP’s is defined in the BDE as .64 to .80. To my surprise, the master got an IP of .61, Looking in the Serengeti.log file on the management server, I can see that it *does* allocate the first 2 IP’s.

     It is impossible in my opinion, only one very special case: configure a .61 IP for customized template VM when customizing it, and the customized template VM is not a CentOS 6.x OS and our agent scripts do not work on it. Then when temple VM is cloned, all deployed VMs try to fetch the .61 IP, one succeed and all others cannot get any IP.

Ed: Not 100% sure I fully understand how this could have happened, but I did indeed use address .61 for customizing the template VM!

  3) On the master node, it shows the IPv4 address (inet addr). On the worker node, there is no IPv4, only the usual IPv6 address.

    For BDE 1.1.0, we disabled IPv6 module on template VM, so if you are using default template VM, there should be no IPv6 addresses on deployed nodes...

Ed: Could this be due to point 1) above…


  4) What would be the best procedure to make the environment ‘clean’?

   To make the current env clean:

   a) delete all clusters deployed by this BDE server;

    b) delete all networks;

    c) make sure the default template VM is inside the vApp with BDE server, and delete all its snapshots.

    d) restart tomcat service on BDE server.

    e) add network.

    f) create cluster.

   If you want to reinstall a fresh BDE, just shutdown the BDE vApp, delete it and redeploy a new one.

  5) Furthermore, could you please attach some logs? i.e

   a) /opt/serengeti/logs/serengeti.log on BDE server;

   b) /var/log/boot.log on deployed nodes;

   c) /opt/serengeti/logs/serenget-boot.log on deployed nodes if it exist.

   d) ifcfg-ethX on deployed nodes.

Ed: will send you a,b,d by email. Regarding c), the directory /opt/serengeti does not exist on the nodes.


  6) If possible, please also CC your question to my email(xiaodingbian@vmware.com), then I can reply more timely.

Ed: Ok


I need to do a new test and pay particular attention to your points in 1 and 2 above. I will make sure the environment is cleaned as described above in 4.

Once that test is done, I will report back on the status.

Thanks again for your great feedback, it is much appreciated.

Regards,

Ed

0 Kudos
EdSp
Enthusiast
Enthusiast

Hi guys,

Some progress has been made. Deploying Apache with the default hadoop-template vm is now working well. This means that after creating a custom CentOS VM and swapping back the hadoop-template VM, I previously may not have done the Apache restart on the management server, as was suggested in point 1) earlier.

Trying to deploy a Pivotal HD cluster using the CentOS6.4 VM still doesn't get an IPv4, but I have noticed a possible issue in the output of the installer of the OS customization script (BDE Admin and User's Guide, pg 36, step 8d).

At the end of the output it has:

:

:

warning: chef-11.8.0-1.el6.x86_64.rpm: Header V4 DSA/SHA1 Signature, key ID 83ef826a: NOKEY

Preparing... ########################################### [100%]

      1:chef ########################################### [100%]

Thank you for installing Chef!

ERROR:  Could not find a valid gem 'ruby-shadow' (= 2.1.4) in any repository

ERROR:  Possible alternatives: ruby-shadow

[root@localhost tmp]#


This may have to do with 'chef'? I didn't see any issues before that in the output of the installer of the customization script.

What can I do to resolve this?

Many thanks,

Ed

0 Kudos
admin
Immortal
Immortal

If tomcat is not restart and agents are not installed successfully, then I can understand why this wired issue happens.

"The range of IP’s is defined in the BDE as .64 to .80. To my surprise, the master got an IP of .61, Looking in the Serengeti.log file on the management server, I can see that it *does* allocate the first 2 IP’s."

Have cc Chiqing, He and Jesse have more experience on Chef issues.

0 Kudos
EdSp
Enthusiast
Enthusiast

Many thanks for your fast responses Bian. Much appreciated!

Regards,

Ed

0 Kudos
jessehuvmw
Enthusiast
Enthusiast

This error should be the root cause of your issue.  Because the installer.sh for custermizing template will exit on any error, so the rest scripts in installer.sh was not executed.   I can run "gem install ruby-shadow -v 2.1.4" well in my machine. So is it possible your template VM can't connect to the rubygem server to download the gem file ?  You can try run that gem command again, or run "wget https://rubygems.org/gems/ruby-shadow/versions/2.1.4" to check if the internet connection is ok.


  ERROR:  Could not find a valid gem 'ruby-shadow' (= 2.1.4) in any repository

  ERROR:  Possible alternatives: ruby-shadow


BTW, BDE 2.0 was released last week VMware vSphere Big Data Extensions Documentation. In BDE 2.0, the Hadoop Template Virtual Machine now uses CentOS 6.4 as its default operating system. This provides an increase in performance, as well as support for all Hadoop distributions for use with Big Data Extensions.  So it can save you time for creating the customzied CentOS 6.4 template. Hope you enjoy it.

Cheers, Jesse Hu
0 Kudos
EdSp
Enthusiast
Enthusiast

Thanks Jesse,

Just to keep this thread up-to-date...

Yes, that was the issue, and the installer script had stopped at that point. We fixed the wget issue and then ran the installer script again, from that point on.

This resulted in 1 warning message, I think from the gem install command. Is this something to worry about?

[root@localhost tmp]# ./inst2.sh /usr/java/default/jdk1.6.0_31

install chef client and ruby shadow and dependencies

WARNING:  Error fetching data: SocketError: getaddrinfo: Name or service not known (http://rubygems.org/specs.4.8.gz)

Building native extensions.  This could take a while...

Successfully installed ruby-shadow-2.1.4

1 gem installed

Loaded plugins: fastestmirror, security

Loading mirror speeds from cached hostfile

* base: ftp.heanet.ie

* extras: centosn4.centos.org

* updates: centosq3.centos.org

:

:

Complete!

Changing password for user serengeti.

passwd: all authentication tokens updated successfully.

I created a PHD cluster anyway. It correctly gets IPv4 addresses (as the chef client was now installed on the VM template), but bootstrapping the master node VM fails.

The file serengeti.log on the management server shows:

2014 Jun 16 15:11:56,204+0000 ERROR SimpleAsyncTaskExecutor-1| com.vmware.bdd.service.job.software.ironfan.IronfanSoftwareManagementTask: command execution failed. Bootstrapping VM failed.

The file ironfan.log on the management server shows:

I, [2014-06-16T14:56:50.550688 #32545]  INFO -- : Starting software management server on 9090...

I, [2014-06-16T15:10:44.502956 #32545]  INFO -- : ============= Invoking Ironfan Knife CLI =============

I, [2014-06-16T15:10:44.503043 #32545]  INFO -- : knife cluster create PHDcl3 -f /opt/serengeti/logs/task/9/1/PHDcl3.json --yes --bootstrap -V

[2014-06-16T15:10:45+00:00] INFO: HTTP Request Returned 404 Object Not Found: Cannot load data bag item PivotalHD for data bag hadoop_distros   

[2014-06-16T15:10:45+00:00] INFO: HTTP Request Returned 404 Object Not Found: Cannot load data bag item PivotalHD for data bag hadoop_distros

[2014-06-16T15:10:45+00:00] INFO: Inventorying servers in PHDcl3 cluster, all facets, all servers

Finding relevant servers to create:

  +------------------------+------------+----------+--------+-------+-------------+-------------+------------+-------------+

  | Name                   | InstanceID | State    | Flavor | Image | Public IP   | Private IP  | Created At | Launchable? |

  +------------------------+------------+----------+--------+-------+-------------+-------------+------------+-------------+

  | PHDcl3-Worker-0        |            | VM Ready |        |       | 10.103.1.64 | 10.103.1.64 |            | -           |

  | PHDcl3-ComputeMaster-0 |            | VM Ready |        |       | 10.103.1.65 | 10.103.1.65 |            | -           |

  +------------------------+------------+----------+--------+-------+-------------+-------------+------------+-------------+

10.103.1.65 [2014-06-16T16:12:31+01:00] INFO: Set Bootstrap Action to ''

10.103.1.65 [2014-06-16T16:12:31+01:00] INFO: mounting data disk /dev/disk/by-id/scsi-36000c29f10abcee2c7b23d8c761c64f5-part1 at /mnt/scsi-36000c29f10abcee2c7b23d8c761c64f5-part1

10.103.1.65 [2014-06-16T16:12:31+01:00] WARN: Cloning resource attributes for directory[/etc/hadoop/conf] from prior resource (CHEF-3694)

10.103.1.65 [2014-06-16T16:12:31+01:00] WARN: Previous directory[/etc/hadoop/conf]: /var/chef/cache/cookbooks/hadoop_cluster/recipes/hadoop_conf_xml.rb:40:in `from_file'

10.103.1.65 [2014-06-16T16:12:31+01:00] WARN: Current  directory[/etc/hadoop/conf]: /var/chef/cache/cookbooks/hadoop_common/libraries/default.rb:25:in `force_link'

10.103.1.65 [2014-06-16T16:12:31+01:00] WARN: Cloning resource attributes for directory[/var/log/hadoop] from prior resource (CHEF-3694)::

10.103.1.65 [2014-06-16T16:12:31+01:00] WARN: Previous execute[Make sure hadoop owns hadoop dirs]: /var/chef/cache/cookbooks/hadoop_cluster/libraries/hadoop_cluster.rb:382:in `ensure_hadoop_owns_hadoop_dirs'

10.103.1.65 [2014-06-16T16:12:31+01:00] WARN: Current  execute[Make sure hadoop owns hadoop dirs]: /var/chef/cache/cookbooks/hadoop_cluster/libraries/hadoop_cluster.rb:382:in `ensure_hadoop_owns_hadoop_dirs'

10.103.1.65 package hadoop-yarn-resourcemanager is not installed

10.103.1.65 [2014-06-16T16:12:31+01:00] WARN: Cloning resource attributes for execute[fix bug: service status always returns 0] from prior resource (CHEF-3694)

10.103.1.65 [2014-06-16T16:12:31+01:00] WARN: Previous execute[fix bug: service status always returns 0]: /var/chef/cache/cookbooks/hadoop_cluster/libraries/hadoop_cluster.rb:355:in `hadoop_package'

10.103.1.65 [2014-06-16T16:12:31+01:00] WARN: Current  execute[fix bug: service status always returns 0]: /var/chef/cache/cookbooks/hadoop_cluster/libraries/hadoop_cluster.rb:355:in `hadoop_package'

10.103.1.65 [2014-06-16T16:12:31+01:00] WARN: Cloning resource attributes for ruby_block[hadoop-yarn-resourcemanager] from prior resource (CHEF-3694)

10.103.1.65 [2014-06-16T16:12:31+01:00] WARN: Previous ruby_block[hadoop-yarn-resourcemanager]: /var/chef/cache/cookbooks/hadoop_common/libraries/chef_monitor.rb:28:in `set_bootstrap_action'

10.103.1.65 [2014-06-16T16:12:31+01:00] WARN: Current ruby_block[hadoop-yarn-resourcemanager]: /var/chef/cache/cookbooks/hadoop_common/libraries/chef_monitor.rb:28:in `set_bootstrap_action'

10.103.1.65 package hadoop-mapreduce-historyserver is not installed

10.103.1.65 [2014-06-16T16:12:31+01:00] WARN: Cloning resource attributes for execute[fix bug: service status always returns 0] from prior resource (CHEF-3694)

10.103.1.65 [2014-06-16T16:12:31+01:00] WARN: Previous execute[fix bug: service status always returns 0]: /var/chef/cache/cookbooks/hadoop_cluster/libraries/hadoop_cluster.rb:355:in `hadoop_package'

10.103.1.65 [2014-06-16T16:12:31+01:00] WARN: Current  execute[fix bug: service status always returns 0]: /var/chef/cache/cookbooks/hadoop_cluster/libraries/hadoop_cluster.rb:355:in `hadoop_package'

10.103.1.65 Converging 100 resources

:

:

:

:

D, [2014-06-16T15:11:55.203317 #32545] DEBUG -- : progress: <Software::Mgmt::Thrift::OperationStatus finished:false, succeed:false, progress:55, error_msg:"", total:2, success:0, failure:1, running:1>

Bootstrapping cluster PHDcl3 completed with exit status [0, 1] at 2014-06-16 15:11:55 +0000

Creating cluster PHDcl3 completed.

  PHDcl3-Worker-0:      Syncing to cloud

  PHDcl3-ComputeMaster-0: Syncing to cloud

Finished! Current state:

+------------------------+------------+------------------+--------+-------+-------------+-------------+------------+-------------+ 

| Name                   | InstanceID | State            | Flavor | Image | Public IP   | Private IP  | Created At | Launchable? |

+------------------------+------------+------------------+--------+-------+-------------+-------------+------------+-------------+ 

| PHDcl3-Worker-0        |            | Service Ready    |        |       | 10.103.1.64 | 10.103.1.64 |            | -           | 

| PHDcl3-ComputeMaster-0 |            | Bootstrap Failed |        |       | 10.103.1.65 | 10.103.1.65 |            | -           |

+------------------------+------------+------------------+--------+-------+-------------+-------------+------------+-------------+

I, [2014-06-16T15:11:56.132226 #32545]  INFO -- : ============= Ironfan Knife CLI exited with status code 3 =============

D, [2014-06-16T15:11:56.133792 #32545] DEBUG -- : get operation progress for cluster PHDcl3 ...

D, [2014-06-16T15:11:56.195506 #32545] DEBUG -- : progress: <Software::Mgmt::Thrift::OperationStatus finished:true, succeed:false, progress:100, error_msg:"Bootstrapping VM failed.", total:2, success:1, failure:1, running:0>

and the history server logfile shows:

([master node] /var/log/gphd/hadoop-mapreduce/mapred-mapred-historyserver-10.103.1.65.log):

:STARTUP_MSG:   build = file:///var/ci/jenkins/workspace/HudsonHD2_0_1_0HadoopStackPackagingBuild_ongoing_sprint_release/orchard_build/build/hadoop/rpm/BUILD/hadoop-2.0.2-alpha-gphd-2.0.1.0-src/hadoop-common-project/hadoop-common -r Unknown; compiled by 'hadoop' on Tue Jun 25 13:31:15 CST 2013

************************************************************/

2014-06-16 16:12:54,900 INFO  hs.JobHistory (JobHistory.java:init(75)) - JobHistory Init

2014-06-16 16:12:55,731 FATAL hs.JobHistoryServer (JobHistoryServer.java:main(141)) - Error starting JobHistoryServer

org.apache.hadoop.yarn.YarnException: Error creating done directory: [hdfs://SiteA-1G.isilon.spoc:8020/tmp/hadoop-yarn/staging/history/done]

        at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.init(HistoryFileManager.java:421)

        at org.apache.hadoop.mapreduce.v2.hs.JobHistory.init(JobHistory.java:87)

        at org.apache.hadoop.yarn.service.CompositeService.init(CompositeService.java:58)

        at org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer.init(JobHistoryServer.java:83)

        at org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer.main(JobHistoryServer.java:138)

Caused by: org.apache.hadoop.ipc.RemoteException(java.lang.SecurityException): No such username! Make sure your client's local username exists on the cluster.

        at org.apache.hadoop.ipc.Client.call(Client.java:1228)

        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)

        at $Proxy10.getFileInfo(Unknown Source)

        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

:

:

So still a few issues, but there is progress. Will try to update the thread after fixing them.

Ed

Message was edited by: EdSp

0 Kudos
jessehuvmw
Enthusiast
Enthusiast

Glad to hear the customized CentOS 6.4 template works.  This warning about rubygem server is ok since the gem is installed correctly.

The error of starting JobHistoryServer might be caused by incorrect hdfs dir permission:


org.apache.hadoop.yarn.YarnException: Error creating done directory: [hdfs://SiteA-1G.isilon.spoc:8020/tmp/hadoop-yarn/staging/history/done]

Could you please run 'hadoop fs -ls -R /tmp' on the JobHistoryServer node and paste the output ?  Here is the output of mine .


hadoop fs -ls -R /tmp

drwxrwxrwx   - hdfs hadoop          0 2014-06-16 03:43 /tmp/hadoop-mapred

drwxrwxrwx   - hdfs hadoop          0 2014-06-16 03:43 /tmp/hadoop-mapred/mapred

drwxrwxrwx   - hdfs hadoop          0 2014-06-16 03:43 /tmp/hadoop-yarn

drwxrwxrwt   - mapred mapred          0 2014-06-16 03:43 /tmp/hadoop-yarn/staging

drwxr-xr-x   - mapred mapred          0 2014-06-16 03:43 /tmp/hadoop-yarn/staging/history

drwxrwx---   - mapred mapred          0 2014-06-16 03:43 /tmp/hadoop-yarn/staging/history/done

drwxrwxrwt   - mapred mapred          0 2014-06-16 03:43 /tmp/hadoop-yarn/staging/history/done_intermediate

If the owner of /tmp/hadoop-yarn/staging is not mapred, please run 'hadoop fs -chown -R mapred:mapred  /tmp/hadoop-yarn/staging' to correct it.

Cheers, Jesse Hu
0 Kudos
jessehuvmw
Enthusiast
Enthusiast

BTW, when deploying PHD cluster, please create at least 2 nodemanager nodes (each has at least 4GB memory). If only 1 worker node or the memory is too slow, running job on PHD cluster will hung.

Cheers, Jesse Hu
0 Kudos
EdSp
Enthusiast
Enthusiast

Hi Jesse,

That just makes a lot of sense what you suggested. I will include that note in my notes document. Unfortunately I had to leave my BDE 1.1 environment and setup a new BDE 2.0 environment. The BDE 1.1 was starting to take too much time in getting the cluster creation running successfully. With the release of  BDE 2.0 cluster creation (for non-Apache clusters) will be a bit easier to setup. I have just added the PHD yum repo 🙂

After creating the first cluster, I get the following error, for each of the nodes:

[2014-06-18T10:47:26.382+0000] Cannot bootstrap node PHDcl1-ComputeMaster-0.

remote_file[/etc/yum.repos.d/phd.repo] (hadoop_common::add_repo line 85) had an error: OpenSSL::SSL::SSLError: SSL_connect returned=1 errno=0 state=SSLv2/v3 read server hello A: unknown protocol

SSH to this node and view the log file /var/chef/cache/chef-stacktrace.out, or run the

Googling around, does this have to do with IP-->hostname resolution by DNS? Can I put a hostname in the /etc/hosts file? If so, which IP's/hostnames should go in there? I think just the management server?

Currently the management server doesn't seem to have a 'normal' hostname:

[root@10 1]# cat /etc/hostname

localhost.localdom

[root@10 1]# hostname

10.103.1.60

Should that be changed?

Please find attached the output of running "sudo chef-client" on the master node.

Many thanks again for your feedback. Much appreciated!

Ed

0 Kudos
EdSp
Enthusiast
Enthusiast

There is a paragraph in the BDE 2.0 Admin Guide, Hadoop Distribution Deployment Types, which contains the following text:

When creating clusters using Hadoop distributions based on Hadoop 2.0 and later, the DNS server in your network must provide forward and reverse FQDN/IP resolution. Without valid DNS and FQDN settings, the cluster creation process might fail, or the cluster is created but does not function. Hadoop distributions based on Hadoop 2.0 and later include Apache Bigtop, Cloudera CDH4 and CDH5, Hortonworks HDP 2.x, Intel 3.x, and Pivotal PHD 1.1 and later releases.

Does that have to do with the SSL Validation Failure I described in my previous reply?

P.S. Creation of an Apache cluster is successful. This occurs only when I create a PHD cluster.

Ed

0 Kudos
MichaelWest
VMware Employee
VMware Employee


Ed,

Without digging through every bit of detail in this thread, I am assuming that you created your Yum repository on the BDE management server.  You will notice if you check the httpd.conf file that all http traffic is redirected to https.  That is by desgin.  If you do not have FQDN defined management server you can get SSL validation errors during bootstrap.  Here are the workarounds:

1) Provide FQDN for management server

2) OR   modify your template VM.  Add an entry to /etc/hosts to manually associate a hostname to the management server IP.  Set Hostname on management server

3) OR  Move your Yum repository to a separate VM from the management server and make sure to use http when connecting.

I have not tried these with Pivotal, but resolved the same problem with Cloudera Yum install.

0 Kudos
jessehuvmw
Enthusiast
Enthusiast

Hi Ed,

From the output of chef-client, I found that you are using this url http://10.103.1.60/phd/1/phd.repo as the PHD yum repo, and 10.103.1.60 is the IP of your BDE Server.  Then you should use https://10.103.1.60/phd/1/phd.repo  (i.e. use https not http), because all http connection on the httpd server is redirected to https and using http will cause the unknown protocol SSLError.  If you create the PHD yum repo on another VM (e.g. you can clone the hadoop-template VM in BDE vApp to set the yum repo), and do not redirect http to https, you should use http in the yum repo url.   If the VM serving as yum repo server has a valid FQDN, you need to use FQDN (not the IP) in the yum repo url (as well as in the content of the repo file).

Besides this yum repo url issue, you need to the DNS server in your network (i.e the VLAN that the nodes of the hadoop cluster belongs to, not the management server) must provide forward and reverse FQDN/IP resolution. And the DHCP server should be configured to return the FQDN and IP for each dhcp client.

BDE server is not required to have a valid FQDN, but the nodes are required. If you can not configure DNS serve to provide FQDN resolution, as Michael suggested, you can power on the hadoop-template VM and manually add all IPs to FQDNs mapping in /etc/hosts (the FQDN can be any string you defined). Please follow topicMaintain a Customized Hadoop Template Virtual Machine when modifying the hadoop-template, and remember to restart tomcat service after the hadoop-template is modified.

When creating clusters using Hadoop distributions based on Hadoop 2.0 and later, the DNS server in your network must provide forward and reverse FQDN/IP resolution. Without valid DNS and FQDN settings, the cluster creation process might fail, or the cluster is created but does not function. Hadoop distributions based on Hadoop 2.0 and later include Apache Bigtop, Cloudera CDH4 and CDH5, Hortonworks HDP 2.x, Intel 3.x, and Pivotal PHD 1.1 and later releases.

Cheers, Jesse Hu
0 Kudos
EdSp
Enthusiast
Enthusiast

Great, thanks all. Will give the management server a proper hostname and get the IP/hostname added to DNS, after I will test again.


Ed

0 Kudos
EdSp
Enthusiast
Enthusiast

After giving the management server a proper hostname and adding it to the DNS server there was progress, however, it turns out I had to do something similar to all nodes, as you already suggested in your last response. So I added all IPs to FQDNs mapping in /etc/hosts. Next I also had to add the mapred user to the Isilon system, which I hadn't done yet. This issue with its solution was in the Admin Guide, in the troubleshooting section.


I can now finally successfully create a compute-only Pivotal HD cluster on Isilon! 🙂


Many thanks again to you all for your support.

Regards,

Ed

0 Kudos