VMware Cloud Community
STef77
Contributor
Contributor

Problem enabling HA on cluster with 2 esxi (4.1) hosts

Hi,

When trying to enable HA on a cluster with 2 members I got this error message on the second host (every time I try, there is always 1 server (arbitrary) that gets its config done and the problem occurs on the second one)

HA agent on XXXXXXX in cluster ZZZZZZ in YYYYYYYY has an error : cmd addnode failed for primary node: getrulevar failed: Error [10024]: Rule Not Found: Unknown HA error

This happens during the "Configuring HA" task

You can notice that there is no explicit message related to DNS, missing swapping file etc...

... Of course, DNS works fine, swap is activated etc...

Before starting this topic I dig into the forum and I was attempting the following :

Everything there : http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=100159...

Manual operation in the console :

/opt/vmware/aam/VMware-aam-ha-uninstall.sh
services.sh stop
services.sh start

then with service console "Reconfigure for VMware HA"

I also tried to remove the hosts from the VC then add them back, making a new cluster...

I also mention that hot vmotion works fine both way ( from srv1 to srv2 and srv2 to srv1 )

Both servers are identical (Dell r610 dual xeon with 48 gbram) with 10 nic per server, redundant network (2 switches).

The network are all on Distributed vSwitches, there is a dedicated vSwitch with redundant interfaces for heartbeat and another one for vMotion

We are on a gigabit network, switches are HP procurve and not saturated by collisions or heavy traffic.

Just HA cannot be activated on 1 of the 2 servers in the cluster (sometimes on srv1 sometimes srv2)...

Any help will be greatly apreciated.

Thanks a lot,

STef

0 Kudos
10 Replies
Josh26
Virtuoso
Virtuoso

Hi,

I know you said DNS works fine, but just to be sure..

Did you configure all lower case hostnames on the ESXi servers themselves, and when they were added to vCenter, and does DNS match this?

0 Kudos
STef77
Contributor
Contributor

Hi Josh,

All names are lower case since the beginning...

Tkx

STef

0 Kudos
a_p_
Leadership
Leadership

I'm not sure whether this matters, however make sure the name of the management port group is the same on your hosts. I assume the hosts are in the same subnet!?

André

0 Kudos
STef77
Contributor
Contributor

The configuration are identical (same ports groups, same dvswitches etc) on both server.

I think the solution will come with the answer of "Why does the second server fails to configure HA ?"

When activating HA, the HA configuration starts on the same time on both servers.

Sometimes server1 wins the race and sometime server2 wins the race. The one that wins the race has HA activated properly.

The second one cannot finnish the configuration...

Time is also synchronized between DomainController, VirtualCenter and the 2 hosts.

error.png

In the first screenshot you can see the detailled error message. esxi-srv1 has HA succefully enabled (you need to trust me I didn't took a screenshot to show it)

STef

0 Kudos
STef77
Contributor
Contributor

Activating DRS works fine

moving MTU from 1500 to 9000 didn't helped.

upgrading the 2 physical servers didn't helped as well.

Really don't know what to do...

Just this in the logs but I don't know how to interpret it :

01/28/11 00:03:28 [wait_agent_startup  ] ConfigurationStatus=complete, heartbeat_config complete

01/28/11 00:03:28 [elapsed_time        ] elapsed time  0 minute(s) and 51 second(s)

VMwareprogress=59

01/28/11 00:03:28 [get_rule_var        ]

01/28/11 00:03:28 [issue_cli_cmd       ] command is '/opt/vmware/aam/bin/ftcli -cmd "getRuleVars VMWareClusterManager"'

01/28/11 00:03:28 [issue_cmd           ] CMD:    /opt/vmware/aam/bin/ftcli -cmd "getRuleVars VMWareClusterManager"

01/28/11 00:03:28 [issue_cmd           ] STATUS: 1

01/28/11 00:03:28 [issue_cmd           ] RESULT:

01/28/11 00:03:28 [issue_cmd           ] Error [10024]: Rule Not Found

01/28/11 00:03:28 [issue_cmd           ]

VMwareerrortext=getrulevar failed:  Error [10024]: Rule Not Found

01/28/11 00:03:28 [vpxa_respond        ] VMwareerrortext=getrulevar failed:  Error [10024]: Rule Not Found

VMwareerrorcat=internalerror

01/28/11 00:03:28 [vpxa_respond        ] VMwareerrorcat=internalerror

01/28/11 00:03:28 [myexit              ] copying /var/lib/vmware/aam/vmware-sites to /var/log/vmware/aam/aam_config_util_addnode.log

01/28/11 00:03:28 [myexit              ] Failure location:

01/28/11 00:03:28 [myexit              ]        function main::myexit called from line 1506

01/28/11 00:03:28 [myexit              ]        function main::get_rule_var called from line 2583

01/28/11 00:03:28 [myexit              ]        function main::manage_primaries called from line 1240

01/28/11 00:03:28 [myexit              ]        function main::add_aam_node called from line 210

01/28/11 00:03:28 [myexit              ] VMwareresult=failure



STef

0 Kudos
idle-jam
Immortal
Immortal

could do a back to back LAN connectivity for the HA console. this would ensure that there is no MTU/switch/firewall settings that might make HA unsable. only from there you would move on to troubleshooting HA agent at each of the host.

0 Kudos
ronmanu07
Enthusiast
Enthusiast

Hi STef,

A few things to check:

Are the necessary ports open for communication from the SC to your ESX hosts.

Do you have the hostnames in the local /etc/hosts file on the ESX hosts.

Are the port names and VLAN's consistent, sometimes one character breaks all this HA config (has happened to me).

Is the shared storage accessible from all the hosts.

Check the HA logs and SC logs for more detailed error messages.

Just a few things hope this helps.

0 Kudos
STef77
Contributor
Contributor

Hi Idle Jam,

I don't understand well what "could do a back to back LAN connectivity for the HA console" is.

The vCenter Server is plugged on the same switch than the ESXi hosts.

The windows FW is desactivated.

FW on esxi hosts are also automatically desactivated when HA is activated.

Tkx for your help.

Regards,

STef

0 Kudos
STef77
Contributor
Contributor

Hi Ronmanu07,

Thanks for your answer.

I'm using dvSwitches but I experiment the same problem with virtual switches.

FT has dedicated traffic (but actually I'm not using it).

All names etc are the same on both server (and HW is identical).

From the command line (that I had to enable for ESXi) of one server I can ping the other one on the different adresses they have (vmkernel for vmotion, mgmt etc).

I can also ping the vCenter and vCenter ping each ESXi server on their management address.

adding entries for my ESXi hosts in /ets/hosts didn't helped...

STef

0 Kudos
STef77
Contributor
Contributor

Digging into the logs :

[2011-01-28 21:15:57.862 FFCE4B90 verbose 'Locale' opID=HB-host-361@182-c3] Default resource used for 'LicenseManager.LicenseInfo.dpvmotion.label' expected in module 'default'.
[2011-01-28 21:15:57.862 FFCE4B90 verbose 'Locale' opID=HB-host-361@182-c3] Default resource used for 'LicenseManager.LicenseInfo.vaai.label' expected in module 'default'.
[2011-01-28 21:16:00.013 FFE40B90 verbose 'Statssvc'] HostCtl exception Unable to complete Sysinfo operation.  Please see the VMkernel log file for more details.
[2011-01-28 21:16:00.017 FFE40B90 verbose 'Statssvc'] HostCtl exception Unable to complete Sysinfo operation.  Please see the VMkernel log file for more details.

It says that the licences are ok but still have a problem somewhere...

/var/log/vmware/aam # tail aam_config_util_addnode.log
FULLTIME_SITES_TID 00000002
+ 1:8042,8042,8043 esxi-srv2    vmware #FT_Agent_Port=8045
+ 2:8042,8042,8043 esxi-srv1 vmware
01/28/11 21:15:40 [myexit              ] Failure location:
01/28/11 21:15:40 [myexit              ]        function main::myexit called from line 1506
01/28/11 21:15:40 [myexit              ]        function main::get_rule_var called from line 2583
01/28/11 21:15:40 [myexit              ]        function main::manage_primaries called from line 1240
01/28/11 21:15:40 [myexit              ]        function main::add_aam_node called from line 210
01/28/11 21:15:40 [myexit              ] VMwareresult=failure
01/28/11 21:15:40 [elapsed_time        ] Total time for script to complete:  0 minute(s) and 58 second(s)

This is not really explain what's wrong...

/var/log/vmware/aam # tail  vmware_esxi-srv1.log
By: FT/Agent on Node: esxi-srv1
MESSAGE: Finished reconfig of heartbeat settings in 0 seconds.
===================================
Info NODE Fri Jan 28 21:15:23 2011
By: FT/Agent on Node: esxi-srv1
MESSAGE: Node esxi-srv1 is running.
===================================
Error FT Fri Jan 28 21:15:23 2011
By: FT/Agent on Node: esxi-srv1
MESSAGE: ftProcMon failed. Being restarted

This one is strange since I'm not using FT yet (even if a port group is already created for that)... [desactivated FT by unchecking FT box in ports groups, didn't change anything...]

STef

0 Kudos