VMware Cloud Community
SafetyMan
Contributor
Contributor

Upon Reboot: Have to reconfigure HA everytime; HA agent on x has an error

Description: upon rebooting one of the nodes in a cluster, it does not come back online with HA enable: If have reconfigure HA upon every reboot of the node. This procedure goes into many checks to determine where they problem may exist and for my situation; correct this problem.

Summary steps:

Setup new cluster

dragged two nodes into the cluster (now I have a 2 node cluster)

Enabled HA; Reconfigure HA for both Nodes

Created two resouce pools

Dragged 5 VM's from Cluster root to one fo the resource pools

==============================================

HA Cluster and Nodes are running correctly and I can use DRS /VMOTION

Now I reboot the 2nd node (not primary).

==============================

Upon the 2nd node showing up in VMware VI Client it goes into

and alert state. The cluster also goes into alert state

To re-enable HA on the 2nd node

\----


I have to right-click the ESXHOST in viclient and Made the /etc/hosts file on both servers show the name of the server only (it did show the the Fully qualified domain name \[fqdm])

(nano /etc/hosts)

HOSTS file now looks like:

\----


127.0.0.1 localhost.localdomain localhost

10.0.1.1 esx1

10.0.1.2 esx2

\# notice the next lines are for NTP

0.pool.ntp.org

1.pool.ntp.org

2.pool.ntp.org

pool.ntp.org

I also made sure my DNS settings were working correctly:

=========================================

The foward(Forward lookup zones) DNS lookup was working: nslookup esx1.safetylca.org

BUT NOT THE reverse (reverse lookup zones) DNS lookup: nslookup 10.0.1.1

the reverse lookup produce the error:

\** server can't find 1.1.0.10.in-addr.arpa: NXDOMAIN

Correction: Created a new Reverse lookup zone and deleted and re-created the esx servers on the forware lookup zone which created records

in the reverse lookup zones

ALL LOOKUPS STARTED WORKING

nslookup esx1

nslookup 10.0.1.1

nslookup esx1.mydomain.com

>>>>same for 2nd node

I also examined some log files which I found on some other message boards, but again they did not tell me much:

================================================

/opt/LGTOaam512/log

/var/log/vmware/vpx/vpxa.log

ERROR I Found: 547056 warning \[vpxaHalStats] Unexpected return result Expect 1 sample, receive 2 ????what ever that means.

So finally I decided to remove both machines from the cluster.

============================================

>Enter maintenance mode:

>delete from cluster

>logged into the esxhost: typed reboot

I did the same thing for the primary node also....

Once both esxhosts rebooted and after no more nodes existed in the Virtual center cluster: I readded the two hosts to the cluster.

Rebooted the servers at the same time: IT WORKED

Rebooted the primary node: IT WORKED

rebooted the 2nd node: IT WORKED AGAIN.

finally........

Search Engine Reference:

HA does not work after a cold boot

HA agenet does not work after I power-off the esxhost

Power off

power button off

Power esx host off

Reboot esx host

reboot server

Red exclamation icon

Red invalid cluster

CLUSTER SETTING WHICH NEEDS TO BE UNCHECKED: "allow virtual machines to be started even if they violate availability constraints"

CLUSTER SETTING: strict admission control

PRIMARY HA AGENT

Number of hosts failures allowed = 1

>Esx command: hostname

0 Kudos
5 Replies
Kindred_VMSuppo
Contributor
Contributor

Have you tried to create a new cluster group and move all ESX servers into it? HA is very picky and once something goes bad, it stays bad.

0 Kudos
SafetyMan
Contributor
Contributor

Early on in the troubleshooting process I did do this.

It did not resolve the problem.

So I started backtracking my steps to ensure the pre-requisites were correctly configured as indicated above.

If deleting the hosts from the clusters did not work. That was going to be my next step(again).

I was actually 1 step away from calling tech support on this one.

-Doug

0 Kudos
psharpley
Enthusiast
Enthusiast

You could also try adding the das.isolationaddress option and the value (IP address) of a pingable address, usually the subnet gateway - if it replies to pings. This is under cluster pool settings > advanced.

I also like to leave the fqdn in hosts, just add the host names at the end of each line. This makes for less dependency on DNS.

HA does take several minutes to get going again when a host has been restarted.

0 Kudos
steve_pitt
Contributor
Contributor

Can I ask if you have fully patched both your VMWare ESX servers and your Virtual Centre server?

Can you post your current hosts file?

Are your server getting there correct licensing from a VC server, and can you give us a bit of detail on the physical servers set-up?

Steve "Bug-Man" Pitt

0 Kudos
admin
Immortal
Immortal

Check your switch to make sure portfast is enabled on the port your service console is plugged into. If you are using a trunked port you will need portfast trucking enabled.

0 Kudos