Description: upon rebooting one of the nodes in a cluster, it does not come back online with HA enable: If have reconfigure HA upon every reboot of the node. This procedure goes into many checks to determine where they problem may exist and for my situation; correct this problem.
Summary steps:
Setup new cluster
dragged two nodes into the cluster (now I have a 2 node cluster)
Enabled HA; Reconfigure HA for both Nodes
Created two resouce pools
Dragged 5 VM's from Cluster root to one fo the resource pools
==============================================
HA Cluster and Nodes are running correctly and I can use DRS /VMOTION
Now I reboot the 2nd node (not primary).
==============================
Upon the 2nd node showing up in VMware VI Client it goes into
and alert state. The cluster also goes into alert state
To re-enable HA on the 2nd node
\----
I have to right-click the ESXHOST in viclient and Made the /etc/hosts file on both servers show the name of the server only (it did show the the Fully qualified domain name \[fqdm])
(nano /etc/hosts)
HOSTS file now looks like:
\----
127.0.0.1 localhost.localdomain localhost
10.0.1.1 esx1
10.0.1.2 esx2
\# notice the next lines are for NTP
0.pool.ntp.org
1.pool.ntp.org
2.pool.ntp.org
pool.ntp.org
I also made sure my DNS settings were working correctly:
=========================================
The foward(Forward lookup zones) DNS lookup was working: nslookup esx1.safetylca.org
BUT NOT THE reverse (reverse lookup zones) DNS lookup: nslookup 10.0.1.1
the reverse lookup produce the error:
\** server can't find 1.1.0.10.in-addr.arpa: NXDOMAIN
Correction: Created a new Reverse lookup zone and deleted and re-created the esx servers on the forware lookup zone which created records
in the reverse lookup zones
ALL LOOKUPS STARTED WORKING
nslookup esx1
nslookup 10.0.1.1
nslookup esx1.mydomain.com
>>>>same for 2nd node
I also examined some log files which I found on some other message boards, but again they did not tell me much:
================================================
/opt/LGTOaam512/log
/var/log/vmware/vpx/vpxa.log
ERROR I Found: 547056 warning \[vpxaHalStats] Unexpected return result Expect 1 sample, receive 2 ????what ever that means.
So finally I decided to remove both machines from the cluster.
============================================
>Enter maintenance mode:
>delete from cluster
>logged into the esxhost: typed reboot
I did the same thing for the primary node also....
Once both esxhosts rebooted and after no more nodes existed in the Virtual center cluster: I readded the two hosts to the cluster.
Rebooted the servers at the same time: IT WORKED
Rebooted the primary node: IT WORKED
rebooted the 2nd node: IT WORKED AGAIN.
finally........
Search Engine Reference:
HA does not work after a cold boot
HA agenet does not work after I power-off the esxhost
Power off
power button off
Power esx host off
Reboot esx host
reboot server
Red exclamation icon
Red invalid cluster
CLUSTER SETTING WHICH NEEDS TO BE UNCHECKED: "allow virtual machines to be started even if they violate availability constraints"
CLUSTER SETTING: strict admission control
PRIMARY HA AGENT
Number of hosts failures allowed = 1
>Esx command: hostname
Have you tried to create a new cluster group and move all ESX servers into it? HA is very picky and once something goes bad, it stays bad.
Early on in the troubleshooting process I did do this.
It did not resolve the problem.
So I started backtracking my steps to ensure the pre-requisites were correctly configured as indicated above.
If deleting the hosts from the clusters did not work. That was going to be my next step(again).
I was actually 1 step away from calling tech support on this one.
-Doug
You could also try adding the das.isolationaddress option and the value (IP address) of a pingable address, usually the subnet gateway - if it replies to pings. This is under cluster pool settings > advanced.
I also like to leave the fqdn in hosts, just add the host names at the end of each line. This makes for less dependency on DNS.
HA does take several minutes to get going again when a host has been restarted.
Can I ask if you have fully patched both your VMWare ESX servers and your Virtual Centre server?
Can you post your current hosts file?
Are your server getting there correct licensing from a VC server, and can you give us a bit of detail on the physical servers set-up?
Steve "Bug-Man" Pitt
Check your switch to make sure portfast is enabled on the port your service console is plugged into. If you are using a trunked port you will need portfast trucking enabled.