VMware Cloud Community
elmonogrande
Contributor
Contributor

Weirdness with HA

Ok so here's the situation: HA on one of my four ESX servers seems to be "flapping". It periodically turns read and indicates an error with HA. After reading the forums, all four servers have the other hosts defined in the /etc/hosts file. They all ping correctly and resolve fine. Running "/opt/LGTOaam512/bin/ft_gethostbyname" is where the weirness occurs:

\[root@esx01 root]# /opt/LGTOaam512/bin/ft_gethostbyname esx01

10.20.1.21 esx01

10.20.1.21 esx01

10.20.1.21 esx01

\[root@esx01 root]# /opt/LGTOaam512/bin/ft_gethostbyname esx02

10.20.1.22 esx02

10.20.1.22 esx02

10.20.1.22 esx02

\[root@esx01 root]# /opt/LGTOaam512/bin/ft_gethostbyname esx03

10.20.1.23 esx03

10.20.1.23 esx03

10.20.1.23 esx03

\[root@esx01 root]# /opt/LGTOaam512/bin/ft_gethostbyname esx04

10.20.1.24 esx04

As you can see, server 4 (the one flapping) only returns one entry. I have verified that DNS is correct for everything. Thoughts?

Reply
0 Kudos
14 Replies
wobbly1
Expert
Expert

do you get the same results if you run this on the other hosts?

Reply
0 Kudos
elmonogrande
Contributor
Contributor

Sorry, I failed to include that information in the post. Yes, I get the same results even on the server that is flapping.

Reply
0 Kudos
wobbly1
Expert
Expert

ok, try dropping esx04 out of the cluster and then add it back in again

Reply
0 Kudos
wally
Enthusiast
Enthusiast

We see the same thing on our hosts but

/opt/LGTOaam512/vmware# perl aam_config_util.pl -cmd=listnodes -z

does show the exact same (and in our belief correct) status on all nodes. Which tool can we trust ?

We feel this 'flapping' was introduced with the last 2007 patches but we have no real proof of this. We see more HA flapping during vmotion actions so we suspect that there is a timeout value somewhere that is a bit to critical.

Reply
0 Kudos
elmonogrande
Contributor
Contributor

I did one better (before I posted this message). Since I was seeing this happening even worse with VC Patch 2 installed, I rebuilt the entire VC server back to the original 2.0.1 release. This solved the problems I was having with the other 5 servers (2 datacenters). Server 4 was the only one flapping. So, I removed it and added it back in. Still the same thing. FYI, all 6 ESX servers are using the same 3.0.1 release. No updates have been applied to any of these servers.

Reply
0 Kudos
elmonogrande
Contributor
Contributor

wally,

I was seeing this issue even worse with Patch 2 during vmotion, too. That's when I decided to roll everything back to the original release. At least now during vmotion, the entire cluster doesn't start to flap whereas it used to. Only server 4 does regardless of where I am vmotioning to. You might be on to something with a timeout value perhaps being a bit to "sensitive". Guess it's time to dig out the documentation.

Reply
0 Kudos
wobbly1
Expert
Expert

just out of interest then, did you have HA/DRS clusters disabled when you applied patch2?

Reply
0 Kudos
elmonogrande
Contributor
Contributor

I sure did. At that time, I only had the 1 development datacenter which incidentally, never experienced an HA issue. The production datacenter was a totally new DC. And from the word "go", it was having problems. Hence why I rolled back to 2.0.1

Reply
0 Kudos
matthew_hadfiel
Contributor
Contributor

Hi,

I am also experiencing this same error on 4 of my 14 ESX servers. As with the other environment all servers can resolve and ping each other, both by hosts file and DNS.

elmonogrande, did you ever find out anything more on this? I'm just wondering if anyone else here has lodged a vmware support case about this or had any response or resolution from Vmware. Otherwise I'll have to start the process myself.

Reply
0 Kudos
frankdenneman
Expert
Expert

>all four servers have the other hosts

defined in the /etc/hosts file.

So do I understand correctly that you have included ALL host names of all your esx servers in the host file on the ESX servers?

So esx01 has the ip-addresses and hostnames of esx01, esx02, esx03 and esx04 in the file /etc/host?

Do you have the resolv.conf configured with the hostnames and ip-addresses of your dns servers?

If so delete the hostnames of all other ESX servers from the /etc/host file.

Only use the hostname of the host itself and list it's shortname.

so 123.456.789.000 esx0 esx0.domainname.com

Although you can resolv all host just fine, leave the hostnames of other servers out of the host file if you have a dns server to work with.

Blogging: frankdenneman.nl Twitter: @frankdenneman Co-author: vSphere 4.1 HA and DRS technical Deepdive, vSphere 5x Clustering Deepdive series
Reply
0 Kudos
VirtualKenneth
Virtuoso
Virtuoso

Agreed with Frank D, but that's just because it's better manageable and reduces the change of errors in the host files

It shouldn't however affect your HA environment if there aren't any typo's in the current files.

Reply
0 Kudos
admin
Immortal
Immortal

elmonogrande, have you filed an SR on this one?

The multiple entries returned by ft_gethostbyname are probably because of duplicate entries in your /etc/FT_HOSTS file, which is the HA (AAM) cache for name resolution (so that it doesn't depend on DNS during failover), so that's likely not the issue.

Reply
0 Kudos
admin
Immortal
Immortal

logs would be useful, from the flapping host and the other hosts as well.

Reply
0 Kudos
admin
Immortal
Immortal

Also, is it only that host that is having problems? i.e. if you remove that host and add another one in its place, is that ok?

Reply
0 Kudos