VMware Cloud Community
polysulfide
Expert
Expert

HA not working on host failure

I'm having an issue with my HA. Twice in the last month or so I've had a host server go 'wonky' where the host is disconnected AND the VMs are no longer responsive.

In both of these cases HA didn't bring the VM's back online on other hosts. They just say (disconnected) in VC. I can wrestle the host server back into a connected state and bring the VMs back online but that really isn't the most ideal situation.

I have a 4-node cluster

ESX: 3.5.0, 123630

VC: 2.5.0, 119598

Isolation Response: Power Down VMs

All hosts resolve each other's FQDN and short name

Any ideas what I should be looking for in the logs or any additional options I can configure to make this more resilient?

Thanks,

If it was useful, give me credit

Jason White - VCP

0 Kudos
9 Replies
khughes
Virtuoso
Virtuoso

Sounds like it is a communication issue where your VC isn't communicating with the host that gets disconnected. Is it the same house that it has happened on or different hosts?

  • Kyle

-- Kyle "RParker wrote: I guess I was wrong, everything CAN be virtualized "
0 Kudos
Troy_Clavell
Immortal
Immortal

does your host actually go down as well as the VM's? There is a known issue where vCenter disconnects from the hosts and they are seen as "not responding", but usually come back within a few minutes.

There is a U3 patch to take care of it.

see KB http://kb.vmware.com/kb/1007041

0 Kudos
polysulfide
Expert
Expert

I've had 3 different incidents this month. In the first and third incident (two different hosts) the host is disconnected and the VMs are down. The VMs remained disconnected and HA didn't migrate / power them on. The server didn't ever come back. A hard reboot of the server resolved the issue, it wouldn't restart gracefully because it couldn't unmount some of the filesystems. I don't have a lot more state information as another admin resolved both of these with a reboot in the interest of bringing the VMs back.

The second incident involved a host that was disconnected but the VMs were all online. This one had a corrupt vpx agent and needed it reinstalled.

ESX350-200810201-UG is installed on all hosts.

0 Kudos
Troy_Clavell
Immortal
Immortal

do you have enough resources to handle an HA event? What is the admission control level set on your HA cluster? You may try to change it to "Allow VMs to be powered on....

0 Kudos
polysulfide
Expert
Expert

I generally prevent violating availability constraints so that HA will always have enough resources. Allowing violations during the incident didn't allow any of the VMs to be powered on. Shouldn't it continue to power VMs on one-at-a-time until the constraint is reached anyway?

Thanks,

If it was useful, give me credit

Jason White - VCP

0 Kudos
Troy_Clavell
Immortal
Immortal

I agreee... I'm just trying to see what if any help I can offer. It may also be beneficial to open an SR

khughes
Virtuoso
Virtuoso

If it is happening to random hosts, besides the vpxa becoming corrupted, we had a simular issue when some networking components were messed up. Are you sure all your DNS entries are correct, no mispellings? I'm assuming you never contacted VMware when one of these events happend for them to take a look at it?

  • Kyle

-- Kyle "RParker wrote: I guess I was wrong, everything CAN be virtualized "
polysulfide
Expert
Expert

I verified resolution for each host from the service console of each host with ping

I verifed that HA Agent is running on each host with ftcli -domain vmware -cmd listnodes

I verifed that the FT_HOSTS file on each host was correct

I verifed HA can resolve every host with ft_gethostbyname

I didn't contact VMware. I haven't had very good luck with their support. I always get somebody who is less than knowledgeable in the topic I file the request for even after giving explicit details. I get much better support in the community.

I think I'll open one now, maybe there is some info to be gleaned from the logs.

I wonder if there's a CLI method to force an HA event? I see lots of goodies in /opt/vmware/aam/bin

If it was useful, give me credit

Jason White - VCP

0 Kudos
polysulfide
Expert
Expert

After a little digging, it looks like the same SCSI Reservation error I had before.

If the SCSI bus is blocked and it's not able to release the VMFS metadata locks and let the other hosts know the VMs are offline this might be able to interfere with HA working properly.

If it was useful, give me credit

Jason White - VCP

0 Kudos