VMware Cloud Community
dhertanu
Contributor
Contributor
Jump to solution

HA issue in cluster

Hi,

I have two HP Proliant DL360 Gen10 hosts running ESXi 6.5 in a cluster (VCSA 6.7). Following a recent firmware update, the SD cards where ESXi resides, start behaving weird. Sometimes the SD cards won't be seen at boot time, sometimes they will but, after booting, /bootbank would be linked to /tmp, and sometimes will work fine. I noticed that, if I change the boot type from UEFI to Legacy BIOS, the SD cards behave normally so after the last tests I migrate the VMs from one host to another to avoid downtime, then reboot the hosts and change the boot type.

After the reboot, one of the host is fine, the other reports HA issues, basically the host is there, can communicate with vCenter but it times out when trying to reconfigure it for HA.

I check vpxa.log and fdm.log  especially and peek through the rest of the logs but I can't find anything obvious. I also reconfigure boot type to UEFI for the host but it didn't help.

Could anyone point me in the right direction to troubleshoot this?

Thanks,

Daniel

Reply
0 Kudos
1 Solution

Accepted Solutions
FlyinRhino67
Contributor
Contributor
Jump to solution

The issue i had at a customers site was this: VMware Knowledge Base (KB74966)

Is it possible for you to update the hosts to the 6.7 or the latest 6.5(HP Image)?

There are other issues with drivers here: ESXi 6.5 U2 Hosts become unresponsive - VMKernel.log errors

Regards

Felix

View solution in original post

Reply
0 Kudos
10 Replies
NathanosBlightc
Commander
Commander
Jump to solution

If you have only 2 hosts inside the cluster, it's obviously a normal reaction to issue an HA warning. Because when you reboot each of hosts, there are not enough members for remaining host inside the cluster for failover actions or migrating the VMs

Please mark my comment as the Correct Answer if this solution resolved your problem
Reply
0 Kudos
dhertanu
Contributor
Contributor
Jump to solution

Maybe I wasn't clear enough... The HA issue exists with both hosts up & running.

Reply
0 Kudos
FlyinRhino67
Contributor
Contributor
Jump to solution

Hey,

i think there was an issue with some version of 6.5 resulting in temporary non responsiveness of the host.

Which 6.5 build you have?

Best Regards

Felix

Reply
0 Kudos
a_p_
Leadership
Leadership
Jump to solution

Just a thought. Unless already done, disable HA, and then enable it again to see whether this helps.


André

Reply
0 Kudos
dhertanu
Contributor
Contributor
Jump to solution

Felix,

The hosts are running a HPE customized image 6.5.0 update 2 based on ESXi 6.5.0 Update 2 Vmkernel Release Build 9298722.

Thanks,

Daniel

Reply
0 Kudos
FlyinRhino67
Contributor
Contributor
Jump to solution

The issue i had at a customers site was this: VMware Knowledge Base (KB74966)

Is it possible for you to update the hosts to the 6.7 or the latest 6.5(HP Image)?

There are other issues with drivers here: ESXi 6.5 U2 Hosts become unresponsive - VMKernel.log errors

Regards

Felix

Reply
0 Kudos
dhertanu
Contributor
Contributor
Jump to solution

Andre,

I just tried that and it didn't change anything. One host was ok after re-enable HA, the other one timed out.

Daniel

Reply
0 Kudos
dhertanu
Contributor
Contributor
Jump to solution

Felix,

I'm trying to patch the host using VUM but I got into some network issue with the outgoing traffic so I have to fix that first.

Thanks,

Daniel

Reply
0 Kudos
NathanosBlightc
Commander
Commander
Jump to solution

Please check the /var/log/fdm.log file to check the VMware HA agent operations, maybe you find something related to your issue

Please mark my comment as the Correct Answer if this solution resolved your problem
Reply
0 Kudos
dhertanu
Contributor
Contributor
Jump to solution

Felix,

I pushed the latest updates for 6.5 and the HA got back. Everything looks normal now.

Thank you everyone for the assistance.

Daniel

Reply
0 Kudos