VMware Cloud Community
dgingeri
Enthusiast
Enthusiast

We keep getting "this virtual machine failed to become vsphere ha protected" on one cluster

I work for a cloud service, and we host many customer private clusters across the country on one 6.5 U2g VCSA. We have one customer cluster, 6 hosts and 60 VMs, that keeps having VMs come up with the error "this virtual machine failed to become vsphere ha protected...".  It's easy to correct, but it has to be done manually, turning off HA and then turning it back on.  It's happened 4 times in the last 3 days.  I'm not even sure at this point if it really is working, as it seems to break every time a VM migrates to a new host, and, no, it is not always the same source or destination host.  So, I can't even tell if HA is working, and I don't want a host failure to be the time when we discover that it really isn't working, and have a high paying customer have 8-10 VMs down until they're manually restarted.  The hosts are all physically identical, same firmware levels, and same version, 6.5U3, of ESXi, with Enterprise+ licensing.  

Is there a more permanent fix for this issue?  I haven't been able to find anything in VMWare's knowledge base other than the fix I'm already doing, which seems to last maybe until the next VM migration, or maybe not at all.  

Reply
0 Kudos
11 Replies
scott28tt
VMware Employee
VMware Employee

Moderator: Moved to Availability: HA & FT Discussions


-------------------------------------------------------------------------------------------------------------------------------------------------------------

Although I am a VMware employee I contribute to VMware Communities voluntarily (ie. not in any official capacity)
VMware Training & Certification blog
Reply
0 Kudos
nachogonzalez
Commander
Commander

Hey, hope you are doing fine

Can you disable HA and re enable it? This solves issues sometimes.
What does fdm.log has to say about this?
Which version of ESXi do you have?

Does the error match this? https://kb.vmware.com/s/article/2020082


Reply
0 Kudos
dgingeri
Enthusiast
Enthusiast

Yes, I have done that, 4 times in the last 3 days.  It lasts "fixed" until the next VM migration.

Reply
0 Kudos
nachogonzalez
Commander
Commander

Can you share fdm.log to investigate if there is an issue over there?
Also, can you tell a little bit more about HA configuration? How is admission control configured? Are you using any reservation? What datastores you use for heartbeating? does it select automatically?

Reply
0 Kudos
Lalegre
Virtuoso
Virtuoso

Hey @dgingeri,

The first thing I consider not right is that you have ESXi hosts with a higher version than vCenter. Alwasy ensure that your vCenter Server is equal or higher in version that your ESXi hosts.

As @nachogonzalez the fdm.log will have the details of your issue and when you check that, take a look at this KB because it applies to your version: https://kb.vmware.com/s/article/66928

Reply
0 Kudos
RajeevVCP4
Expert
Expert

Can you provide fdm.log ((/var/log/fdm.log) file with time stamp and vm name too.

 

Rajeev Chauhan
VCIX-DCV6.5/VSAN/VXRAIL
Please mark help full or correct if my answer is use full for you
Reply
0 Kudos
daphnissov
Immortal
Immortal

If you work for a cloud provider, then you're probably in the VSPP program. Why don't you open a ticket with GSS instead since this impacts one of your customers?

Reply
0 Kudos
andvm
Hot Shot
Hot Shot

6.5 U2g VCSA

6.5 U3 ESXi

Am I not seeing a mismatch here, meaning VCSA version should be the same or higher than ESXi?

Reply
0 Kudos
Lalegre
Virtuoso
Virtuoso

6.5 U2g VCSA

6.5 U3 ESXi

VCSA version should be same or higher. You have 6.5 U2g for VCSA and 6.5 U3 for ESXi which is a higher version than VCSA which means VCSA is lower than ESXi.

Please algo get the fdm.log as mentioned previously

Reply
0 Kudos
balzerb
Contributor
Contributor

I realize this thread is over a year and a half old, but we just recently started running into this exact problem on our cluster.  This cluster has been in operation for over 2 years, and we have never encountered this error until recently, and we ONLY encounter it when creating new VM's, either from a template, or building one from scratch.  The one recent change in the cluster is that we migrated from an aging fiber channel NetApp SAN to a new NFS NetApp SAN.  But I'm not sure if that is merely coincidental or not.  I'm not sure why the new SAN would be the cause of this.  Disabling/enabling vSphere HA resolves the issue, but its still troubling that this suddenly has started happening.

Anyway, here are the specifics of our cluster:

  • 6 Dell PowerEdge servers running ESXi 6.7 Standard build 19195723
  • VCSA 6.7 Standard build 19299595

These are the latest versions of each, according to the VMware versions and builds info.  Attached is an excerpt of the fdm.log for one of the affected VM's.  I know this error is very low-risk, but I would still like to know why this has just now (or recently) started happening.  I just don't want run into a situation where a small problem grows into a larger one.

Reply
0 Kudos
dgingeri
Enthusiast
Enthusiast

I have since learned that when this happens, it only alerts on freshly moved or new VMs in the cluster, but HA is not working on the ENTIRE cluster.  It may not be alerting HA isn't working, but it does not bring VMs back up if a host fails. 

I have not been able to find a real solution, either.  The best thing I've found is rebuilding all the hosts and the cluster.  The only way I've been able to correct this is to reformat and reinstall each host and then re-add them to the vcenter under a whole new cluster, one at a time, moving VMs over to the new cluster by removing them from the inventory of the old cluster and adding them to the new one, again, one at a time.  It's a pain in the behind, but it does eliminate the issue. 

Reply
0 Kudos