Are you able to attach the vmware.log from one of the Linux machines from when it restarted? Don't copy and paste here but actually attach the log please
Also make sure it is from when the VM restarted.
Are the ESXi hosts certified/supported for ESXi 6.7 U1?
Are the hardware BIOS, hardware components' firmware/driver versions in line with 6.7 U1?
Is the issue observed for VMs on a particular ESXi host or on all ESXi hosts in cluster ?
What's the vm HW version in use on affected vms ?
Any particular flavor of linux OS vm repeatedly facing issue or all linux vms facing issue ?
Are there other OS vms and any weird issue reported on them ?
The hosts are certified and supported for ESXi 6.7 U1 and all BIOS. drivers and firmware were supported for 6.7 U1 as well.
The last suggestion from GSS has been to update the drivers and firmwares to the latest available versions. We are now actioning this and will monitor the VMs to see if there is any changes.
I am still intrigued though as to why a old driver/ firmware would cause VMs to reboot especially the linux ones.
Keep you guys posted on this one.
1 person found this helpful
After applying the drivers and upgrading firmware on the hosts, we are still seeing the VMs being sporadically restarted by HA. The case has been escalated to a P1 with GSS and after a couple of phone calls with Gas they have confirmed that there are 5 other customers reporting the same issue with other hardware vendors. We have other clusters in the same environment that is not impacted and other customers who are running vSphere 6.7 U1 and they are not impacted by this issue.
Gss are still working on root cause but it looks like it could be due to an issue when VMs are migrated from ESXi 6.5 to ESXi 6.7.
Further updates to follow
An update on this issue - Engineering have an ESXi patch that provides additional debugging that they'd like to apply. Of the 6 SRs for this issue, one customer has applied the patch above and 4 have reverted their environments to 6.5.
Further updates to follow.
Thanks for keeping the thread alive. Looking forward to see what Engineering team finds here.
One thing I've confirmed with the team here is that all the VMs that are getting restarted, are running virtual hardware versions 10 and below. Not sure if this
This information is being sent to the VMware engineering team to be added into the RCA.
Your response has invoked more queries from me:
Have you done a test by upgrading the affected VM's HW version to 13 or above and then see if the issue still occurs ?
When you updated the ESXi from 6.5 to 6.7, did the cluster EVC settings were changed or have you enabled VM based EVC after which you started seeing this issue ?
The restart issue is occurring only once on every VM. So far we've had 90+ VMs restart but not one VM has been restarted more than once.
VMware engineering have confirmed that the Crash / backtrace for all the issues reported are same with the memory fault and it has nothing to do with the Hardware version.
VMware GSS have also provided a patch for ESXi which we will rolling out to the cluster tonight. If the pattern continues then VMs will continue to get reset which will allow us to get the additional information for engineering to further look into the issue.
More updates to follow.
The patch provided by VMware Gss has been applied to the cluster and after applying the patches we are still seeing VMs being reset by HA. Logs for the corresponding VMs and hosts have been uploaded to the engineering team and they are currently reviewing them.
We hope to hear back from the soon and I will post some updates as soon as I hear back.
Just an update on this case - The customer had requested us to roll back half the cluster to 6.5 as they could not tolerate further VMs crashing.
VMware GSS provided us with a patch for 6.7 (debug patch) which has further logging capabilities that they need to investigate the issue further however all our attempts to install this one host failed and the build number on the host wasn't changing. After multiple phone calls with GSS to resolve this, VMware engineering have now supplied us another image which has actually worked and the build number is now updated on the host. Unfortunately its been so long and the customer has not experienced any VM HA reset events in the last 2 weeks now.
We are sill planning to roll this out to all 6.7 hosts this week and if we encounter the issue again we will upload logs to GSS.