ESXi Upgrade to 6.7 U1 causing VMs to restart

oneilv · ‎04-29-2019

Hey guys,

We have recently upgraded ESXi on all our servers in the cluster from ESXi 6.5 to ESXi 6.7 Update1 (EP7 build) and since then we are seeing issues on multiple VMs (linux vms) that are randomly getting restarted. These reboots were seen since the upgrade and also during vMotion of vms to other hosts during the upgrade. Even today, 4 days after the upgrade, some VMs are rebooting with the below error.

The error we have seen in the events are

vmware esx unrecoverable error (vcpu-2) vmk: unable to decompress BPN (I've attached a screenshot as well)

Has anyone come across this?

Cheers,

Onil Varghese

ThompsG · ‎04-29-2019

Hi oneilv,

Are you able to attach the vmware.log from one of the Linux machines from when it restarted? Don't copy and paste here but actually attach the log please

Also make sure it is from when the VM restarted.

Kind regards.

oneilv · ‎04-29-2019

Sure, here you go ThompsG

oneilv · ‎04-29-2019

VM logs attached as requested ThompsG

pragg12 · ‎04-30-2019

Hi,

Are the ESXi hosts certified/supported for ESXi 6.7 U1?

Are the hardware BIOS, hardware components' firmware/driver versions in line with 6.7 U1?

Is the issue observed for VMs on a particular ESXi host or on all ESXi hosts in cluster ?

What's the vm HW version in use on affected vms ?

Any particular flavor of linux OS vm repeatedly facing issue or all linux vms facing issue ?

Are there other OS vms and any weird issue reported on them ?

Consider marking this response as "Correct" or "Helpful" if you think my response helped you in any way.

oneilv · ‎05-02-2019

Hey pragg12,

The hosts are certified and supported for ESXi 6.7 U1 and all BIOS. drivers and firmware were supported for 6.7 U1 as well.

The last suggestion from GSS has been to update the drivers and firmwares to the latest available versions. We are now actioning this and will monitor the VMs to see if there is any changes.

I am still intrigued though as to why a old driver/ firmware would cause VMs to reboot especially the linux ones.

Keep you guys posted on this one.

Cheers, Onil

oneilv · ‎05-06-2019

Hi Guys,

After applying the drivers and upgrading firmware on the hosts, we are still seeing the VMs being sporadically restarted by HA. The case has been escalated to a P1 with GSS and after a couple of phone calls with Gas they have confirmed that there are 5 other customers reporting the same issue with other hardware vendors. We have other clusters in the same environment that is not impacted and other customers who are running vSphere 6.7 U1 and they are not impacted by this issue.

Gss are still working on root cause but it looks like it could be due to an issue when VMs are migrated from ESXi 6.5 to ESXi 6.7.

Further updates to follow

Cheers, Onil

oneilv · ‎05-08-2019

Hey Guys,

An update on this issue - Engineering have an ESXi patch that provides additional debugging that they'd like to apply. Of the 6 SRs for this issue, one customer has applied the patch above and 4 have reverted their environments to 6.5.

Further updates to follow.

Cheers, Onil

pragg12 · ‎05-09-2019

Thanks for keeping the thread alive. Looking forward to see what Engineering team finds here.

Consider marking this response as "Correct" or "Helpful" if you think my response helped you in any way.

oneilv · ‎05-09-2019

Hey guys,

One thing I've confirmed with the team here is that all the VMs that are getting restarted, are running virtual hardware versions 10 and below. Not sure if this

This information is being sent to the VMware engineering team to be added into the RCA.

Cheers, Onil

pragg12 · ‎05-09-2019

Your response has invoked more queries from me:

Have you done a test by upgrading the affected VM's HW version to 13 or above and then see if the issue still occurs ?

When you updated the ESXi from 6.5 to 6.7, did the cluster EVC settings were changed or have you enabled VM based EVC after which you started seeing this issue ?

Consider marking this response as "Correct" or "Helpful" if you think my response helped you in any way.

oneilv · ‎05-09-2019

Hi pragg12

The restart issue is occurring only once on every VM. So far we've had 90+ VMs restart but not one VM has been restarted more than once.

VMware engineering have confirmed that the Crash / backtrace for all the issues reported are same with the memory fault and it has nothing to do with the Hardware version.

VMware GSS have also provided a patch for ESXi which we will rolling out to the cluster tonight. If the pattern continues then VMs will continue to get reset which will allow us to get the additional information for engineering to further look into the issue.

More updates to follow.

Cheers, Onil

pragg12 · ‎05-16-2019

Hi oneilv

Do you have any further updates on this issue ?

Consider marking this response as "Correct" or "Helpful" if you think my response helped you in any way.

oneilv · ‎05-16-2019

Hi pragg12,

The patch provided by VMware Gss has been applied to the cluster and after applying the patches we are still seeing VMs being reset by HA. Logs for the corresponding VMs and hosts have been uploaded to the engineering team and they are currently reviewing them.

We hope to hear back from the soon and I will post some updates as soon as I hear back.

Cheers, Onil

oneilv · ‎06-18-2019

Hey all,

Just an update on this case - The customer had requested us to roll back half the cluster to 6.5 as they could not tolerate further VMs crashing.

VMware GSS provided us with a patch for 6.7 (debug patch) which has further logging capabilities that they need to investigate the issue further however all our attempts to install this one host failed and the build number on the host wasn't changing. After multiple phone calls with GSS to resolve this, VMware engineering have now supplied us another image which has actually worked and the build number is now updated on the host. Unfortunately its been so long and the customer has not experienced any VM HA reset events in the last 2 weeks now.

We are sill planning to roll this out to all 6.7 hosts this week and if we encounter the issue again we will upload logs to GSS.

Cheers,

Onil

oneilv · ‎06-23-2019

Hi all,

Just to update you all, the debug patch was applied to all the 6.7 hosts in the cluster and we have seen VMs being reset in the last week. Some VMs have been reset on the patched host and logs are now with VMware engineering. Hopefully they can find the cause for this issue soon.

I will post further updates when we hear back from VMware.

Cheers,

Onil Varghese

oneilv · ‎06-24-2019

Hi All,

The root cause for this issue has been identified by VMware. Please note their statement below:

The root cause for the VM crashing is due to the zlib module which was upgraded to 1.2.11 from 1.1.4 in ESXi 6.7. After upgrading there are few issues with Memory Compression Optimizations. We have changed the code to re-introduce this memory optimization in ESXi 6.7 U3.

We are waiting for an official KB article to be sent to us about this. At this stage we recommend anyone looking to upgrade to 6.7 to wait till 6.7U3 is released.

Cheers,

Onil Varghese

pragg12 · ‎06-27-2019

Thanks for keeping the thread alive. Looking forward to further update on this.

Consider marking this response as "Correct" or "Helpful" if you think my response helped you in any way.

pragg12 · ‎09-08-2019

oneilv Any further update on this thread post ?

Consider marking this response as "Correct" or "Helpful" if you think my response helped you in any way.

oneilv · ‎09-08-2019

Hi pragg12

Please look at the below post from my colleague Matt which contains more details about the issue.

https://virtualtassie.com/2019/quick-post-vsphere-6-7-sporadic-vm-resets-by-vsphere-ha/

If you need any further information please let me know.

Cheers, Onil