Ubuntu VMs freezing since vSphere ESXi 7u3c update

basicmonkey · ‎02-23-2022

Since upgrading our two main hosts to v7 update 3c, I'm getting VM freezes on a couple of VMs. One is happening every few hours.

vSphere reports "The CPU has been disabled by the guest operating system. Power off or reset the virtual" at the time of the lockup. VM needs reset to continue.

One VM on shared storage has had this happen once since the upgrade. Other one on local SSD is repeatedly suffering. I've move the repeat offender to shared storage to see if it happens again.

Both VMs are hardware 13. I've tried upgrading the repeated one to 15 and made no difference.

All my VMs are using NVME storage mode. Only these two have shown this issue, and never before u3c upgrade.

Both VMs had this error in syslog about 15 mins before lockups:

Feb 23 00:46:07 mon kernel: [36834.916147] nvme nvme0: I/O 213 QID 1 timeout, aborting
Feb 23 00:46:07 mon kernel: [36834.916287] nvme nvme0: Abort status: 0x0
Feb 23 00:46:37 mon kernel: [36865.123128] nvme nvme0: I/O 213 QID 1 timeout, reset controller
Feb 23 00:46:37 mon kernel: [36865.165409] nvme nvme0: 15/0/0 default/read/poll queues
Feb 23 02:34:22 mon kernel: [43329.673158] nvme nvme0: I/O 100 QID 4 timeout, aborting
Feb 23 02:34:22 mon kernel: [43329.673393] nvme nvme0: Abort status: 0x0
Feb 23 02:34:52 mon kernel: [43359.880150] nvme nvme0: I/O 100 QID 4 timeout, reset controller
Feb 23 02:34:52 mon kernel: [43359.926041] nvme nvme0: 15/0/0 default/read/poll queues

Versions:

Hypervisor:VMware ESXi, 7.0.3, 19193900
Model:PowerEdge R640
Processor Type:Intel(R) Xeon(R) Gold 6226 CPU @ 2.70GHz
Ubuntu 20.04 LTS
Linux Kernel 5.4.0-100-generic x86_64

Many thanks in advance!

basicmonkey · ‎02-24-2022

Update: VMs dropping like flies, every day. Both on shared storage and local SSD. Going to have to re-image one of my hosts back to 7u2 as they were upgraded using patches. Ugh.

basicmonkey · ‎02-24-2022

Update:

Also NVME errors on my pfSense VM (FreeBSD 12).

I've re-imaged a host to 7u2c and soaking for 24 hours to verify it's u3 related. Any issues will revert to u2a which is where the hosts were before I started the upgrade process.

basicmonkey · ‎02-25-2022

No issues on same host back on U2e.

basicmonkey · ‎03-07-2022

Is nobody else having these issues with U3? Seems to be unusable for me at the moment...

michaelrash · ‎03-13-2022

I am seeing this same issue. I have added a "SATA controller" and then changed the drive to "SATA controller 0" and my issue has went away. I have not dug into the issue further.

basicmonkey · ‎03-18-2022

This still exists even with the latest patches (including one NVME driver patch).

Seems to only happen with a few VMs running on the host. Huge amount of headroom, only 15% CPU use and 50GB RAM left.

Would be great to know that VMWare are aware!

vanlisur · ‎03-21-2022

i have also faced same issue and now i am trying to rollback the update.

wendy's lunch time surveyzop.com

RNadmin · ‎03-30-2022

We're experiencing the same.

Very similar setup to you.

We only see on selected Ubuntu VMs.

I'm runnig VCF4.4, so rollback to an earlier esxi version is not an option.

basicmonkey · ‎03-30-2022

Glad it's not just me. It's also happening on my pfSense VM which is FreeBSD 12 so not just a Ubuntu thing.

Would really like to hear that VMware are aware and looking into it.

SusAdmin1 · ‎03-30-2022

for the sake of argument, can you try running pv controllers?

basicmonkey · ‎03-30-2022

This issue doesn't affect the few VMs that are running on SCSI controllers. We moved most over to NVME a while ago to improve performance and it did significantly.

SusAdmin1 · ‎03-30-2022

So then would you say the issue is with NVME in this patch? I only ask because I'm running over 700 ubuntu vms...and really do not want those kinds of problems. But I want to get these security patches applied all the same.

basicmonkey · ‎03-30-2022

Yes, 98%.

Since the VMs all seem to report an issue with NVME drive just before they die, it would make sense.

It seems to affect VMs rather randomly, and some more susceptible than others (can't pin down what it is).

The commonality is:

NVME storage controller used for boot drive
*nix OS
ESXi 7 u3c
Affects both local SSD and NFS storage (not tried on VSAN)

I've migrated all VMs to 7 u2e hosts and 0 problems. Within 24 hours of migrating to a u3c host I'll lose at least 2 VMs.

RNadmin · ‎03-31-2022

I am completely on VSAN.

I have an application which is clustered between 3 VMs (At least one of these VMs were freezing daily). I have changed all the Boot drives of these VMs to SCSI Controllers (I have left the application drives on NVME). So far 24hrs stable for the cluster.

Out of interest, which version of VMTools are you running on the problematic VMs? (I am currently on OPEN-VM-TOOLS 11.3.0 - I know 11.3.5, and 12.0.0 is available, but I can't upgrade those at this point.)

Arnaud_L · ‎03-31-2022

Do you have any pending update for the sata storage firmware (on iDRAC) ?

RNadmin · ‎03-31-2022

No Firmware upgrades pending for me.

As I am running full VCF, I am fully upgraded to everything within that. I also have no SATA backed disks other than the ESXi hypervisor itself. (Also VSAN datastore disks are SAS backed).

paudieo · ‎03-31-2022

Please file a VMare support ticket with VMware global services.
There is a similar issue being tracked internally that matches some of symptoms quoted here.

JohnMorin · ‎05-01-2022

i'm having the same exact Issue with Oracle Linux 8.4 VM.

Currently, the technician of our case doesn't work on the case saying it's a VM related issue. SR# 22322639904

basicmonkey · ‎05-03-2022

Are there any updates with the investigations into this?

Many thanks.