VMware Cloud Community
basicmonkey
Enthusiast
Enthusiast

Ubuntu VMs freezing since vSphere ESXi 7u3c update

Since upgrading our two main hosts to v7 update 3c, I'm getting VM freezes on a couple of VMs. One is happening every few hours.

vSphere reports "The CPU has been disabled by the guest operating system. Power off or reset the virtual" at the time of the lockup. VM needs reset to continue.

One VM on shared storage has had this happen once since the upgrade. Other one on local SSD is repeatedly suffering. I've move the repeat offender to shared storage to see if it happens again.

Both VMs are hardware 13. I've tried upgrading the repeated one to 15 and made no difference.

All my VMs are using NVME storage mode. Only these two have shown this issue, and never before u3c upgrade.

Both VMs had this error in syslog about 15 mins before lockups:

 

Feb 23 00:46:07 mon kernel: [36834.916147] nvme nvme0: I/O 213 QID 1 timeout, aborting
Feb 23 00:46:07 mon kernel: [36834.916287] nvme nvme0: Abort status: 0x0
Feb 23 00:46:37 mon kernel: [36865.123128] nvme nvme0: I/O 213 QID 1 timeout, reset controller
Feb 23 00:46:37 mon kernel: [36865.165409] nvme nvme0: 15/0/0 default/read/poll queues
Feb 23 02:34:22 mon kernel: [43329.673158] nvme nvme0: I/O 100 QID 4 timeout, aborting
Feb 23 02:34:22 mon kernel: [43329.673393] nvme nvme0: Abort status: 0x0
Feb 23 02:34:52 mon kernel: [43359.880150] nvme nvme0: I/O 100 QID 4 timeout, reset controller
Feb 23 02:34:52 mon kernel: [43359.926041] nvme nvme0: 15/0/0 default/read/poll queues

 

Versions:

  • Hypervisor:VMware ESXi, 7.0.3, 19193900
  • Model:PowerEdge R640
  • Processor Type:Intel(R) Xeon(R) Gold 6226 CPU @ 2.70GHz
  • Ubuntu 20.04 LTS
  • Linux Kernel 5.4.0-100-generic x86_64

Many thanks in advance!

27 Replies
basicmonkey
Enthusiast
Enthusiast

Update: VMs dropping like flies, every day. Both on shared storage and local SSD. Going to have to re-image one of my hosts back to 7u2 as they were upgraded using patches. Ugh.

0 Kudos
basicmonkey
Enthusiast
Enthusiast

Update:

Also NVME errors on my pfSense VM (FreeBSD 12).

 

I've re-imaged a host to 7u2c and soaking for 24 hours to verify it's u3 related. Any issues will revert to u2a which is where the hosts were before I started the upgrade process.

0 Kudos
basicmonkey
Enthusiast
Enthusiast

No issues on same host back on U2e.

0 Kudos
basicmonkey
Enthusiast
Enthusiast

Is nobody else having these issues with U3? Seems to be unusable for me at the moment...

0 Kudos
michaelrash
Contributor
Contributor

I am seeing this same issue. I have added a "SATA controller" and then changed the drive to "SATA controller 0" and my issue has went away. I have not dug into the issue further. 

0 Kudos
basicmonkey
Enthusiast
Enthusiast

This still exists even with the latest patches (including one NVME driver patch).

Seems to only happen with a few VMs running on the host. Huge amount of headroom, only 15% CPU use and 50GB  RAM left.

Would be great to know that VMWare are aware!

0 Kudos
vanlisur
Contributor
Contributor

i have also faced same issue and now i am trying to rollback the update.

wendy's lunch time surveyzop.com
0 Kudos
RNadmin
Contributor
Contributor

We're experiencing the same.

Very similar setup to you.

We only see on selected Ubuntu VMs.

I'm runnig VCF4.4, so rollback to an earlier esxi version is not an option.

0 Kudos
basicmonkey
Enthusiast
Enthusiast

Glad it's not just me. It's also happening on my pfSense VM which is FreeBSD 12 so not just a Ubuntu thing.

Would really like to hear that VMware are aware and looking into it.

Tags (1)
0 Kudos
SusAdmin1
Contributor
Contributor

for the sake of argument, can you try running pv controllers?

0 Kudos
basicmonkey
Enthusiast
Enthusiast

This issue doesn't affect the few VMs that are running on SCSI controllers. We moved most over to NVME a while ago to improve performance and it did significantly.

0 Kudos
SusAdmin1
Contributor
Contributor

So then would you say the issue is with NVME in this patch? I only ask because I'm running over 700 ubuntu vms...and really do not want those kinds of problems. But I want to get these security patches applied all the same. 

0 Kudos
basicmonkey
Enthusiast
Enthusiast

Yes, 98%.

Since the VMs all seem to report an issue with NVME drive just before they die, it would make sense.

It seems to affect VMs rather randomly, and some more susceptible than others (can't pin down what it is).

The commonality is:

  • NVME storage controller used for boot drive
  • *nix OS
  • ESXi 7 u3c
  • Affects both local SSD and NFS storage (not tried on VSAN)

I've migrated all VMs to 7 u2e hosts and 0 problems. Within 24 hours of migrating to a u3c host I'll lose at least 2 VMs.

0 Kudos
RNadmin
Contributor
Contributor

I am completely on VSAN.

I have an application which is clustered between 3 VMs (At least one of these VMs were freezing daily). I have changed all the Boot drives of these VMs to SCSI Controllers (I have left the application drives on NVME). So far 24hrs stable for the cluster.

Out of interest, which version of VMTools are you running on the problematic VMs? (I am currently on OPEN-VM-TOOLS 11.3.0 - I know 11.3.5, and 12.0.0 is available, but I can't upgrade those at this point.)

 

0 Kudos
Arnaud_L
Enthusiast
Enthusiast

Do you have any pending update for the sata storage firmware (on iDRAC) ?

0 Kudos
RNadmin
Contributor
Contributor

No Firmware upgrades pending for me.

As I am running full VCF, I am fully upgraded to everything within that. I also have no SATA backed disks other than the ESXi hypervisor itself. (Also VSAN datastore disks are SAS backed).

0 Kudos
paudieo
VMware Employee
VMware Employee

Please file a VMare support ticket with VMware global services.
There is a similar issue being tracked internally that matches some of symptoms quoted here.

0 Kudos
JohnMorin
Contributor
Contributor

i'm having the same exact Issue with Oracle Linux 8.4 VM.

Currently, the technician of our case doesn't work on the case saying it's a VM related issue. SR# 22322639904

0 Kudos
basicmonkey
Enthusiast
Enthusiast

Are there any updates with the investigations into this?

Many thanks.

0 Kudos