Rosivan
Contributor
Contributor

ESXi 6.7 U3 host crash after start Oracle Linux 7 U9 with Unbreakble Enterprise Kernel

After install an Oracle Linux on a VM over an ESXi 6.7 U3 host, i have faced this issue;

The ESXI 6.7 U3 fall into a crash purple screen when a new Oracle Linux 7 U9 is started with UEK.

WhatsApp Image 2021-02-12 at 09.50.42.jpeg

When try to start the VM with RHCK kernel, everything runs well.

Have anyone see something like that?

0 Kudos
6 Replies
vbondzio
VMware Employee
VMware Employee

That's not supposed to happen ... can you open an SR, upload host logs and DM me the number?

0 Kudos
Rosivan
Contributor
Contributor

Yeah, i've never seen that before.

Our support contract has expired, and i'm trying to renew it.

This weekend tryed to migrate the .vmdk to another datastore in another storage system and the issue happen again.

So i have migrate the .vmdk to a local server datastore and magicaly nothing wrong happen. One more time, migrate back the .vmdk (to storage system datastore) and the error backs again. To finish my weekend, i left the .vmdk in Local server datastore and the VM is up until now without any problems.

 

0 Kudos
vbondzio
VMware Employee
VMware Employee

So that does seem to confirm that the PSOD prints relevant context, i.e. that it is related to the qfle3i driver. Check whether this happens with the most up-to-date async driver too, I saw some potentially related issues but I can't say anything without a dump.

Is this really the same VM and the only difference is the kernel used? I.e. same VMX / disks etc.?

vbondzio
VMware Employee
VMware Employee

Just for some extra "search-ability" and to explain further. The PSOD was caused by an NMI (nonmaskable interrupt), so most likely some device detects a fatal flaw and crashes the host. In most scenarios, those are caused hardware issues but it can originate in software too. I've found a similar failure pattern reported before, one or maybe two that is, so pretty rare. I can't be sure thought because that would require looking at the actual core dump. That being said, what I found _might_ be addressed in the latest async driver for 6.7 and _probably_ in the inbox driver for 7.0 (U1).

Rosivan
Contributor
Contributor

Yes, it's the same VM, when the boot menu is started i can choose wich kernel to use.

And yes, probably you are right about qfle3i driver, after install the VIB from this KB: https://kb.vmware.com/s/article/56357, my Lab host worked well with the Oracle Linux VM with the UEK.
Now i need schedule a window to remediate the prod hosts and check if the issue will solved.

 

For now, i need to thank you so much your idea and your help!
After apply the patch in nodes from the cluster, i back here again and give an updated feedback.

0 Kudos
vbondzio
VMware Employee
VMware Employee

I think the issue in that KB is a different one than you experienced but as long as that updated driver also ships with a fix for what you saw it doesn't really make a difference :-). Fingers crossed it's gone in prod too!

0 Kudos