VMware Cloud Community
Bill_Oyler
Hot Shot
Hot Shot

Purple Screen on Cisco UCS server with ESXi 6.5 Build 5146846 (March 9, 2017)

Last week I patched our (7) lab ESXi 6.5 servers to the current patch level (Build 5146846 - from March 9, 2017).  The (5) Dell servers have been running flawlessly on this new patch level.  However, the (2) Cisco UCS B200 M4 servers have been crashing constantly with Purple Screen of Death (PSOD).  I verified that the Cisco UCS servers are running the current firmware (BIOS & VIC) supported by Cisco with ESXi 6.5, and I also used Update Manager to apply the current Cisco Image Profile ("Vmware-ESXi-6.5a.0-4887370-Custom-Cisco-6.5.0.2-Bundle.zip " from 2017-03-14), to get the current drivers, but that did not help -- PSODs continued.  The PSODs keep occurring on both ESXi hosts in the 2-node Cisco cluster, rendering a complete cluster outage.  First, one ESXi host PSODs.  Then a little while later, the second ESXi host PSODs.  Sometimes the PSOD takes 5 hours to occur, sometimes it only takes 25 minutes.  (Another PSOD occurred while I was typing this post, after a reboot of the blade just about 25 minutes earlier.)  I've captured dozens of PSOD screen shots, and every single one contains the following lines:

NOT_IMPLEMENTED bora/vmkernel/sched/cpusched.c:9581

On each PSOD, a different VM name is listed two lines below the above error code, so there is no consistency as to which VM triggers this panic. 

The interesting thing is that the (5) Dell servers are humming along without issue on this patch level.  The Cisco servers are using Intel(R) Xeon(R) CPU E5-2670 v3 CPUs, while the Dell servers are using earlier generation CPUs (Sandy Bridge or Westmere).  Both the Dell servers and Cisco servers are using vSphere Replication 6.5.  The Cisco servers are running a heavier load of VMs, many of which are using multiple vCPUs, so perhaps there is another vCPU scheduling bug that is being triggered? 

We have vRealize Log Insight running, and the following are some of the last messages sent by the ESXi host after PSOD:

[Originator@6876 sub=VpxaHalCnxHostagent opID=WFU-491dcb0b] Applying updates from 215636 to 215637 (at 215636)

[Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: guest, 45. Sent notification immediately.

The vRealize Log Insight VM is running in the Cisco cluster, so I might be missing some of the most important log entries right at the time of the PSOD.

Anyone else running into this??  (Note that this is a "lab" environment, so it's not production impacting.  This is exactly why we have a "lab" environment!!!)

Below is one screen shot example of the PSOD.  They all pretty much look like this:

pastedImage_8.png

Bill Oyler Systems Engineer
Tags (1)
0 Kudos
16 Replies
Bill_Oyler
Hot Shot
Hot Shot

One thing I noticed is that ESXi 6.5 P1 (Build 5146846) has a new default value for the "iovDisableIR" setting.  Starting with build 5146846, the default value is FALSE (IR is enabled), just like earlier versions of ESXi 4.1 through 6.0U2 pre-P4.  Then with ESXi 6.5 GA and 6.5a, the default changed to "TRUE".  Now it looks to be back at "FALSE" again.  I'm not sure if this is causing the Purple Screen on Cisco, but there were several reports of HPE customers experiencing Purple Screen by the change of this setting.  See below for example of what this setting defaults to in the ESXi 6.5 GA and 6.5a builds, and then what it defaults to in ESXi 6.5 Patch 1:

ESXi 6.5a - Build 4887370:

[root@host1:~] esxcli system settings kernel list -o iovDisableIR


Name         Type Description                                Configured Runtime Default
------------ ---- ------------------------------------------ ---------- ------- -------
iovDisableIR Bool Disable Interrrupt Remapping in the IOMMU. TRUE       TRUE    TRUE

ESXi 6.5 Patch 1 - Build 5146846:

[root@host2:~] esxcli system settings kernel list -o iovDisableIR


Name         Type Description                                Configured Runtime Default
------------ ---- ------------------------------------------ ---------- ------- -------
iovDisableIR Bool Disable Interrrupt Remapping in the IOMMU. FALSE      FALSE   FALSE

Bill Oyler Systems Engineer
0 Kudos
cscno
Contributor
Contributor

I've got the same issue on my Dell servers.  I tried setting iovDisableIR back to true with no success.  VMware support asked me to revert back to 6.5.0a (before the March patches), and that a fix for this is scheduled for a release sometime in July.  I will test the revert and post back in a fews days if it helps.

cscno
Contributor
Contributor

I wasn't able to attempt a revert to the previous build using the instructions here : Reverting to a previous version of ESXi (1033604) | VMware KB as it can only revert to the previous build, and not 2 builds (I had installed both March patches).  So I had to reinstall ESXi 6.5.0a (build 4887370) from scratch.  So far, it's been a day, and my hosts haven't rebooted yet.

0 Kudos
Bill_Oyler
Hot Shot
Hot Shot

Same here.  I re-installed 6.5a from scratch and I've had zero PSODs since March 20th.

Bill Oyler Systems Engineer
0 Kudos
vanree
Enthusiast
Enthusiast

We had the same on 2 of our 3 Dell R320 servers this week.

It started with losing all access to iSCSI devices on one server, today a second server had that too.

Rolling back to original Dell 6.5 image now.

0 Kudos
AlfredoUN
Contributor
Contributor

Hello,

We have the same issue with HP DL360 G9 and the latest ESXi Custom ISO Build VMware-ESXi-6.5.0-OS-Release-5146846-HPE-650.9.6.5.27-May2017.

It occurs when a VM hardly stresses the vCPUs.

I have operated the Shift+R solution to revert to a previous build and it seams to work for me.

For us it's not a definitive solution because we can't keep our ESXi up to date in this way.

Have you got some answers from VMware about this issue?

Best Regards,

Alfredo

0 Kudos
Bill_Oyler
Hot Shot
Hot Shot

A few weeks ago I patched all servers (Dell and Cisco UCS) to the latest patch level -- ESXi 6.5d Build 5310538 -- and that has solved the Purple Screen of Death issue on Cisco UCS.  No PSOD in several weeks now.

Bill Oyler Systems Engineer
0 Kudos
AlfredoUN
Contributor
Contributor

Hello Bill,

Have you used the Custom ISO for the Cisco part?

HPE has not delivered the latest release 6.5d.

I have to wait!

Thanks for your answer.

Have a nice day

0 Kudos
Bill_Oyler
Hot Shot
Hot Shot

Yes, Cisco has published the 6.5d custom ISO.  For HPE, you can use their latest custom ISO and simply use Update Manager to apply the latest ESXi patches.  Then, when HPE releases their custom ISO, you'd use Update Manager to update HPE custom VIBs (drivers and software) via the Update Manager URLs they put in their HPE VMware Recipe Book (offline bundles).

Bill Oyler Systems Engineer
0 Kudos
AlfredoUN
Contributor
Contributor

That's what i have done. Just after upgrade of ESXi to 6.5 i have pushed all the patches including those ones that are in HPE Software Delivery Repository vibsdepot (aka HPE Online Depot) .

I have 3 clusters in my VMware infrastructure.

All of 3 were updated to 6.5 with latest patches but only one cluster has the POSD issue.

The cluster is the most stressed one with the highest load in terms of CPU and IO.

This cluster is the only one that have a LSI 3008 SAS HBA Controller in the servers to access a full SSD array.

And the issue appears only when there is WM that stresses the ESXi.

Do you think making a custom ISO based on the original VMware 6.5d build and adding HPE Drivers can resolve the problem?

0 Kudos
Bill_Oyler
Hot Shot
Hot Shot

Did you update all HPE firmware with the Service Pack for ProLiant and any post-SPP firmware updates per the latest HPE Firmware Recipe Book?  Out of date firmware is typically the cause of HPE Purple Screen of Death issues.  If you keep experiencing PSODs with all firmware and drivers up to date, it's time to open a ticket with HPE Support.

Bill Oyler Systems Engineer
0 Kudos
iHenk
Contributor
Contributor

Same over here on IBM/Lenovo System x servers. It started after applying the latest ESXi 6.5 patches. I'm on 6.5 5310538.

Already upgraded vCenter and NSX to the latest version to be sure, no relief.

Peculiar, the server with the PSOD is always the one which the vCenter server is running.

0 Kudos
cscno
Contributor
Contributor

Good to know this is still happening on 6.5.0d.  I was told the fix should be in the major 6.5 U1 release sometime in July.  I revered to 6.5.0a 4887370 while I wait for the release.  I do have the latest version of vCenter working fine though (6.5.0e 5705665).  It only seems to be a bug in ESXi, not vCenter.

0 Kudos
5mall5nail5
Enthusiast
Enthusiast

Just experienced this PSOD today on a Supermicro X9DRI-LN4F+ system with dual E5-2670 CPUs and 256GB of RAM.  Was up ~10 days and then PSOD'd out.  Same exact error.

Any fixes?

0 Kudos
RomaTuan
Contributor
Contributor

i have the same PSOD on Cisco ucs run Esxi 6.5 build 4887370

What is the cause error?

Plz help ! See image below.

0 Kudos
cscno
Contributor
Contributor

I had this issue, and had reverted to 6.5.0a as I waited for a patch from VMware.  I upgraded to ESXi 6.5 U1 (build 5969303) about a week ago, and haven't had a PSOD yet.

0 Kudos