VMware Cloud Community
lshisler
Contributor
Contributor

ESXi 7.0U1 - 15 hosts experience PSOD within minutes of each other

We already have a support ticket in but I wanted to see if anyone else has seen this.

Sometime on or around 5:00PM EST on 9/23/2021, 15 of our 22 ESXi hosts purple screened nearly simultaneously.  This happened over a span of about 5 minutes.  This brought our vSAN clusters to a halt and forced HA action on 803 of our 1099 powered on VMs.  Out of the 15 hosts that experienced a PSOD, 8  experienced a 2nd PSOD once they had fully rebooted.

3 hours prior to this, we upgraded our vCenter from 7.0u2 to 7.0u2d.  10 to 15 minutes prior to the event, we implemented the last of a series of LACP load balancing hash changes.  Two dswitches had already been successfully changed 2 hours prior, and then 45 minutes prior to the 3rd change.  We quickly listed our changes for the day and made the decision to shut our vCenter server down.  Since then, no further PSODs have occurred.

Our infrastructure is primarily run from 22 ESXi hosts.  The hosts are split into a management cluster with 2 hosts, a primary vSAN cluster with 14 hosts and 2PB of capacity, and a production vSAN cluster with 6 hosts and 350TB of capacity.

The vSAN hosts are a mix of Dell PE R730xd and R740xd servers running ESXI 7.0 U1  All hardware\driver\firmware combinations are on the vsphere or vSAN compatibility lists for 7.0.1.  These hosts have been running on this version of ESXI since mid February 2021.  With the exception of the occasional bad stick of RAM, these hosts have experienced no downtime, with the vast majority having 189+ days of uptime.

The management hosts are Dell PE R7515 servers running ESXI 7.0 U2.  These hosts were upgraded to ESXI 7.0.U2 in late August of 2021.

All hosts use either two variants of the HBA330, or in the case of the R7515s, they use the Dell PCIe expanders with U.2 NVME drives from Micron.  The driver for the HBA330 is 17.00.10.00 and the firmware is 16.17.00.50.

All hosts are using Intel X710-2 NICs with a driver of 1.10.9.0 and a firmware of 7.20, and Intel X520-DA2 NICs with a driver of 1.8.9.0 and a firmware of 19.0.12.

Our distributed switches are version 6.5 or 6.6 and are managed by a Cisco ACI infrastructure, meaning we actually make our distributed switch policy changes inside of ACI which then pushes them to vCenter.

Attached is a core dump screen shot.  All core dumps resemble this one.

We are still working with VMWare support but it appears to be an issue they've either not seen before or are having a hard time identifying.

 

0 Kudos
0 Replies