ESXi 4.1 lock up/freeze IBM HS22s

WayneDADC · ‎04-16-2012

Hi,

We have a problem with our HS22 blades running ESXi 4.1 occasionally locking up with an orange screen. The problem first appeared on 21 March and has appeared on a different blade every day or two since.

Blades running ESXi 4.1 occasionally have ESXi report loss of network connectivity, loss of network uplink redundancy or randomly lock the blade with the orange screen still showing on the console. There appears to be no correlation between the timing of the network faults and the freezes, The BladeCenter detects the lock up, and attempts to restart the blade but it hangs during boot and has to be powered down, unjacked from the chassis, reinserted and powered up again to clear. The Brocade converged switch module logs shows the ethernet port dropping briefly during the network fault and shutting down when the blade freezes. No storage or upstream ethernet problems are found, the converged switches do not show outages on any other ports used by the other blades.

After disabling IRQ Routing in VMware on two blades, instead of generic CPU failure, it now reports "System board, (PCI Slot 2) bus uncorrectable error" at the point of failure.

Environment

HS22 (7870) blades with two Intel Xeon E5540/2,53GHz CPU, 48GB (12x4G) RAM, 2x 600GB HDD RAID0 internal storage, Brocade 2-port 10Gb Converged Network Adapter (CFFh) running ESXi 4.1.0 update 2 (build 502767)

IBM BladeCenter H (8852) with 2x AMM (firmware build BPET54V), 2x IPv6 is disabled throughout, 2x Cisco WS-CBS3012-IBM-I (software 12.2(52)SE - management network), 2x Brocade 8470 converged switch modules.

10GE is connected to Core switch (2x Cisco 6509 configured as a single VSS) via OM3 MM fibre with LC connectors.

8GFC is connected to SAN switch (4x IBM 2498-B40 arranged as two redundant fabrics) via OM3 MM fibre with LC connectors.

Storage is 3x DS4700, 1x DS3524 connected to SAN switch (4x IBM 2498-B40 arranged as two redundant fabrics) via OM3 MM fibre with LC connectors.

other blades in environment, none of which are experiencing any problems:

HS21 (type 8853) with QLogic QMI1830 CNAs running VMware ESXi 5.0

HS21 (type 8853) with QLogic QMI1830 CNAs running Windows 2008 R2 x64

HS22 (type 8853) with QLogic QMI1830 CNAs running Windows 2008 R2 x64

JS22 (type 7998) with QLogic QMI1830 CNAs running VIO 2.2.0.11-FP-24 SP-02

PS701 (type 8406) with QLogic QMI1830 CNAs running VIO 2.2.0.11-FP-24 SP-02

Changes in previous 6 months

No changes between the end of July 2011 and 15 January 2012 due to peak season lock-down apart from AMM.

14 Nov 11 Update AMM firmware to BPET62F to get all to consistent level (PTM chassis needed this release to fix a problem with pass-through module). This triggered a bug in the AMM that caused the AMMs not to be recognised:

"Stand-by MM failure on system management bus, check devices"
"Failure reading device on system management bus 0"

27 Feb 12 Rolled AMM firmware back to BPET54V

26-27 Jan 12 Re-patched SAN fabrics to ensure full redundancy.

Problems first manifested on 21 Mar 12.

23 Mar 12, IBM replace blade, CPU1, retain CNA, storage and memory. Blade failed again eight days later.

27 Mar 12 Connected second storage controller on DS3524, changed default block size, upgrade firmware, in preparation for SVC. No systems report storage problems

27 -30 Mar 12 Upgrade Dynamic System Analysis to v4.01, uEFI flash to 1.17, Brocade 10G boot code to 3.0.0.0, LSI 1064e SAS controller to 1.30.10.00, Broadcom NetXtreme to 6.2.0
IMM to 1.32 (YUOOD4G) from mounted ISO as directed by IBM

16 April 12 Disable IRQ routing on AUHUNESXi02 and AUHUNESXi05 (per http://www-947.ibm.com/support/entry/portal/docdisplay?lndocid=MIGR-5086606&brandind=5000008&myns=x0...
and http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=103026...)

ongoing - added additional guest VMs to the environment progressively over 6 months. VMware logs do not show excessive load on memory, CPU, network or storage.

I've seen similar discussions on Dell and HP hardware, could there have been an ESXi update that went awry?

Wayne

All

ESXi 4.1 lock up/freeze IBM HS22s