ESXi 5.1.0 crashes under heavy network load

acantos · ‎08-17-2013

I have a VM that has an NFS-based disk. When I do something IO (and possibly VM-network) intensive on the VM, ESXi crashes.

The motherboard is an Intel DH57JG, the chipset is Intel H57, and the Ethernet controller is 82578DC (hence the e1000 driver in the stack trace) Here's the stack trace. I copied it by hand, so there might be a few 8's that should be 0's.

VMWare ESXi 5.1.0 [Releasebuild-1065491 x86_64] #PF Exception 14 in world 2476:helper31-7 IP 0x418027119bld addr 0x18 PTEs:0x0: cr0=0x8001003d cr2=0x18 cr3=0xdb5ac000 cr4=0x216c frame=0x412286b1bcc0 ip=0x418027119b1d err=9 rflags=0x10202 rax=0x0 rbx=0x412286b1bd88 rcx=0x41000162a700 rdx=0xbad000e rbp=0x412286b1bd0 rsi=0x412286b1bdd0 rdi=0x412401cdc840 r8=0x412286b1bd18 r9=0x0 r10=0x100 r11=0x0 r12=0x0 r13=0x412401fc9a88 r14=0x410006286c80 r15=0x4100063610d0 *PCPU2:2476/helper31-7 PCPU 0: VVSHV Code start: 0x418027000000 VMK uptime: 14:03:47:28.991 0x412206b1bdb0:[0x418027119b1d]Net_PktFree@vmkernel#nover+0x10 stack: 0x412206b1bde0 0x412206b1bdf0:[0x41802753a401]skb_release_data@com.vmware.driverAPI#9.2+0xa4 stack: 0x410006206c80 0x412206b1be10:[0x41802753a507]__kfree_skb@com.vmware.driverAPI#9.2+0x2a stack: 0x412206b1be60 0x412206b1be60:[0x41802768ba00]e1000_clean_rx_ring@#+0xbb stack: 0x0 0x412206b1be90:[0x41802760bf43]e1000e_down@#+0xde stack: 0x410006201580 0x412206b1beb0:[0x41802760bf99]e1000e_reinit_locked@#+0x3c stack: 0x62024c0 0x412206b1bf60:[0x41802755aac3]vmklnx_workqueue_callout@com.vmware.driverAPI#9.2+0x11a stack: 0x0 0x412206b1bff8:[0x41802784842f]helpFunc@vmkernel@nover+0x52e stack: 0x0 0x412206b1bff8[0x0] stack: 0x0

One thing I forgot to mention: this was a 5.0.0->5.1.0 upgrade, and the upgrade didn't "stick" the first time, so I might be dealing with corrupted files.

acantos · ‎08-29-2013

I'll give that a try. Someone also suggested I completely disable power management.

I also tried running just the VM that's been crashing the host. I ran a command that's been doing it (Freebsd's "portsnap fetch update"), and I was shocked at how quickly the host crashed. It almost feels like the vmdk must be incredibly corrupt, or something, but I saw the crash on bulk db inserts over nfs, too.

acantos · ‎08-29-2013

One massive oversight: vmdk wasn't actually involved here. It turns out 99%+ of the IO on the guest was either nfs or another network protocol. There's still an issue, and it may very well be a bad network driver, but it seems to be guest network traffic.

acantos · ‎08-29-2013

No luck disabling hyperthreading.

acantos · ‎08-29-2013

Shutting off hyperthreading, disabling power saving features, and updating the microcode together all didn't work.

admin · ‎08-29-2013

With all of those things ruled out, I suspect a TLB tagging issue. Can you try adding the following to your /etc/vmware/config (with all VMs powered off):

monitor.virtual_mmu = software

This should force a TLB flush on every transition between the virtual machine monitor and the vmkernel.

acantos · ‎08-29-2013

It seems to be working, but connections between guests don't seem to work as well. At least it's not crashing. I'm going to run things overnight, see how that goes.

No luck. I'm getting the e1000_clean_rx_ring error again, though at least it's after an hour, not instantly.

All