VMware Cloud Community
acantos
Contributor
Contributor

ESXi 5.1.0 crashes under heavy network load

I have a VM that has an NFS-based disk. When I do something IO (and possibly VM-network) intensive on the VM, ESXi crashes.

The motherboard is an Intel DH57JG, the chipset is Intel H57, and the Ethernet controller is 82578DC (hence the e1000 driver in the stack trace) Here's the stack trace. I copied it by hand, so there might be a few 8's that should be 0's.

VMWare ESXi 5.1.0 [Releasebuild-1065491 x86_64] #PF Exception 14 in world 2476:helper31-7 IP 0x418027119bld addr 0x18
PTEs:0x0:
cr0=0x8001003d cr2=0x18 cr3=0xdb5ac000 cr4=0x216c
frame=0x412286b1bcc0 ip=0x418027119b1d err=9 rflags=0x10202
rax=0x0 rbx=0x412286b1bd88 rcx=0x41000162a700
rdx=0xbad000e rbp=0x412286b1bd0 rsi=0x412286b1bdd0
rdi=0x412401cdc840 r8=0x412286b1bd18 r9=0x0
r10=0x100 r11=0x0 r12=0x0
r13=0x412401fc9a88 r14=0x410006286c80 r15=0x4100063610d0
*PCPU2:2476/helper31-7
PCPU  0: VVSHV
Code start: 0x418027000000 VMK uptime: 14:03:47:28.991
0x412206b1bdb0:[0x418027119b1d]Net_PktFree@vmkernel#nover+0x10 stack: 0x412206b1bde0
0x412206b1bdf0:[0x41802753a401]skb_release_data@com.vmware.driverAPI#9.2+0xa4 stack: 0x410006206c80
0x412206b1be10:[0x41802753a507]__kfree_skb@com.vmware.driverAPI#9.2+0x2a stack: 0x412206b1be60
0x412206b1be60:[0x41802768ba00]e1000_clean_rx_ring@#+0xbb stack: 0x0
0x412206b1be90:[0x41802760bf43]e1000e_down@#+0xde stack: 0x410006201580
0x412206b1beb0:[0x41802760bf99]e1000e_reinit_locked@#+0x3c stack: 0x62024c0
0x412206b1bf60:[0x41802755aac3]vmklnx_workqueue_callout@com.vmware.driverAPI#9.2+0x11a stack: 0x0
0x412206b1bff8:[0x41802784842f]helpFunc@vmkernel@nover+0x52e stack: 0x0
0x412206b1bff8[0x0] stack: 0x0

One thing I forgot to mention: this was a 5.0.0->5.1.0 upgrade, and the upgrade didn't "stick" the first time, so I might be dealing with corrupted files.

0 Kudos
25 Replies
acantos
Contributor
Contributor

I'll give that a try. Someone also suggested I completely disable power management.

I also tried running just the VM that's been crashing the host. I ran a command that's been doing it (Freebsd's "portsnap fetch update"), and I was shocked at how quickly the host crashed. It almost feels like the vmdk must be incredibly corrupt, or something, but I saw the crash on bulk db inserts over nfs, too.

0 Kudos
acantos
Contributor
Contributor

One massive oversight: vmdk wasn't actually involved here. It turns out 99%+ of the IO on the guest was either nfs or another network protocol. There's still an issue, and it may very well be a bad network driver, but it seems to be guest network traffic.

0 Kudos
acantos
Contributor
Contributor

No luck disabling hyperthreading.

0 Kudos
acantos
Contributor
Contributor

Shutting off hyperthreading, disabling power saving features, and updating the microcode together all didn't work.

0 Kudos
admin
Immortal
Immortal

With all of those things ruled out, I suspect a TLB tagging issue.  Can you try adding the following to your /etc/vmware/config (with all VMs powered off):

monitor.virtual_mmu = software

This should force a TLB flush on every transition between the virtual machine monitor and the vmkernel.

0 Kudos
acantos
Contributor
Contributor

It seems to be working, but connections between guests don't seem to work as well. At least it's not crashing. I'm going to run things overnight, see how that goes.

No luck. I'm getting the e1000_clean_rx_ring error again, though at least it's after an hour, not instantly.

0 Kudos