Hi,
I'm running on ESX 5.1. The guest is CentOS 6.3 with vmxnet3 1.1.32.0.
When under load kernel is short of skbs the machine crashes with the following stack:
[77782.262000] ------------[ cut here ]------------
[77782.262000] kernel BUG at /tmp/build_rpm_sandbox.vy6fT4/BUILD/vmware-open-vm-tools-9.0.0/vmxnet3/vmxnet3_drv.c:1467!
[77782.262000] invalid opcode: 0000 [#2] SMP
[77782.262000] last sysfs file: /sys/module/ipv6/initstate
[77782.262000] CPU 0
[77782.262000] Modules linked in: ip_vs libcrc32c u2p(P)(U) nls_cp950 netconsole hades(P)(U) imp_iTCO_wdt(U) hades_wd(U) ip6table_filter ip6_tables ip6t_REJECT configfs vmblock(U) vsock(U) fuse vmware_balloon ipt_REJECT iptable_filter ip_tables ppdev parport_pc parport vmxnet3(U) ipv6 sg i2c_piix4 i2c_core vmci(U) shpchp ext3 jbd mbcache sd_mod crc_t10dif sr_mod cdrom mptspi mptscsih mptbase scsi_transport_spi pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: netconsole]
[77782.262000]
[77782.262000] Pid: 14178, comm: gw.x Tainted: P --------------- 2.6.32-279.el6.imp4.x86_64 #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
[77782.262000] RIP: 0010:[<ffffffffa01b0a1f>] [<ffffffffa01b0a1f>] vmxnet3_rq_rx_complete+0x9bf/0xbc0 [vmxnet3]
[77782.262000] RSP: 0018:ffff8800282037c8 EFLAGS: 00010046
[77782.262000] RAX: 0000000000000040 RBX: ffff880239dd0ee0 RCX: 0000000000001000
[77782.262000] RDX: 0000000000000004 RSI: 00000000000006c8 RDI: 0000000000000000
[77782.262000] RBP: ffff880028203838 R08: ffff880238629800 R09: 00000000000006c8
[77782.262000] R10: 0000000000000001 R11: 0000000000000000 R12: ffff88023a94aea0
[77782.262000] R13: ffff8802365a4c10 R14: ffff880239dd06e0 R15: ffff88023862aa18
[77782.262000] FS: 00007f61f2aea700(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
[77782.262000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[77782.262000] CR2: 0000000000a9f270 CR3: 000000008beaa000 CR4: 00000000000006f0
[77782.262000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[77782.262000] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[77782.262000] Process gw.x (pid: 14178, threadinfo ffff880080828000, task ffff880080820970)
[77782.262000] Stack:
[77782.262000] ffff88023b9c0c40 ffff88023b9c0c80 0000000000000001 ffff880239dd0ee8
[77782.262000] <d> ffff880028203868 0000001081151947 0000000000000001 000000c100000000
[77782.262000] <d> ffff88023b9c0c80 ffff880239dd06e0 ffff880239dd0ee8 0000000000000010
[77782.262000] Call Trace:
[77782.262000] <IRQ>
[77782.262000] [<ffffffffa01b0da3>] vmxnet3_poll_rx_only+0x53/0x110 [vmxnet3]
[77782.262000] [<ffffffff8143c835>] netpoll_poll_dev+0xd5/0x490
[77782.262000] [<ffffffff8143cc9a>] find_skb+0x8a/0x90
[77782.262000] [<ffffffff8143cf96>] netpoll_send_udp+0x36/0x230
[77782.262000] [<ffffffffa001633b>] write_msg+0xbb/0x110 [netconsole]
[77782.262000] [<ffffffff81064a65>] __call_console_drivers+0x75/0x90
[77782.262000] [<ffffffff81064aca>] _call_console_drivers+0x4a/0x80
[77782.262000] [<ffffffff81064fde>] release_console_sem+0x4e/0x220
[77782.262000] [<ffffffff81065797>] vprintk+0x247/0x560
[77782.262000] [<ffffffff814e81ce>] printk+0x41/0x43
[77782.262000] [<ffffffff810a3ed7>] print_modules+0x97/0xf0
[77782.262000] [<ffffffff8100e066>] show_registers+0x46/0x280
[77782.262000] [<ffffffff814ee23a>] ? atomic_notifier_call_chain+0x1a/0x20
[77782.262000] [<ffffffff814ec053>] __die+0xb3/0xf0
[77782.262000] [<ffffffff8100f178>] die+0x48/0x90
[77782.262000] [<ffffffff814eba44>] do_trap+0xc4/0x160
[77782.262000] [<ffffffff8100cd95>] do_invalid_op+0x95/0xb0
[77782.262000] [<ffffffffa01b0a1f>] ? vmxnet3_rq_rx_complete+0x9bf/0xbc0 [vmxnet3]
[77782.262000] [<ffffffff81151947>] ? cache_alloc_refill+0x2f7/0x580
[77782.262000] [<ffffffff8100be3b>] invalid_op+0x1b/0x20
[77782.262000] [<ffffffffa01b0a1f>] ? vmxnet3_rq_rx_complete+0x9bf/0xbc0 [vmxnet3]
[77782.262000] [<ffffffffa01b01fe>] ? vmxnet3_rq_rx_complete+0x19e/0xbc0 [vmxnet3]
[77782.262000] [<ffffffff81427c9c>] ? dev_hard_start_xmit+0x2bc/0x3f0
[77782.262000] [<ffffffff81276950>] ? swiotlb_map_page+0x0/0x100
[77782.262000] [<ffffffffa01b0da3>] vmxnet3_poll_rx_only+0x53/0x110 [vmxnet3]
[77782.262000] [<ffffffff8142bd62>] net_rx_action+0x102/0x2f0
[77782.262000] [<ffffffff8106cde1>] __do_softirq+0xc1/0x1e0
[77782.262000] [<ffffffff8100c1ac>] call_softirq+0x1c/0x30
[77782.262000] <EOI>
[77782.262000] [<ffffffff8100ddb5>] ? do_softirq+0x65/0xa0
[77782.262000] [<ffffffff8106cc7a>] local_bh_enable+0x9a/0xb0
[77782.262000] [<ffffffff8142c0eb>] dev_queue_xmit+0x19b/0x6f0
[77782.262000] [<ffffffff814639b0>] ? ip_finish_output+0x0/0x310
[77782.262000] [<ffffffff81463aec>] ip_finish_output+0x13c/0x310
[77782.262000] [<ffffffff81463d78>] ip_output+0xb8/0xc0
[77782.262000] [<ffffffff8146303f>] ? __ip_local_out+0x9f/0xb0
[77782.262000] [<ffffffff81463075>] ip_local_out+0x25/0x30
[77782.262000] [<ffffffff81463550>] ip_queue_xmit+0x190/0x420
[77782.262000] [<ffffffff8147821e>] tcp_transmit_skb+0x3fe/0x7b0
[77782.262000] [<ffffffff8147a5db>] tcp_write_xmit+0x1fb/0xa20
[77782.262000] [<ffffffff8147af90>] __tcp_push_pending_frames+0x30/0xe0
[77782.262000] [<ffffffff8146a5de>] tcp_push+0x6e/0x90
[77782.262000] [<ffffffff8146b5ba>] tcp_sendmsg+0x64a/0xa20
[77782.262000] [<ffffffff81417147>] sock_aio_write+0x167/0x180
[77782.262000] [<ffffffffa020969d>] ? hook_function+0x1d/0x30 [hades_wd]
[77782.262000] [<ffffffff81167a2a>] do_sync_write+0xfa/0x140
[77782.262000] [<ffffffff8108aa50>] ? autoremove_wake_function+0x0/0x40
[77782.262000] [<ffffffff8105e1f0>] ? pick_next_task_fair+0xd0/0x130
[77782.262000] [<ffffffff811fe706>] ? security_file_permission+0x16/0x20
[77782.262000] [<ffffffff81167df4>] vfs_write+0x184/0x1a0
[77782.262000] [<ffffffff81168711>] sys_write+0x51/0x90
[77782.262000] [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
[77782.262000] Code: 00 00 00 c7 83 b4 00 00 00 00 00 00 00 31 c0 83 f1 01 88 8b b8 00 00 00 e9 1c fb ff ff 48 83 bb c8 00 00 00 00 0f 85 5a fb ff ff <0f> 0b eb fe 0f 0b eb fe 0f 0b 0f 1f 80 00 00 00 00 eb f7 8b 83
[77782.262000] RIP [<ffffffffa01b0a1f>] vmxnet3_rq_rx_complete+0x9bf/0xbc0 [vmxnet3]
[77782.262000] RSP <ffff8800282037c8>
crash-x86_64> mod -s vmxnet3 vmxnet.ko
mod: vmxnet3: cannot find or load object file: vmxnet.ko
crash-x86_64> mod -s vmxnet3 vmxnet3.ko
MODULE NAME SIZE OBJECT FILE
ffffffffa01b7ac0 vmxnet3 60947 vmxnet3.ko
crash-x86_64> gdb list * vmxnet3_rq_rx_complete+0x9bf
0xffffffffa01b0a1f is in vmxnet3_rq_rx_complete (/tmp/build_rpm_sandbox.vy6fT4/BUILD/vmware-open-vm-tools-9.0.0/vmxnet3/vmxnet3_drv.c:1467).
1462 PCI_DMA_FROMDEVICE);
1463 rxd->addr = cpu_to_le64(rbi->dma_addr);
1464 rxd->len = rbi->len;
1465
1466 } else {
1467 BUG_ON(ctx->skb == NULL && !skip_page_frags);
1468 /* non SOP buffer must be type 1 in most cases */
1469 BUG_ON(rbi->buf_type != VMXNET3_RX_BUF_PAGE);
1470 BUG_ON(rxd->btype != VMXNET3_RXD_BTYPE_BODY);
1471
The line is added by the following commit https://github.com/torvalds/linux/commit/5318d809d7b4975ce5e5303e8508f89a5458c2b6.
Any ideas?
This seems to me more like open-vm-tools problem. Do you really need it? Because vmxnet3 is supported by kernel-sources since 2.6.32. And if you run incompatible versions of kernel & open-vm-tools, you might get these problems...
BTW, open-vm-tools is NOT the same as vmware-tools! Moreover, your open-vm-tools are quite old (9.0.0). Current stable branch is 9.4.0, so maybe it is time to update it (vmware-tools is even further, iirc in 9.6.x)...
Forgot to mention, that I also applied netpoll race condition patch ([PATCH] vmxnet3: fix netpoll race condition &mdash; Linux Network Development).
This fix as it looks is not related to problem I see.
This seems to me more like open-vm-tools problem. Do you really need it? Because vmxnet3 is supported by kernel-sources since 2.6.32. And if you run incompatible versions of kernel & open-vm-tools, you might get these problems...
BTW, open-vm-tools is NOT the same as vmware-tools! Moreover, your open-vm-tools are quite old (9.0.0). Current stable branch is 9.4.0, so maybe it is time to update it (vmware-tools is even further, iirc in 9.6.x)...
Thanks for that mention of the vm tools. We forgot to update our linux repositories to reference 5.5 after upgrading and vmware hasn't deployed the newer tools to the 5.1 repo. Just updated and tested and their 5.5 repo is indeed serving up the 9.4 branch.
@JarryG Indeed upgrading vmware-tools to latest tools fom ESX 5.5 solved problem, while we run on ESXi 5.1
