VMware Cloud Community
chaddy
Contributor
Contributor

Lost network connectivity ESX 4

Hi all,

Just this morning this issue happened on our second VMWare server, it happened a 6 weeks ago on our first VMWare server. Our VM's intermittantly become unresponsive via the network and we couldn't connect to the service console. After rebooting via the CLI we are able to connect via service console and start up VM and all works as normal. No alerts are listed in vCenter but upon connecting to the CLI I have errors in /var/log/messages, /var/log/vmkwarning and as described in KB1017458. I verified that we do have the ESX400-201002401-BG patch installed as mentioned in the KB1017458 (it wasn't installed on the first server last time we experienced this issue).

vmkwarning log:

Jun 30 07:31:25 cma32 vmkernel: 6:12:46:44.715 cpu0:4096)VMNIX: WARNING: NetCos: 1075: virtual HW appears wedged (bug number 90831), resetting

Jun 30 07:31:25 cma32 vmkernel: 6:12:46:44.715 cpu11:4119)WARNING: Net: 1205: non-forced disable with 128 packets in flight.

Jun 30 07:31:25 cma32 vmkernel: 6:12:46:44.715 cpu11:4119)WARNING: Net: 1210: forced disable with 128 packets in flight.

Jun 30 07:32:30 cma32 vmkernel: 6:12:47:47.811 cpu1:4238)WARNING: LinNet: netdev_watchdog: NETDEV WATCHDOG: vmnic0: transmit timed out

Jun 30 07:32:31 cma32 vmkernel: 6:12:47:50.795 cpu8:4230)WARNING: LinNet: netdev_watchdog: NETDEV WATCHDOG: vmnic0: transmit timed out

Jun 30 07:33:40 cma32 vmkernel: 6:12:48:59.824 cpu8:4242)WARNING: CpuSched: 965: world 4242(helper21-12) did not yield PCPU 8 for 4001 msec, refCharge=7996 msec, coreCharge=8329 msec,

Jun 30 07:36:00 cma32 vmkernel: 6:12:51:19.519 cpu5:4289)ALERT: Heartbeat: 518: PCPU 4 didn't have a heartbeat for 3780 seconds. may be locked up

Jun 30 07:36:00 cma32 vmkernel: 6:12:51:19.519 cpu5:4289)WARNING: NMI: 1612: Sending NMI IPI to PCPU 4 to get its backtrace (Src 1, Req 1)

Jun 30 07:36:00 cma32 vmkernel: 6:12:51:19.519 cpu4:4285)ALERT: NMI: 2001: NMI IPI received. Was eip(base):ebp:cs (Src 0x1)

Jun 30 07:36:00 cma32 vmkernel: 6:12:51:19.595 cpu13:4280)ALERT: Heartbeat: 518: PCPU 6 didn't have a heartbeat for 3780 seconds. may be locked up

Jun 30 07:36:00 cma32 vmkernel: 6:12:51:19.595 cpu13:4280)WARNING: NMI: 1612: Sending NMI IPI to PCPU 6 to get its backtrace (Src 1, Req 1)

Jun 30 07:36:00 cma32 vmkernel: 6:12:51:19.595 cpu6:4275)ALERT: NMI: 2001: NMI IPI received. Was eip(base):ebp:cs (Src 0x1)

Jun 30 07:39:41 cma32 vmkernel: 6:12:55:00.921 cpu0:4096)VMNIX: WARNING: NetCos: 1075: virtual HW appears wedged (bug number 90831), resetting

Jun 30 07:39:41 cma32 vmkernel: 6:12:55:00.921 cpu10:4117)WARNING: Net: 1205: non-forced disable with 128 packets in flight.

Jun 30 07:39:41 cma32 vmkernel: 6:12:55:00.921 cpu10:4117)WARNING: Net: 1210: forced disable with 128 packets in flight.

Jun 30 07:43:40 cma32 vmkernel: 6:12:59:00.071 cpu8:4237)WARNING: CpuSched: 965: world 4237(helper21-7) did not yield PCPU 8 for 4001 msec, refCharge=7998 msec, coreCharge=8331 msec,

Jun 30 07:47:00 cma32 vmkernel: 6:13:02:20.151 cpu0:4096)VMNIX: WARNING: NetCos: 1075: virtual HW appears wedged (bug number 90831), resetting

Jun 30 07:47:00 cma32 vmkernel: 6:13:02:20.151 cpu8:4117)WARNING: Net: 1205: non-forced disable with 128 packets in flight.

Jun 30 07:47:00 cma32 vmkernel: 6:13:02:20.151 cpu8:4117)WARNING: Net: 1210: forced disable with 128 packets in flight.

Jun 30 07:50:46 cma32 vmkernel: 6:13:06:02.321 cpu1:4242)WARNING: LinNet: netdev_watchdog: NETDEV WATCHDOG: vmnic1: transmit timed out

Jun 30 07:53:28 cma32 vmkernel: 6:13:08:48.321 cpu8:4240)WARNING: CpuSched: 965: world 4240(helper21-10) did not yield PCPU 8 for 4002 msec, refCharge=7995 msec, coreCharge=8241 msec,

messages log:

Jun 30 06:38:44 cma32 kernel: http://560207.735093 NETDEV WATCHDOG: vswif0: transmit timed out

Jun 30 06:45:51 cma32 kernel: http://560634.167440 NETDEV WATCHDOG: vswif0: transmit timed out

Jun 30 06:51:56 cma32 kernel: http://560998.696585 NETDEV WATCHDOG: vswif0: transmit timed out

Jun 30 06:54:42 cma32 kernel: http://561164.391621 NETDEV WATCHDOG: vswif0: transmit timed out

Jun 30 07:01:48 cma32 kernel: http://561588.905511 NETDEV WATCHDOG: vswif0: transmit timed out

Jun 30 07:08:52 cma32 kernel: http://562012.361168 NETDEV WATCHDOG: vswif0: transmit timed out

Jun 30 07:15:51 cma32 kernel: http://562430.779261 NETDEV WATCHDOG: vswif0: transmit timed out

Jun 30 07:18:29 cma32 vobd: Jun 30 07:18:29.589: 563515795364us: http://vprob.net.redundancy.degraded Uplink redundancy degraded on virtual switch "vSwitch0". Physical NIC vmnic2 is down. 3 uplinks still up. Affected port groups: "Service Console", "VM Network", "VM Network", "VM Network", "VM Network", "VM Network".

Jun 30 07:24:20 cma32 kernel: http://562939.132972 NETDEV WATCHDOG: vswif0: transmit timed out

Jun 30 07:31:25 cma32 kernel: http://563362.547808 NETDEV WATCHDOG: vswif0: transmit timed out

Jun 30 07:32:31 cma32 vobd: Jun 30 07:32:31.270: 564321486766us: http://vprob.net.redundancy.degraded Uplink redundancy degraded on virtual switch "vSwitch0". Physical NIC vmnic0 is down. 2 uplinks still up. Affected port groups: "Service Console", "VM Network", "VM Network", "VM Network", "VM Network", "VM Network".

Jun 30 07:39:41 cma32 kernel: http://563857.837700 NETDEV WATCHDOG: vswif0: transmit timed out

Jun 30 07:47:00 cma32 kernel: http://564296.256806 NETDEV WATCHDOG: vswif0: transmit timed out

Also, the NIC we are using is a 4 port NetXen HP NC375i which have some documented problems, although most of those problems are around not all 4 ports working but they are all working for me. This issue prompted to me to do what I should've done a while ago and setup the service console to be on its own Intel nic and now just the VMs are running on 3 ports of the NetXen nic.

Any ideas what could be causing this?

Thanks,

-Chad

0 Kudos
6 Replies
Simonds
Contributor
Contributor

I have the same issue now. Just upgraded to new hosts with this card (NC375T) running latest firmware 4.0.530 and the hosts will randomly lose network connectivity...

HELP?

0 Kudos
Mackopes
Enthusiast
Enthusiast

We had so many problems with the quad port NetXen cards in our DL370 G6s that we ended up ripping them ALL out and replaced them with broadcom cards...

Aaron

0 Kudos
kesparlat
Enthusiast
Enthusiast

I have the same problem, in my case if you take a look in the physical adapter the link is up, but Service Console detects it as down, It seems like a driver error.

I've upgraded to that (4.0.570):

http://downloads.vmware.com/d/details/esx4x_qla_nx_nic_dt/ZHcqYmRAdyViZHdlZQ

Best regards.

0 Kudos
EdZ
Contributor
Contributor

We are seeing the same thing happening in our environment with ESX 4.0. One of the VM's intermittenly loses network connectivity, and it can be restored by clicking the "connected" box in the VM under network settings.

Thanks,

Ed

0 Kudos
vGuy
Expert
Expert

Hello EdZ,

If you are facing Physical NIC issue all of your VMs will be affected not a particular one. I suggest you to double check the settings

of the affected VM. For example, the VM portgroup should be the same on all the ESX hosts in the cluster...HTH

0 Kudos
kesparlat
Enthusiast
Enthusiast

Submited a SR with HP (the server vendor) confirmed the bug with the brocade driver. I'm waiting for a newer one.

Actually all the NICs are paired with other from different chipset (Intel) to avoid network loss conectivity in hosts.

0 Kudos