6 Replies Latest reply on Dec 15, 2010 1:24 AM by kesparlat

    Lost network connectivity ESX 4

    chaddy Novice

       

      Hi all,

       

       

      Just this morning this issue happened on our second VMWare server, it happened a 6 weeks ago on our first VMWare server.  Our VM's intermittantly become unresponsive via the network and we couldn't connect to the service console.  After rebooting via the CLI we are able to connect via service console and start up VM and all works as normal.  No alerts are listed in vCenter but upon connecting to the CLI I have errors in /var/log/messages, /var/log/vmkwarning and as described in KB1017458.  I verified that we do have the ESX400-201002401-BG patch installed as mentioned in the KB1017458 (it wasn't installed on the first server last time we experienced this issue).

       

       

       

       

       

      vmkwarning log:

       

       

      Jun 30 07:31:25 cma32 vmkernel: 6:12:46:44.715 cpu0:4096)VMNIX: WARNING: NetCos: 1075: virtual HW appears wedged (bug number 90831), resetting

      Jun 30 07:31:25 cma32 vmkernel: 6:12:46:44.715 cpu11:4119)WARNING: Net: 1205: non-forced disable with 128 packets in flight.

      Jun 30 07:31:25 cma32 vmkernel: 6:12:46:44.715 cpu11:4119)WARNING: Net: 1210: forced disable with 128 packets in flight.

      Jun 30 07:32:30 cma32 vmkernel: 6:12:47:47.811 cpu1:4238)WARNING: LinNet: netdev_watchdog: NETDEV WATCHDOG: vmnic0: transmit timed out

      Jun 30 07:32:31 cma32 vmkernel: 6:12:47:50.795 cpu8:4230)WARNING: LinNet: netdev_watchdog: NETDEV WATCHDOG: vmnic0: transmit timed out

      Jun 30 07:33:40 cma32 vmkernel: 6:12:48:59.824 cpu8:4242)WARNING: CpuSched: 965: world 4242(helper21-12) did not yield PCPU 8 for 4001 msec, refCharge=7996 msec, coreCharge=8329 msec, 

      Jun 30 07:36:00 cma32 vmkernel: 6:12:51:19.519 cpu5:4289)ALERT: Heartbeat: 518: PCPU 4 didn't have a heartbeat for 3780 seconds. may be locked up

      Jun 30 07:36:00 cma32 vmkernel: 6:12:51:19.519 cpu5:4289)WARNING: NMI: 1612: Sending NMI IPI to PCPU 4 to get its backtrace (Src 1, Req 1)

      Jun 30 07:36:00 cma32 vmkernel: 6:12:51:19.519 cpu4:4285)ALERT: NMI: 2001: NMI IPI received. Was eip(base):ebp:cs 0x4785dd(0x418008a00000):0x4100c05ef3e8:0x4010(Src 0x1)

      Jun 30 07:36:00 cma32 vmkernel: 6:12:51:19.595 cpu13:4280)ALERT: Heartbeat: 518: PCPU 6 didn't have a heartbeat for 3780 seconds. may be locked up

      Jun 30 07:36:00 cma32 vmkernel: 6:12:51:19.595 cpu13:4280)WARNING: NMI: 1612: Sending NMI IPI to PCPU 6 to get its backtrace (Src 1, Req 1)

      Jun 30 07:36:00 cma32 vmkernel: 6:12:51:19.595 cpu6:4275)ALERT: NMI: 2001: NMI IPI received. Was eip(base):ebp:cs 0x4785cb(0x418008a00000):0x4100c059f3e8:0x4010(Src 0x1)

      Jun 30 07:39:41 cma32 vmkernel: 6:12:55:00.921 cpu0:4096)VMNIX: WARNING: NetCos: 1075: virtual HW appears wedged (bug number 90831), resetting

      Jun 30 07:39:41 cma32 vmkernel: 6:12:55:00.921 cpu10:4117)WARNING: Net: 1205: non-forced disable with 128 packets in flight.

      Jun 30 07:39:41 cma32 vmkernel: 6:12:55:00.921 cpu10:4117)WARNING: Net: 1210: forced disable with 128 packets in flight.

      Jun 30 07:43:40 cma32 vmkernel: 6:12:59:00.071 cpu8:4237)WARNING: CpuSched: 965: world 4237(helper21-7) did not yield PCPU 8 for 4001 msec, refCharge=7998 msec, coreCharge=8331 msec, 

      Jun 30 07:47:00 cma32 vmkernel: 6:13:02:20.151 cpu0:4096)VMNIX: WARNING: NetCos: 1075: virtual HW appears wedged (bug number 90831), resetting

      Jun 30 07:47:00 cma32 vmkernel: 6:13:02:20.151 cpu8:4117)WARNING: Net: 1205: non-forced disable with 128 packets in flight.

      Jun 30 07:47:00 cma32 vmkernel: 6:13:02:20.151 cpu8:4117)WARNING: Net: 1210: forced disable with 128 packets in flight.

      Jun 30 07:50:46 cma32 vmkernel: 6:13:06:02.321 cpu1:4242)WARNING: LinNet: netdev_watchdog: NETDEV WATCHDOG: vmnic1: transmit timed out

      Jun 30 07:53:28 cma32 vmkernel: 6:13:08:48.321 cpu8:4240)WARNING: CpuSched: 965: world 4240(helper21-10) did not yield PCPU 8 for 4002 msec, refCharge=7995 msec, coreCharge=8241 msec,

       

       

      messages log:

       

       

      Jun 30 06:38:44 cma32 kernel: http://560207.735093 NETDEV WATCHDOG: vswif0: transmit timed out

      Jun 30 06:45:51 cma32 kernel: http://560634.167440 NETDEV WATCHDOG: vswif0: transmit timed out

      Jun 30 06:51:56 cma32 kernel: http://560998.696585 NETDEV WATCHDOG: vswif0: transmit timed out

      Jun 30 06:54:42 cma32 kernel: http://561164.391621 NETDEV WATCHDOG: vswif0: transmit timed out

      Jun 30 07:01:48 cma32 kernel: http://561588.905511 NETDEV WATCHDOG: vswif0: transmit timed out

      Jun 30 07:08:52 cma32 kernel: http://562012.361168 NETDEV WATCHDOG: vswif0: transmit timed out

      Jun 30 07:15:51 cma32 kernel: http://562430.779261 NETDEV WATCHDOG: vswif0: transmit timed out

      Jun 30 07:18:29 cma32 vobd: Jun 30 07:18:29.589: 563515795364us: http://vprob.net.redundancy.degraded Uplink redundancy degraded on virtual switch "vSwitch0". Physical NIC vmnic2 is down. 3 uplinks still up. Affected port groups: "Service Console", "VM Network", "VM Network", "VM Network", "VM Network", "VM Network".

      Jun 30 07:24:20 cma32 kernel: http://562939.132972 NETDEV WATCHDOG: vswif0: transmit timed out

      Jun 30 07:31:25 cma32 kernel: http://563362.547808 NETDEV WATCHDOG: vswif0: transmit timed out

      Jun 30 07:32:31 cma32 vobd: Jun 30 07:32:31.270: 564321486766us: http://vprob.net.redundancy.degraded Uplink redundancy degraded on virtual switch "vSwitch0". Physical NIC vmnic0 is down. 2 uplinks still up. Affected port groups: "Service Console", "VM Network", "VM Network", "VM Network", "VM Network", "VM Network".

      Jun 30 07:39:41 cma32 kernel: http://563857.837700 NETDEV WATCHDOG: vswif0: transmit timed out

      Jun 30 07:47:00 cma32 kernel: http://564296.256806 NETDEV WATCHDOG: vswif0: transmit timed out

       

       

       

       

       

      Also, the NIC we are using is a 4 port NetXen HP NC375i which have some documented problems, although most of those problems are around not all 4 ports working but they are all working for me. This issue prompted to me to do what I should've done a while ago and setup the service console to be on its own Intel nic and now just the VMs are running on 3 ports of the NetXen nic.

       

       

      Any ideas what could be causing this?

       

       

      Thanks,

       

       

      -Chad