VMware Cloud Community
tehpwn
Contributor
Contributor
Jump to solution

Network Layer stops working randomly

Hello everyone,

we've got a very strange problem since upgrading a Fujitsu Primergy RX300 S5 Server (48 GB RAM) fromESXi 5.1 to 6.0.

Almost every morning from 7 to 8 p.m. the network layer of the machine stops working. I.e. the whole network of the physical

server gets cut off, no connection to the VM's whatsoever possible.

Restarting the network from the console doesn't help, but I assume the VM's are still up and online since pinging the VM's from

the hypervisors console works.

The logs are not very comprehensive, we are right now trying to find out what exactly can cause the network layer to stop working.

There is another ESXi server working flawlessly (so far) with 6.0 in our network, hardware ist mostly the same, running on the

same network switch.

Anyone here have an idea what could actually cause this behaviour?

1 Solution

Accepted Solutions
tehpwn
Contributor
Contributor
Jump to solution

So I've put the new NIC in the machine on friday, we'll see what happens.

/e: Ok, it's a week now since the last incident and installation of the new network card. With all due precaution I'll declare this as solved, the new card seems to work.

View solution in original post

Reply
0 Kudos
19 Replies
Alistar
Expert
Expert
Jump to solution

Hello,

this might be caused by Driver/Firmware discrepancy on your hosts. I have had network stack crashing on one ESXi host with a different firmware version than what our standard was. Have you made sure that these are up-to-date? Also the vmkernel.log and vmkwarning.log would help a lot if you could catch them within the incriminated timeline - the logs are very chatty and I think we can retrieve useful information from them.

Cheers!

Stop by my blog if you'd like 🙂 I dabble in vSphere troubleshooting, PowerCLI scripting and NetApp storage - and I share my journeys at http://vmxp.wordpress.com/
BoneTrader
Enthusiast
Enthusiast
Jump to solution

did you check the HCL?

i don't think the S5 is certified for 5.5 or higher

Sateesh_vCloud
Jump to solution

I Agree with BoneTrader this hardware is not supported for ESXi 6.0

1.PNG

------------------------------------------------------------------------- Follow me @ www.vmwareguruz.com Please consider marking this answer "correct" or "helpful" if you found it useful T. Sateesh VCIX-NV, VCAP 5-DCA/DCD,VCP 6-NV,VCP 5 DCV/Cloud/DT, ZCP IBM India Pvt. Ltd
Reply
0 Kudos
laura_g
Contributor
Contributor
Jump to solution

Actually we had the same issue last night, 24h after the upgrade to ESXi 6.0 on a HP DL585 G7 with Broadcom 5719 and that server is definitely supported. It was upgraded using the HP custom ISO, with latest HP bundle updates and latest firmware for the NICs applied as well.

The network went down, restarting it from console didn't help and the logs contained "LinNet: netdev_watchdog:3678: NETDEV WATCHDOG: vmnic1: transmit timed out" for multiple vmnics and while I was logged on the console the CPU seemed to spike and becoming unresponsive for periods of time.

It doesn't help that it seems that you cannot edit the allocated resources for the ESXi host anymore, so the host CPU is capped at 279 MHz :smileyplain:

Reply
0 Kudos
tehpwn
Contributor
Contributor
Jump to solution

Hello all,

thank you for your answers. We'll definitely look into the firmware issue, maybe there is an upgrade available from Fujitsu.

And yes, we should've checked the HCL more thoroughly before upgrading. Will try to provide to logs next time this happens.

If this persists, what do you think are our options? Downgrade to 5.1? Is that actually possible?

Or maybe put some new NICs into the server?

The reason for the upgrade was that we've upgraded RAM in the server and the 32GB limit of ESXi 5.0.

Regards,

Reply
0 Kudos
tehpwn
Contributor
Contributor
Jump to solution

Ok, so it happened again two days ago, this time in the early evening.

I've gotten ahold of a new network card (Intel Pro/1000) and will put it in the server on the weekend. Anyone got an info about how the hypervisor reacts to a change of network cards?

Reply
0 Kudos
tehpwn
Contributor
Contributor
Jump to solution

So I've put the new NIC in the machine on friday, we'll see what happens.

/e: Ok, it's a week now since the last incident and installation of the new network card. With all due precaution I'll declare this as solved, the new card seems to work.

Reply
0 Kudos
wallstru
Contributor
Contributor
Jump to solution

FYI, I have had this happen twice to one of my cluster hosts, about a week apart.  My hosts are Dell Poweredge R710 servers, and this is happening with a clean install and configuration of ESXi 6.0 on both of them.  I sent my logs into VMWare support and was told the following:

We see " LinNet: netdev_watchdog:3678: NETDEV WATCHDOG:" alerts in vmkernel logs, which is making the host loosing connectivity. This is a known BUG in ESXi 6.0 and currently we do not have any resolution for this. The temporary fix is to reboot the host so that it can recover the network settings.

For now we are recommending to rollback to ESXi 5.5 ( http://kb.vmware.com/kb/1033604 >Reverting to a previous version of ESXi)


Just thought I'd pass along this info for anyone else that encounters this error and comes searching for it.  I just received some instructions from VMWare support because they want to gather some debug logs to try and track the issue down.  I'll update this thread with more info if I get it.

wally

Reply
0 Kudos
controlac
Contributor
Contributor
Jump to solution

I believe we're having the exact same problem on our R810 host. We recently updated to ESXi 6.0 using the Dell customized bundle and now I'm seeing the NETDEV_WATCHDOG error in the vmkwarning.log. Did you end up finding a solution, or did you revert as recommended? Just wondering the best way to proceed at this point.

Reply
0 Kudos
wallstru
Contributor
Contributor
Jump to solution

i wasn't able to revert because i had wiped and reinstalled 6.0 from scratch on these hosts.  vmware support is waiting to hear back from me with some details on my environment because they want to run some diagnostic code on my hosts to see if they can pinpoint the problem.  it's the beginning of the school year though, so i haven't had time to get back to them with the info yet (and luckily, i haven't had any more host crashes lately either).

wally

Reply
0 Kudos
virtualkingraj
Contributor
Contributor
Jump to solution

Fujitsu Primergy RX300 S5 Server is not supported in HCL for 6.0.

Reply
0 Kudos
controlac
Contributor
Contributor
Jump to solution

Just a quick update. I opened a case with VMWare and they were able to confirm this is an issue with ESXi 6.0.

They provided a workaround via a script that must be ran on each affected 6.0 host. It must also be applied manually each time the host reboots. This seems to have cleared up the problem, and rumor has it, a permanent fix will be released sometime in September or early October.

Reply
0 Kudos
simplyskeptic
Contributor
Contributor
Jump to solution

Can you share this script, as I'm having the same issue with the same hardware.  Thanks!

Reply
0 Kudos
controlac
Contributor
Contributor
Jump to solution

They didn't provide me with the script itself, the engineering team ran this directly on our servers. If you open a support case they should be able to do this for you.

Reply
0 Kudos
simplyskeptic
Contributor
Contributor
Jump to solution

Thanks for the reply.  I'm using the free version of ESXi, so I don't believe that I'm entitled to make that support call.  Hopefully someone else can add to this thread a method of working around the problem until a patch is publicly provided.

Reply
0 Kudos
dexterous
Contributor
Contributor
Jump to solution

Here's the KB covering this issue:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=212466...

The script is attached to the above case and to this post but here's a text version:

#!/bin/python

import sys

from vmware import vsi

INTR_TRACKER_PREFIX = "/sched/intrTracker/intrCookies/"

INTR_HARDWARE_PREFIX = "/hardware/interrupts/cookieList/"

if len(sys.argv) != 2:

   print "Usage: ./stop_intr_mig.py <stopFlag>"

   print "stopFlag: 1 to stop, 0 to resume"

   sys.exit(1)

intrCookies = vsi.list(INTR_TRACKER_PREFIX)

shouldStop = int(sys.argv[1])

for intr in intrCookies:

   intr = int(intr)

   intrType = vsi.get(INTR_HARDWARE_PREFIX + "%d/CookieInfo" % intr)['type']

   if intrType != 3:

      print "skipping intr %d" % intr

      continue # skip non-MSIX interrupts

   if (shouldStop != 0):

      pcpu = vsi.get(INTR_TRACKER_PREFIX + "%d/stats" % intr)['currDest']

      vsi.set(INTR_TRACKER_PREFIX + "%d/manageManual" % intr, pcpu)

      print "pinning interrupt %d to pcpu %d" % (intr, pcpu)

   else:

      vsi.set(INTR_TRACKER_PREFIX + "%d/manageAuto" % intr, 0) # 0 is ignored

      print "restore interrupt %d to auto control" % intr

Reply
0 Kudos
controlac
Contributor
Contributor
Jump to solution

Here's more detailed instructions on how to apply the script. Remember this must be ran each time the affected host is restarted.


1. Copy the script(top_intr_mig.py) to an available datastore.

2. Connect to the affected host through SSH and run the following commands

3. cd vmfs

4. cd volumes

5. ls

6.  cd "datastore name"

7.  chmod +x stop_intr_mig.py

8.  /bin/python stop_intr_mig.py 1   (No spaces between /bin & /python)

Reply
0 Kudos
satish_halemani
Enthusiast
Enthusiast
Jump to solution

We have multiple ESXi hosts and copy this python script manually and running them will take lot of effort.

Is there any option we can remotely copy this script to multiple ESXi hosts and get it run in one go?

Thanks in advance

Satish

Reply
0 Kudos
sandsturm911
Contributor
Contributor
Jump to solution

Unbelievable, VMware has lounched the patch for this problem:

VMware KB: VMware ESXi 6.0, Patch ESXi600-201510401-BG: Updates esx-base

I will install it today on my test infra and see if this helps...