VMware Cloud Community
COS
Expert
Expert

VM's lose network connectivity randomly

We have been experiencing some VM's losing network connectivity sporadically. The VM's stay online for a while then suddenly it acts like it is no longer on the network. Everything appears to be correct. The vNICs are connected and you can get to the console. I can't ping it from outside the host and the VM can't ping it's Def Gateway.

Hardware is 4 HP Gen 8 LFF with Quad port NIC's HP NC364T running ESXi 5 U1 with vCenter Server clustered.

Anyone experience this?

If I vmotion it to another host it comes back online. That's been our temporary solution.

Thanks

0 Kudos
26 Replies
hussainbte
Expert
Expert

ssh to the host as root.

run below command

1) "net-stats -l"

capture the port number from the VM.

run

2) vsish

3) cd net/portsets/switchname/ports/portnumberfromstep1/      (switchname is your virtual switch name- can be standard or DVS)

4) cat teamUplink

this will tell you which uplink the VM is currently using

If you found my answers useful please consider marking them as Correct OR Helpful Regards, Hussain https://virtualcubes.wordpress.com/
0 Kudos
coolsport00
Enthusiast
Enthusiast

@RSNTeam -

Thank you for your response! I certainly don't wish you experiencing problems in your environment, but I'm glad to hear someone else experiencing similar issues as us. It's been frustrating I haven't been able to find *anything* on this. I mean seriously.. in the beginning of our 6U3 > 6.5U1 migration, something as SIMPLE as disconnecting a Host from old VCSA & then connecting the Host to new VCSA.. no Host upgrades yet... no VMware Tools upgrades yet... just a new vCenter.. and VMs, though pingable, have their services not working, be it web UI, domain controller services (DNS, directory services), etc. Why would connecting a Host with running VMs that are connected to the Host's vSS cause such an issue? To be fair, rebooting the VMs took care of those initial problems, but certainly wonder why it happened. Then, after I began upgrading Hosts and then connecting them to a vDS (ver 6.5 vDS), that's when we noticed other communication oddities as I mentioned earlier (i.e. Cisco voice server not getting NTP info from Cluster master, occasional VM not pingable, or having latent communication within our network, etc). Don't get it. 😕

Since you're having issues on an older version of vSphere, I'd like to ask if you've heard about the issue VMware has/had with their vmxnet3 network adapter? Back a few yrs ago when we upgraded our environment from 5.5 > 6.0, we experienced EXTREME latency on VMs running the vmxnet3 network adapter. We mainly saw this in an "app cluster" (web, app, db VM servers) when each of the VMs were on different Hosts. When we placed the app VMs on the same Host, the latency went away. What we ended up having to do was change the network adapter back to an E1000. I think 6U3 resolved that issue though. This article shares several issues experienced with the vmxnet3, if you haven't seen it: https://vinfrastructure.it/2016/05/several-issues-vmxnet3-virtual-adapter/ . Maybe this is what you all are experiencing? Although, you did say when VMs/Hosts are migrated back to a vSS your environment is fine, so not sure. Just thought I'd share it.

Anyway, thank you for responding. Maybe someone has a suggestion? I haven't received a response on my communities post yet. 😕

Regards.

0 Kudos
RSNTeam
Contributor
Contributor

A happy and prosperous new year to everyone!!

coolsport00

You're welcome Smiley Happy. I am also struggeling to get some hints to the root cause of this problem.

I can not imagine that we are affected by the VMXNET3 problem. Because what we experience is the following:

Our VMs are configured like that:

1st interface is for management traffic and is connected to vSS locally on the ESXi servers.

2nd interface is for productive traffic and is connected to a vDS.

The connection-problems, which occur regularly, are ONLY experienced on the 2nd interface for productive traffic.

While the issue occurs, the affected VMs are

- not able to ping their gateway

- not able to reach other hosts in their network, EXCEPT for VMs, which run on the SAME ESXi host

So every traffic, which would leave the ESXi host, is not returned.

So what we are suspecting now of course is that there must be some problem with the distributed switch we are using.

But neither the VMware support, nor my research in the web brought me in this direction. There is no sign of evidence, that this can be related to a distributed switch.

The VMware support told us, that they suspect a problem with MAC tables on some physical switch.

This seems to be a vaild suspicion, but this does not explain, why our problem does not occur with the 1st interface for management traffic.

We did not have a single incident, where the 1st interface was affected. It was, in every single case, the 2nd interface.

If someone does have some idea what could be going on, please share it!

Greeting,

RSNTeam

0 Kudos
jcb_sw
Contributor
Contributor

Hello,

We encountered kind of similar issue this morning.

A Virtual Machine had lost it's network connection.

We could only ping it's own IP address, we didn't try to ping other VM on the same ESX host.

We couldn't ping the Gateway.

We tried to remove the network card, re-create a new one, it didn't solve our issue.

Finally, issue has been solved by connecting the network card to a different port ID on the dvSwitch.

We're opening a case by VMware to understand the root cause of the issue.

0 Kudos
Morpheus187
Contributor
Contributor

Hello

I've experienced a similiar problem today.

Multiple VM's lost network connection and we started troubleshooting and found out that the machine could ping themself on the same host but not outside to the network.

The strange thing was, not all machines on that host were affected, just a few of them had this kind of problem and we could resolve a few with disconnecting and reconnecting the nic inside the vm.

But still 2 machines refused to access the machine, so we restarted the first host and the machine worked on that host, but when we moved it back to the original host, it stopped again. So we rebooted the second host and all problems are gone, every machine works on every host again.

I did some more intense troubleshooting on an affected machine and installed wireshark. I could see the machine desperately sending out ARP packages, asking of the mac of the default gw without an answer. But I could also see the default GW sending arp package, asking my machine for it's mac address, which made absolutely no sense to me.

I tripple checked the switch configuration and everything I could check, without finding any misconfiguration.

It just looked to me like ESXI was eating some packages on some machines and only a restart of the host resolved the problem.

0 Kudos
jgyles
Contributor
Contributor

Hello COS​ - where you ever able to get a solution on this? I know it is quite old - but hoping someone can shed some light on this problem. We are experiencing it now and there doesnt seem to be much information on the web about it.

0 Kudos
ThereAreSomeWho
Contributor
Contributor

I am experiencing this problem on some new hosts that I'm hoping to migrate to. My virtual standard switch contains four active adapters. Two are using the BNX2 driver and two are using the IGN driver. The Intel adapters with the IGN driver are the ones that seem to fail every 20 days. This is happening on two different servers connected to two different physical switches.

I've never experienced this issue on our old hardware. However, with the old hardware, I have a virtual standard switch with two active adapters instead of four. One of those adapters is also using the BNX2 driver, while the other is using the ne1000 driver.

So, the major differences I see between the two sets of hardware are

  • different models of Intel NICs
  • Intel NICs on the new hardware support  SR-IOV (though not enabled) while no other NIC does
  • Virtual switch on server not having issue contains two physical adapter ports, Virtual switch on server four physical adapter ports

My manager wants me to try to get this going without stealing hardware from the old servers to put in the new. He's convinced I have an incorrect setting somewhere. I haven't found the "stop working after 20 days" setting yet though.

0 Kudos