VMware Cloud Community
COS
Expert
Expert

VM's lose network connectivity randomly

We have been experiencing some VM's losing network connectivity sporadically. The VM's stay online for a while then suddenly it acts like it is no longer on the network. Everything appears to be correct. The vNICs are connected and you can get to the console. I can't ping it from outside the host and the VM can't ping it's Def Gateway.

Hardware is 4 HP Gen 8 LFF with Quad port NIC's HP NC364T running ESXi 5 U1 with vCenter Server clustered.

Anyone experience this?

If I vmotion it to another host it comes back online. That's been our temporary solution.

Thanks

Reply
0 Kudos
26 Replies
dkraut
Enthusiast
Enthusiast

Define "acts like it is no longer on the network"?  If you can ping to/from the vm, it's on the network.  Maybe you are having a name resolution issue, DNS?

Reply
0 Kudos
COS
Expert
Expert

From within the VM, I can not ping anything but itslef, localhost. There's no reply when pinging the Def Gateway. Name resolution to anything else on the network obviously fails for that reason.

Now from outside the VM, other VM's on the same subnet and same DV Switch, there is no reply from either pinging that VM's DNS name or the IP address. The VM's on the same subnet and same DV Switch are live on the network, meaning they respond to pings and can athenticate to all resources on the network.

So it's acting like it is not on the network.

Again, if I vmotion it to another host, it starts to reply to pings on the network and *IS* on the network.

Now for kick's, I vmotioned it back onto the previous host it was on (when it was not on the network) and it's still on the network.......lol, weird, I know!!!

I can't seem to replicate it. We find out when customers call in saying thier VM is no longer alive. :smileyshocked:

Reply
0 Kudos
COS
Expert
Expert

Let me also add what we did to troublesooht....

We diconnected the NIC then reconnected it, Failed

Set the VM to DHCP then put back the static IP. Failed

We rebooted the VM, Failed

We removed the NIC from the VM, booted it up, removed the ghosted NIC device then added the NIC back, Failed

Restored the VM from a snapshot, Failed

Seems the only thing that works is vmotioning it to another host.

Reply
0 Kudos
technobro1
Contributor
Contributor

I can also report that

Realtek 8168 Gigabit Ethernet

Cannot connect to the specified gateway 192.168
.1.1. Failed to set it.
error
9/30/2012 11:48:13 AM
localhost.localdomain

Lost network connectivity on virtual switch
"vSwitch0". Physical NIC vmnic0 is down.
Affected portgroups:"Management Network".
error
9/30/2012 11:48:13 AM
localhost.localdomain

i disable the virtual NIC and re enable and it came back .

Win 64b

ESX 5.1

Reply
0 Kudos
dkraut
Enthusiast
Enthusiast

Sorry, I misread your original post.  So we had a similar problem many moons ago, but I'm not sure it's relevant.  Are you using different port groups and DRS?  Are the vm's being moved around when this occurs?  If so, do you have the same port groups on all ESXi hosts?  What was happening with us was that occasionally a vm would be vmotioned from one host to another, but the new host did not have the correct port group/vlan so it would lose connectivity until we either created the necessary port group or vmotioned it back to a host that had the correct port group.     

Reply
0 Kudos
COS
Expert
Expert

All hosts in the cluser have the appropriate port groups. They're all profiled.

It's just weird that when I vmotion off of the current host, say "Host10" to another host, say "host20", the network comes back online. So I vmotioned it back from "Host20" back to the original host "Host10", the network stays online. It's like networking for the affected VM get's "Hung" untill it's vmotioned.

Yes DRS is enabled and VM's move dynamically.

So if the port groups/vlan were incorrect, the network on that VM should go offline when I vmotion it back to the original host. But that's not what happens.

:smileyconfused:.....scratchin my head on this one.....lol

Reply
0 Kudos
karthickvm
VMware Employee
VMware Employee

Hello COS,

I Suspect the issue is with the Physical Switch, please follow the below steps (it may not resolve the issue but sure it will give idea to resolve Smiley Happy ).

1. When the issue occurs , i.e when you are not able to ping the VM , Check the phyiscal switch mac table and see if you are seeing the MAC of the NIC

2. Also at the time of the issue , try pining the other VMs in the same port group in same ESX host.

3. If you are able to ping VMs within host and port group then need to check in the phyiscal switch

4. If you are not able to ping the VM within ESX host then need to re-validate the configuration.

I hope this will sort out.

Karthic.
vRNI TPM
Reply
0 Kudos
kermic
Expert
Expert

Agree with karthickvm

Sounds like a physical switch issue to me as well. Main reason - when VM is migrated via vMotion, one of the last steps of migration is that destination host sends out a request for a physical switch to update it's MAC tables (basically host is telling pSwitch that VMs MAC address will now be living on port attached to destination host). If you say that VM gets access to network after migration, seems that problem is resolved when MAC tables are updated on pSwitch.

I'd probably ask my network admin to take a look at pSwitch.

Other things to check:

Are all VMs affected or only some? If only some, are there any signs of MAC conflicts (like log entries on Guest, duplicate MAC errors on pSwitch) anywhere?

Which pNIC load balancing policy are you using? IP-Hash in some cases might show similar symptoms if pSwitches are not etherchannel capable.

WBR

Imants

Reply
0 Kudos
alvinswim
Hot Shot
Hot Shot

I had that exact same problem with ESX 4/4.1 and what it turned out to be was one of our core switches just acting wacky. We had Dell/Cisco/VMware all working with us.. no one could figure it out, and one day we decided to reboot our core switches.. and the problem went away as mysteriously as it showed up.. we didn't have physical wiring issues or anything. I think over time there's some sort of a buildup of something that can cause this situation, but thats entirely a guess.. I'd say, if you can just reload all your physical switches.

here's the link to my previous post:

http://communities.vmware.com/thread/319531

good luck..

Reply
0 Kudos
COS
Expert
Expert

Still working on the issue. Were making some changes on the hosts. I'll post what we did and the results when were done....

Thanks everyone!

Reply
0 Kudos
OB_Juan
Contributor
Contributor

We're running into the EXACT same issue with only one of the hosts in a five-node cluster.  Running ESXi 5.0, 469512 on a Dell PowerEdge R610. When it happens, it doesn't happen to all the guests.  This morning I had two guests on there, and only one lost it's network connectivity.  Changing it to another network doesn't help, but like COS said, if we VMotion it to another host, network comes back.  And if we VMotion it back to the "bad host" the network stays connected.

I haven't noticed that this happens when we do any particular thing, but this latest occurrence, I storage migrated the VM to another datastore.  The migration finished at 4:06pm and we started getting ping failures directly after.  So I VMotioned it to another host, and back to the bad one, and it's happy as could be (for now).

Please let me know if you guys find anything!

Reply
0 Kudos
OB_Juan
Contributor
Contributor

I checked the CDP (Cisco Discovery Protocol) information from both NICs in the Configuration tab, under Networking.  Compared the info from the "bad" host, to an unaffected host in the same cluster.  Found ONE of the NICs on the bad host is in a different VLAN.

I have a ticket opened for our swtich guys to check it out.  It would make sense that if only one NIC is configured improperly on the host, that only some of the guests might be trying to use that NIC, while the others are humming along just fine on the properly configured NIC.

I'll let you know what happens, but this seems to be the smoking gun.

Reply
0 Kudos
alvinswim
Hot Shot
Hot Shot

When we faced this issue, we shut down one of the 2 NIC's on the hosts. Basically when you have the nic's teamed vsphere would choose based on port ID and load "I think" on which nic to send network traffic.

For example:

On Host A, VM1 would be on vmnic0. Lets say you loose connectivity here and you decide Vmotion to Host B and VM1 comes up on vmnic1. You will likely want to blame that host for bad connectivity.

I was able to track down this behaviour because at the time we still had console access and access to esxtop.. I haven't tried with 5.0 but I imagine if you enable console esxtop would still be there.. anyway.. the test here would be to disable vmnic0 and force everything to go to vmnic1 on that Host A. and if things come back to life then you have either one of 2 things..

1. either a bad network segment with a bad access switch or a bad core switch on that segment

2. bad cable on vmnic 0

3. bad vmnic 0

Either way in my case we had a bad core switch on one segment that affected all vmnic0 on all hosts. And that point we had rebooted all of our switches and it cleared the issue so we were not able to pinpoint exactly the behaviour if we had only rebooted the one bad switch. We consequently had it replaced a few weeks later.

Reply
0 Kudos
OB_Juan
Contributor
Contributor

That's a good approach alvinswim.  Our switch guys wrote back and said that indeed, one of the ports on the physical switch was only set to look at one of our supported VLANs.  So he added the other, and so far, so good.

I'm not looking at my Vcenter right now... is there a way to find out what host NIC a guest is using at the time?

Reply
0 Kudos
dkraut
Enthusiast
Enthusiast

Go here and grab a free copy of RVTools. http://www.robware.net

It will provide a wealth of info regarding the network setup on your hosts and vm's.

Reply
0 Kudos
alvinswim
Hot Shot
Hot Shot

enable ssh and remote in to it and then at the command prompt type esxtop

in esxtop hit the letter "n"

and you will see which vm is at which vmnic

Reply
0 Kudos
RSNTeam
Contributor
Contributor

Hi COS,

@all:

sorry to reply to such an old thread, but we have the exact same problem here.

I stumbled upon your thread  while I was googling for this exact problem, because we are experiencing the same issue in our VMware-environment currently.

As I did not find any similar threads:

Have you found a solution to this problem back in 2012 and how did you solve it?
I also suspect the network hardware to be the problem, but of course I need a prove to give to the network admins before they start analyzing on their side...

Would be very nice to hear from you as we are currently in the dark with this phenomenon.

Thanks in advance!

Cheers, RSNTeam

Reply
0 Kudos
coolsport00
Enthusiast
Enthusiast

@RSNTeam -

I *just* posted a network issue we are having in 6.5U1 here: Odd Network Issues Since Migrating Environment From vSphere 6U3 > 6.5U1

Can you read my post & see if you are having similar? Not all VMs have a network issue or show as 'down'. I think it's mostly communication than 'down' type stuff. Although, we have had a VM be not pingable (glad this part is rare). What we noticed is this seems to be solely when VMs run on vDS and not vSS. Curious to hear your issue.

Thanks.

Reply
0 Kudos
RSNTeam
Contributor
Contributor

coolsport00

We are having a very similar issue like yours.

But we are currently using vSphere 6.0 U3, not 6.5 U1.

We are regularly observing VM's not able to communicate via certain NIC's anymore.

These NICs are always the ones connected to the vDS. The NICs connected to the vSS are always fine.

The version of our vDS is still 5.5. We are currently suspecting that this could be the issue and we plan to upgrade them to 6.0, too.

If that does not solve the issue, we will try to analyze towards pSwitches. As stated somewhere above, other guys found the solution to this problem in faulty MAC table somewhere on pSwitches.

But it will be hard to have enough evidence for the network admins to start their analysis...

It would be good to know how COS​ resolved his issue way back in 2012.

Greetings,

RSN Team

Reply
0 Kudos