VMware Cloud Community
Muldov
Contributor
Contributor

Random disconnnects of guests after migrating to new DC

So, some background information.

We have 4 datacenters with 4 separate VC instances running on ESXi 6.0, one of which is about to be decommissioned, and an 10G OTV link that connects them. When we move a few of the boxes to the new DC, it will randomly lose its network connectivity. sometimes an hour, sometimes after a few hours, The one in particular we are using to test this at the moment, is a UCS Central Appliance 1.5(1b) which uses an E1000 adapter, and no way to try to flip that to VMXnet that i know of. no get it back, Vmotion works, or disconnect and reconnect the nic from the edit properties of the guest wakes it back up.

That said,

The hardware running in the old DC are IBM, while the new hardware runs on Cisco UCS. when in the old DC, all the systems run fine, with no issues. When we migrate them to the new DC, we will lose NIC connectivity at random times. so far we have 2 boxes that are experiencing this issue, but one is critical, and were up against the wall for the deadline to be out of this old Datacenter.

We haven't found any specific pattern to when or why. the only parallels aside from both boxes are linux, and what would appear to be appliances. One is HW version 10 the other 11, Both on different VLANS, and both different flavors of Linux, one being looks like it was built on ciscos version of RHEL locked down without the usual RHEL interface, the other Emperor. One has updated tools, the other has no tools, but even with Tools installed, same issue.

the only similarity between the boxes, is, Both use the E1000 adapter, and both are some flavor of linux.

As far as the vDS, the only difference between one and the other is in teaming and failover. the one that works has a standby nic with failback set to no, while the one that does not work has both nics active, and failback set to yes,

Network failover detection is set to link status only, and the vds has plenty of available ports.

The kicker is we have 100+ other VMS with similar configurations residing on those vlans that those 2 systems, and more than 500 others on other vlans, all running with no problem at all.

No errors, on the guest or host that i can find at all.

Any thoughts?

EDIT: So it seems any change at all to the portgroup wakes it back up. It died again, and simply flipped failback from yes to no, and it started pinging again.... Im gonna go bald with this one.

Reply
0 Kudos
7 Replies
RAJ_RAJ
Expert
Expert

Hi ,

Please check the version of ESXi host on new DC and old DC  .

Also try remove and install the vm tools on the vms , use GOS managesvm tools .

RAJESH RADHAKRISHNAN VCA -DCV/WM/Cloud,VCP 5 - DCV/DT/CLOUD, ,VCP6-DCV, EMCISA,EMCSA,MCTS,MCPS,BCFA https://ae.linkedin.com/in/rajesh-radhakrishnan-76269335 Mark my post as "helpful" or "correct" if I've helped resolve or answered your query!
Reply
0 Kudos
Muldov
Contributor
Contributor

Hi Raj,


no luck. Weve removed the tools multiple times, with no luck. The new DC is on a newer release, but that has been only recently. when this issue first popped up, they were all on the same release.

Also, I did manage to swap the nic on the one guest to the VMXnet3 from the E1000, and didn't help.

One guest has no tools at all.

Reply
0 Kudos
RAJ_RAJ
Expert
Expert

Hi ,

Please share the version and build number of ESXi  ,in 6.0 there is some bug on network connectivity of hosts .

Ref#  ESXi 6.0 host loses network connectivity randomly (2124669) | VMware KB

RAJESH RADHAKRISHNAN VCA -DCV/WM/Cloud,VCP 5 - DCV/DT/CLOUD, ,VCP6-DCV, EMCISA,EMCSA,MCTS,MCPS,BCFA https://ae.linkedin.com/in/rajesh-radhakrishnan-76269335 Mark my post as "helpful" or "correct" if I've helped resolve or answered your query!
Reply
0 Kudos
Muldov
Contributor
Contributor

The release we are on now is 6.0.0 4192238

Reply
0 Kudos
RAJ_RAJ
Expert
Expert

Hi ,

could you please check below one ,  you may update the drivers for the esxi host as well.

ESXi 6.0 host loses network connectivity randomly (2124669) | VMware KB

RAJESH RADHAKRISHNAN VCA -DCV/WM/Cloud,VCP 5 - DCV/DT/CLOUD, ,VCP6-DCV, EMCISA,EMCSA,MCTS,MCPS,BCFA https://ae.linkedin.com/in/rajesh-radhakrishnan-76269335 Mark my post as "helpful" or "correct" if I've helped resolve or answered your query!
Reply
0 Kudos
Muldov
Contributor
Contributor

Hi Raj,

I engaged VM Support and were actually trying that right now. Updating the Esxcli nic drivers on the host to the latest Smiley Wink ill update later if we have any luck with it. the Cisco Drivers and Firmware is all up to date

Something else interesting that we found, is that it doesn't seem to be all the vlans that lose connectivity to it. Servers on other hosts in other vlans and datacenters are still able to access it, and L2 remains functional. its almost like the route is being lost somewhere. now the tricky thing is, why only those specific VMs, and why only from those source networks?

Thanks for your replys sir!

Reply
0 Kudos
Muldov
Contributor
Contributor

no luck with Enic updates...

however

heres an interesting update.

As we tested, when the system drops, we can ping it from other Datacenter locations, and Via L2

Now, again, I mentioned this was moved from one Datacenter to another. They are using an OTV link between the 2 to allow the vlans to exist on both locations. IF I ping that target guest from a box still physically located in the old datacenter, the box comes back….

 

Reply
0 Kudos