Neismark
Contributor
Contributor

VMs randomly losing network connection on ESXi 6.5

Hi all,

I've been trying to figure this out for almost two days now and can't get nowhere...

I run an ESXi 6.5 HPE Custom (installed on SD card) on a HP DL380 G9 machine.

I started the ESXi up, configured it and started creating VMs with the built-in web client - one Windows server 2012R2 and two server 2016. The first three VMs were fine, got their updates from Microsoft, joined the WIndows domain, one machine became promoted to a DC - which means: they had network without problems and each one rebooted several times without problems. I even copied several GB of data to one of them without problems.

In the meantime, I created a fourth server 2016 VM. After installing and updating it, I also installed VMware tools on that one (Had left that out on the other ones).

This 4th VM came up and was not reachable from the local net. The OS also said "no internet connection". The only IP I could successfully ping was that of the the ESXi itself. SSHing into the ESX I could also ping it from the host.

I installed VMware tools on the other hosts and apparently from that point on the other VMs started losing network connectivity more or less randomly. I even experienced that one VM lost network connectivity while I changed things on two other ones and didn't touch that one...

While debugging for hours on end, I've seen all sorts of weird stuff. One VM could successfully ping three of its neighbor-VMs, but not the fourth one, while at the same time being able to ping the Google DNS resolver, but not its default gateway nor my workstation (And of course wasn't accessible from there by ICMP or RDP).

To cut it short: I went through the VMware troubleshooting kb article from A to Z - to no avail.

My config:

- One Host with 128 GB RAM and two 6-core CPUs (No second host at this time)

- all 4 hardware NICs connectetd to a Netgear ProSafe switch

- one VSwitch, two Portgroups (Management and VMs)

- All VMs have static IPv4 addresses, IPv6 switched off for now

- VMs have 16 GB RAM and 1 CPU for a start

- configured LACP on the switch over all 4 NICs

- Completely switched off STP on the switch

- Switched off the windows firewalls completely

What I tried (most of it multiple times):

- Rebooting the VMs

- Disabling/Reenabling NIC in Windows

- Removing NIC and re-adding it

- Replacing the e1000 NIC by VMXnet3

- different vSwitch configs (notifying/not notifying, IP-Hash/Origin)

I'm even thinking about going back to the latest 6.0 and redo everything from scratch, which would mean half a week of work going down the drain.

I would be really grateful for any hints what else to check or test.

As I'm fairly new to VMware, maybe there's some stuff to be found out on the CLI?

Regards,

Mark

Tags (2)
0 Kudos
3 Replies
a_p_
Leadership
Leadership

Welcome to the Community,

LACP is only supported with Distributes Virtual Switches. For Standard vSwitches I'd suggest you go with the default network configuration, i.e. multiple physical switch ports (tagged/untagged, as required), and the "Route Based on Originating Virtual Port" policy.

If you want to configure channeling anyway, please make sure you read https://kb.vmware.com/s/article/1004048 to find out whether your switch offers similar configuration settings.

André

0 Kudos
Neismark
Contributor
Contributor

Thanks for the hint. I did switch that off again, but it wasn't the solution.

The solution was much easier: Reboot the ESXi host. Now everything works like a charm again...

0 Kudos
bgwright04
Contributor
Contributor

We're having a similar issue, but if we migrate the VM to another host, connectivity is restored.

0 Kudos