VMware Cloud Community
barrypsaha
Contributor
Contributor

Recovered VM's - No network

Hi - I wonder if anyone has come across this before or has any suggestions / ideas.

We have 5 ESXi Hosts in our Primary data centre and 5 in our DR DC, all running VMware ESXi, 6.7.0, 18828794. Site recovery manager, protection groups and recovery plans exist to mirror the live VM's the the DRDC. All runs OK but randomly after running the recovery plans some VW's come back up without a working network connection. With testing I have proved this not specific to any one VM, host or vNIC driver. Also to repair the broken NIC seems to require different fixes every time. Sometimes a VMotion to another host will fix it, sometimes we have to add another vNIC, sometimes changing the vNetwork the NIC is connected to and then back again will fix it. What I can say with certainty is it never happens when recovering back to the primary DC, I can ever recover a "disconnected" VM back and it's fine in the primary.

 

Anyone got any ideas or seen this before?

 

 

0 Kudos
5 Replies
kjdfhaueiase
Enthusiast
Enthusiast

Hi,
    We are seeing some VMs randomly dropped off the network as well on 6.7.  I am not sure what triggers it but sometimes it happens after a vmotion. 
    In fact, we vmotioned the VM to 7 machines in our cluster and continued to ping the VM. On 4 out of 7 we got ping; the other 3 no dice.

    Yet later one of the machines we had successful testing on also dropped the machine from the network.

    DRS is on... and backup recovery automation is happening too ( I can see snapshots occurring in the events pane).

    This machine is linux while the other machines are windows based.  We were going to move the machine to a non DRS cluster but we know now it happens without vmotioning.

    What didn't work:

-Upgrading hardware version of the VM
-Updating NIC from e1000 to VMNET3

    We are at a loss at how to keep this machine up.

     Another thing we noticed which may or may not be related: Someone who is given explicit rights to the machine (and no other rights) can not solve the problem with a reboot or restart; however, Cluster admins CAN remedy the problem with a reboot. So there could be some permissioning failures related to what's happening in the background. That's just a hunch, as nothing is showing up in the VM's log.

0 Kudos
Campos69
Enthusiast
Enthusiast

This machine is linux while the other machines are windows based. We were going to move the machine to a non DRS cluster but we know now it happens without vmotioning.

EmployeeConnection

0 Kudos
ngo_s
Contributor
Contributor

Sounds like you need to look at the physical network infrastructure. For example, do you have portfast (or the similar function to spanning tree) enable on the physical switches that are connected to your ESXi infrastructure?

https://kb.vmware.com/s/article/1003804

I would troubleshoot looking at the ARP tables on your physical switches to see if it knows where the VM is located (ESXi host wise) before and after a vMotion.

Other issues could be a mismatch in the port group configuration on the target hosts if you are using standard vswitches. It is also possible to have a mismatch physical port configuration as well. If you are using a distributed virtual switch, ensure you have the proper vlan configs on the physical ports connected to each ESXi physical vmnic.

Other steps would be to ensure that all your ESXi hosts and peripheral devices are on the same firmware/drivers versions or upgrade them to the latest supported from the VMware HCL - https://www.vmware.com/resources/compatibility/search.php. I'm surprised VMware support has not told you to do this yet.

 

0 Kudos
depping
Leadership
Leadership

9 out of 10 times this is a Spanning Tree misconfiguration indeed... Would also be my first guess.

0 Kudos
CatherineLuke
Contributor
Contributor

Having Same type  issue. I was using vmware workstation, and then a minute later after a reboot I didn't have internet anymore on any of my vms. I'm thinking the issue has something to do with a windows update and possibly WSL. I'm on windows 11. Also Idk if it's a related problem, but the windows vm wont boot at all unless I set the number of CPUs down to 1, and it has no network in either case. The network adapter is on, but not seen by the guest. This issue has survived a complete reinstall of vmware player. There was an error message.

0 Kudos