VMware Cloud Community
Bruticusmaximus
Enthusiast
Enthusiast

Windows 2012 VM drops off network

This is going to be a long one. We have a ticket open with Vmware, hardware vendors, and Microsoft on this one. Going in circles

We have 2 VMs running on the same host.  Over the past 2 weeks, these VMs have dropped off the network about 5 times. To fix it, you have to disable and re-enable the NIC

  • While the VM is down, you can't ping anything. Not the gateway, not another VM on the same host on the same virtual switch, not another VM in the environment.
  • Vmotion to another host doesn't fix it
  • The VM is not under heavy load when this happens.
  • Ipconfig /all look normal.
  • Disconnecting and re-connecting the NIC at the VM hardware level doesn't fix it
  • There are entries in the Windows event logs that are a symptom of the VM being disconnected from the network (Can't resolve host names, authentication errors, etc) but, there is no event showing that the NIC disconnected.

This is on a Vblock. VCE has looked at everything hardware-wise and see nothing. No indication that there was even an outage.

Vmware has looked at everything and see nothing on the ESXi side. Vmware tools is up to date.

Microsoft is looking at it and still can't find anything.

Any thoughts? Any ideas on what else I can test for?

0 Kudos
16 Replies
daphnissov
Immortal
Immortal

Search for the commonalities first, then the distinctions. What do these VMs have in common, and forget the infra layers? What other VMs are experiencing this? Were they clones? New builds? Do they come from the same lineage? Also, what tests (other than those mentioned) have you performed from an OS/application layer to isolate?

BerndtSchumann
Enthusiast
Enthusiast

Although VMware Tools seem to be up to date, did you try reinstalling that?

0 Kudos
Bruticusmaximus
Enthusiast
Enthusiast

The VMs were created from the same template about 6 months ago.  We have about 300 VMs with the same configuration in our environment these are the only two with this issue.  We have about 1400 VMs in our environment and these are the only two it happens with.  I really wish it was happening on another less critical server so I could beat this to death.

Bruticusmaximus
Enthusiast
Enthusiast

We have considered this. We were going to uninstall tools, delete the nic, re-add the nic, and install tools.  Initially, Vmware tools was not up to date so, we updated it.  I thought for sure that would fix it.

Microsoft wants to setup some tracing for when it happens again so, I'm not sure if we'll make a change until after it happens again.

0 Kudos
Bruticusmaximus
Enthusiast
Enthusiast

Here's what we're trying next if it happens again. Boot into safe mode with networking. If we can ping it in safe mode, it is a driver issue in Windows (AV, Backup agent, etc). Both Vmware and Microsoft recommended this. I'll let you know.

0 Kudos
pragg12
Hot Shot
Hot Shot

Hi,

As you have described the issue, it seems to me that issue may be the VM OS level.

Is the VM hardware version upgraded to latest ?

What's the Windows OS version ?

Are all the software that are installed on these 2 VMs, compatible with the Windows OS version ? 

Can you share a list of software installed on these 2 VMs, highlighting the common ones ?

Is there any anti-virus software installed on these 2 VMs ? If yes, can you disable/uninstall them and then check if the issue re-triggers ?

Consider marking this response as "Correct" or "Helpful" if you think my response helped you in any way.
0 Kudos
jkastner
Contributor
Contributor

I have had a similar situation.   Ended up being the Windows Network Location Service periodically setting the connection to a Public network instead of the Domain network, thereby making windows firewall block pretty much all connections.  I find it does this quite often if using the VMXNET3 adapter instead of the e1000e. 

Bruticusmaximus
Enthusiast
Enthusiast

The OS is Server 2012R2 fully patched

SQL is the only thing running on one of the VMs.  The other one is running just the Informatica application.  It is fully compatible with 2012.  We did a rebuild of this system last summer.  We wanted to go with 2016 but, we verified with the vendor that only 2012 is supported.

Anti-Virus software has been uninstalled.

0 Kudos
Bruticusmaximus
Enthusiast
Enthusiast

That's interesting.  Was there anything in the Windows event logs that pointed to this?  Our event logs don't even show a network disconnect warning.  It just shows things like "Can't resolve host name" and "Can't authenticate to AD" and stuff like that.  Just logging the symptoms of the problem but, not the problem itself.

0 Kudos
jkastner
Contributor
Contributor

I was suffering from the same DNS resolve issues and not being able to ping anything on the network,  and it was by chance I just happened to notice that the network connection was set to public.  That tipped me off, and from there I started looking at the windows firewall logs and discovered it that way. The logs wont show a disconnect, because technically it is till connected.  But the public profile is by default set up to pretty much deny most traffic and causes it to be basically useless.  Solutions are to either to disable the Network Location service or set defaults to assign networks to at least private network.

Bruticusmaximus
Enthusiast
Enthusiast

Thanks.  I'm going to add this to the things to check next time.

0 Kudos
Bruticusmaximus
Enthusiast
Enthusiast

UGH !!! It happened again just a little while ago.  It was set to "Domain network" and not "Public" while the issue was happening. Smiley Sad  I was hoping it would say "Public" so I could go back to MS with something.  Nice tip though.

0 Kudos
pragg12
Hot Shot
Hot Shot

Have you tried TCP/IP stack reset ?

Check this blog link: Truly reset the TCP/IP stack

If this doesn't work or tried already, try removing IP configuration from vm's NIC and remove the NIC completely from VM. Make sure, the local administrator password is working. Reboot the vm, without NIC. Then, add a new NIC to vm (you can use the previous NIC's MAC address if you want or required) and re-configure the network settings.

Consider marking this response as "Correct" or "Helpful" if you think my response helped you in any way.
0 Kudos
Bruticusmaximus
Enthusiast
Enthusiast

This is our next plan. Power down, remove NIC, power back up and make sure it didn't leave a ghost NIC in device manager. Power down again and add NIC back in.

0 Kudos
pragg12
Hot Shot
Hot Shot

Let us know how it goes.

Consider marking this response as "Correct" or "Helpful" if you think my response helped you in any way.
0 Kudos
Bruticusmaximus
Enthusiast
Enthusiast

We had RSS turned on at the hardware level and on the NIC in the OS.  We shut it off and, it hasn't happened in about 3 weeks now. having said that, I'm sure it will happen tonight.

RSS is on by default.  We have it on for all 1400 of our VMs.  A lot are configured just like these four.  So .... it makes no sense but, I'm happy it fixed it for now.

0 Kudos