Aelirik
Contributor
Contributor

ESXi 5 VM / Server 2008 R2 loses network connection after reboot

Hey guys,

I am fairly new to the forums but I have been dealing with ESXi for a while now.

Recently been having an issue with some server 2008 r2 standard virtual machines on an esxi 5 host where when the host is rebooted it loses its networking. The network adapter is still attached to the machine, network addressing is still all in tact, but am unable to ping to the default gateway.

Can ping different servers on the same subnet but this obviously kills all connections for internet access. Being that this is a citrix terminal server its not such a helpful thing to have after applying updates.

The only way i have been able to resolve this so far is to disconnect the adapter while it is live, reboot the VM and once it is back up and running, reconnect the network adapter. I have experienced this on other VM's hosted under this esxi environment but not all of them are experiencing the same problem. So far it has only been a few of them, all setup very similarly with VMXNET 3 adapters, server 2008 r2 OS.

One of them was a fresh install of server 2008 r2 and after running some windows updates it had the same issue as the one that i am currently having trouble with.

I have patched the servers with the latest updates from vmware expecting that this may help to resolve the problem however here i am looking for a bit more information or help on these issues.

Spent a bit of time working through a few different steps to try to work out what is causing this, tried removing and replacing adapters but so far with no real luck. Just lucky enough that i can get the VM back online without too much hassle at this stage.

Any help would be appreciated.

Cheers,

Tom

53 Replies
Josh26
Virtuoso
Virtuoso

LEHPSTS wrote:

We have 2 switches trunked also. Why would a trunked switchport arbitrarily block 1 mac address when others on that same vnic in sequence work fine?  I have this from two different vms on two different switch ports from two different hosts.

Our situation is resolved when we set dhcp on the nic, allow it to assign pvt 169.xx IP, then reassign static it had to begin with.  However a reboot starts this process all over again.

Someone in an earlier post or alternate thread even changed the mac to no avail...

If you are using a broken load balancing protocol like "route by IP hash" you would see exactly the scenario you describe.

0 Kudos
LEHPSTS
Contributor
Contributor

Single nic so no load balancing.

0 Kudos
LEHPSTS
Contributor
Contributor

Update:  Removed offending NIC from the VM, ((verified nothing hidden) in dev mgr, uninstalled VMware tools and shutdown VM.  Removed NIC from VM Settings and added new VMXNET3 nic (auto assigned new MAC).  Restarted VM, reloaded VMware tools to discover NIC (VMXNET3).  Restarted VM and configured with appropriate VLAN tag in adapter properties--waited for time out of DHCP and added static for VLAN.  Connectivity restored.  Rebooted 3 times and holding strong.

Observable activity:  Watched switchport arp table and router arp table.  When VM initializes, we see ARP entry on VLAN1--presumably while adapter initializes and then assigns appropriate VLAN tag configuration.  We then see the same MAC address for the appropriate VLAN about 5-7 seconds later and VM is at login screen.  We saw this same activity prior to above, however, before above steps, we would see an IP address assigned to the MAC in the router ARP table for VLAN1.  (we have a small range on VLAN1 for DHCP to allow for updates at build time).  This assignment only happened PRIOR to the above.  It has not happened on this VM since the above was completed.  One could draw a conclusion that something with the VM network stack was hosed--could be something in the Template even.  I am attempting to replicate on another problem.  Will post results.

flea59
Contributor
Contributor

Were you able to replicate the problem Lehpsts?  We're having the same problem, however if I log onto the affected VM and disable/re-enable the NIC, the static address is restored.  Thanks.

0 Kudos
Aelirik
Contributor
Contributor

I managed to solve our problem with this the other day. Simply came down to someone not doing their job correctly when implementing VLAN's on the switch. We had the problem when we moved these problematic vm's to a dvswitch that they would play up and stop responding. When we moved them back to a host which wasn't on a dvswitch they would work fine after a bit of playing around with disconnecting / reconnecting the network adapters. As our switches are in an HA setup we checked the VLAN's between the switches and noticed that the 2 specific VLAN's we were having troubles with were not noted on the switches. When we put the VLAN's on the trunk ports on the switches it all started working fine. Could drag and drop VM's between hosts on the dvswitch and couldn't get them to fail. All in all it came down to a networking fault which just required a lot of time and troubleshooting and double checking the entire setup to make sure nothing was missed.

0 Kudos
flea59
Contributor
Contributor

In our case it turned out to be the way Cisco’s IP Device Tracking feature interacted with the gratuitous arp as implemented in Server 2008 R2.  We disabled the IP Device Tracking for now. Another alternative is to disable the gratuitous arp via the DWORD:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\ArpRetryCount=0

0 Kudos
borgward
Contributor
Contributor

We had this problem and noted that the NIC (inside Windows) is, "Allowed to go to sleep" in its properties.

Since we turned off this 'feature' we haven't had any of these problems.

Be sure to do this on the template if you deploy from template for new servers.

Al G.

0 Kudos
Morpheus0026
Contributor
Contributor

I had a similar problem,  I ran a continuous ping going to my VM Host, when the Applying Computer Settings Screen came up it would respond to pings for a bit and then the CTRL-ALT-DEL screen comes up and the replies stopped.

Despite the fact that my server was deployed from the same 2008 r2 template that all of my servers were deployed with, I found that the Windows firewall settings were not the same.  I fixed my firewall settings and this resolved my issue.

0 Kudos
SAHEALTHNOC
Contributor
Contributor

We had a similar problem with a template builds from ESXi 5.1 host build 1743533 that would drop it NIC settings every time the VM settings were updated or a migration was performed.

This KB fixed the issue and also indicates it will fix the issue on reboot as well..

http://kb.vmware.com/kb/2078352

0 Kudos
supraturtle
Contributor
Contributor

This thread is a little old, but I thought you all might appreciate my findings:

I have a clustered ESXi 5.1 environment with many subnets and many VM versions and various virtual hardware. In my findings virtual adapter hardware, vm version, subnet, rules, load balancing... nothing seems to matter. All my Windows machines (7,8,203,2008,2012) seem to exhibit this behavior after applying Windows updates (with updates starting at about September 2013 and later. 2k3 and 2k8 the most commonly.)

With fair consistency I have network connections go to "Unidentified Network" after applying Windows updates from about 1 year ago onward. After the reboot, network becomes unidentified.

My results also support that this behavior occurs after more than one update--so there may be several catalyst updates.

********To fix this, I simply console in with Vsphere and right/click disable the adapters from Network connections. After a few moments' pause I re-enable them. The network picks right up again.*****

Rebooting, registry changes, all sorts of messing around just got me in deep water for this. I run test environments which are expected to be as default as possible.

My research best resulted in an apparent security policy feature instituted in certain updates that caused similar issues in other systems and are instituted via Windows Updates. If your machines do all the Updates installs automatically this will appear to just 'magically' happen when you notice it. I manually monitor and apply my updates across the system, so hence I noticed the drop-outs and was able to troubleshoot right away.

I hope this saves someone a lot of trouble...

0 Kudos
Ratoka
Contributor
Contributor

We are having exactly this issue.  Disabling and re-enabling the adapters fixes the issues.  We are on 5.5 though. 

0 Kudos
vBrendan
Enthusiast
Enthusiast

Did you ever find a permanent solution for this, although disabling and re-enabling the adapter works, it does not help when you schedule work for the middle of the morning and have to check connectivity every time. Thanks

0 Kudos
Ratoka
Contributor
Contributor

Unfortunately not for us.  Strangely enough this is a really low occurrence for us as well.  I would say that about 1 in 20 reboots have the issue.

0 Kudos
WillFulmer
Enthusiast
Enthusiast

0 Kudos
TerryAsnet
Contributor
Contributor

Hi There,

I am also having the same issue but running ESX 5.1 and the issue where i loose connectivity  to the default gateway and network device on the same subnet, seems to only be on server 2012. 2008 R2 works fine.  My resoulution is to disable the network adapter in the VM and re-enable it to get it up and going.

I'm also getting this same issue on another ESX server running 5.5 with windows server 2012.

0 Kudos
mikejroberts
Enthusiast
Enthusiast

Just experienced the exact same issue on 5.5 update 2 with the latest patches.  So far I have only seen the issue with two VMs (out of hundreds that were rebooted) and all of the usual Windows troubleshooting was futile.  Only disconnecting or removing the NIC helped (tried vMotioning, rebooting, dumping the IP settings, disabling/enabling within Windows, repairing VMware tools, reinstalling tools, etc..).  Both of the VMs (2008 R2 and 2012 R2) were using VMXNet3 with 9.4.10.x tools. 

0 Kudos
Ratoka
Contributor
Contributor

I have noticed that this primarily happens on my VMs that have NLB installed and running.  Are you by chance seeing the same thing?

0 Kudos
mikejroberts
Enthusiast
Enthusiast

We don't use NLB and I was notified that we had the same issue with a RedHat VM over the weekend.  VMXNet3 seems to be the common factor.

0 Kudos
Ratoka
Contributor
Contributor

We have stayed away from the e1000 NICs due to instability in 5.5.  I don't have it bookmarked, but VMware has a KB saying to minimize the number of e1000 NICs; if I remember correctly there was a memory leak.

Found it: VMware KB: VMware ESXi 5.x host experiences a purple diagnostic screen mentioning E1000PollRxRing an...

0 Kudos
mikejroberts
Enthusiast
Enthusiast

We also moved away from E1000 and E1000E because of the PSOD issue.  This particular problem has been with VMXNet3.

0 Kudos