Aelirik
Contributor
Contributor

ESXi 5 VM / Server 2008 R2 loses network connection after reboot

Hey guys,

I am fairly new to the forums but I have been dealing with ESXi for a while now.

Recently been having an issue with some server 2008 r2 standard virtual machines on an esxi 5 host where when the host is rebooted it loses its networking. The network adapter is still attached to the machine, network addressing is still all in tact, but am unable to ping to the default gateway.

Can ping different servers on the same subnet but this obviously kills all connections for internet access. Being that this is a citrix terminal server its not such a helpful thing to have after applying updates.

The only way i have been able to resolve this so far is to disconnect the adapter while it is live, reboot the VM and once it is back up and running, reconnect the network adapter. I have experienced this on other VM's hosted under this esxi environment but not all of them are experiencing the same problem. So far it has only been a few of them, all setup very similarly with VMXNET 3 adapters, server 2008 r2 OS.

One of them was a fresh install of server 2008 r2 and after running some windows updates it had the same issue as the one that i am currently having trouble with.

I have patched the servers with the latest updates from vmware expecting that this may help to resolve the problem however here i am looking for a bit more information or help on these issues.

Spent a bit of time working through a few different steps to try to work out what is causing this, tried removing and replacing adapters but so far with no real luck. Just lucky enough that i can get the VM back online without too much hassle at this stage.

Any help would be appreciated.

Cheers,

Tom

53 Replies
bapcare
Contributor
Contributor

I am currently having this same issue as well. We just installed 3 new physical servers running vmware ESXi 5.5.0 2069190.

My Upgrade process was.

Setup new Physical servers. Vmware/Update/Network etc.

Connect new Servers to same fibre channel storage.

Remove servers from inventory on old vcentre.

Ad to inventory on new servers.

Set Network Label on new servers to Production (we didn't use the same name).

This problem so far is only affecting my Citrix Servers that are running Windows 2k8 R2 SP1. These servers restart every morning (clear logons,etc) and randomly 1 or 2 servers will be off the network.

I have a small DHCP in my server range so they are either assigned to an address in that range or a 169 address. Disabling and Re-enabling works most of the time. Other times i need to just set the static address again which it complains it already has on another network adapter.

Another change is that we let VMware do the load balancing (2 x 10GB) now where the old physical servers were setup with LACP on the switches.

Any updates or other things i can try. So far all i have done is disabled gratuitous ARP on our Cisco Core which i thought had worked but then it came back after 3-4 days.

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=102837...

Thanks All

0 Kudos
Aelirik
Contributor
Contributor

When I created this initial post, I was going through absolutely everything that I could trying to resolve it. One of the main things I was doing was simply disconnecting and then reconnecting the adapter (untick the box within the network adapter rather than removing and re-adding) and sometimes this would work, sometimes not. I went through months of different ideas with no luck.

The only thing that resolved it was ensuring the tagged VLAN's on the network adapters out of the esxi hosts were correct. There were multiple hosts with multiple VM's and these terminal server / citrix VM's were the only ones being affected. The VLAN tags were assigned against one of the trunk ports but not against the other, meaning sometimes it would come up, but then sometimes it wouldn't. It simply depended on which way the traffic was routing out of the host back to the switches.

As they are new hosts, i would definitely go back and double / triple check the networking configuration and ensure that you have replicated your VLAN's across the host NIC's and that there isn't one VLAN inadvertently missing. This messed us up for a number of months and all it came down to was a simple misconfiguration by the guy managing our network.

Ratoka
Contributor
Contributor

I have checked and doubled checked my VLANs.  We are running a fairly small environment with only 4 VLANs presented to VMware.  What I suspect at this point is the actual network infrastructure.  We are running Cisco 4900Ms for these NICs.  Each ESXi host has 2 10Gb 2 port NICs.  We have one port of each NIC designated for VMs, and each is going to different switches.  I am thinking that the issue is with NICs being connected to different switches that are not stacked as they are not aware of the CAM of the other switch.  I am digging deeper into the switches as the VMs reboot, but I am thinking of disabling the second NIC to see if the issue is resolved.

0 Kudos
Aelirik
Contributor
Contributor

If you only have 4 VLANS then it is pretty easy to manage, we had quite a number of them being presented to the hosts which is how it was missed initially. That seems very odd... It is a completely different issue to what i was experiencing. We had 4 1Gb interfaces which were load balanced across two switches, which were stacked. Are there other VM's on this infrastructure that experience the same issues? Certainly trying to isolate the network to one path would be a good way to start diagnosing / troubleshooting the issue. I run another network with HP 5500's and HP DL380P servers with 10GB NIC's connected, one to each switch. The HP's are stacked together and we have never experienced an issue with any of the VM's which are server 2012 R2, windows 8, windows 7 and some other linux VM's. I would be investigating the networking further.

0 Kudos
Ratoka
Contributor
Contributor

That makes sense.  I think the difference here is the lack of stacking.  We are in progress of a network overhaul which will include stacked switches.  At very least I will report back after.

0 Kudos
WillFulmer
Enthusiast
Enthusiast

This may be a relevant MS Hotfix

http://support.microsoft.com/kb/2555789 - Blank default gateway may occur after configuring Static IP address following network driver upgrade on Windows 7 and Server 2008 R2

0 Kudos
mitcha
Contributor
Contributor

Hi all, I have been running esxi 5.1 and just installed 3 win 8r2 servers and found that are losing connection to the gateway. i found the first 8r2 server is up and runs solid and does not lose connection but the second one will drop connection after just 1 minute or so . i have been searching for an answer and have tried a lot of config changes but they did not help at all - what i did come up with is all the 8r2 servers are on a single vm vlan and so i changed one of the effected servers to another vlan with no other 8r2 servers on it and it is working ok now it is no longer losing connection and running solid. so if no other changes help you might want to try just having one 8r2 server per vm vlan and see if any difference.

Well thats my 2 cents

Please let me know if that helps and good luck with this issue

Regards

mitch

0 Kudos
Ratoka
Contributor
Contributor

Is there any possibility with your configuration that the MAC addresses between these VMs are the same?

0 Kudos
mitcha
Contributor
Contributor

Hi,

No they were not using the same mac - from the gateway router (Cisco 2851) i was doing a sho ip arp and sho ip arp x.x.x.x for the ip address of each server and verified that they were different. also in the vm config for each i had checked and also changed the mac to other values during testing . the strange thing was when the server would stop responding on the network - i would ping from another workstation on a client segment and also the server to the gateway and the gateway to server and i would do a clear ip arp x.x.x.x to restore the connection. both vm servers are installed as new systems - not copies of each other. one was a datacenter  version the other enterprize and verified the addresses were different. i am running one of the systems as a PDC the other is an exchange 2007 server.  then installed another  8r2 and checked connectivity before installing exchange and it was losing connection also, so that ruled exchange as being part of the issue. i also have a copy of win server 12R2 and it was not having any issue just win server 8r2. also after it lost connection checked and the mac had not changed and gw was always the same on each server.  never the same mac on both severs  - it did not lose the gw info it would just stop talking - as long as the arp timeout was at 0 it would communicate and as soon as it would start to age it would stop communicating. also tried static arp entry at the router and from the server but no change in result. also when first noticed both servers had not been authenticated. the did the activation on both but no diff. also after losing connection they would both be able to communicate to any devices on the same segment and the one would just drop being able to communicate off segment to other vlans in the network. the only thing that helped me was to move one of them to another segment and then things started to work and no more problems. really strange.

Regards

Mitch

0 Kudos
jpvhfbt
Contributor
Contributor

I'll throw my 2 cents in here.  We had this same problem with Win2K8 VMs, each are one of two different vlans.  The suggestion about setting ArpRetryCount=0 seems to have solved the issue for us.  We also have 2 server access switches that are not stacked, and both are trunked to a core switch (which IS a switch stack).  Each ESX host has trunked NICs to both server access switches.  My thoughts are, on reboot, the ESX host may decide to pass traffic to another NIC interface, therefore to another server access switch, and then onto a different trunk on the core switch.  In our case, the core switch IS THE SVI for both of these vlans.  I think something is happening within the core switch such that the SVI is not responding to something that Win2K8 is asking for at boot (maybe part of that "Home/Public/Private" network crap, which does NOT belong in a server O/S, IMHO).  If the core switch for some reason hasn't recognized a MAC update coming from the second server access switch, the SVI may not respond at all, or may respond on the wrong trunk port to the OLD server access switch. 

In any case, setting the ArpRetryCount registry setting mentioned earlier in this thread seems to have fixed the issue for us.  Your mileage may vary.

0 Kudos
mitcha
Contributor
Contributor

as a follow up i did resolve my issue - it turns it was very spacific to my system setup and not software related - the config on my cisco 2851 isr router with a 36 East Ethernet module.

the configuration is a cisco 2851 isr and it also has a 36 port fe switch installed in it and that is where the problem started.

the problem is / was when more than one win 7 or later device is on the same lan segment,  The first device always worked ok but the second/ third ... device will keep losing connection to it's gateway, the effected devices still can communicate on the local segment but unable to get off segment to communicate to other devices and access the internet.

network stick diagram:

PC1 --------------------|3550 gig switch|  \

PC2 --------------------|3550 gig switch|    ----------|C2851 isr router with 36 port FE and 48v POE ip phone supply|

VM server 9 vm's ----|3550 gig switch|  /             |C2851 isr router with 36 port FE and 48v POE ip phone supply|

                                                                           |C2851 isr router with 36 port FE and 48v POE ip phone supply|

ip phone 1---------------------------------------------- |C2851 isr router with 36 port FE and 48v POE ip phone supply|

ip phone 2---------------------------------------------- |C2851 isr router with 36 port FE and 48v POE ip phone supply|

ip phone 3---------------------------------------------- |C2851 isr router with 36 port FE and 48v POE ip phone supply|

                                                                           |C2851 isr router with 36 port FE and 48v POE ip phone supply|

before i added the 36 port sw module i had defined the vlans on the trunk interface on gig 0/0 like:

interface GigabitEthernet0/0.100

description vlan 100 192_168_0_0 Server net

encapsulation dot1Q 100

ip address 192.168.0.2 255.255.255.0

ip helper-address 192.168.0.11

and the command int vlan XXX is not recognized yet by the 2851.

after  you add a 36 port switch module to the 2851 then the command "int vlan XXX" is recognized

so I had updated all from to GigabitEthernet0/0.100 to

interface Vlan 100

description vlan 100 192_168_0_0 Server net

ip address 192.168.0.2 255.255.255.0

ip helper-address 192.168.0.11

with win 2000 and xp this works ok no issues

also the FE ports are configured like

interface FastEthernet1/0

switchport access vlan 100

switchport voice vlan 35

no ip address

spanning-tree portfast

and the 3550 gig switch has its vlan defined also but I found an issue if booth the fe ports on the 2851 and the gig ports on the 3550 used the same vlan so I had made it so no ports on ether of them where on the same vlan this is before the issue was noticed. Pre win 7 config as it was.

So the solution was found by adding another 2851 switch as a router  and removing  all vlan config from the orginal 2851 that was not used by the FE ports.

So I removed all the "interface GigabitEthernet0/0.XXX" statements and just used the "int vlan  XXX" statements for the FE port vlans, then on the new 2851 I only added the "interface GigabitEthernet0/0.100" type statements and no "int vlan XXX" statements as it would not let me with out the 36 port module installed.

Then setup the old 2851 to route through the new 2851 and the 3550 gig e switch and internet. Then tested all devices and no issue any more all running solid.

Next I added the GigabitEthernet0/0.XXX statements back to the old 2851 but only for the vlans used that are off the device(2851) not for the FE vlans on the 2851. Removed the new 2851 and all is still working.

I figured that the int vlan xxx definitions on the 2851 for the vlans that where on the 3550 was what was causing the issue.


Regards

Mitch

0 Kudos
Balogh
Contributor
Contributor

Had the same problem with a Windows 2003 so I disabled the IPSEC service and reboot.

Not shure if it will work under Windows 2008R2 but you can try to disable IPsec Policy Agent service and reboot.

Dani

0 Kudos
Donzai
Contributor
Contributor

This sounds like a ghost network adapter issue. I had many troubles involving this issue and was able to resolve using steps outlined in Microsoft KB269155 https://support.microsoft.com/en-us/kb/269155

I used Method 1 outlined and was able to remove the ghost adapters. I think the prevention is to uninstall the network adapter before saving the VM as a template.

0 Kudos
viquar
Contributor
Contributor

I had the same issue that i had been trying since 2 days and finally got it fixed using the below steps.

  • The network interface to which vmnet0 is bridged. If there are multiple network adapters on the host machine the virtual network might be bridged to the wrong network. To eliminate this problem disable the “VMware bridge protocol” option by going to the connection properties of the network interfaces which should not be used for bridging.
  • IP address for the virtual machine is set incorrectly. There is a possibility of mistyping the IP address if you entered it manually, check if this is correct and also check the subnet mask address.
  • Wrong option selected under network connection. While setting up the virtual machine you might have selected a network connection option other than “Bridged” check for this and change it to bridged if this is so.
  • Finally there might be a firewall running in the virtual machine which is blocking the network data transfer, check for this and create firewall rules appropriately.

Keep Rocking!! VMware Rocks!

Viq

0 Kudos