I am fairly new to the forums but I have been dealing with ESXi for a while now.
Recently been having an issue with some server 2008 r2 standard virtual machines on an esxi 5 host where when the host is rebooted it loses its networking. The network adapter is still attached to the machine, network addressing is still all in tact, but am unable to ping to the default gateway.
Can ping different servers on the same subnet but this obviously kills all connections for internet access. Being that this is a citrix terminal server its not such a helpful thing to have after applying updates.
The only way i have been able to resolve this so far is to disconnect the adapter while it is live, reboot the VM and once it is back up and running, reconnect the network adapter. I have experienced this on other VM's hosted under this esxi environment but not all of them are experiencing the same problem. So far it has only been a few of them, all setup very similarly with VMXNET 3 adapters, server 2008 r2 OS.
One of them was a fresh install of server 2008 r2 and after running some windows updates it had the same issue as the one that i am currently having trouble with.
I have patched the servers with the latest updates from vmware expecting that this may help to resolve the problem however here i am looking for a bit more information or help on these issues.
Spent a bit of time working through a few different steps to try to work out what is causing this, tried removing and replacing adapters but so far with no real luck. Just lucky enough that i can get the VM back online without too much hassle at this stage.
Any help would be appreciated.
Kindly cross check with the drivers weather your vm nic is having E1000 driver set to it. and weather your VM Tools are installed/updated as the driver are updated depending on the Hardware you are using through vmtools
Thanks for your response.
It is definately installing as a vmxnet3 adapter. I can see that the adapter is correctly picking up the drivers. I have installed and updated the vmware tools to the latest version with no luck in resolving the problem so far.
Should have posted in the window above.
Checking through the event logs and such doesn't give me any information on the error that I am experiencing. Just the normal events are showing.
How many NICs on your ESXi? Load balanced? Does the VM in question or the port group it is in uses all the uplinks or selected? Since the issue is happening to other VMs maybe, look at the uplinks and their configuration.
Do you have network load balancing setup on VMs?
There is just 1 nic assigned to the VM so there is no load balancing. It is in its own port group which has VLAN's assigned to it. Only one other vm experienced this issue once and since it has had several reboots and never experienced an issue since. there is no NLB setup on the VM's.
It might be of guest OS issue , try with installing the Hot Fix mention in the below link.
that looks kind of close but perhaps i should be looking down the microsoft path.
As there is no Hyper V installed, and the network adapter only fails after a restart i wont apply the fix to the production server without further testing.
It seems close but there is quite a few differences that dont quite match the issues that I am experiecing.
I am having the same sort of problem you described.
We have experienced it on a VM with Windows 2008 R2 using the vmxnet3 adapter and another VM wityh Windows 2003 using a vmxnet2 adapter.
We've tried the disable/enable combination, we've added a new adapter (and deleted the old) but the problem persists.
I was curious if you found anything beyond what is posted here in trying to resolve the issue.
Did you find a cause/fix?
Unfortunately I still haven't resolved the issue. It is incredibly odd as its 2 VM's that are on the same ESXi host as about 10 other VM's, all running the same OS that do not have this fault.
My other thoughts have been to do with our firewalls, as they are the only 2 VM's on that particular range of ip addressing. I find it incredibly strange the way these VM's act as sometimes its a quick fix to bring them back online, other times its an absolute nightmare yet there is no reason or explanation as to why they lose their network adapters.
I have noticed that they sometimes lose their networking within windows, it sets back to a DHCP address but as there is no dhcp server just gives a 169 address. We then have to set it to DHCP on the network adapter, and then configure the static addressing again.
I have spent multiple hours on this problem with no fixes so far, the next thing i will be suggesting will be changing the networking all together on the network as a test to see if we can get the fault to reproduce that way. I have duplicated the VM's which experience the same issues, tried complete uninstall / re-installs, manually set the MAC address and all kinds of other things to get this fault fixed but unfortunately I have had no luck.
Has this been happening for you from day one or has this just appeared as a problem after a change was made or something along those lines?
Any information you also have in regards to the fault would be appreciated, i am always open to ideas especially as i have exhausted every avenue i can think of so far ( apart from changing the network it currently sits on )
I have some more information which may be of help to you madden.
We had another network which started producing the same issues. I had a feeling that it was something to do with HA on our firewalls, so we got hold of our vendor and ran some packet sniffers to see what was happening. The initial ARP request which is sent out when the network card is broadcasting searching for its default gateway was not making it to our firewalls. Occasionally after multiple times of removing the nic / rebooting the VM it would come online (this was on the second network producing the fault).
From there we checked out the network port groups to ensure that the networking had been configured properly. Found that someone had totally screwed up the config and had included all of the iscsi and vmotion adapters, plus both of the network cards had been configured in active mode, rather than one in active and the other in standby. We removed the iscsi and vmotion adapters from the list as they have no place in there, and then put one of the adapters as an active and the other in standby. Had to restart a couple of the VM's but most of them came online almost instantly, being able to ping the default gateway and could get internet access without any issues.
We checked this on the network that is having the intermittent faults and found the same type of setup with 2 adapters configured as being active, we changed one of them in to standby but still had the same fault with the default gateway. Also running the packet sniffer could not see the arp requests hitting the firewalls. We put the second adapter back in to active mode and it came back up and online which suggested that it may be something to do with our switches blocking the broadcasts coming through on the first adapter. As we had no other time to test with i haven't gone back to this but it does suggest something along those lines. If you have the ability to do port mirroring on your switches mirror one of the ports where the esxi is plugged in to and run wireshark on your laptop. You should be able to see the arp requests hitting your switches. From there if you also have the ability to run packet sniffers on your firewall also check there to see if the arp requests are coming through. It will certainly give you a good idea of exactly where the problems are.
Ditto. Recently built fresh Win2K8 R2 SP1 build and after all current patches, problem surfaced. I have several VM's in this boat. We hate reboot for fear of losing connections. I was hoping someone also had a fix before we try what everyone else has done to get the same results. I believe this is an OS patch issue. Win2K8 SP2 (not R2) does NOT have this problem. There must be something in the patches from MS that has broken this or the VM tools that interface with VMXnet3 adapters... Any other help is much appreciated...
Curious that you mention Microsoft as being at fault, the one thing we have had is the network settings going missing within the VM's which also creates a problem. The only thing is that we have 30+ Server 2008 R2 VM's between 5 different hosts and we only experience the issues on two of the VM's which are on their own network. All the other VM's we can reboot at any stage and never have trouble bringing back up and online.
I have researched all types of microsoft fixes etc to try to diagnose the problem but all attempts have been fruitless so far. I still believe it is an issue with the configuration of the networking in VMWare so I would certainly be inclined to be checking there first before wasting hours of time like i have done going down the microsoft path.
Interestingly, I do have other Win2K8R2 VM's on different VLAN's that do not appear to exhibit the problem. The problem also does not follow the VLAN. I have some that are fine on the troubled VLAN's. Further, it does not seem to be isolated to the host or even to VMware. I have a Hyper-V R2 server with the exact same problem with another 2K8 R2 VM. Totally different server but same problem. This is why I think this might be OS related.
Most of the VMware Win2K8R2 guests were cloned from the same template and the template was sysprep'd so it initializes all of the hardware at startup--including adapters. There really is no rhyme or reason to the problem or why it happens on some and not others. It is truly maddening. In the case of my latest VM today, I have 4 others from the same template that work fine on another host and another VLAN. Following that logic, perhaps one might think the virtual network stack is at fault in VMware, but since the problem occurs on another HOST on another PLATFORM, I cannot discount something in the OS. The answer lies here somewhere. I just wish there were more markers to find it.
Hmm wow that is curious... Our problem is that any machine on a particular VLAN produces the fault and it is incredibly difficult to bring it back online. We have even used templates as you have and dont experience the same problems. I have tried all different hosts and the problem has followed the VLAN which certainly looks to us to be networking either with that VLAN or with the ESXi but we have yet to prove that theory as some out of hours testing needs to be done. Just dont fancy trying to fix those servers again after another restart...
Perhaps a call to microsoft may be in order to see if they can shed any light on your situation
Throwing some ideas around..
Legacy? Search for removed NICs using http://support.microsoft.com/kb/315539 and uninstall them. This can be surprisingly helpful.
MAC security/sticky/etc on your physical switch? What is your physical switch and how is it configured?
Have been through the legacy adapters and removed them. Still made no difference to our network. Physical switches are the next thing to be looked at but again, after hours as we can't afford for this environment to go offline during hours.
We have 2 physical switches setup with trunked ports, we have a feeling that one of the switches is blocking the arp requests but again, just have to confirm that.
We have 2 switches trunked also. Why would a trunked switchport arbitrarily block 1 mac address when others on that same vnic in sequence work fine? I have this from two different vms on two different switch ports from two different hosts.
Our situation is resolved when we set dhcp on the nic, allow it to assign pvt 169.xx IP, then reassign static it had to begin with. However a reboot starts this process all over again.
Someone in an earlier post or alternate thread even changed the mac to no avail...