VMware Cloud Community
vChr1s
Contributor
Contributor

Network Issues with Server 2008 R2 as Guest after a vMotion

Ok.  This is a strange problem, and I have not been able to find anything similar on the Internet, so before I call support, I figured I'd post something here.

We just started testing with 2008 R2 in our environment.  I brought up our first 2008 R2 DC and everything is fine.  Now, I moved our vCenter server from 2003 R2 to 2008 R2 just yesterday and I experienced a strange problem where my ESXi hosts in a different subnet lost connection to vCenter.  From my workstation I could ping the hosts and vCenter, however, from vCenter I could not ping the hosts.  I fought this for an entire day and I decided to try and build another 2008 R2 server to test with.  I installed from scratch, tweaked and modified the server (the whole time keeping a ping open to a DC in another subnet).  Everything was working until the replies suddenly stopped.  I wasn't even doing anything.  But I noticed that vCenter had vMotioned the 2008 R2 VM from one host to another.  I vmotioned it back, and the pings worked again.  Everytime the VM was on host A, pings worked.  When it was vMotioned to host B, they didn't.

Now I thought the vmotion was causing the problem, so I shut the VM down, migrated it, and tried to ping, but got the same result.  I could ping from A but not from B.

My next step was to deploy another new 2008 R2 server from template.  This time I first powered it up on B, and magically, pings worked!!!  I vmotioned it to A.... they stopped.  So I proved that there is nothing wrong with the config of either host.  I can deploy a 2008 R2 VM to either host initially and everything works.  As soon as it gets vmotioned, network connectivity outside the subnet is lost.

I only have 2003 R2 and 2008 R2, no 2008.  This only effects the 2008 R2 VMs I have.  My 2003 R2 servers are fine...And again, this is only when connecting or pinging a different subnet.  Pinging within my subnet works fine all the time on 2008 R2.

Theory/Conclusion:  It seems that with 2008 R2, something network related gets registered (MAC address of pNIC???) and it doesn't like moving from host to host.  The reason the 2008 DC is ok is because it was never vMotioned, I turned off DRS temporarily during work hours.  I tried searching for this issue but found nothing.  Any help is appreciated before I call VMware support.  Thanks.

Environment Info:

  • All ESXi 4.1 Update 1, build 348481
  • vCenter Server is on 4.1 U1 on Server 2008 R2 Enterprise x64
  • HA/DRS enabled clusters
  • All vSwitches are standard vSwitches

P.S. If any more info is needed, let me know.

**Minor Update:  If I vmotion the VM from host A to B, and the pings stop, I can remove the vNIC and add a new one, and the pings start to work.  Also, I have tried with all drivers (i.e. VMXNET3, VMXNET2, & e1000).  It's almost as if the vNIC in 2008 R2 caches something or is somehow connected to the pNIC on the host.

Chris A. Consider awarding points for "helpful" and/or "correct" answers.
0 Kudos
16 Replies
a_p_
Leadership
Leadership

Rather than looking at the hosts or the VM, did you also verify the physical switch's port settings are correct? E.g. spanning tree portfast, switchport mode access, ... How did you configure the virtual network?

André

0 Kudos
vChr1s
Contributor
Contributor

Thanks for the reply André.  I did verify the physical switch ports.  Nothing has changed there.

I'm not sure I know what your asking in regards to how the virtual network is configured.  There are four standard vSwitches, each with one port group.  One for MGMT, one for vMotion, one for storage, and one for the VM Network.  Storage and vMotion are in their own seperate vlans, while VM Network and MGMT are in the default vlan for our building's subnet.

I really think it's a VM or server 2008 thing.  Everything has always, and continues to work fine with 2003 VMs.  Also, after the vmotion, when pings stop, if I remove the vNIC, and add a new one, everything is fine untill the next vmotion.

Chris

Chris A. Consider awarding points for "helpful" and/or "correct" answers.
0 Kudos
a_p_
Leadership
Leadership

Regarding the virtual network configuration, I though about the policy, e.g. "Route based on originating Port ID" or "IP Hash", ... and - depending on the setting - your physical setup VLAN, EtherChannel, ...

if I remove the vNIC, and add a new one, everything is fine untill the next vmotion.

With adding a new virtual NIC you will get a new MAC address. For me this looks like a network issue on the physical side. You said that you can ping systems on the same subnet, does that include the gateway address?

André

vChr1s
Contributor
Contributor

Thanks again.  Each vSwitch has two pNICs that are both active.  Load balancing is IP hash and the physical ports on the switch are in trunk mode (HP version of Etherchannel).

Yes, I can ping the gateway even though I can't get to other subnets

Chris A. Consider awarding points for "helpful" and/or "correct" answers.
0 Kudos
a_p_
Leadership
Leadership

I'm not very familiar with routers and switches, however I think something goes wrong with updating the MAC-Address tables.

What I would try is to a.) make sure the trunk configuration is correct, what I actually assume since other OS's are working (see the HP part in http://kb.vmware.com/kb/1004048) and b.) disable IPv6 in the Windows 2008 R2 VM's network settings to see whether this changes the behavior.

André

0 Kudos
vChr1s
Contributor
Contributor

Yeah, I'm not really a network guy either.  I didn't setup or configure the physical switch stuff, although I did review that vmware kb you referred to with the network guy when everything was originally setup.

Anyway, a new idea dawned on me over night (I can't believe I didn't think of this before).  I deployed the same 2008 r2 template to another cluster, in another building, in seperate physical and virtual datacenter.  And... everything worked!  So I guess it's not the VM or OS by itself.

There are some differences between the two physical datacenters.  Physically, they are different switches.  I am having problems with the cluster using our newer HP Procurve switches while the other datacenter still has older Nortels.  However, from what I can gather the ports are configured the same.

I suppose the other possibility is that there is a problem with the host or virtual network configuration.  I have looked over the settings multiple times and haven't found any differences.  I have even bounced one of my problem hosts (host A) and the problem remains.

I should be getting a call from VMware support in a few hours and I'll see what they have to say.  Thanks.

Also, I did disable IP v6 per this MS kb article  http://support.microsoft.com/kb/929852

Chris A. Consider awarding points for "helpful" and/or "correct" answers.
0 Kudos
mike_laspina
Champion
Champion

Hi,

Since the event occurs when your VM's mac is moved to a new switch port its most likely related to a MAC table limit or other port/mac blocking function.

Do you have access the the HP switch to see if the VM's mac is active on the associated switch port?

e.g.

show mac-address a1

http://blog.laspina.ca/ vExpert 2009
vChr1s
Contributor
Contributor

Mike -

Thanks.  Yes, I do have access.  Right now my test 2008 VM is on host B and the pNICs for the VM network are on trunk 9.  When I look at the MAC address table for the trunk I see the 2008 VM's MAC address.  When I look at the trunk for host A VM network I currently do not see it.  I am assuming this is correct?  Also, the vlan MAC address table has the MAC address and is showing it on the correct trunk too.

And currently, I cannot ping outside the local subnet from my test 2008 VM on host B.

Chris A. Consider awarding points for "helpful" and/or "correct" answers.
0 Kudos
mike_laspina
Champion
Champion

With the mac address showing up on the port we know there is no blocking presently. Try disabling and re-enabling the vnic on your misbehaving VM to see if it finds a route on a fresh init without a possible blocking state.

For brevity sake check the protocol binding order on the VM as well (Props of the nic inside 2008).

Is the VM multi homed?

http://blog.laspina.ca/ vExpert 2009
0 Kudos
vChr1s
Contributor
Contributor

All good stuff.  Yep, disable and re-enable vNIC didn't help (although had I removed and reinstalled the vNIC, that would have worked).  NIC binding has the LAN first and Provider order has Windows Network first.  And no, this VM has a single nic.  No multi homed.

Again just to reiterate, I can ping and connect to physical and virtual machines in my local subnet, just not outside it.  ...and only on server 2008.  server 2003 works fine.  Also, it's not one host having a problem and another isn't.  Whichever host the VM gets deployed to first works.  Once I vmotion it to the other, inter-subnet traffic stops.  Then inter-subnet traffic regains once I vmotion the VM back to original host.  But, on my other cluster in a different datacenter, it all works fine.

Thanks for everyones suggestions.  Keep em coming... 🙂

I'm just throwing this out there... but I found this article  http://malaysiavm.com/blog/packet-drop-on-windows-2008-with-nexus-1000-version-1-2/ which mentions some specific issues with the nexus 1000v and IGMP which was causing "the Windows 2008 R2 virtual machine will not be reachable from external, but you can ping the gateway from the virtual machine console".  It also mentions that this would not be experienced in server 2003.

While this most likely is not my problem, I have to ask if server 2008 introduces or has any other new protocols turned on by default that my switch might not allow, or would cause issues if the switch doesn't have whatever it is enabled?

Chris A. Consider awarding points for "helpful" and/or "correct" answers.
0 Kudos
vChr1s
Contributor
Contributor

Also, when the VM is unable to ping anything outside the local subnet, I am also unable to ping it from a PC on another subnet.

Chris A. Consider awarding points for "helpful" and/or "correct" answers.
0 Kudos
mike_laspina
Champion
Champion

Hi,

I suspect that we may have a vswitch nic teaming backplane issue.

e.g.

Host1 pNic1 goes to physical switch1

Host2 pNic2 goes to physical switch1

Host1 pNic2 goes to physical switch2

Host2 pNic1 goes to physical switch2

Check the physical cables to ensure they connect in the same order

e.g.

Host1 pNic1 goes to physical switch1

Host2 pNic2 goes to physical switch1

Host1 pNic1 goes to physical switch2

Host2 pNic2 goes to physical switch2

http://blog.laspina.ca/ vExpert 2009
0 Kudos
vChr1s
Contributor
Contributor

Mike -

Thanks for your response.  I'm fairly certain about the config of the cabling.  Each cable was labeled at both ends with what host and vNIC it was plugged in to when it was installed.  Also, vMotions work.  I am not seeing any failures or errors.  However, it is a valid point and I will double check this.  As far as the switch goes, we are connecting to an HP 8212 modular switch.  The trunks are trunked across modules for redundancy.

I had a chance today to speak with vmware and HP support.  After talking to both, they both seem reasonably certain that it is some type of issue with the switch relating to ARP.  VMware was not able to provide any ideas as to what "VMware" problem was occurring.  I was asked by HP to clear the ARP cache and upgrade the firmware, which has some fixes for ARP related bugs.  I'll have to schedule the firmware upgrade for next week but I'll report back on Monday about the clearing of the ARP cache and the cables.  Thanks guys.

Chris A. Consider awarding points for "helpful" and/or "correct" answers.
0 Kudos
vChr1s
Contributor
Contributor

So, just to make a final note, it turns out that there was a bug fix / enhancement in a firmware upgrade for our HP 8212 switch that HP said would solve the problem.  I scheduled the downtime this past Friday evening and updated the software and boot rom for the switch.  Everything went well with the upgrade and now I can vMotion between hosts without loosing pings to different subnets.

According to HP there was an enhancement in ARP that they believed would resolve the issue.  Some quick research on my own end found some information about how Server 2008 issues ARP requests different than 2003.  I chalked this up as the reason why my 2003 VMs worked fine but 2008 didn't.  Anyway, thanks for everyones help and suggestions.

Chris A. Consider awarding points for "helpful" and/or "correct" answers.
0 Kudos
rcaldeira
Enthusiast
Enthusiast

VChr1s,

I have the same issue with Nexus1000v and Cisco swithes 6500 and 2950.

After you upgrade the firmware of your HP switch everything is ok?

Rodrigo

0 Kudos
vChr1s
Contributor
Contributor

Rodrigo -

Yes, upgrading the firmware and boot rom of the HP switch solved my problem; however I was not using Nexus or Cisco equipment. Did you take a look at the article I referenced? The link is below. What version of the nexus 1000v are you running?

http://malaysiavm.com/blog/packet-drop-on-windows-2008-with-nexus-1000-version-1-2/

Chris A. Consider awarding points for "helpful" and/or "correct" answers.
0 Kudos