ESXi v3.5 u4 weird networking problems.

rdmt · ‎04-30-2009

I just installed an ESXi sever running v3.5 update 4 and added two VMs. The VMs are:

Server 2003 R2 - MS Terminal Server - 172.16.3.200
Server 2008 - MS Active Directory - 172.16.2.1
ESXi server - 172.16.10.3
Default Gateway - 172.16.1.1
Other systems on the network - 172.16.3.x and 172.16.2.x

Both of the VM guest servers (2003 and 2008) never lose connection to the internet or the other servers on the LAN. The server 2003 box loses connection from the rest of the LAN sometimes and frequently from the 2008 VM which is on the same ESXi server. The 2008 server never loses connection to anything.

This problem occurrs maybe 4-5 times a day and last for anywhere from 2-10 minutes. I don't have to do anything to resolve it it just starts working again. I have done constant pings to all the boxes involved and can see them stop for a few minutes at a time on the 2003 server but not on the ESXi or the 2008 server. The 2003 server also sometimes loses connection to only the 2008 server but can still talk to everything else. This is driving me nuts, please help! Very similar to this report I found http://communities.vmware.com/thread/178753.

vSwitch setup is as follows:

vSwitch0
vmnic0 gig full
DC VM 172.16.2.1
Management Network 172.16.10.3
vSwitch1
vmnic1 gig full
MSTS VM 172.16.3.200

RParker · ‎04-30-2009

Well Windows 2008 uses IP6, Windows 2003 does not. That's the first problem, so it could be not routing IP6 packets properly. Also this appears to be a problem with the Windows 2003 driver / NIC. Did you check the logs in Windows 2003 to verify that you don't have a problem with drivers or some other issue?

rdmt · ‎04-30-2009

Thanks for the responce. I disabled IPv6 in Server 2008 so I don't think that is the issue. I also checked the event logs and don't see anything related to a network driver problem. The only odd thing I noticed there is that on the server 2008 NIC it uses the Intel Pro 1000 driver and on the server 2003 NIC it uses the VMware Accelerated AMD PCNet Adapter. I can't remember for sure but prior to this latest version of ESXi I thought all of the VMs I'd worked on used the Intel driver.

The hardware is a Dell PE1950 with dual NetExtreme II NICs.

twashburn · ‎04-30-2009

I just checked an ESXi 3.5 update 3 test host I have running and for Vista and Server 2008 configurations it uses the E1000 adapter and XP and Server 2003 VM's use the Flexible adapter.

kjb007 · ‎04-30-2009

What size network are you using? What is your netmask? If you use typical settings, then all 3 of these servers are on separate subnets. Is that correct? Have you verified the IP/netmask is correct for the size network you are using?

-KjB

VMware vExpert

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

rdmt · ‎04-30-2009

The network mask is 255.255.0.0 so all subnets can talk to each other. It works most of the time so I don't think its a basic network configuration problem. It's something that comes and goes.

rdmt · ‎04-30-2009

I have been able to pinpoint the exact problem now. The two VMs get in a state as described where they cannot ping each other. I did constant pings from other machines on the network to all the systems involved and had only one packet dropped out of almost 20K. However during the time period that the constant ping was going the problem where the two VMs couldn't talk to each other happened at least twice for a few minutes at a time. So I think the rest of the network to the VMs is fine it's just the two VMs hosted on the same physical server have this odd issue where they lose connection to each other every so often for a breif period.

RParker · ‎04-30-2009

each other happened at least twice for a few minutes at a time. So I think the rest of the network to the VMs is fine it's just the two VMs hosted on the same physical server have this odd issue where they lose connection to each other every so often for a breif period.

OK, obligatory windows question, did you turn off internal firewall on BOTH VM's?

rdmt · ‎04-30-2009

Server 2003 doesn't have the firewall service as active. Server 2008 does by default but allows traffic on all of the ports that should be necessary including ICMP. I'm also not sure how a firewall issue would come and go?

kjb007 · ‎04-30-2009

2003- 32 bit only gives you option for Flexible or Enhanced. 2003-64 bit provides e1000, and 2008, provides e1000 and enhanced. The flexible would be the AMD Accelerated Adapter. You could manually edit the vmx file for the 2003 vm and change it to e1000. You could also try removing the vNIC in the 2003 vm to Enhanced, or just remove and re-add.

-KjB

VMware vExpert

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

rdmt · ‎04-30-2009

The server 2003 VM is 32-bit and I will try switching it to e1000 in the vmx config file. I will do that this evening off hours and post back with my results. The server 2008 system is 64-bit and already using the e1000 driver presumably.

MrPauloAndersen · ‎04-30-2009

so, the vm's are on the same host but on two seperate vswitchs?

If this is the case the IP traffic will always route throught the network.

When VM1 can not ping VM1 on the same host... What does the VM network media state say the status of the network is? Can VM1 ping another server on the same host on the same vswitch? Can VM2?

it most likely will not say...

ipconfig /all

Ethernet adapter XXXXXXX Network Connection:

Media State . . . . . . . . . . . : Media disconnected

If this is is the case and the virtual network looks ok, then I would say we need to look a little further down the network stack.

Mr. Andersen

rdmt · ‎05-01-2009

Update with hopefully some helpful information:

I was able to change over to the e1000 driver and immediately after doing so I noticed that I could no longer ping between the two VMs. I dug deeper and discovered that the Server 2008 system had the wrong MAC for the Server 2003 NIC. I deleted the entry from the arp table and added it statically and it works now. I did the same thing on the Server 2003 system although it had the correct MAC. I figure I'd rather have the static entry and KNOW it's right that question it if this comes up again. So far so good however this has only be setup for the past 10 mintues at this point so I'll post back later this evening if this resolved the problem, if not I'll be back for more helpful suggestions.

Anyone ever seen the MAC arp problem before on two VMs like this?

kjb007 · ‎05-01-2009

For multicast, yes, for unicast regular packets, not really. Do you have notify switches enabled in your vSwitch config. It is the default, but I'm not sure if this setting was changed or not.

-KjB

VMware vExpert

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

rdmt · ‎05-01-2009

Notify switches is set to enable in the vSwitch config for both vSwitches.

kjb007 · ‎05-01-2009

When / if this problem occurs again, can you try to run the repair option from within windows to see if things get cleared up. That should force an arp update from the OS itself.

-KjB

VMware vExpert

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

rdmt · ‎05-01-2009

The problem happened on the Server 2008 box but this time it could not talk to my workstation. When I did arp -a I noticed that the default gateway (our firewall) and my workstation were listed as the same MAC. Any ideas on how that would happen? There is no reason why my workstation should try to communicate with the server through the firewall since we are all plugged into the same switch. I deleted the arp rule manually and it resolved the problem although I'm sure the repair method you suggestion would have worked as well.

kjb007 · ‎05-01-2009

Very strange indeed. Is the firewall your router also? Is your workstation and the server on the same VLAN/subnet or different ones?

-KjB

VMware vExpert

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

rdmt · ‎07-09-2009

I am replying to this in hopes to get this resolved or at least pinpoint the problem. In the weeks since I last updated this I've been using a manual work around where I statically assign the MAC addresses in the ARP tables of each problematic server and that has worked but it appears to be less and less reliable as a work around lately.

I have since dug into our pSwitch setup closer and found that STP (Spanning-Tree) was enabled on all ports on the switch. I tried disabling STP and that didn't seem to help so the last thing I tried was enabling STP with fast learning which is what I have in place right now. The problem still occurs but I've narrowed it down with further information but I'm not sure what it's telling me.

MAC table on the switch shows all correct MAC addresses for the ports.

STP status on all ports including the problematic two is "Forwarding" which I believe is what I want for normal results.

STP path cost on port 1 (Firewall) and port 48 (one of the problematic VMs) is 10. All other ports (2-47) have a path cost value of 1. (I have since moved this from port 48 to port 45 and it now shows path cost value of 1).

STP ports are listed as untagged.

The Server 2008 VM continues to have its ARP table filled with incorrect MAC addresses but it is always the MAC address of the firewall. As an example, our terminal server is IP xxx.yyy.3.200 and it's MAC ends in 10. The ARP entry for that IP on the Server 2008 VM ends in 36 which is our firewall's MAC address. So it is seeing the MAC address for several other IPs as the firewall.

J1mbo · ‎07-10-2009

Do you have more than one vSwitch in the server? Do you have more than on physical NIC in the server? If so, how are they all hooked together?