Re: VERY strange ESX3 ARP and network problems

kellino · ‎04-28-2007

We’ve had a case open with VMWare for a few weeks now, and I’m putting this out here as a cry for help because all of our server and network engineers are stumped. As you read this you’re going to say to yourself “that’s not possible”, but we’ve seen it and have reproduced it in webex sessions.

This is a bit complicated as there are several conditions and variables so I’ll try to touch on all of these. Here goes…

We have 3 ESX (3.01) hosts in a cluster. The first 2 of these ESX hosts (HP DL580 G3) have 3 virtual switches. Each virtual switch has 2 NICs assigned – each NIC going to a different Cisco 4500 switch. In the processes of working on this issue we have disabled any load balancing in ESX as well as VLANs and VLAN tagging.

The third host has only 2 NICs and we are using VLANs on it to separate the COS/VMK and VM networks.

The ESX hosts do not appear to be a variable as the problem has been experienced on VM’s on each host.

Here’s the fun part – trying to describe the problem. We have about 60 VM’s in the cluster. The last 3-5 VM’s to experience a “network state change” will have network connectivity issues. It doesn’t matter if the VM is Linux or Windows and the problem will “move” to last few VM’s that were “touched”. When the problem “moves away” from a VM, it regains full network connectivity, even if nothing was done to the VM.

Now there are two variations of this “network connectivity problem” – I’m guessing that they both happen with nearly the same frequency.

The first variation is where the VM can’t communicate beyond its local subnet. If you look at the ARP table, either there will be no entry for the gateway, or it will appear with a MAC address of all zeros and “invalid” for type. If you try to ad a static ARP entry for the gateway it still doesn’t help.

The second variation is where the VM can reach some subnets, but not others. Yes, I know – this sounds like a VLAN issue. Imagine our disappointment when we disabled all trunking and VLAN tagging and the problem persisted. The list of subnets that you can get to, and those that you cannot, seems to be static (consistent).

Got all that? Some VM’s can’t get off the gateway and other VM’s will only talk to certain subnets. And this problem happens on any VM (Linux or Windows – most are Windows 2003) and it seems to follow the VM’s that last experienced a “network state change”.

Here is something we demonstrated in a webex for VMware support. We took a VM and demonstrated it was healthy (ARP table looked good, and could ping a set of 3 hosts). Then we took a snapshot and then reverted to that snapshot. After reverting to the snapshot, one of the 2 network problems had manifested and the VM no longer had a “healthy” network connection.

No this is not a joke – this is a real live problem that has us stumped and unfortunately has our management losing faith in VMWare.

I don’t know that it is a pure VMware problem, and I’m very highly inclined to think that the network (switch and port settings) is at least a factor. But we’re stumped none the less.

These Cisco 4500 switches support a lot more in our datacenter and none of our physical servers (which we have more of) seem to be suffering any problems whatsoever.

Any ideas would be greatly appreciated.

I’d give 100 points to anyone that can answer this, but you’ll have to settle for 10.

oh, I should add that we've never had any problems on the virtual switches servicing the COS or VMKernel. The only problems are on vswitches used to connect VM's to the network.

Thanks!

Message was edited by:

kellino

wobbly1 · ‎04-28-2007

have you opened a support call with cisco to look at this?

Sorry, can't help directly but just some thoughts on what I'd try out..

Have you tried manually setting a mac address to a VM and seeing whether you can emulate the problem then on it?

Have you tried to recreate the problem with the host removed from the cluster?

Rumple · ‎04-29-2007

We have our ESX Servers plugged into a Cisco 4510 with no problems like this.

Now I will tell you that we did have an issue with one of the IOS codes on the 4510. We started experiencing issues like certain IP's could not get to www.google.com for instance, but others could. We also looked at arp tables, dns and evn packet sniffs without success.

On an effected machine if all I did was change the IP I could then get out to the net. This started happening more and more frequently around the office with more and more websites until one night our core switch bounced every interface at 12:45am and then rebooted. It never came up and rebooted again and auto loaded an older revision of the code. Needless to say that caused quite a smelly storm :O) We had our CCIE and 5 cisco engineers basically rip that Core swithc apart and they never found the issue. We eventually upgraded past that code level and have not had any other issues (althouhg our current code does have a supervisory card memopry leak that took out the core one time at 8am

What version of the IOS do you have on your 4500? I can check our IOS version monday morning.

Paul_Lalonde · ‎04-29-2007

Check out:

kb.vmware.com/kb/507

Try setting the MAC statically to your VMs. I have seen similar instances of this behaviour (particularly with certain switches and our PIX firewall) and a refresh of the CAM (switch) and XLATE table (PIX) seems to fix it. It may be the way in which VM MAC addresses can change (ie. when a VM is migrated via DRS).

Paul

grasshopper · ‎04-29-2007

The first variation is where the VM can’t communicate
beyond its local subnet.

From my experience, the most common reason for the behavior in the above variation is a scenario where an ESX network bond consisting of two NIC ports going to two switch ports, which were erroneously configured for different VLANS on the pSwitch. The result is that sometimes the guest can ping beyond the gateway, sometimes it can't.

Of course I'm sure you triple checked that but just throwing it out there.

Some other alternative troubleshooting would include temporarily removing the additional members of the bond, so that each vswitch only has one network connection.

Additionally, you may consider making a copy of the before and after config file (.vmx) then performing a diff to see what has changed when doing the test scenario illustrated via webex previously.

Of course, good old fashioned log analysis is in order as well. So performing a tail on the /var/log/vmkernel and vmware.log files would be helpful as well.

kellino · ‎04-29-2007

Thanks for all the advice and insight thus far.

I don't think the MAC issue is relevant here for several reasons --

1) MAC changes simply aren't nearly as common as we are observing the problem.

2) I did multiple reboots, reverts and migrations and did not observe a single MAC change

3) I observed the problem appear on a VM, even though the MAC never changed.

Here's something interesting. Some VM's appear to "self-heal" after a period of time, while others will not.

I just found a VM that will work find on Host A, but never on host B. And I found another VM that works find on host B, but not on Host A. Remeber all VLAN tagging and trunking is disabled on host B.

All transmit load balancing (ESX) is also disabled. I think I'm going to unplug a cable from each host so that we only have 1 common switch in the mix and see where that takes us. Also if it works out, we will try moving a host to one of our core switches (6500).

ygoodman · ‎04-29-2007

Hi,

I might be totally wrong, but this sounds like a spanning-tree problem to me.

Keep in mind that virtual switches do not support STP (since they cannot cause a loop).

It might be worth having your network team check out the STP events in your upstream switches. Connecting the Hosts to only one switch may also solve the problem.

Just for reference, I run 714 VMs on 25 ESX hosts (7 clusters) connected to a Cisco 4506 (and a 6509 Backbone), with no such problems, but each of my hosts are connected to a single 4506 (I use teaming only to solve link failure problems, not switch outages).

Good Luck,

Yehuda

kellino · ‎04-30-2007

Thanks Yehuda.

The VMWare engineers we've been working with originally told us to enable BOTH portfast and spantree on the switch ports.

I asked our network team if we could have a loop due to spantree, and they said that since spantree is Layer 2, we would be seeing the issue on the whole subnet, as opposed to a problem that appears to move among specific VM's.

gogogo5 · ‎04-30-2007

a few things to add:

1. Have you configured enough virtual switchports (defaults to 56 I think) on your VM vSwitch?

2. Have you tried totally removing then re-creating your VM vSwitch? I've seen this fix strange VM network connectivity issues.

3. Shot in the dark this one, but trying to help! Do you configure MAC address security on your switchports?

GavinJ · ‎05-01-2007

Out of interest we had pretty much this exact range of symptoms 2 years back with ESX 2.5.x. At the time we were trying to run 'out-ip' based load-balancing on our servers. Unfortunately we were unable to reach a support fix with VMware for it after weeks of pulling our Cisco network apart and ended up settling with 'out-mac'. You mention you've disabled load-balancing - has modifying your host load-balancing options had any effect on the symptoms?

Gavin

kellino · ‎05-01-2007

Virtual switchports -- that was actually one of the first things we looked at about a month ago. We jacked them up to 120....

Recreating VMswitch? Didn't try that but we are seeing this on three hosts -- three hosts that didn't have any issues until either an environmental change (not on ESX side) or problems of scale exposed this problem.

No MAC security is used. All good ideas and thanks....This is why we are stumped!

kellino · ‎05-01-2007

We never had any recieve load balancing. Each vswitch spans 2 switches.

Transmit load balaning we completly turned off a week or more ago. No changes.

The latest itieration is this....

From Host A we disabled the path to switch B. And from Host B we disabled the path to switch A.

We still have problems we were epxerimenting with. Then out of the blue today while I was in a meeting, 9 hosts lost their connectivity -- all on host B. Within 40 minutes 5 of these came back on by themselves. The remaining 4 had bad ARP tables and could not go anywere. Vmotioned these 4 to Host A and the ARP problems persist even after VM reboots.....

kellino · ‎05-02-2007

Now this is interesting.....

As I mentioned before we have load balancing disabled on ESX ("use explicit failover order") and we also unplugged one of the two NICs, leaving only one path.

Well one of our network techs saw the dangling cable, not aware of the situation, and plugged it in.

When he plugged it in all the VM's on the host fell off the network and were unreachable. Within 40 minutes most found their way back online on their own, but the rest had invalid ARP tables and would only work if vmotioned to another host.

Here's the million dollar question -- if ESX transmit load balancing is disabled, and someone plugs in a 2nd NIC for the virtual switch, why should it have any impact at all?

Looking at the configuration the 2nd NIC had higher priority, so when it was plugged it, ESX tried to use it. I understand that, but I don't understand why nothing on this connection had access for 20-40 minutes, and some still didn't get healthy ARP tables....

After this incident we moved one of our hosts from the 4500 switches to the core 6500 switches. We haven't seen any ARP issues since doing this, but we still have the "selective subnet" problem, where a VM can get to some subnets but not others.

The more we test the less sense this all makes....

gogogo5 · ‎05-02-2007

Start stripping the complexity away. Unbind all but one pNIC from the vSwitch hosting your VMs and see if the problems still persist.

asyntax · ‎05-02-2007

This might be out there, but check the number of ports you have assigned to the vswich you have your vlans on. Make sure the number of ports is greater than the number of VM’s you are running. If not change it to the max number of VM’s you would ever run on that switch and restart the server.

jdvcp · ‎05-06-2007

I just resolved a similar issue where VMotioning was causing ARP loops in my cisco switches. Our network guy re-read the 802.1q guide and changed some settings:

switchport nonegotiate

spanning-tree portfast trunk

no cdp enable

kellino · ‎05-06-2007

Thanks. VMWare did mention the first two settings previously -- I'll double check to make sure these are still in place after moving to the core switches.

I could be wrong but I'm guessing that there are two problems here. I'm slightly inclined to think that the "Selective subent" issue I describe above might be a VLAN issue somehow....still working on this with our network team....Thankse veryone for the ideas so far....

When I have it all figured out (or at least part of it) I'll follow up here.

kellino · ‎05-07-2007

We already had all of these settings it seems, except for the no cdp enable which hasn't made any difference.

It's getting much more difficult to not point the finger at VMWare.

We've ruled out switch hardware by moving to the 6500 core switches and no where else in our datacenter do we see such symptoms -- only in the ESX environment do we have any symptoms.

We have plenty of virtual ports, we disabled load balancing and the NICs are set to 1000 Full. Not much else to do on the VMWare side -- yet the problems are unique to VMWare it seems.

Rumple · ‎05-07-2007

The problem with pointing the finger at vmware would also mean that most of us would also be having the same problems and since most of us are using Cisco core level switches (4500+) we would all be running into the same problems.

kellino · ‎05-07-2007

The problem with pointing the finger at vmware would
also mean that most of us would also be having the
same problems and since most of us are using Cisco
core level switches (4500+) we would all be running
into the same problems.

Agreed! I made this exact point orignally to the network team to solicit their involvement. Only problem is 3 weeks later we still don't know the root cause, nor can I explain why ARP loops and other such issues would be exclusive to ESX and not the rest of the datacenter.

Just getting desperate I guess.....if there's no settings on the ESX side to tweak at this point and the problem is only seen on ESX servers -- I'm just having a hard time trying to answer this question.