Re: VERY strange ESX3 ARP and network problems - Page 2

kellino · ‎04-28-2007

We’ve had a case open with VMWare for a few weeks now, and I’m putting this out here as a cry for help because all of our server and network engineers are stumped. As you read this you’re going to say to yourself “that’s not possible”, but we’ve seen it and have reproduced it in webex sessions.

This is a bit complicated as there are several conditions and variables so I’ll try to touch on all of these. Here goes…

We have 3 ESX (3.01) hosts in a cluster. The first 2 of these ESX hosts (HP DL580 G3) have 3 virtual switches. Each virtual switch has 2 NICs assigned – each NIC going to a different Cisco 4500 switch. In the processes of working on this issue we have disabled any load balancing in ESX as well as VLANs and VLAN tagging.

The third host has only 2 NICs and we are using VLANs on it to separate the COS/VMK and VM networks.

The ESX hosts do not appear to be a variable as the problem has been experienced on VM’s on each host.

Here’s the fun part – trying to describe the problem. We have about 60 VM’s in the cluster. The last 3-5 VM’s to experience a “network state change” will have network connectivity issues. It doesn’t matter if the VM is Linux or Windows and the problem will “move” to last few VM’s that were “touched”. When the problem “moves away” from a VM, it regains full network connectivity, even if nothing was done to the VM.

Now there are two variations of this “network connectivity problem” – I’m guessing that they both happen with nearly the same frequency.

The first variation is where the VM can’t communicate beyond its local subnet. If you look at the ARP table, either there will be no entry for the gateway, or it will appear with a MAC address of all zeros and “invalid” for type. If you try to ad a static ARP entry for the gateway it still doesn’t help.

The second variation is where the VM can reach some subnets, but not others. Yes, I know – this sounds like a VLAN issue. Imagine our disappointment when we disabled all trunking and VLAN tagging and the problem persisted. The list of subnets that you can get to, and those that you cannot, seems to be static (consistent).

Got all that? Some VM’s can’t get off the gateway and other VM’s will only talk to certain subnets. And this problem happens on any VM (Linux or Windows – most are Windows 2003) and it seems to follow the VM’s that last experienced a “network state change”.

Here is something we demonstrated in a webex for VMware support. We took a VM and demonstrated it was healthy (ARP table looked good, and could ping a set of 3 hosts). Then we took a snapshot and then reverted to that snapshot. After reverting to the snapshot, one of the 2 network problems had manifested and the VM no longer had a “healthy” network connection.

No this is not a joke – this is a real live problem that has us stumped and unfortunately has our management losing faith in VMWare.

I don’t know that it is a pure VMware problem, and I’m very highly inclined to think that the network (switch and port settings) is at least a factor. But we’re stumped none the less.

These Cisco 4500 switches support a lot more in our datacenter and none of our physical servers (which we have more of) seem to be suffering any problems whatsoever.

Any ideas would be greatly appreciated.

I’d give 100 points to anyone that can answer this, but you’ll have to settle for 10.

oh, I should add that we've never had any problems on the virtual switches servicing the COS or VMKernel. The only problems are on vswitches used to connect VM's to the network.

Thanks!

Message was edited by:

kellino

ThompsG · ‎05-07-2007

Hi,

Just a couple of things to ask:

1) Do you have any TACLANE's in your environment?

2) Somebody mentioned about upgrading the IOS on your CISCO switches. Has this been tried yet?

We had a similar problem but it was related to the TACLANE's we are using. What we ended up having to do in order to fix it was put a plain old switch between ESX and the real world. No fancy pants switch just a plain old layer 1 switch. Perhaps you could try this as well?

Kind regards,

Glen

kellino · ‎05-10-2007

Eureka!

This is what we've found. When a VM is moved from host A to host B, Switch A's MAC table will basically say "Switch B says he owns this MAC now -- ask him", but if you look at switch B, the MAC table hasn't updated properly. If you manually purge that MAC entry from switch B, it does a broadcast to find the correct MAC/IP pair and everything is pretty.

If we moved a VM from host B to host A, the situation was that each switch was saying "I've got this MAC on one of my ports" -- explaing why we could reach some subnets but not others.

When all ESX hosts are on the same switch, we have no issues.

Our network guys are still working with Cisco to figure out why this is happening, but we are much closer and have an easy work-around. Thanks everyone for all your good suggestions!

kellino · ‎05-10-2007

We don't know the root cause yet, but the problem is that the MAC table on the Cisco switches aren't being updated properly when hosts move/use different paths.

The temporary workaround is to keep all hosts on the same switch.

jdvcp · ‎05-11-2007

You may have mentioned this, but are you using MAC-out loadbalancing on the vSwitch? I also have the following settings at the vSwitch level (with all underlying port groups inheriting)

Both pNICs as primary

Rolling YES

Notify YES

Link Test

Mac-out

oreeh · ‎05-11-2007

Adjust the ARP / MAC caching timeout.

Most switches have really high default values.

When reconfiguring the NIC teams make sure to set the "Notify Switches" option in the "NIC teaming settings" tab.

kellino · ‎05-11-2007

Those were the original settings, but for the past 4 weeks we were running with load balancing (on ESX) disabled, and for the past 2 weeks we were running with only 1 NIC plugged in, and still experiencing the same symptoms.

Both pNICs as primary
Rolling YES
Notify YES
Link Test
Mac-outYou may have mentioned this, but are you using
MAC-out loadbalancing on the vSwitch? I also have
the following settings at the vSwitch level (with all
underlying port groups inheriting)

jdvcp · ‎05-11-2007

unbelievable. One way we validated we only had a teaming issue with ARP was to use only 1 NIC...the config you just told me about. In that case, we had no ARP issues.

Are the 2 6500s connected by a trunk which allows all necessary traffice, vlans, etc between switches?

Once again, sorry if this was already covered.

kellino · ‎05-11-2007

Adjust the ARP / MAC caching timeout.
Most switches have really high default values.
When reconfiguring the NIC teams make sure to set the
"Notify Switches" option in the "NIC teaming
settings" tab.

I think our network guy said that by default MAC gets purged after 5 minutes of no traffic -- but we proved this wasn't the issue.

I'm almost inclined to think it's an IOS bug. He observed 2 interesting things:

1) ARP response packets being sent TO ESX with a MAC of 00-00-00-00-00. Needless to say he wasn't sure who or what might have been trasmitting these packets

2) It's almost like at least one switch wasn't able to update it's MAC table properly. For example, Switch B recieved notification from Swith A, that he was now the owner of the MAC. The evidence for this was that Switch B's MAC table referred back to Switch A as the owner. But if you looked at Switch A's table, it hadn't updated this MAC. Manually purge this one MAC from the table and it instantly broadcasts, finds it, and correctly updates the table.

kellino · ‎05-11-2007

Are the 2 6500s connected by a trunk which allows all
necessary traffice, vlans, etc between switches?

like a management network for inter-switch communication? Yes, I'm fairly certain they have something like this setup. There really isn't much evidence of any systemic communication problem between the switches. well none actually :). I'm inclined to think that somewhere there was more of a functional breakdown than general communication but that's largely a guess

oreeh · ‎05-11-2007

I'm almost inclined to think it's an IOS bug. He observed 2 interesting things:
1) ARP response packets being sent TO ESX with a MAC of 00-00-00-00-00

sounds like a severe bug

Jan_MS · ‎07-25-2007

Hello,

we have the same problem. Sometimes, when the VM changed the network card/physical switch (e.g. VMotion), the MAC adress is not shown on another cisco switch port. It looks like the "notify switch" doesn´t work correct!

Is this problem solved or its a bug or what?

Thanks and regards,

Jan

skalol · ‎12-09-2008

I've read all this thread without found any answer...

I've got a similar problem when using failover between nics. No problems with the vmotion funcs but only failover.

I use only one vswitch with portgroups like this :

Switch Name Num Ports Used Ports Configured Ports Uplinks

vSwitch0 32 8 32 vmnic3,vmnic2,vmnic1,vmnic0

PortGroup Name Internal ID VLAN ID Used Ports Uplinks

VM Network 204 portgroup8 204 0 vmnic2,vmnic3

VM Network 1 portgroup4 1 0 vmnic2,vmnic3

VM Network 203 portgroup1 203 1 vmnic2,vmnic3

Service Console portgroup0 203 1 vmnic1,vmnic0

VMotion portgroup7 205 1 vmnic0,vmnic1

Loadbalance on ip

Notify Yes

Failback No

Vmnic 2 -> PSwitch 1 / Vmnic3 -> Pswitch2

For exemple, after disabling on one of the Pswitches uplink to vmnic2, only one ping lost for a vm linked to a subnet... that's correct ! But then, when enabling vmnic2, i can see the ping lost... and the mac of the vm on both Pswitches! Have to wait for seconds to see ping reached from or to the vm !

Anything about a patch on 3.0.2 61618 ? Why does it work on vmotion but on failover ?

skalol · ‎12-10-2008

Ok right !

I haven't found anything using the search engine, but reading the patches, i've found the ESX-1003515 solving the RARP problems with nics failover for ESX 3.0.2 ! So now everything works fine !

All

VERY strange ESX3 ARP and network problems