Weve had a case open with VMWare for a few weeks now, and Im putting this out here as a cry for help because all of our server and network engineers are stumped. As you read this youre going to say to yourself thats not possible, but weve seen it and have reproduced it in webex sessions.
This is a bit complicated as there are several conditions and variables so Ill try to touch on all of these. Here goes
We have 3 ESX (3.01) hosts in a cluster. The first 2 of these ESX hosts (HP DL580 G3) have 3 virtual switches. Each virtual switch has 2 NICs assigned each NIC going to a different Cisco 4500 switch. In the processes of working on this issue we have disabled any load balancing in ESX as well as VLANs and VLAN tagging.
The third host has only 2 NICs and we are using VLANs on it to separate the COS/VMK and VM networks.
The ESX hosts do not appear to be a variable as the problem has been experienced on VMs on each host.
Heres the fun part trying to describe the problem. We have about 60 VMs in the cluster. The last 3-5 VMs to experience a network state change will have network connectivity issues. It doesnt matter if the VM is Linux or Windows and the problem will move to last few VMs that were touched. When the problem moves away from a VM, it regains full network connectivity, even if nothing was done to the VM.
Now there are two variations of this network connectivity problem Im guessing that they both happen with nearly the same frequency.
The first variation is where the VM cant communicate beyond its local subnet. If you look at the ARP table, either there will be no entry for the gateway, or it will appear with a MAC address of all zeros and invalid for type. If you try to ad a static ARP entry for the gateway it still doesnt help.
The second variation is where the VM can reach some subnets, but not others. Yes, I know this sounds like a VLAN issue. Imagine our disappointment when we disabled all trunking and VLAN tagging and the problem persisted. The list of subnets that you can get to, and those that you cannot, seems to be static (consistent).
Got all that? Some VMs cant get off the gateway and other VMs will only talk to certain subnets. And this problem happens on any VM (Linux or Windows most are Windows 2003) and it seems to follow the VMs that last experienced a network state change.
Here is something we demonstrated in a webex for VMware support. We took a VM and demonstrated it was healthy (ARP table looked good, and could ping a set of 3 hosts). Then we took a snapshot and then reverted to that snapshot. After reverting to the snapshot, one of the 2 network problems had manifested and the VM no longer had a healthy network connection.
No this is not a joke this is a real live problem that has us stumped and unfortunately has our management losing faith in VMWare.
I dont know that it is a pure VMware problem, and Im very highly inclined to think that the network (switch and port settings) is at least a factor. But were stumped none the less.
These Cisco 4500 switches support a lot more in our datacenter and none of our physical servers (which we have more of) seem to be suffering any problems whatsoever.
Any ideas would be greatly appreciated.
Id give 100 points to anyone that can answer this, but youll have to settle for 10.
oh, I should add that we've never had any problems on the virtual switches servicing the COS or VMKernel. The only problems are on vswitches used to connect VM's to the network.
Thanks!
Message was edited by:
kellino
have you opened a support call with cisco to look at this?
Sorry, can't help directly but just some thoughts on what I'd try out..
Have you tried manually setting a mac address to a VM and seeing whether you can emulate the problem then on it?
Have you tried to recreate the problem with the host removed from the cluster?
We have our ESX Servers plugged into a Cisco 4510 with no problems like this.
Now I will tell you that we did have an issue with one of the IOS codes on the 4510. We started experiencing issues like certain IP's could not get to www.google.com for instance, but others could. We also looked at arp tables, dns and evn packet sniffs without success.
On an effected machine if all I did was change the IP I could then get out to the net. This started happening more and more frequently around the office with more and more websites until one night our core switch bounced every interface at 12:45am and then rebooted. It never came up and rebooted again and auto loaded an older revision of the code. Needless to say that caused quite a smelly storm :O) We had our CCIE and 5 cisco engineers basically rip that Core swithc apart and they never found the issue. We eventually upgraded past that code level and have not had any other issues (althouhg our current code does have a supervisory card memopry leak that took out the core one time at 8am
What version of the IOS do you have on your 4500? I can check our IOS version monday morning.
Check out:
kb.vmware.com/kb/507
Try setting the MAC statically to your VMs. I have seen similar instances of this behaviour (particularly with certain switches and our PIX firewall) and a refresh of the CAM (switch) and XLATE table (PIX) seems to fix it. It may be the way in which VM MAC addresses can change (ie. when a VM is migrated via DRS).
Paul
The first variation is where the VM cant communicate
beyond its local subnet.
From my experience, the most common reason for the behavior in the above variation is a scenario where an ESX network bond consisting of two NIC ports going to two switch ports, which were erroneously configured for different VLANS on the pSwitch. The result is that sometimes the guest can ping beyond the gateway, sometimes it can't.
Of course I'm sure you triple checked that but just throwing it out there.
Some other alternative troubleshooting would include temporarily removing the additional members of the bond, so that each vswitch only has one network connection.
Additionally, you may consider making a copy of the before and after config file (.vmx) then performing a diff to see what has changed when doing the test scenario illustrated via webex previously.
Of course, good old fashioned log analysis is in order as well. So performing a tail on the /var/log/vmkernel and vmware.log files would be helpful as well.
Thanks for all the advice and insight thus far.
I don't think the MAC issue is relevant here for several reasons --
1) MAC changes simply aren't nearly as common as we are observing the problem.
2) I did multiple reboots, reverts and migrations and did not observe a single MAC change
3) I observed the problem appear on a VM, even though the MAC never changed.
Here's something interesting. Some VM's appear to "self-heal" after a period of time, while others will not.
I just found a VM that will work find on Host A, but never on host B. And I found another VM that works find on host B, but not on Host A. Remeber all VLAN tagging and trunking is disabled on host B.
All transmit load balancing (ESX) is also disabled. I think I'm going to unplug a cable from each host so that we only have 1 common switch in the mix and see where that takes us. Also if it works out, we will try moving a host to one of our core switches (6500).
Hi,
I might be totally wrong, but this sounds like a spanning-tree problem to me.
Keep in mind that virtual switches do not support STP (since they cannot cause a loop).
It might be worth having your network team check out the STP events in your upstream switches. Connecting the Hosts to only one switch may also solve the problem.
Just for reference, I run 714 VMs on 25 ESX hosts (7 clusters) connected to a Cisco 4506 (and a 6509 Backbone), with no such problems, but each of my hosts are connected to a single 4506 (I use teaming only to solve link failure problems, not switch outages).
Good Luck,
Yehuda
Thanks Yehuda.
The VMWare engineers we've been working with originally told us to enable BOTH portfast and spantree on the switch ports.
I asked our network team if we could have a loop due to spantree, and they said that since spantree is Layer 2, we would be seeing the issue on the whole subnet, as opposed to a problem that appears to move among specific VM's.
a few things to add:
1. Have you configured enough virtual switchports (defaults to 56 I think) on your VM vSwitch?
2. Have you tried totally removing then re-creating your VM vSwitch? I've seen this fix strange VM network connectivity issues.
3. Shot in the dark this one, but trying to help! Do you configure MAC address security on your switchports?
Out of interest we had pretty much this exact range of symptoms 2 years back with ESX 2.5.x. At the time we were trying to run 'out-ip' based load-balancing on our servers. Unfortunately we were unable to reach a support fix with VMware for it after weeks of pulling our Cisco network apart and ended up settling with 'out-mac'. You mention you've disabled load-balancing - has modifying your host load-balancing options had any effect on the symptoms?
Gavin
Virtual switchports -- that was actually one of the first things we looked at about a month ago. We jacked them up to 120....
Recreating VMswitch? Didn't try that but we are seeing this on three hosts -- three hosts that didn't have any issues until either an environmental change (not on ESX side) or problems of scale exposed this problem.
No MAC security is used. All good ideas and thanks....This is why we are stumped!
We never had any recieve load balancing. Each vswitch spans 2 switches.
Transmit load balaning we completly turned off a week or more ago. No changes.
The latest itieration is this....
From Host A we disabled the path to switch B. And from Host B we disabled the path to switch A.
We still have problems we were epxerimenting with. Then out of the blue today while I was in a meeting, 9 hosts lost their connectivity -- all on host B. Within 40 minutes 5 of these came back on by themselves. The remaining 4 had bad ARP tables and could not go anywere. Vmotioned these 4 to Host A and the ARP problems persist even after VM reboots.....
Now this is interesting.....
As I mentioned before we have load balancing disabled on ESX ("use explicit failover order") and we also unplugged one of the two NICs, leaving only one path.
Well one of our network techs saw the dangling cable, not aware of the situation, and plugged it in.
When he plugged it in all the VM's on the host fell off the network and were unreachable. Within 40 minutes most found their way back online on their own, but the rest had invalid ARP tables and would only work if vmotioned to another host.
Here's the million dollar question -- if ESX transmit load balancing is disabled, and someone plugs in a 2nd NIC for the virtual switch, why should it have any impact at all?
Looking at the configuration the 2nd NIC had higher priority, so when it was plugged it, ESX tried to use it. I understand that, but I don't understand why nothing on this connection had access for 20-40 minutes, and some still didn't get healthy ARP tables....
After this incident we moved one of our hosts from the 4500 switches to the core 6500 switches. We haven't seen any ARP issues since doing this, but we still have the "selective subnet" problem, where a VM can get to some subnets but not others.
The more we test the less sense this all makes....
Start stripping the complexity away. Unbind all but one pNIC from the vSwitch hosting your VMs and see if the problems still persist.
This might be out there, but check the number of ports you have assigned to the vswich you have your vlans on. Make sure the number of ports is greater than the number of VMs you are running. If not change it to the max number of VMs you would ever run on that switch and restart the server.
I just resolved a similar issue where VMotioning was causing ARP loops in my cisco switches. Our network guy re-read the 802.1q guide and changed some settings:
switchport nonegotiate
spanning-tree portfast trunk
no cdp enable
Thanks. VMWare did mention the first two settings previously -- I'll double check to make sure these are still in place after moving to the core switches.
I could be wrong but I'm guessing that there are two problems here. I'm slightly inclined to think that the "Selective subent" issue I describe above might be a VLAN issue somehow....still working on this with our network team....Thankse veryone for the ideas so far....
When I have it all figured out (or at least part of it) I'll follow up here.
We already had all of these settings it seems, except for the no cdp enable which hasn't made any difference.
It's getting much more difficult to not point the finger at VMWare.
We've ruled out switch hardware by moving to the 6500 core switches and no where else in our datacenter do we see such symptoms -- only in the ESX environment do we have any symptoms.
We have plenty of virtual ports, we disabled load balancing and the NICs are set to 1000 Full. Not much else to do on the VMWare side -- yet the problems are unique to VMWare it seems.
The problem with pointing the finger at vmware would also mean that most of us would also be having the same problems and since most of us are using Cisco core level switches (4500+) we would all be running into the same problems.
The problem with pointing the finger at vmware would
also mean that most of us would also be having the
same problems and since most of us are using Cisco
core level switches (4500+) we would all be running
into the same problems.
Agreed! I made this exact point orignally to the network team to solicit their involvement. Only problem is 3 weeks later we still don't know the root cause, nor can I explain why ARP loops and other such issues would be exclusive to ESX and not the rest of the datacenter.
Just getting desperate I guess.....if there's no settings on the ESX side to tweak at this point and the problem is only seen on ESX servers -- I'm just having a hard time trying to answer this question.