We are having a problem with some of our virtual machines intermittently losing communication with each other, and I’m at a loss as to the source.
We have about 250 VM’s running on about 20 HP BL465C blades installed on two HP C7000 chassis, using the HP Virtual Connect interconnect modules. The blade chassis are connected to our core Cisco 6500 switches. The VMWare hosts are at 5.0, the guest VM’s are a mix on Windows 2003, 2008, and 2008R2.
What’s going on is that everything seems to be OK, but then out of nowhere, we will get communication failures between specific machines. It looks like it’s an ARP issue. Using PING, it works fine in one direction, but we get an “unreachable” error when going the other way, unless we ping from the target back to the source first.
For example: we have servers, “A” and “B”. Ping A to B fails with “unreachable”. Ping “B” to “A” works fine. However after pinging “B” to “A”, we can now ping “A” to “B”, at least for a while until the entry falls out of the ARP cache. If we go into server “A” and set a static ARP entry (“arp –s”) for server “B”, everything works OK. Through all this both server “A” and server “B” have no issues communicating with any other machines.
We tried using vMotion to move the servers to a different host, different blade chassis, etc. Nothing worked except when we put both VM’s on the same host. Then everything worked OK. Moving one of the servers to a different host and the problem came back.
It seems like either the ARP broadcast from the one server, or the reply back from the target isn't making it through. However, according to our networking group, there are no issues showing up Cisco switches.
Early this year, we had an issue where it happened on about a third of machines at the same time (it caused significant outages to production systems!). It seemed like it was limited to machines on one chassis (but not all of the machines on that chassis). At that time, we opened up tickets with VMWare and HP. Neither found anything wrong with our configurations, but somewhere in the various server moves, configuration resets, etc., everything started working.
Since that time we’ve seen it very intermittently on a few machines, but then it seems to go away after a few days.
The issue we found today was that the server we’re using for the Microsoft WSUS server hadn’t been receiving updates from a couple of the member servers. We could ping from the WSUS to the member server, but not back from the member server unless we put a static ARP entry in the member server. The member servers are working fine otherwise, talking to other machines OK, etc. They are a production environment, so we’re limited on the testing we can do.
Also, when it has happened, it seems like always been between machines on the same subnet. However, most of our servers are on the same subnet, so it might just be coincidence.
I’ve done a lot of internet searching, and have found some postings with similar issues, but haven’t found any solution. I don’t know if it’s a VMWare issue, HP, Cisco, or Windows issue.
Any assistance would be appreciated.
I never did find a clear cause or solution.
However, I just completed a firmware update on the chassis, virtual connect and the blades. As part of the process, I also went and reconfigured Virtual Connect to use VLan tunneling, instead of the Shared Uplink Sets configuration that we had originally set up.
As I understand it, using the shared uplink set method, Virtual Connect strips off the VLAN tag. Then if you send “multiple networks” to the blade NIC, it basically reassembles the packet with the VLAN tags.
Using VLan tunneling, it sends all the packets coming into the VC module straight through to the blade and lets the O/S on the blade split out the different VLAN’s.
I had come across an old posting stating that since the ARP pack was so small to begin with, once Virtual Connect stripped off the VLAN tag, sometimes the packet was too small and got discarded. Supposedly an earlier version of the VC firmware fixed the issue.
Since we were sending all the VLAN’s to the blade NIC’s anyway, using the “multiple networks” config, it seems more efficient to just let Virual Connect just pass the packs through without splitting/recombining them.
I’m hoping that will take care of the ARP issue, but since it’s so intermittent, it’s hard to tell. I tested it on on a two machines that showed the issue a couple of weeks ago (when I posted the original message). The Ping/ARP worked OK, but then moving the machines back to a some blades that I hadn’t updated showed it still working…
If it does come back, I’ll open a case with VMWare and HP. Part of the reason for the firmware updates was so that if I did need to open a case, at least we’re running the current releases.
So seeing how you used to have this problem and are having it again now check that there are no static ARP mappings on the switches. Sometimes when people are desparately trying to troubleshoot something they keep making changes trying to fix it but don't record and rollback the changes that are unsuccessful.
Do you have any nested ESXi hosts on this subnet? They can be very annoying as their VM config can report one address while their actual management MAC is different, you need to check the actual ESXi host in Vcenter, not its VM.
Log into each switch and inspect its ARP table. There are 3 scenarios on each:
1) The ARP entry points in the correct direction
2) The ARP entry points in the wrong direction suggesting duplicate MAC or incorrect static ARP
3) There is no entry
I came across this article and will give it a try. It sounds like a reasonable explanation. The source of the problem has been difficult to track down as it is sporadic. I also have a case open for this issue.
Thanks for the comment, but I don't believe that's the issue for us. Our blades are using the NC553i, which is the Server Engines chipset, not the 5xx NetXen series.
Also, the HP advisory it talking about loosing network connectivity, but even when we had the Ping/Arp issue on a server, it was only between specific ones. The servers could both communicate with no issues with other machines. Also, once the ARP entry was put in, both servers would work fine with each other.
The work around for this appears to be using "Originating Port" Teaming and not "Physical NIC Load". Switch all dvs port groups to the "Originating Port" teaming and the problem goes away. Also be aware that the October SPP installs a newer version of the NC553i firmware than the "October HP Recipe".
The "October2012VMwareRecipe3.0.pdf" lists the correct firware version as 4.1.450.16 however the SPP installs 4.1.450.7 . You need to ensure the earlier version is loaded to comply with the Recipe.
You need to change the teaming with either of these versions.
Not a fix but a workaround none the less.
In my case all port groups are already set to originating port and we have the problem. We are currently going through firmware and driver updates to see if that fixes the issue. So far so good on half the hosts.
We’re not using the distributed virtual switches (at least not yet), each blade has it’s “own” vSwitch. The teaming on those are set to “route based on the originating virtual port ID”.
The firmware on our NIC’s is the 4.1.450.7 version. About two weeks ago I had updated them using the SPP so that it would bring the other parts (P4xx, BIOS, etc) up to the current levels. I saw that 4.1.450.16 was out, but the “resolved issues” didn’t seem like it addressed anything that applied to us.
That’s kind of what I’m hoping too, that getting the firmware and drivers all up to the current levels will help, along with switching the Virtual Connect configuration to the VMWare blades to use the “vLAN tunneling” instead of the shared uplink set/“multiple network” server profile NIC.
All of the blades now have the current drivers and 4.1.450.7 firmware. One of the two chassis has also been reconfigured to use the vLan tunneling. So far, I haven’t come across the PING/ARP issue. The second chassis has a few non-VMWare blades in it, so I can’t change it’s Virtual Connect configuration until our maintenance window next weekend.
Our arp problem did not go away until we back reved the firmware to 4.1.450.16 as per the October recipe. Have seen this before in VC upgrades. One version breaks teaming or "Smart Link" The next version fixes it and then they seem to forget and the very next version breaks it. Enterprise networking? Virtual Connect? What a nightmare. For the other guy who is talking about VLAN Mapping vs VLAN Tunneling. Mapping adds absolutely no value to a VMware solution and yet another layer of abstraction (it tags and untags every packet with another header). Would also recommend reverting every enclosure to use the Factory assigned MAC addresses as once again there is no need to move blades between device bays if they are all part of a vSphere cluster. The redundancy is provided by HA.
Would strongly recommend anyone who is looking at this and considering a HP VC or Flex Fabric solution to run a mile. Us standard CISCO or Brocade switch modules.
Have you had this issue at all since the firmware update? I am currently on a call with HP trying to troubleshoot this same exact issue. VMware tech pointed at the edge switch, networking team points to chassis. I am at a loss and have been working this for a couple weeks now. I just wonder if you have seen this come back up since the updates. We are currently running OA 3.70, VC 3.51, and ESXi 5.1.
thanks for your time.
Having seen this problem several times I believe it is as simple as the gratuitous arp not reaching the external switches when a vm is migrated. The vSwitch i.e. the target host involved in the migration is supposed to send an rarp packet to the external switch. Due to HPs abysmal record of VC firmware updates I would bet the farm that the VC layer is dropping the rarp packet. Get VMware support to prove to you firstly that the rarp is being sent from the target host to notify the switches. Then get your network guys to see if they are recieving the rarp on the external switch ports. Either esxi is not sending the "Notify Switches" (rarp) packet at all or the VC layer is not passing it on to the blade uplinks (external switch ports). We have VC configured for tunneling (trunk mode) so this should not happen however we are still using VC assigned MAC addresses which have caused several major outages in the past. Every time networking breaks it is a combination of VC the emulex NICs or both.
The rarp packet theory is validated by the fact that if a VM is pinging a host on a different subnet or the gateway the problem never occurs during migration. Let me know what you find.
Just an update. The only way we have been able to "Solve" this issue is to configure all hosts to use active/standby uplinks. The VMs then don't lose connectivity when they migrate. Not sure why but it appears the active/active options no longer work (from ESXi 5 onwards) we have tried both load based teaming and originating port id in the Active/Active configuration but neither seem to update the switches when a vm migrates. Both of these options worked previously. Something broke this in version 5. Now we only have half the available bandwidth being used by each host.
Also suprisingly in this instance it is not HP Virtual Connect at fault (a first!) as our IBM x3650 hosts are doing exactly the same thing.
Any feedback on why this is happening would be very much appreciated.
In our case the issue was finally resolved when HP sent us new NICs of the same model but with a newer revision. They eventually said that there was a hardware problem related to the Qlogic chipset on the HP NC375T adapters. Prior to that we had tried several OS patches related to ARP and different combinations of drivers and firmware as instructed, to no avail. We have not had the problem since, nor have we seen it on the other NIC models we use.
As for the issue being the NC375T, we're not using the Qlogic based NIC's, our blades have the on board NC551i Emulex chipset.
So far it hasn't come back since we updated all the firmware and re-configured the Virtual connect to use tunneling mode.
I may have spoke too soon when I said that we weren't having the issue any more. Earlier today I found one VM that was unable to ping a different server. Both in the same subnet, but on different blades. I could ping from the target to the source, and then ping from source to target worked OK for a while, then stopped working again. I vmotioned the target to a blade in a different chassis and everything worked OK.
I can't find it on anything else, but that doesn't mean it's not happening. We have something like 300 servers, and most of the communication is NOT server-server , so short of having every server test ping every other server, it's not really going to show up..
NV1, you mentioned that a workaround was to use active/standby NIC's. Do you mean to have on one NIC on the virtual switch as active and the other on standby? We are doing ours with all the NIC's on the virtual switch as active. I'd really hate to lose the 10Gb bandwidth by turning one NIC as standby..
That is correct I have had to turn off Active / Active teaming on all dvSwitch port groups as it is the only way I can "resolve" the problem. It is a major step backwards for the ESXi platform. The config I am now using is:
All vm traffic port groups use vmnic0 active and vmnic1 standby and
All vmkernel (management and vMotion) port groups use vmnic1 active and vmnic0 standby. At least this way both 10Gb NICs are being used however not as efficiently as aggregating 20 Gb and letting NIOC do it's thing.
This is major but I have yet to find a real "fix" for this in the VMware KB.
Will also post links to serious bugs in the console network settings check and dvSwitch health check that have cost me days and days of lost time chasing my tail on problems that don't exist because of buggy code.
Stay tuned and please update me if you make any progress.
While I am at it here is another beauty from the big V.
The new distributed switch “Health Check” had me all excited once I had finished the 5.1 upgrade. It identifies misconfigurations between the virtual network ports and the physical network ports , except once again there is a bug that throws a serious error (randomly on different hosts at different times it would appear).
Unfortunately this makes a very good new feature somewhat unreliable. The VLAN and MTU check appears to work well however but have turned off the other check (Teaming and failover check) until this bug is resolved by VMware.
Currently both issues don’t have a patch which is annoying especially this one which has been around since the product was released last year.
Pass it on. Might save folks some time if they are not already aware of it.
Also another one
Basically when the “Test Network Settings” function is run from the ESXi console once all is good. But if there is actually a misconfiguration (incorrect hostname, DNS or IP settings) and you run it again after fixing the issue the test never completes properly. That is what happened the with the first host I saw the problem on. That led me to re test the other 490c G7’s I was building at the same time (the first test had been successful on these ones). I had already built the G6 490c hosts and all their tests were only run once (successfully). The chase then started with the more hosts I tested the bigger the problem appeared to be, throwing me off the real problem (the host or hosts dropping pings and disconnecting from vCenter).
When I went back today and started testing hosts that had not had the problem previously they all failed to resolve the ESXi hostname in DNS as described by the article (even though pings and nslookup always do). I still have at least one host that is dropping its connection to vCenter intermittently but will go back and confirm that it has a separate problem.
Unfortunately VMware support were not aware of the test utility fault when I started seeing the problem back in early February so I was chasing my tail trying to solve a connectivity problem when it was just a bug in the “Testing” utility.
Release quality and problem rectification is starting to become a real problem with the vSphere platform. Is perhaps because we are now dealing with EMC not VMware? The core technical team that invented "VMware" is now all gone (Dianne, Mendel and Steve) replaced by Cisco and EMC folks. Both excellent hardware companies but in my experience often dreadful software companies when they try to be all things to all people (as VMware is also now doing). Sadly most of the vSphere features that have been released since the folks above left the company were in fact working in alpha code before they left. Unfortunately the issues we have been discussed here are the start of things to come I.e. "get the feature to market ASAP".
I have spent the last 10 years specializing in this platform with great results. At the end of the day the only features that truly matter to my customers are:
Quality, reliability and performance. They are certainly paying for this!
Hopefully someone will give this feedback to Pat.