VMware Cloud Community
MikeOD
Enthusiast
Enthusiast

VM Ping/ARP issue

We are having a problem with some of our virtual machines intermittently losing communication with each other, and I’m at a loss as to the source.

We have about 250 VM’s running on about 20 HP BL465C blades installed on two HP C7000 chassis, using the HP Virtual Connect interconnect modules.  The blade chassis are connected to our core Cisco 6500 switches.  The VMWare hosts are at 5.0, the guest VM’s are a mix on Windows 2003, 2008, and 2008R2.

What’s going on is that everything seems to be OK, but then out of nowhere, we will get communication failures between specific machines.    It looks like it’s an ARP issue.  Using PING, it works fine in one direction, but we get an “unreachable” error when going the other way, unless we ping from the target back to the source first.

For example: we have servers, “A” and “B”.   Ping A to B fails with “unreachable”. Ping “B” to “A” works fine.   However after pinging “B” to “A”, we can now ping “A” to “B”, at least for a while until the entry falls out of the ARP cache.  If we go into server “A” and set a static ARP entry (“arp –s”) for server “B”, everything works OK.  Through all this both server “A” and server “B” have no issues communicating with any other machines.

We tried using vMotion to move the servers to a different host, different blade chassis, etc.  Nothing worked except when we put both VM’s on the same host.  Then everything worked OK.  Moving one of the servers to a different host and the problem came back.

It seems like either the ARP broadcast from the one server, or the reply back from the target isn't making it through.  However, according to our networking group, there are no issues showing up Cisco switches.

Early this year, we had an issue where it happened on about a third of machines at the same time (it caused significant outages to production systems!).   It seemed like it was limited to machines on one chassis (but not all of the machines on that chassis).  At that time, we opened up tickets with VMWare and HP.  Neither found anything wrong with our configurations, but somewhere in the various server moves, configuration resets, etc., everything started working.

Since that time we’ve seen it very intermittently on a few machines, but then it seems to go away after a few days.

The issue we found today was that the server we’re using for the Microsoft WSUS server hadn’t been receiving updates from a couple of the member servers.  We could ping from the WSUS to the member server, but not back from the member server unless we put a static ARP entry in the member server.  The member servers are working fine otherwise, talking to other machines OK, etc.   They are a production environment, so we’re limited on the testing we can do.

Also, when it has happened, it seems like always been between machines on the same subnet.  However, most of our servers are on the same subnet, so it might just be coincidence.

I’ve done a lot of internet searching, and have found some postings with similar issues, but haven’t found any solution.  I don’t know if it’s a VMWare issue, HP, Cisco, or Windows issue.

Any assistance would be appreciated.

Mike O'Donnell

54 Replies
MikeOD
Enthusiast
Enthusiast

That's what I thought you meant.  We're not using the distributed switch settings; each blade has it's own configuration.

About how many VM's are you running per host?  I realized 10Gb is a large pipe, and most servers don't even push a 1Gb nic, but we're running about 20VM's/host.  I'm really uneasy about running all that through a single NIC, even if it is 10Gb

0 Kudos
geeaib824
Contributor
Contributor

I was seeing the same issue, but we decided to update our Virtual Connect firmware from version 3.51 to version 3.70 based on the HP Recipe (http://vibsdepot.hp.com/hpq/recipes/December2012VMwareRecipe4.0.pdf).  I then cabled up a total of two ports per interconnect bay X3 and X4 per bay on my c7000 enclosure (interconnect 2x bay 1 and 2x bay 2) each pair going to a single switch and we then setup LACP on them which allowed the ports in the Shared Uplink Set to go to Active/Active.  We are tagging the VLANs on the Ethernet Networks in VC, then set my LOM's to Active/Active on the dvSwitches in ESXi 5, we havent had the issue since. 

0 Kudos
NV1
Contributor
Contributor

Averaging around 15 on our smaller blades (490c) and around 30 on our big BL720c blades (2x 10 core, 256GB). Sometimes this is over 50 per host (during patching and upgrades etc when we do multiple remediations at the same time).

Only 8 simultaneous vMotions ( the default for 10GB on 5.x) seems to ever push the pipe but I am still not happy with the active / standby arrangement as it is a major step backwards in functionality that has been working seamlessly since version 2.5.

Regards

Nick

0 Kudos
NV1
Contributor
Contributor

geeaib824

This makes sense as the the problem started around the time LACP support was introduced in ESXi. Unfortunately it has broken the simple originating port ID and LBT teaming when the VC modules are not using LACP. We have each module connected to 1 x 10Gb port on the same switch. Redundancy is provided by a standby 1 GB link on each VC module to a separate switch. ESXi has no visibility of this, the failover is handled at the VC layer. VC is the latest firmware by the way.

All ESXi is aware of is the 2 10Gb uplinks to the same switch. A very simple config that has always worked seamlessly in the past. We are also seeing exactly the same symptoms on our rack hosts that don't use VC. I suspect if we turn LACP on then the problem will disappear. We are however deliberately trying to keep things as simple as possible given the Nightmares we have experienced with Virtual Connect over the last 3 years.

Fundamentally it appears that in introducing LACP to ESXi directly the other options have broken. Ironically we stopped using LACP at the VC modules because it previously broke the ESXi teaming!

Message was edited by: NV1 Added reference to who I am replying to: geeaib824

0 Kudos
MikeOD
Enthusiast
Enthusiast

It seems like (at least for us), after any kind of major virtual connect change, it's about 5 or 6 months then the issue comes back.

Our structure has a four port LACP group going to each interconnect (1 and 2), and then the NIC's in VMware set to active/active.  When this first came up, we thought it was a firmware issue, so we make sure we were up to date on them.   Then when it came back, we originally had Virtual Connect splitting up the VLAN's and re-combining them, we we updated firmware again and converted the connections to tunneling mode.  That seemed like it took care of the issue, but I found one case today.

The next thing I might try is what NV1 suggested, only having one NIC active in vmware and the other on standby.    I would hate losing half our bandwidth, but if that's what it takes...

0 Kudos
NV1
Contributor
Contributor

Hi Mike

I feel your pain. It is very frustrating that problems that are fixed in one version of VC break again in the next version. From what you are saying LACP at the VC layer does not work for you like it appears to have worked for geeaib824 . That does not suprise mel. This is why I have spent the last 2 years trying to dumb down VC as much as possible and keep both the VC and dvSwitch uplinks as simple as possible. However we are now struggling with bugs at both the VC and ESXi networking layer so it is very difficult to understand where to start.

One thing we did about a year ago was implement 2 rack hosts for the Management Cluster which at least protects vCenter, AD, DNS etc whenever Virtual Connect falls over. Makes it easier to troubleshoot and get everything back online. In this instance it has proven to me that the problem is at the ESXi or CISCO layer as both the blade hosts and rack hosts are having the same problem.

A massive investment in Rolls Royce blade infrastructure (6 x C7000 Enclosures) and we have to put in workarounds like the rack hosts to deal with the type of flakey networking issues that I have not seen for over 10 years (before working with HP Blades).

Now it is even worse with VMware forgetting about release management, regression testing and after sales support. Not good. It certainly looks like this problem is a VMware / CISCO problem however not a HP one.

0 Kudos
MikeOD
Enthusiast
Enthusiast

I've been doing some more testing and I might be on to something..

In our structure, each blade has four NIC's.  The two Emulex on board, and two add on (intel) on the mezzanine card.

Each NIC is going to it's "own" interconnect (I'm using interconnect 1, 2, 5, and 6).  Our data center has two "core" 6500 switches.  On the back of the chassis, the horizontal interconnects go to opposite switches.  I'm using LACP for the uplink from the interconnect to the 6500, but each LACP group is on it's own interconnect, I don't have the LACP groups spanning the interconnects.

The end result is that in VMWare I have this structure:

NIC0 - 6500A

NIC1 - 6500B

NIC2 - 6500A

NIC3 - 6500B

In VMWare, I had the virtual switch set with all four NIC's active, with the load balancing set to "route based on the originating virtual port ID"

After trying different configurations with the active/standby configuraions, what I have now is:

Active:

NIC0 - 6500 A

NIC1 - 6500 B

Standby

NIC2 - 6500 A

NIC3 - 6500 B

Before the configuration change I've been able to find some consistent, repeatable cases of VM's that can't ping certain other ones.

I've set this configuration on a couple of the blades, and after moving VM's and giving it a day or so, the pings are working OK.

Could the problem have been that original configuration had two active NIC's were going to the same 6500 switch?

In another note, I just found out about a week ago that the Cisco 6500 switches are about 3 years behind on their firmware updates.  They're doing the updates this weekend, maybe that will have some effect on it.

0 Kudos
NV1
Contributor
Contributor

Hi Mike

I agree there appears to be a problem using the Originating Port ID or LBT teaming when both ports are connected to the same external switch. Our configuration currently has that as we are using one CISCO Nexus for the 10Gb connectivity. Our redundancy is at the Virtual Connect modules where we have standby 1Gb uplinks on each module should the 10Gb switch fail.

We have the Nexus switches with close to the latest firmware so I don't think it is a CISCO firmware problem .More something to do with the vSphere load balancing without using LACP on the same CISCO switch.

I will ask our network guys to move one of the 10Gb uplinks to a different Nexus and then see if this resolves the problem. We have an engineering test enclosure I can do this with.

Will get back to you with the results.

0 Kudos
reca42
Contributor
Contributor

Exactly the same problem here: Two VMs on two different ESX hosts connected with Nexus1K and a 2x10GbE portchannel (LACP) to a Nexus 5548.

I tried different Nexus 5548, the problem still exists.

If someone has an idea I have a small lab to test various scenarios or something.

Rene Caspari

Network Engineering

0 Kudos
MikeOD
Enthusiast
Enthusiast

We haven't seen this issue for several months.  We've done several updates/reconfigurations so I'm not sure what (if anything) fixed it.

What we did was:

-Ensured chassis and individual blades are at the latest firmware

-Configured Virtual Connect to use VLan tunneling mode

-Limited number of "active" NIC's on each vswitch.  Set redundant NIC's as "standby"

-Network group updated firmware on 6500

Hopefully we'll never see the ARP issue again...

0 Kudos
reca42
Contributor
Contributor

I found an ESX cluster (one out of six) which doesn't have this problem. And until now the only difference are the NICs, Cluster with the Bug use Broadcom, this one uses Intel.

We finally opened a call, let's see if this helps.

Kind regards,

Rene

0 Kudos
patelf
Contributor
Contributor

Hi all, was there ever a firm cause/resolution surrounding this issue?

0 Kudos
emenius77
Contributor
Contributor

We were seeing very similar behavior to that of the original poster.  We also have a very similar design.  Two blade enclosures with Virtual Connect modules and two uplink sets connected to a Cisco 6500 core.  We were seeing one vlan in particular where the traffic could pass ingress into the environment, but the traffic headed egress would drop between the LOM and the virtual connect uplink.  The weird part was other vlans on the portchannel trunk were not affected and even traffic on the problematic vlan would occasionally work both ways. After having Cisco review the 6500 configuration, a call to HP support uncovered the problem.  The LOMs for a blade are mapped to particular Interconnect bays.  This is important to note.  (This can be seen by looking at the server profile in the Virtual Connect manager.)  When a network is mapped to a LOM, that network needs to be associated with a shared uplink set that is on the same Interconnect bay as the LOM, if the uplink is in the same enclosure as the blade.  That way, if the VC module Interconnect bay fails, it takes down the uplink and the LOM too.  But in most cases, the blade will have a redundant LOM connected to the VC module's other Interconnect bay which should then map to an uplink that is on that Interconnect Bay or possibly an uplink on an Interconnect Bay residing on a different Virtual Connect module in a different enclosure (if you have your VC modules stacked like we have).  Once we sorted out the network -> uplink and network -> LOM mappings, the problem was resolved.  Traffic flowed in both directions for the problematic vlan (which by the way had been added later after the initial VC config was done).  Why the problem was sporadic and didn't affect all vlans still blows my mind.

0 Kudos
JonRudolphVZW
Contributor
Contributor

Like most of you, we were baffled at first DNS not resolving, can't ping, finally traced it down to the ARP. We are fortunate in that we have two "identical sites" with C7000 chassis, Windows and RHEL on the blades as well as on VMs. We have firewalls disabled and fresh OS installs and still get these issues. This allowed us to test a variety of scenarios and we discovered issues on both physical blades and VMs. Our latest theory blames bad spanning-tree configs. One of our sites does not have these issues and one does. Our site which experiences these issues is configured with

switchport

switchport mode trunk

switchport trunk native vlan XXX

switchport trunk allowed vlan AA,BBB-CCC,DDD,EEEE,FFFF,GGGG

spanning-tree port type edge

In our other site, the spanning-tree config is a little more in depth:

spanning-tree port type edge trunk

spanning-tree guard root

The "guard root" is just a little extra config, but the spanning-tree port type is huge so that spanning tree realizes that this is a trunk, and not an access port (it knows to expect multiple MACs).

We are going to add the above configs whenever the network team can schedule the change and I will update!

Jon

0 Kudos
JonRudolphVZW
Contributor
Contributor

I have excellent news. For us, updating our Cisco switch configs to include the following command fix our problem.

spanning-tree port type edge trunk

We have not seen any issues on the network since making this change. We have confirmed this across RHEL and 2008R2 as well as VMs and physical servers.

I know this is too late for some, but I hope this fixes issues for other people. If you're strictly a sys admin and don't know what the above Cisco command means, go back to your network admins and tell them "to fix the spanning tree configs on your switchports".