VMware Cloud Community
MikeOD
Enthusiast
Enthusiast

VM Ping/ARP issue

We are having a problem with some of our virtual machines intermittently losing communication with each other, and I’m at a loss as to the source.

We have about 250 VM’s running on about 20 HP BL465C blades installed on two HP C7000 chassis, using the HP Virtual Connect interconnect modules.  The blade chassis are connected to our core Cisco 6500 switches.  The VMWare hosts are at 5.0, the guest VM’s are a mix on Windows 2003, 2008, and 2008R2.

What’s going on is that everything seems to be OK, but then out of nowhere, we will get communication failures between specific machines.    It looks like it’s an ARP issue.  Using PING, it works fine in one direction, but we get an “unreachable” error when going the other way, unless we ping from the target back to the source first.

For example: we have servers, “A” and “B”.   Ping A to B fails with “unreachable”. Ping “B” to “A” works fine.   However after pinging “B” to “A”, we can now ping “A” to “B”, at least for a while until the entry falls out of the ARP cache.  If we go into server “A” and set a static ARP entry (“arp –s”) for server “B”, everything works OK.  Through all this both server “A” and server “B” have no issues communicating with any other machines.

We tried using vMotion to move the servers to a different host, different blade chassis, etc.  Nothing worked except when we put both VM’s on the same host.  Then everything worked OK.  Moving one of the servers to a different host and the problem came back.

It seems like either the ARP broadcast from the one server, or the reply back from the target isn't making it through.  However, according to our networking group, there are no issues showing up Cisco switches.

Early this year, we had an issue where it happened on about a third of machines at the same time (it caused significant outages to production systems!).   It seemed like it was limited to machines on one chassis (but not all of the machines on that chassis).  At that time, we opened up tickets with VMWare and HP.  Neither found anything wrong with our configurations, but somewhere in the various server moves, configuration resets, etc., everything started working.

Since that time we’ve seen it very intermittently on a few machines, but then it seems to go away after a few days.

The issue we found today was that the server we’re using for the Microsoft WSUS server hadn’t been receiving updates from a couple of the member servers.  We could ping from the WSUS to the member server, but not back from the member server unless we put a static ARP entry in the member server.  The member servers are working fine otherwise, talking to other machines OK, etc.   They are a production environment, so we’re limited on the testing we can do.

Also, when it has happened, it seems like always been between machines on the same subnet.  However, most of our servers are on the same subnet, so it might just be coincidence.

I’ve done a lot of internet searching, and have found some postings with similar issues, but haven’t found any solution.  I don’t know if it’s a VMWare issue, HP, Cisco, or Windows issue.

Any assistance would be appreciated.

Mike O'Donnell

54 Replies
deemee1988
Enthusiast
Enthusiast

Hello MikeOD,


Could you please let me know problem exists in all 3 above mentioned OS


I had same kind of problem with Windows server 2008/R2, where it was windows firewall problem. Issue fixed by enabling 'File and Printer Sharing (Echo request - ICMPv4-IN)'.

To do this go to start -> type -> Windows Firewall and advance security -> Inbound Rules -> enable 'File and Printer Sharing (Echo request - ICMPv4-IN) Smiley Happy


Now try pinging the respective systems in vice versa.


Regards,

deemee1988

Nxt Gen Guy
Reply
0 Kudos
MikeOD
Enthusiast
Enthusiast

Thanks for the response.

This time it seems to only be 2008r2, but that's the majority of our servers anyway. I don't recall if the incident earlier this year had any other O/S, although I believe it did.

However, the firewall is disabled on our servers, since they're internal on our domain.

Also, it's not blocking all PINGs, just between certain servers, and then only intermittently. Each server can send and reply to other servers.

It looks like it's an issue with the ARP responses either not making it back from the target server, or being ignored by the sending server. I just can't figure what's causing it.

Reply
0 Kudos
MikeOD
Enthusiast
Enthusiast

Anybody??

Reply
0 Kudos
SG1234
Enthusiast
Enthusiast

Mike -- is the firmware on the VCM upto date? anything from the OA or the VCM logs?

also are there any standalone blades on the chassis? if so can we isolate this problem only to VMs ?

~Sai Garimella

Reply
0 Kudos
Mangz
Contributor
Contributor

We have  exactly same problems as you do but we are running debian/ubuntu on our vm's. I tried to migrate to same machine and there it works with no problems. I found another person that had the same problems but no solution there either. http://communities.vmware.com/thread/345288  so either put them on the same machine or add static nat. Let me know if you find a solution. Smiley Happy

Reply
0 Kudos
MikeOD
Enthusiast
Enthusiast

If I put both VM’s on the same host, they ping/ARP fine.

As for putting a static MAC, that’s what we’ve done on some of the others, but another odd twist on this is that when I tried to add the “arp –s”, I got an “Access Denied”. However, if I added a static for a different IP/Mac address, it worked OK.

I am running it from an admin command prompt, and I’m admin on the server, so I’m not sure what’s causing that. The server will be rebooted this weekend; I’m hoping that may fix it.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Mike O’Donnell

Department of Technology

(614) 645-6353 (voice)

(614) 645-5444 (fax)

Reply
0 Kudos
MikeOD
Enthusiast
Enthusiast

Nothing is showing up in the logs in VC or OA.

VC is at 3.60, OA is at 3.56. Both of those are one release back, but the release notes for VC 3.70 and OA 3.60 don’t show anything fixed that would account for this. Besides we can’t go to VC 3.70, since we’re using some of the 1/10GB Enet modules; those aren’t supported past 3.60.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Mike O’Donnell

Department of Technology

(614) 645-6353 (voice)

(614) 645-5444 (fax)

Reply
0 Kudos
vbacon
Contributor
Contributor

We just began seeing the same issue, physical or virtual, and have narrowed it down to any flavor of Windows 2008. Clearing the arp cache on the affected servers only briefly fixes the issue. A lot of people report success with the hotfix available at http://support.microsoft.com/kb/2582281 though I just found it today and plan to test in a maintenance window. By the description there are hotfix versions for Vista through Windows 2008 R2 SP1. In the meantime adding a static arp entry is a temporary work around. For not being able to add static entries using the arp -s command, use the netsh int ipv4 add neighbor command instead, which works when arp -s does not.

Other info: When we run a wireshark trace on an affected server and filter for arp entries we see that no arp broadcasts are sent or received and they appear to be filtered by the tcp stack. When we clear the arp cache we see one arp broadcast get sent, and one directed reply. Then it stops working again until the arp cache is cleared again or a static arp entry is added.

Hope something here helps.

Reply
0 Kudos
MikeOD
Enthusiast
Enthusiast

I had come across this before, but from the support article it seemed like it only related to clustering, so I didn’t think it applied.

Also, you mention that the problem goes away briefly after clearing the ARP cache. Do you mean clearing it through netsh command?

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Mike O’Donnell

Department of Technology

(614) 645-6353 (voice)

(614) 645-5444 (fax)

Reply
0 Kudos
vbacon
Contributor
Contributor

The problem comes down to not processing gratuitous ARP, at least in our case. The hotfix is supposed to address that issue.

To clear the arp cache you can use a netsh command though arp -d * is easier. Keep in mind that it will delete static as well as dynamic entries, if you have added any. Do some testing, but that is what we have found so far. We're hoping the hotfix works as adding static arp entries all over the place is not desirable.

Reply
0 Kudos
MikeOD
Enthusiast
Enthusiast

Thanks for the info. We have an outage window this weekend, I’ll apply it on some of the ones we’ve been having the issue with. I’ll post a message next week with the results.

If this does fix it, that would be great. Of course we have probably 100+ Windows 2008/2008R2 servers, so it may take a while to apply to them all..

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Mike O’Donnell

Department of Technology

(614) 645-6353 (voice)

(614) 645-5444 (fax)

Reply
0 Kudos
MikeOD
Enthusiast
Enthusiast

Unfortunately that hotfix didn't solve the issue.  I applied it on both the source and target servers that were having the ping/ARP issue, but I still get the same results, "Destination host unreachable" from one direction, but it works from the other.

It does sound like it's not specific to the server, since if I put both machines on the same host, the ping works OK.  It's got to be something with the networking portion in the vSwitch, the HP chassis Virtual Connect, or the external CISCO switches.

Some more information:

It seems like we're only having this issue between machines that are both on the same subnet.  However, it's not ALL machines on that subnet, just some.   I don't know if the issue is limited to just the one subnet, though, since most of our machines are on that subnet.  The ones I'm seeing with issues are production servers, so I can't move them both to a different subnet to see if that fixes it,

When we've seen this before, we've been able to go ahead and put in a static ARP entry using the ARP -s command.  However, on this one, when I try to do that, I get "The ARP entry addition failed: Access is denied".  I am running this from an Administrative command prompt, and I can add other ARP static entries, pointing to machines on other subnets, but any static I add in the same subnet gives me "access denied". 

Reply
0 Kudos
vbacon
Contributor
Contributor

I did see another thread here in the communities forum where there was mention of a similar symptom, but due to a Broadcom driver. I do not have the link handy but the thread talked about VMware having a request open with Broadcom to update the driver. Perhaps an updated driver is available.

Thank you for the feedback on your experience with the hotfix.

As to the access denied issue adding static arp entries, use the netsh command instead. That worked for us when arp -s gave us the access denied error when entering certain IP addresses.

Reply
0 Kudos
MikeOD
Enthusiast
Enthusiast

The hosts are not using the Broadcom drivers. They are HP 465G7 blades that have two HP NC551i (Emulex) ports and two Intel 82571 ports. The drivers are current. The NIC, blade and chassis firmware were updated about a month ago and are one version back from current, but the release notes on the current versions don’t reference any fixes that seem to apply to this issue.

It does seem like it’s in the HP Virtual Connect and/or extern Cisco switches. If I eliminate those factors by putting both servers on the same blade host (that puts them in the same vSwitch) they work OK. It’s only when the ping/arp has to leave the host that the “unreachable” shows up on some.

I’ve seen some references to a similar issue when you end up with duplicated MAC addresses. I’m not seeing that on those two machines, but could one have a duplicate MAC with some other VM somewhere? Pretty much all of our VM’s are using the automatically generated MAC addresses. With multiple VMWare hosts, do the VMWare hosts communicate with each other to ensure that they don’t assign duplicate MACs to VMs?

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Mike O’Donnell

Department of Technology

(614) 645-6353 (voice)

(614) 645-5444 (fax)

Reply
0 Kudos
vbacon
Contributor
Contributor

To my knowledge the hosts do not communicate with each other to identify duplicate MACs when VM are first created. I've seen duplicate MACs between VMs years ago, but that was back before vCenter even existed. I think the algorithm was changed to reduce the possibility, but it is still technically possible. The only way to know for sure is to dump the MAC address of every VM on every host. I am very much interested in knowing what you find.

Reply
0 Kudos
MikeOD
Enthusiast
Enthusiast

I dug a little more into it, and I don’t think it’s a duplicate MAC within VMWare. I did an export from PowerCLI of all the VM MAC addresses and didn’t see any duplication.

I did do some more checking on the server using the Netsh and ARP commands. On the machine that gets a “unreachable” , using the NetSH command, showing the “neighbors” list, it shows the target as a MAC of all zeros, and type “unreachable”. As I said earlier, I get “access denied” when I try to create a static ARP, but if I use the “Add Neighbors” command in NetSH, it DOES let me add the MAC address of the target machine, and everything works OK.

If I do a “delete neighbors” or a “delete arpcache” command in NetSH, it removes all the entries EXCEPT the “unreachable” ones, and shows the addresses as zeros again.

Is there a way to remove the “unreachable” entries in Netsh? I’m thinking that the issue might be that somehow certain IP’s have been marked as “unreachable” by the server, and then it won’t remove them, even if they are reachable later.

Reply
0 Kudos
MikeOD
Enthusiast
Enthusiast

I still haven't found a solution to this. I have done some more testing and research. I tried the Microsoft hotfix related to a "gratuitous arp" issue in Windows 2008. However, that didn't resolve it.

I ran a script run on all of our machines, doing a "netsh interface ip show neighbors", searching for anything that had an "unreachable" entry.

The issue did show up on multiple subnets, but in each source/target pair, the servers were in the same subnet, with no router.

The ones with "unreachable" all had at least one of the servers in the VMWare environment, passing through the blade chassis Virtual Connect.

If the two VM's were on the same host, the "unreachable" issue went away. Moving them back to different machines, and the "unreachable" came back.

There were some repeats in the "targets", but most other servers could ping those target servers OK.

We are using VLans, and having Virtual Connect separate the networks before sending them to the blades. I recall seeing something about a Virtual Connect issue where when the VC environment would strip off the VLAN from the ARP packet, the resulting packet would be too small and would be dropped by other networking. However, I thought that was fixed in a later firmware release. Also, wouldn't it affect all ARP's going through Virtual Connect, and not just some?

Reply
0 Kudos
vbacon
Contributor
Contributor

I cannot speak to HP VirtualConnect as we do not use it, but I can confirm that the garp hotfix does not fix our issue either, at least on the first server we put it on. Only some Windows 2008 servers are affected, and they are all plugged into the same pair of switches and firewalls, but across three different subnets and firewall interfaces (So far. Two more occurrences on a third subnet today).

Our pain stems more from traffic suddenly not routing through the firewall due to missing arp entries there - that are normally learned dynamically - so our temporary patch is to add static arp entries on the ASA firewall which fixes the problem. This is not a viable long term solution for us though.

There may be more arp entries missing for servers on the same subnet, but we do not have a lot of server to server traffic at layer two so I'm not sure.

The big question is why this seemingly came up out of nowhere, with no change that we can find. We're still looking.

Reply
0 Kudos
vbacon
Contributor
Contributor

I am curious as to whether you have found a solution to this problem. This problem is now seen on a Windows 2008 server on another vDS, and I just opened a support case to help me figure out whether something between the VM and the physical network is an issue. I think it is within the OS itself though as traces today show that the OS is not making ARP broadcasts for anything other than the default gateway.

Reply
0 Kudos