Re: Can someone please explain why RARP and not GA...

sgadsby · ‎02-28-2008

ESX sends RARPs at various times, particularly when a device is VMotioned to another host.

Whilst this successfully updates forwarding tables, it does not update ARP tables. As a result, switches lose contact with a VMotioned VM until their ARP cache is updated. This only happens when a Layer 3 IP packet egresses the VM into the switch for whatever reason.

If a Gratuitous ARP were sent, wouldn't the packet still reach all L2 switches indicating that they should update their forwarding tables? And in addition to that it would update the ARP tables of any switches in the L2 broadcast realm.

If the problem is that the ESX server is the one sending the RARPs on behalf of the VM, then surely VM Tools installed inside the VM culd be configured to immediately send GARPs on a VMotion event?

Is there some other technical reason why RARPs and not GARPs?

There have been various threads on this topic over the last few years but no definitive answers that I can see. I guess everyone doing HA is relying on one of two things for HA to work:

a) a packet egresses the switch port that the VM is connected to;

b) the switch successfully picks up the new ARP by broadcast on first IP request.

Frankly I'm not sure why b) doesn't consistently work for me; I had thought that it was a problem with the Broadcom NIC drivers in ESX because there have been some recent patches, however I still see the problem. I am running latest Broadcom firmware for BCM5708S, and ESX is latest 3.5.

Rgds.

--

I need to read "Xen and the art of VMware sales"

-- I need to read "Xen and the art of VMware sales" 🙂

RenaudL · ‎02-29-2008

ESX sends RARPs at various times, particularly when a device is VMotioned to another host.

Yes.

Whilst this successfully updates forwarding tables, it does not update ARP tables.

That's all we want: update the forwarding tables of the physical switches. Why would we need to update the ARP tables? The <MAC:IP> pair of a VM doesn't change after it is Vmotion'ed...

As a result, switches lose contact with a VMotioned VM until their ARP cache is updated. This only happens when a Layer 3 IP packet egresses the VM into the switch for whatever reason.

AFAIK, switches don't have ARP caches. They are L2 devices, they don't care (and they shouldn't) about L3+.

If a Gratuitous ARP were sent, wouldn't the packet still reach all L2 switches indicating that they should update their forwarding tables? And in addition to that it would update the ARP tables of any switches in the L2 broadcast realm.

A gratuitous RARP (which is a broadcasr packet) also reaches all the physical switches.

If the problem is that the ESX server is the one sending the RARPs on behalf of the VM, then surely VM Tools installed inside the VM culd be configured to immediately send GARPs on a VMotion event?

Again, we don't need GARP requests. Secondly, we don't want to rely on the VMware Tools being installed on the guest: Vmotion should work with any guest.

oreeh · ‎02-29-2008

AFAIK, switches don't have ARP caches. They are L2 devices, they don't care (and they shouldn't) about L3+.

Unfortunately modern switches know about L3+ and in fact have ARP caches and most of the time these ARP caches use problematic expire timings.

sgadsby · ‎03-02-2008

Thanks for bringing that to my attention -- yes, the issues will only be apparent if the switches are also routing (L3) switches.

ARP tables on switches are not just mac:ip but also include a port reference as per oreeh.

With VMotion the mac:ip doesn't change but the port does. If the port is on the same switch then sometimes arp timeouts and failure to invalidate the old arp entry result in loss of comms.

I realise that RARPs go everywhere as well, but the upshot is that they do not do all the work, whereas a GARP would do all the work.

What about you RARP as per normal, and make GARPs an option if VM Tools is installed?

Or perhaps ESX could learn all the IP addresses used on a particular VM and GARP on its behalf?

In my case the switch is misbehaving by not invalidating the arp after it receives a rarp, but ESX shoyld be able to prevent this kind of confusion because VMotion is somewhat a special switching scenario.

--

I need to read "Xen and the art of VMware sales"

-- I need to read "Xen and the art of VMware sales" 🙂

RenaudL · ‎03-02-2008

What is the make/model of the switches you're using? Because that's exactly what our RARPs should do: update the <MAC:Port> pairs in each physical switch's routing tables in the path. It kind of dazzles me that you have switches which ignore them.

sgadsby · ‎03-02-2008

Renauld, RARPs do successfully update MAC:Port, however they do not update IP:MAC:Port! This is the problem.

The ARP table that contains the IP:MAC:Port mapping is different from the Switch Forwarding table that contains the MAC:Port mapping.

The forwarding table is successfully updated by the RARP, but the ARP table is not.

The switch should really invalidate or update the ARP entry that refers to the MAC address in the RARP if the port is different from the one in the ARP table. In my case it doesn't which is probably an oversight in the switch firmware, however I still argue that ESX should be intelligent enough to work around this kind of limitation by sending a GARP instead (or as well).

My switches are Nortel Layer 3 Gigabit Switch Modules in an IBM BladeCenter H.

Imagine if you had your two switches connected to two routers with no cross-links like this:

ESX1 -- Switch -- Router A -- Router B -- Switch -- ESX2

In this case, when a machine is VMotioned from ESX1 to ESX2, then the RARP tells the switches to update their forwarding tables and this works successfully. However the Routers are Layer 3 devices and assuming they ignore the RARPs, they now do not know how to route to the virtual machine, and comms are lost.

RARPs are not enough! They only work if a) the Router successfully updates its arp cache (Layer 3); or b) there is another link between the L2 Switches.

I suspect L3 Routers are meant to update their arp caches, however I also suspect that not all of them do, and therefore it would help if ESX sent GARPs instead of (or as well as) RARPs.

Does this make sense?

--

I need to read "Xen and the art of VMware sales"

-- I need to read "Xen and the art of VMware sales" 🙂

RenaudL · ‎03-02-2008

Simon,

Thanks for the nice explanation, I'm beginning to understand the issue now: your router has a cache to quickly associate an IP to an output port, so that it doesn't have to do an ARP cache lookup on each packet of a conversation. It only fetches the <MAC:Port> of a peer when the corresponding entry in the cache expires (or is missing), which is the moment where it actually receives the correct, updated routing information.

At least that's what I'm thinking is happening, please tell me if I'm mistaken.

We already agree that this is primarily an issue with the router: when receiving any kind of ARP / RARP packet, it should take care of immediately invalidating all the cache entries related to the information contained in the packet.

However in your case, our RARPs are more or less ignored by your physical hardware. Hmmm, I see no straightforward way to fix this.

Let me grab some information, somebody may have already thought about that.

RenaudL · ‎03-03-2008

I got information from somebody who has a better knowledge of the internals of switches, here's what I can tell you:

First of all, the configuration you describe is actually unsupported: we want all the ESX hosts within the cluster to be located on the same L2 broadcast domain, otherwise things may break like you described.

Secondly, yes, ARPs are the proper solution and we may generate them one day in order to support a broader range of network topologies. Getting the IP of the VM is the real problem here. Believe me, it's far trickier than it sounds and using the VMware Tools is not a popular solution (for example: which virtual interface should we pick and what is the naming convention of the guest OS?).

Anyway we're aware of this issue and ESX may one day solve it.

sgadsby · ‎03-03-2008

Thanks for following up Renauld,

I didn't say it explicitly but of course in my example it IS all in one broadcast domain. The routers would be configured with a VLAN across both ports, such that RARPs and the like successfully broadcast all the way from one ESX to the other. Broadcast domain is not the issue in my case.

And glad to hear it re ARPs. Undoubtedly there is a deal of trickiness in capturing addresses for VMs, however as an interim I still think an option in VM Tools is a very simple thing. I assume it would be easy for ESX to notify VM Tools of a VMotion event (if it indeed doesn't already), and it should be easy for the software to generate a Gratuitous ARP on demand. I understand it's not the best long-term solution, but it would work correctly in maybe 90% of cases, and alleviates the immediate pain whilst a longer term solution is being generated.

Thanks for taking the time to chew through this issue; I hope there is a solution soon.

--

I need to read "Xen and the art of VMware sales"

-- I need to read "Xen and the art of VMware sales" 🙂

All

Can someone please explain why RARP and not GARP?