The setup: I have two VMs on the same subnet. Both VMs are on the same ESX host. The VMs are connected to the same virtual switch. The switch has one virtual NIC which is uplinked to a physical DMZ switch. That switch has a Cisco PIX firewall attached.
The problem: When pining between the two VMs, connectivity is sporatic. Somtimes the pings reply, sometimes they don't. There's no logic behind when this issue occurs.
Troubleshooting: Long story short, removing the NIC uplink to the physical switch allows the ping to be continues with 0% packet loss. Reconnect the virtual NIC to the virtual switch and the ping become sporatic again.
Ultimate solution: After working with a level 3 engineer for a few hours, we discovered that two devices were respoding to the ARP requsts for the pings. One was the VM, the other was an unknown MAC address which we later determined to be that of the PIX firewall. After disabling Proxy ARP on the PIX, we no longer had this issue.
Question: So here's what I don't understand. VMware claims that when you have VMs on the same subnet and on the same host, then traffic doesn't go out over the NIC uplink, which clearly isn't true. The ARP broadcasts from one VM are recived by the other VM, as well as going out over the NIC uplink, through the physical switch, and processed by the PIX. The destination VM should get that APR request and reply with a broadcast. But for some reason, that reply isn't being seen by the PIX (at least not consistantly,) so it replies on behalf of that device. The originating VM will either get one, or both ARP replies with two different MAC addresses. We saw this in the WireShark capture we did on the VM. Can anyone explain this behavior? The PIX should clearly see the ARP replies instead of replying on it's own.
Firstly, I don't know what's your mean about "When pining between the two VMs, connectivity is sporatic. Somtimes the pings receive reply, sometimes they don't." It means the ping lose some packets or the ping can not receive reply and after a while it can receive reply?
Sencondly, about you said: "when you have VMs on the same subnet and on the same host, then traffic doesn't go out over the NIC uplink, which clearly isn't true." Here is my thought: If the traffic is a unicast, it does not go out over the NIC uplink, and if the traffic is broadcast, it does go out over the NIC uplink.
When you using VM1 ping VM2, and VM1,VM2 are in same subnet, same host, and VM1's ARP cache is empty. VM1 will send an ARP broadcast packet, and when VM2 receive this ARP packet from VM1, it will reply the ARP request with a unicast,not a broadcast (you can capture the packet by using WireShark to make sure it). I think you make a mistake here. Becase VM2 send an ARP reply with a unicast, it won't go out over the NIC uplink(as Vmware said), so the PIX can not see it, and the PIX reply an ARP reply, too. I think here is the reason why you capture two different MAC addresses reply
the ARP request.
If you use two physical machine instead two VMs, I think it will have the same issue.
So, I think the main question is the ARP reply is unicast, not a broadcast. If I have something wrong, please tell me.
1) What I mean is that there are replies, then there are no replies. It was sporatic.
2) Agreed. Broadcast traffic goes out over the uplink. My point was that VMware's stements were misleading.
3) I agree with you on how ARP is supposed to work. VM1 makes a broadcast. VM2 replies Unicast. But the issue is that Proxy Arp was answering on behalf of VM2 when it shouldn't have. Therefore, VM1 gets the wrong MAC address and then is no longer actually talking to VM2. Proxy Arp should have only answered if it had a route to VM2 in it's table. There seems to be something weird with how the packets are formed in a VM.