Solved: Problems running nested ESXi on top of NSX

kalto · ‎06-13-2018

Hi,

I am running a lab made of 3 ESXi host with VSAN and NSX. Everything works fine, I'm running a whole bunch of VMs in there for testing purpose and all is well.

Recently, I ran into a weird problem when trying to nest three ESXi (I'll call them ESX1, ESX2 and ESX3) on a NSX logical switch that is connected to an edge gateway. Forged transmit and MAC address change are enabled on the dvs port group created for the logical switch. While trying to set them up, I realized that I am losing a whole bunch of packet when trying to communicate with these VMs from the outside (layer 3) and between them (layer 2). I lose up to 50-70% of the packets but not all.

And the whole thing is behaving weirdly:

- All the VMs were supposed to run on separate host (one on each physical host) with an anti-affinity DRS rule

- The VMs are reachable through an edge gateway and a physical router/firewall (Fortigate). Basically, from my workstation to one of these VMs, traffic goes like that: Workstation --> Fortigate --> Transit subnet --> Edge Gateway --> VMs. All of my labs are setup like that, without any problems so far. Even if there is firewall, there isn't any filtering between my workstation and the labs

- The edge gateway is running on another management cluster which is part of the same vCenter and NSX. All of my edges (4 of them) for all of my labs are running that way from that cluster and everything is running fine. As far as I know, there is no problem with my NSX installation.

- When pinging the nested ESX from the outside (my workstation, L3), ESX3 always has only 5-10% packets lost while the other 2 are around 50-70%. That VMs has always the same level of packet loss, no matter on which host it is located.

- When you ping between them (L2), there is always at least 50-70% packet lost, no matter where the VMs are located.

- If you place another VMs (ESX1 or ESX2) on the same host with ESX3 that has only 5-10% packet loss, automatically that VMs (ESX1 or ESX2) will start losing only 5-10% packet, same as ESX3

- If you place all the VMs on the same host, you get absolutely no packet lost when pinging them from the outside (L3), no matter which of the three hosts are running them. But you still get around 50-70% packet loss between them (L2)

- If I also ping from ESX1 or ESX2 to my workstation while pinging from the workstation to the ESX, I got a much lower packet loss (5-10%). No matter where the VM is located or if it is on the same host with ESX3. Packet loss goes back to 50-70% as soon as I stop the ping from the ESX to the workstation.

I tried a lot of things to pinpoint the source of the problems:

- Tried some capture using the available NSX tools, didn't get much except that when the ping doesn't work at L2, the VMs is actually search for the MAC address of the neighbour host it is trying to ping

- Tried removing the IP discovery functionality from the logical switch, thought it might have caused an ARP problem with the Forged transmit. Didn't change anything

- Moved the VM around and tried every possible combination of placement between the host to arrive at conclusion above

I also found two post depicting relatively similar problem with nested ESX on top of NSX:

From the dept of the knowledge arcane: NSX-v with nested ESXi | Telecom Occasionally

NSX and nested ESXi environments: caveats and layer-2 troubleshooting – vLenzker

The problem is, both of these article talks about running nested ESXi AND nested NSX on top of NSX, which cause a problem with the physical ESXi dropping packet from the VXLAN. The point is, I am not running NSX inside the nested ESXi, only simple vSwitch.

So, I'm kind of out of idea where to look next now and I am also getting of of idea/possible correlation.

Anyone with a suggestion?

Thank you!

kalto · ‎06-18-2018

Ok, after some more troubleshooting I can pinpoint to something that look like an arp ageing problem on the logical switch.

Here is what I discovered:

- I added a simple Windows 7 to that logical switch and reproduced exactly the same problem with that VM that I am having with the ESX VM

- I installed Wireshark to that VM and then the same test (ping L2 and L3) I did with the ESX hosts while capturing traffic from inside the VM

As with the other VMs, I am losing around 50-70% packets. So, when I start pinging at L2, I clearly see the arp request go out on the network interface, it receives an answer and things go smoothly for 5-10 pings. After that, it stops responding and all I see going out of the interface is the echo request but not echo reply. Then, at one point, the arp cache entry for the ESX VM I am pinging expires, than I see another arp request being sent and suddenly the ESX VM starts pinging again. This goes on and on.

When it stops pinging, I can see that the arp entry for the ESX VM is still in cache on the Windows 7 VM. To further prove my point, as soon as the ESX VM stops pinging, if I clear the arp cache of the Windows 7 VM, it sends another arp request automatically and the ESX VM starts pinging again.

Also, while capturing trafic on the Windows 7, I realized that the ESX3 VMs (the one with only 5-10% packet lost instead of 50-70%) is continuously sending arp request for an IP address that doesn't exist, probably a misconfiguration on the host. This has the effect of keeping its mac address active on the logical switch, which would be consistent with my hypothesis.

So, for some reason, at one point, the logical switch seems to flush the mac address (source or destination? not sure yet) from its table and then it doesn't seem to know where to sends the packet it receive.

It might look like an arp suppression problem. I don't know exactly where does NSX gets the MAC address from the VM but I would guess from its inventory. Since the vmkernel port has its own MAC address different from the one on the vnic that the physical ESX can see, there might be something there.

Anyway, this is now resolved has I have just updated NSX to 6.4.1 and it now works as it should. Don't know if it is a known problem with 6.4.0, I'll have to that a look.

View solution in original post

kalto · ‎06-18-2018