I am running a lab made of 3 ESXi host with VSAN and NSX. Everything works fine, I'm running a whole bunch of VMs in there for testing purpose and all is well.
Recently, I ran into a weird problem when trying to nest three ESXi (I'll call them ESX1, ESX2 and ESX3) on a NSX logical switch that is connected to an edge gateway. Forged transmit and MAC address change are enabled on the dvs port group created for the logical switch. While trying to set them up, I realized that I am losing a whole bunch of packet when trying to communicate with these VMs from the outside (layer 3) and between them (layer 2). I lose up to 50-70% of the packets but not all.
And the whole thing is behaving weirdly:
- All the VMs were supposed to run on separate host (one on each physical host) with an anti-affinity DRS rule
- The VMs are reachable through an edge gateway and a physical router/firewall (Fortigate). Basically, from my workstation to one of these VMs, traffic goes like that: Workstation --> Fortigate --> Transit subnet --> Edge Gateway --> VMs. All of my labs are setup like that, without any problems so far. Even if there is firewall, there isn't any filtering between my workstation and the labs
- The edge gateway is running on another management cluster which is part of the same vCenter and NSX. All of my edges (4 of them) for all of my labs are running that way from that cluster and everything is running fine. As far as I know, there is no problem with my NSX installation.
- When pinging the nested ESX from the outside (my workstation, L3), ESX3 always has only 5-10% packets lost while the other 2 are around 50-70%. That VMs has always the same level of packet loss, no matter on which host it is located.
- When you ping between them (L2), there is always at least 50-70% packet lost, no matter where the VMs are located.
- If you place another VMs (ESX1 or ESX2) on the same host with ESX3 that has only 5-10% packet loss, automatically that VMs (ESX1 or ESX2) will start losing only 5-10% packet, same as ESX3
- If you place all the VMs on the same host, you get absolutely no packet lost when pinging them from the outside (L3), no matter which of the three hosts are running them. But you still get around 50-70% packet loss between them (L2)
- If I also ping from ESX1 or ESX2 to my workstation while pinging from the workstation to the ESX, I got a much lower packet loss (5-10%). No matter where the VM is located or if it is on the same host with ESX3. Packet loss goes back to 50-70% as soon as I stop the ping from the ESX to the workstation.
I tried a lot of things to pinpoint the source of the problems:
- Tried some capture using the available NSX tools, didn't get much except that when the ping doesn't work at L2, the VMs is actually search for the MAC address of the neighbour host it is trying to ping
- Tried removing the IP discovery functionality from the logical switch, thought it might have caused an ARP problem with the Forged transmit. Didn't change anything
- Moved the VM around and tried every possible combination of placement between the host to arrive at conclusion above
I also found two post depicting relatively similar problem with nested ESX on top of NSX:
The problem is, both of these article talks about running nested ESXi AND nested NSX on top of NSX, which cause a problem with the physical ESXi dropping packet from the VXLAN. The point is, I am not running NSX inside the nested ESXi, only simple vSwitch.
So, I'm kind of out of idea where to look next now and I am also getting of of idea/possible correlation.
Anyone with a suggestion?