I'm sorry to say this , you need do your home work to isolate this performance issue .
1.What is the latency you are seeing when user report the issue ? How did you test the latency ? Is there any strict latency requirements for those apps ?
2.How is the design for this setup ?
3.For what kind of workloads we have performance issues ?
4.What type of traffic is reporting performance issues?
5.Do we have such issues from the beginning ?
6.Was there any change in the setup recently ?
7.Do we have a specific time frame for such issues or it is intermittent ?
8. Do we have any performance monitoring tools/software's in this setup ?
Please do watch VMworld 2017 US - NET1343BU - NSX Performance Deep Dive - YouTube and never ignore vSphere design ,it can be a potential caveat as well.
Answers to your questions are below inline
1.What is the latency you are seeing when user report the issue ? How did you test the latency ? Is there any strict latency requirements for those apps ? Using tool httperf with rate test = 10000, installed on Source VM in the LAN cluster.
2.How is the design for this setup ? 3 ESG in ECMP mode connecting down to one DLR. Separate ESGs in one-arm mode are being used as load balancer for the backend servers.
Only two clusters are under same datacenter at vcenter level. One LAN cluster ( vxlan not configured ), one VXLAN cluster ( vxlan configured ). Source VM is in LAN cluster and target VMs are in VXLAN cluster ( mircosegmentation is done to allow traffic - DFW rules are in place - Target VM's are behind separate ESGs in one-arm mode ).
3.For what kind of workloads we have performance issues ? For all applications hosted in VXLAN cluster.
4.What type of traffic is reporting performance issues? TCP traffic most of the time
5.Do we have such issues from the beginning ? Not from the beginning. We upgraded NSX from 6.3.4 to 6.4.5 in oct-nov 2019. After that customer started reporting such issues in platform. I can't any bug reported by VMware on internet.
6.Was there any change in the setup recently ? No, except for NSX upgrade in cot-nov, 2019.
7.Do we have a specific time frame for such issues or it is intermittent ? it's for every test they running to validate test across platform.
8. Do we have any performance monitoring tools/software's in this setup ? Except the tool httperf, no other tool is being to monitor the latency. Any advice?
A quick hint, are they using the Applied To option under NSX DFW or they keep it the default?
The Applied To defines the scope at which this rule is applicable which decrease the number of rules applied per VM network adapter.
Yes, Problem is customer has created all the dfw firewall rule with "Applied to" set to DFW in turn it has replied to very vnic of VMs hosted on platform. Although firewall rules are around 1500-1700 but per vnic it has exceeded supported number ( 3500 max as per VMware ). In my case it's over 5700. This is what VMware support team has concluded after raising this case to them and root cause of performance issues.
I don't have visibility on what rule is being used for what. Has anyone faced this situation before and what was done to rewrite the existing rules?
To handle this kind of problems, you have to make a global assessment on all your firewall rules and try the below:
- If there is any possibility to merge some rules
- Find any conflict between rules
- And the most important part uses Applied To option properly. the following will more explain this option: Distributed Firewall (DFW) in NSX for vSphere, and “Applied To:” | Telecom Occasionally
- You can benefit from vRLI and vRNI to check the logs and the traffic flows and analyze all kind of traffics and based on that you can customize your DFW rules