Good day mate,
I'm current having an issue with NSX-V 6.4.x version.
Let say that currently we have 2 vCenter working in a linked mode(vCenter-A and vCenter-B)
Then we have Cloud Director working on top if these linked vCenters
NSX-V is configured on both vCenter with version 6.4.11 and on Cloud Director
We've configure the basic component which are NSX manager, NSX controller and deploying NSX edge gateway for most of customers
The problem we facing right now is on vCenter-A we got randomly edge gateway getting hang, the symptoms are as follow:
Let says we have 3 VM here
1. Edge gateway VM = 192.168.1.1
2.VM-A = 192.168.1.2
3.VM-B = 192.168.1.3
Note that this only happen on vCenter-A, for vCenter-B has no issue at all
What we've done so far is we did upgrade NSX on vCenter-A from 6.4.11 to 6.4.14 (not helping, issue still persist after upgrade)
We do have a workaround is when the issue happen so we got trigger that the public ip is unreachable, the workaround we have list below:
We do have a hundred of edge gateway VM on vCenter-A but this happen on one Edge at at time (Another remain stable, only one got issue at a time but different random Edge gateway).
More things to know, for vCenter-A and vCenter-B we are having the physical hosts and switches on the same chassis and rack. Most of them are mixing together using the same HW and configuration. But this never happen on vCenter-B.
vCenter version
7.0.3 (Build 20990077)
ESXi version
7.0.3 (20842708)
The symptoms you're experiencing with random NSX-V Edge Gateway hang issues on vCenter-A but not on vCenter-B can be challenging to diagnose, but I can provide some steps and considerations to help you troubleshoot and potentially resolve the problem:
1. **Verify Compatibility:**
- Ensure that all components, including ESXi hosts, NSX-V, vCenter, and physical infrastructure, are on the VMware Compatibility Guide for your specific versions. Incompatible hardware or software versions can lead to unexpected issues.
2. **Collect Logs and Diagnostics:**
- When the issue occurs, collect logs and diagnostics from the affected Edge Gateway VM and the corresponding ESXi host. Analyzing these logs can provide insights into what's happening at the time of the hang.
3. **Review Edge Gateway Configuration:**
- Verify the configuration of the affected Edge Gateway VMs. Pay attention to firewall rules, routing, NAT, and any custom configurations. Ensure that they align with your network design and requirements.
4. **Monitor Resource Utilization:**
- Monitor the resource utilization (CPU, memory, and network) on the affected ESXi host and Edge Gateway VMs. High resource utilization can lead to performance issues and hangs.
5. **Check for Network Issues:**
- Examine the physical network infrastructure for any potential problems such as packet loss, congestion, or switch issues. Ensure that the network configuration on vCenter-A matches vCenter-B.
6. **Review VMware KB Articles:**
- Search VMware's Knowledge Base for any known issues or solutions related to NSX-V Edge Gateway hangs for your specific version. VMware often publishes articles with troubleshooting steps and fixes for common issues.
7. **Check for ESXi Host Isolation:**
- Ensure that the ESXi hosts where Edge Gateway VMs are running are not experiencing isolation events or network issues. Host isolation can lead to communication problems.
8. **Check for VMware Tools and ESXi Updates:**
- Ensure that VMware Tools inside the VMs and ESXi hosts are up-to-date. Outdated or incompatible versions can lead to communication issues.
9. **Consider NSX-V Version Compatibility:**
- While you mentioned upgrading NSX on vCenter-A, ensure that NSX-V version 6.4.14 is fully compatible with your vCenter and ESXi versions.
10. **Engage VMware Support:**
- If the issue persists and you can't identify the root cause, consider opening a support case with VMware. They have the expertise and tools to diagnose and resolve complex issues.
11. **Performance Monitoring:**
- Implement performance monitoring and alerting for your NSX-V environment. Tools like vRealize Operations Manager can help you proactively detect performance issues.
12. **HA and Fault Tolerance:**
- Consider enabling High Availability (HA) and Fault Tolerance (FT) for critical VMs, including Edge Gateway VMs, to provide redundancy and minimize downtime in case of issues.
13. **Regular Maintenance:**
- Schedule regular maintenance windows to perform updates and patches on your VMware infrastructure components. This can help ensure that you have the latest bug fixes and security updates.
Given the complexity of your environment and the intermittent nature of the issue, it's essential to thoroughly investigate each potential cause and monitor the situation closely. VMware support may be your best resource for diagnosing and resolving this issue, especially if it's specific to vCenter-A and not reproducible on vCenter-B.
