VMware Networking Community
nookzzz
Contributor
Contributor

Guest VM unpredictably lost connection when using NSX-V edge gateway

Good day mate,

I'm current having an issue with NSX-V 6.4.x version.

Let say that currently we have 2 vCenter working in a linked mode(vCenter-A and vCenter-B)

Then we have Cloud Director working on top if these linked vCenters

NSX-V is configured on both vCenter with version 6.4.11 and on Cloud Director

We've configure the basic component which are NSX manager, NSX controller and deploying NSX edge gateway for most of customers

 

The problem we facing right now is on vCenter-A we got randomly edge gateway getting hang, the symptoms are as follow:

Let says we have 3 VM here

1. Edge gateway VM = 192.168.1.1

2.VM-A = 192.168.1.2

3.VM-B = 192.168.1.3

  1. VMs residing this edge gateway lost connection from internet(public IP are not pingable from my laptop) and from VM cannot ping edge gateway VM
  2. On edge gateway VM, ARP connection from another(VM-A and VM-B) using this edge is missing from the ARP output
  3. On edge gateway VM, we login to the console and still reaching the internet (8.8.8.8 for testing)
  4. On edge gateway VM, can't connect to VM-A and VM-B (ping to 192.169.1.2 and 192.168.1.3 from edge VM is unreachable)
  5. VMs residing this edge gateway can't reach to edge gateway (ping to 192.168.1.1 is unreachable) and can't reach internet

Note that this only happen on vCenter-A, for vCenter-B has no issue at all

What we've done so far is we did upgrade NSX on vCenter-A from 6.4.11 to 6.4.14 (not helping, issue still persist after upgrade)

 

We do have a workaround is when the issue happen so we got trigger that the public ip is unreachable, the workaround we have list below:

  1. Redeploy edge gateway from Cloud Director, and the issue fixed (this option is not permanent, we found some edge gateway having repeat issue, but some not until now)
  2. We migrate Edge gateway VM to the same ESX host with the VM and creating a rule for them to make them stay together always(192.168.1.1-3 stay in the same host, this is permanent fix for us right now but not a good idea I know) 

 

We do have a hundred of edge gateway VM on vCenter-A but this happen on one Edge at at time (Another remain stable, only one got issue at a time but different random Edge gateway).

More things to know, for vCenter-A and vCenter-B we are having the physical hosts and switches on the same chassis and rack. Most of them are mixing together using the same HW and configuration. But this never happen on vCenter-B.

 

vCenter version

7.0.3 (Build 20990077)

 

ESXi version

7.0.3 (20842708)

 

 

 

Tags (1)
0 Kudos
1 Reply
andrewassis
Contributor
Contributor

The symptoms you're experiencing with random NSX-V Edge Gateway hang issues on vCenter-A but not on vCenter-B can be challenging to diagnose, but I can provide some steps and considerations to help you troubleshoot and potentially resolve the problem:

1. **Verify Compatibility:**
- Ensure that all components, including ESXi hosts, NSX-V, vCenter, and physical infrastructure, are on the VMware Compatibility Guide for your specific versions. Incompatible hardware or software versions can lead to unexpected issues.

2. **Collect Logs and Diagnostics:**
- When the issue occurs, collect logs and diagnostics from the affected Edge Gateway VM and the corresponding ESXi host. Analyzing these logs can provide insights into what's happening at the time of the hang.

3. **Review Edge Gateway Configuration:**
- Verify the configuration of the affected Edge Gateway VMs. Pay attention to firewall rules, routing, NAT, and any custom configurations. Ensure that they align with your network design and requirements.

4. **Monitor Resource Utilization:**
- Monitor the resource utilization (CPU, memory, and network) on the affected ESXi host and Edge Gateway VMs. High resource utilization can lead to performance issues and hangs.

5. **Check for Network Issues:**
- Examine the physical network infrastructure for any potential problems such as packet loss, congestion, or switch issues. Ensure that the network configuration on vCenter-A matches vCenter-B.

6. **Review VMware KB Articles:**
- Search VMware's Knowledge Base for any known issues or solutions related to NSX-V Edge Gateway hangs for your specific version. VMware often publishes articles with troubleshooting steps and fixes for common issues.

7. **Check for ESXi Host Isolation:**
- Ensure that the ESXi hosts where Edge Gateway VMs are running are not experiencing isolation events or network issues. Host isolation can lead to communication problems.

8. **Check for VMware Tools and ESXi Updates:**
- Ensure that VMware Tools inside the VMs and ESXi hosts are up-to-date. Outdated or incompatible versions can lead to communication issues.

9. **Consider NSX-V Version Compatibility:**
- While you mentioned upgrading NSX on vCenter-A, ensure that NSX-V version 6.4.14 is fully compatible with your vCenter and ESXi versions.

10. **Engage VMware Support:**
- If the issue persists and you can't identify the root cause, consider opening a support case with VMware. They have the expertise and tools to diagnose and resolve complex issues.

11. **Performance Monitoring:**
- Implement performance monitoring and alerting for your NSX-V environment. Tools like vRealize Operations Manager can help you proactively detect performance issues.

12. **HA and Fault Tolerance:**
- Consider enabling High Availability (HA) and Fault Tolerance (FT) for critical VMs, including Edge Gateway VMs, to provide redundancy and minimize downtime in case of issues.

13. **Regular Maintenance:**
- Schedule regular maintenance windows to perform updates and patches on your VMware infrastructure components. This can help ensure that you have the latest bug fixes and security updates.

Given the complexity of your environment and the intermittent nature of the issue, it's essential to thoroughly investigate each potential cause and monitor the situation closely. VMware support may be your best resource for diagnosing and resolving this issue, especially if it's specific to vCenter-A and not reproducible on vCenter-B.