VMware Cloud Community
jconlin2010
Contributor
Contributor

Need some help troubleshooting network problems - ESXi 5.5

We are having some very odd networking problems and working with the network team we are running out of ideas.

The problem:
VMs on standard vSwitches are experiencing problems talking to other systems on the same vLAN resulting in dropped packets and RPC errors even when on the same VM Host.

Here is a quick and dirty how the network is laid out: http://imgur.com/a/iJIFJ

ESXi NICs are hooked up to Nexus1 and Nexus2. When VMs attempt to communicate the path will often go to the wrong physical address, fail to communicate, and then update ARP tables and go to the other NIC.

The vSwitches are configured to load balancing with "Route based on IP hash" and because we are using standard vSwitches we do NOT have LACP enabled.

We have noticed this problem only exists on the Dell cluster which is hooked directly into the nexus environment. The UCS cluster which is plugged into FIs does not share the problem. I suspect this is because the UCS FIs share ARP tables and it never makes it back to the Nexus 5ks.

My suspicion is that there is a problem between the 2 Nexus 5k switches and they are not sharing ARP tables properly, but the network team is insisting that the problem lies with either the ESXi or Windows OS layer within the VMs. I'm not versed enough in low level network operations to argue this, but I'm at a loss for how to troubleshoot this further and get a definitive answer.

Some solutions we've tried:
1. Setting up a dvSwitch on a test box and enabling LACP. This saw no change.
2. Dropping 1 NIC on the vSwitch and force paths to go up Nexus1. This caused the problem to stop, but is unacceptable as a solutions as it removes our path redundancy.

Some things network team wants us to try but we haven't done yet:
1. Manually changing the VMs' "reachable time" on the NICs to a lower value.
2. Changing out the VMXNET3 interfaces for E1000
3. Enabling LACP on the Standard vSwitches (This isn't supported)
4. Upgrading to ESXi 6 (we're note ready for this migration yet)

We're really pulling our hair out over this one...has anyone ever encountered these problems before?

0 Kudos
12 Replies
UmeshAhuja
Commander
Commander

Hi

Can you cross check with the network team about whether there are APR entries in both the Nexus 5K switches with the same MAC address.

Thanks n Regards
Umesh Ahuja

If your query resolved then please consider awarding points by correct or helpful marking.
0 Kudos
jconlin2010
Contributor
Contributor

There are not. Network team is seeing ARP entries for the VM's MAC address on 1 nexus device but not the other.

0 Kudos
UmeshAhuja
Commander
Commander

Then this is the problem, They need to make ARP entry at both the switches with same MAC address and IP address of ESXi hosts

Thanks n Regards
Umesh Ahuja

If your query resolved then please consider awarding points by correct or helpful marking.
0 Kudos
jconlin2010
Contributor
Contributor

That's what I thought too, but they're telling me that that's not the case. According to the network team the ARP entry should only be on the switch where the physical connection is. Once again, I'm not versed enough in networking to dispute this with them, but I did think that that was the way it was supposed to work with both 5k switches syncing their ARP tables.

0 Kudos
UmeshAhuja
Commander
Commander

Hi,

Second though!!!

Have you tried rebooting the ESXi host. It might be ARP entries in cache are older or not present in ARP table.

Troubleshooting network connectivity issues using Address Resolution Protocol (ARP) (1008184) | VMwa...

Thanks n Regards
Umesh Ahuja

If your query resolved then please consider awarding points by correct or helpful marking.
0 Kudos
jconlin2010
Contributor
Contributor

Oh yeah, we've rebooted a bunch. I don';t think the ARP tables on the ESXi management network are the issue. This seems to be a problem with the VMs and the Nexus gear not handling moving MAC addresses properly for some reason. The ARP tables on the VMs keep pointing at Nexus 1 and then the MAC moves to Nexus 2 (or vice versa) and it tries to go down the wrong path...resulting in dropped packets until it figures it out.

0 Kudos
Calyps0Craig
Enthusiast
Enthusiast

Did you ever get a resolution to this? We are experincing what seems to be the same situation and like you and your network team, pulling our hair out!!

0 Kudos
dineshgoundar
Enthusiast
Enthusiast

  1. Whats your network setup? ESXi uplink and using VSS or vDS?
  2. What load balancing algorithm are you using within ESXi?
  3. Whats the uplink switch ports configured as?
0 Kudos
Calyps0Craig
Enthusiast
Enthusiast

1. vDS

2. Route based on originating virtual port

3. spanning-tree port type edge trunk

0 Kudos
Calyps0Craig
Enthusiast
Enthusiast

FYI I have resolved this issue by upgrading the NIC i40e driver to: 2.0.6

0 Kudos
dineshgoundar
Enthusiast
Enthusiast

Nice. Good to hear you have resolved the issue.

0 Kudos
dhanarajramesh

The vSwitches are configured to load balancing with "Route based on IP hash" and because we are using standard vSwitches we do NOT have LACP enabled.

Since LACP ( port channel) not enabled in physical switch side, you can not use route based IP hash . the reason being that IP hash required both source and destination IP addresses to make decision.

so solution would be, 1) do port cahnnel at Phsyical switch side ad use IP hash

                                2) change IP hash to Route based on the originating virtual port

below link should give you more info

Understanding IP Hash load balancing (2006129) | VMware KB

Configure NIC Teaming, Failover, and Load Balancing on a vSphere Standard Switch or Standard Port Gr...

0 Kudos