VMware Cloud Community
david2009
Contributor
Contributor

ESXi 4.1 and Linux VM GuestOS keep losing packets

First of all, I am a network guy and I know just enough ESXi 4.1 and 5.0 to be dangerous.  I know how to setup ESX and connect to my cisco environment (by the way, I am not using 1000V anywhere)  so here we go:

I have an ESX host running on a Dell PowerEdge R910 with latest BiOS & Firmware and so on.  I have a dual-port Intel 10Gig NIC that I use for the VM guests on this ESX host and they are configured in Active/Standby.  The 10Gig NICs are connected to my Cisco Nexus switches for redundancies. There are other NIC on the host but they are for Vmotion, backup and other stuffs so that I will not bore you with.  Everything is certified by VMWare Professional Services.

On that ESX host, I have a Redhat Linux 5.6 VM host with the latest patch and so on running Apache Web Server.  Very often, I see 3-way TCP handshake failed between client hosts and the Apache.  I run tcpdump on the linux host and also place a sniffer on the ESX 10Gig interface to track the traffic and here is what I am seeing:

- client sends a TCP SYNC, I see it on the ESX 10Gig NIC trunk port,  and also on the linux vm guest OS,

- linux vm guest OS sends back a TCP SYN-ACK and I can see it on the linux vm guest OS and also in the ESX 10Gig NIC trunk port,

- client sends a TCP ACK to complete the 3-way handshake, I can see this on the ESX 10Gig NIC trunk port, but I am not seeing it on the linux vm guest OS in tcpdump,

Because of this, the linux vm guest OS keeps sending SYNC-ACK back to the client and the client keeps sending duplicate ACK back to the server but the server never receives the ACK packet thus 3-way handshake never complete.  What make this so difficult to troubleshoot is that the issue is intermittent and it does not happen all the time.  I am confident that this is a VMware issue and not a network issue is that we have multiple physical servers on the same network as the linux vm guest OS that do not have this issue.  And no, we do NOT have duplicate IP address/MAC issue in this network.

ESX server CPU & memory utilization is very low.  throughput on the Intel 10Gig NIC is less than 300Mbps so everything is very low.  The NC set on the linux vm guest OS is E1000.  I look at it using iotop (linux utility) and the utilization is less than 30Mbps.

Anyone has seen this issue before?  Please advise.

Thanks,

0 Kudos
2 Replies
iw123
Commander
Commander

Hi,

Just a question, do you have VM tools installed and working in the VM? Have you considered using VMXNet3 adaptor?

*Please, don't forget the awarding points for "helpful" and/or "correct" answers
0 Kudos
MKguy
Virtuoso
Virtuoso

Try the vmxnet3 NIC as mentioned. What if you switch the active/standby uplinks of the port group?

Also, what physical NIC do you have in the server? There have been a couple of issues lately which required manual NIC driver updates.

10GB Intel NIC sounds like the ixgbe driver, check with ethtool -i vmnicX or esxcfg-nics -l on the ESXi shell. The latest ixgbe driver is available here:

https://my.vmware.com/group/vmware/details?downloadGroup=DT-ESXI50-INTEL-IXGBE-31132&productId=285

But I don't really see a reason why the host would block particularly ACK frames of a TCP handshake while still forwarding other packets at the same time.

-- http://alpacapowered.wordpress.com
0 Kudos