Re: ESXi 5.0 and network packet size / loss issues...

cxo · ‎11-11-2011

We recently upgraded some hosts to ESXi 5.0 (fresh build).

All appears to have gone well. Well, for the most part.

We have an internal proprietary application that needs to talk to other systems on its own protocol. Odd thing, this particular problem only occurs when the VMtools AND VMhardware are upgraded to the latest and greatest. If the VM has latest VMtools and VMhardware 7 or not latest tools and VMhardware 8 (yeah, I still don't understand how this can be!), the issue is not noticed.

The issue is this application starts by making a UDP request to another server. This request is initiated with a 10 byte datagram. Watching TCPDUMPs on the system shows the datagram is sent but on the destination end it is never received. Looks like somewhere along the way with the combination of new tools, new VMhardware, and ESXi 5.0 datagrams of size between 0 and at least 23 bytes are lost (we see 24 byte datagrams get through).

Tried using VMXNET2, VMXNET3 as well. Other VMs using this application do not have the issue, again, if one (or none) of the tools/hardware combination is used.

Anyone have thoughts or ideas or a link to an article articulating network characteristics/limitations with new VMs in ESXi 5.0 release?

VM is a fully patched CentOS 5.7 i386 system.

Thanks!

rickardnobel · ‎11-11-2011

cxo wrote:
The issue is this application starts by making a UDP request to another server. This request is initiated with a 10 byte datagram.

Has this worked before? A packet with only 10 byte UDP payload should not actually work, since it would below the Ethernet minimum size of 64 byte frames.

Ethernet header = 14 bytes

IP header = 20 bytes

UDP header = 8 bytes

UDP payload = XXX

Ethernet checksum = 4 bytes

To reach the 64 bytes minimum frame size XXX would need to be at least 18 bytes. However, it is very strange why your 23 byte payload packet is not sent.

My VMware blog: www.rickardnobel.se

cxo · ‎11-11-2011

Yes, this application has been running here for 25+ years, actually. On a variety of hardware and OSes, including VMware guests from the 3.0.2 ESX days to ESXi 5.0.0 with VMhardware 7.

Not sure the size threshold, yet. I just know that datagrams greater than 23 bytes work. Only know that 10 bytes does NOT work. Not sure of sizes between 0-9 and 11-23 bytes in size.

OSers on physical counterparts have never shown this issue. Trying to use the OS driver instead of VMXNET3 to see if there is correlation there.

Thanks for the info.

rickardnobel · ‎11-11-2011

With TCPDUMP inside the virtual machine, can you see that the small packet leaves the VM?

The reciever of these packets, is that a virtual machine inside the host or a physical machine?

Could you set up another test VM on another portgroup, with the same VLAN, but with Promiscous = Accept and then collect traffic in promiscous mode on that VM to see if the frame is delivered inside the vSwitch or is rejected as soon as it reaches the "first vSwitch port".

My VMware blog: www.rickardnobel.se

irvingpop2 · ‎11-11-2011

I just finished filing support requests for two similar issues. All on ESX 5.0 cluster (with latest updates) on fully-certified HP hardware.

1. Packet corruption issues on CentOS/Redhat Linux

Verified with two guest OSes: CentOS 5.7 32-bit and CentOS 6.0 64-bit. Both using VMware tools 8.6.0.

Verified with two NICs: vmxnet3 and vmxnet

During large/high-bandwidth file transfers, data is corrupted. SCP is the easiest way to verify this, because it will abort any transfer with "corrupted MAC on input" error message. HTTP and HTTPS large files won't match the original MD5sum.

Appears most frequently when transfer bandwidth >= 1MB/s and file size >= 500MB.

Workaround: Disable RX-checksum and TX-checksum, just like the issue described here: https://bugzilla.redhat.com/show_bug.cgi?id=503288

Note: Ubuntu 10.04 64-bit guests on the same cluster, but using open-vm-tools, don't have this issue.

2. Windows 2003 32-bit OS: 5-10% packet loss during moderate activity.

We are noticing a significant amount of packet loss on our Windows 2003 guests, when even sending or receiving 100KB/s of traffic.

Tested with VMXNET3 and E1000 NICs. Tried disabling all Offloading, no improvement.

Cannot reproduce with Windows 2008 (R2 64-bit) or Linux guests.

cxo · ‎11-14-2011

TCPDUMP shows that the packet is leaving. Receiving systems (virtual - ESX(i) 4.1, ESX 4.0, ESXi 5.0 & physical) do not see the packets. Again, only seems to be occuring with guests running on ESXi 5.0, with up to date VMtools and Hardware revision 8. If any of these are not met (i.e. one or two out of the three), the issue does not manifest itself.

I have tried to use the normal OS NIC driver (pcnet32), but that failed (I need to investigate it more however).

Odd thing with some of my guest upgrades I used the non-interactive VMtool upgrade. Then, when complete, I updated VMhardware. After doing so, vCenter notes the VMtools are not up to date. I am going to try to upgrade a VM that hasn't been touched (tools/ hardware) and run things in interactive modefor the tools upgrade, then the hardware upgrade. Will see if that has any effect.

Thanks for the input.

cxo · ‎11-14-2011

Did some more tests and a VM on the same ESXi host with same VMtools/VMhardware charactersitics and OS patch level the same and one had this problem, the other did not. Maddenning. Looking more into TCPDUMP (with -vv options!) showed a bunch of UDP check sum errors. That bit of information helped in search engine uses and this came up:

http://www.linuxquestions.org/questions/linux-networking-3/help-needed-disabling-tcp-udp-checksum-of...

(applied to CentOS also).

Doing this seems to have solvedthe problem. I don't have a reason as for why (yet) it is needed on one guest and not on a similiarly setup guest, but this client/server app is apparently working now.

rickardnobel · ‎11-14-2011

cxo wrote:
but this client/server app is apparently working now.

Nice to see that you got your systems running again. Strange phenomenon really.

My VMware blog: www.rickardnobel.se

All

ESXi 5.0 and network packet size / loss issues?

1. Packet corruption issues on CentOS/Redhat Linux

2. Windows 2003 32-bit OS: 5-10% packet loss during moderate activity.