Re: Advanced Network Settings & Best Practices

silverline · ‎05-07-2012

Ever since my company began its strong push towards virtualization I have struggled with many networking issues pertaining to virtualization. Some of the specific issues I still continually witness are:

- Consistent receive discards on my ESXi hosts vmnics

- Consistent transmit discards on my Cisco Nexus 1000v veth interfaces for the VMs

- ASP.NET errors indicating that connections sometimes timeout. One for example:

"[COMException - Incorrect function. (Exception from HRESULT: 0x80070001)]

[HttpException - An error occurred while communicating with the remote host. The error code is 0x80070001.]"

- Sometimes the NICs will spontaneously remove their default gateway or other configured information.

I have read many online guides and tutorials from Cisco and VMware and have had several support cases open regarding these issues but have never resolved any of them. I have also tweaked many of the settings on my own in hopes of alleviating the issue but it feels a lot like guessing with the information that's out there for what values should be changed under the advanced network settings of the host. I've also rebuilt hosts from scratch several times just to be sure I didnt mess anything up with all of my tweaking.

My hope for this thread is to get other peoples opinions on the best configuration of ESXi host and OS NIC adapter configurations to reduce packet loss as much as possible in a high speed 10G environment. I also would like to share my specific setup and see if anyone has any recommendations for tweaks I could make to optimize traffic forwarding behavior.

In my primary VMWare envionrment I have twelve ESXi hosts spread across two Cisco B series blade chassis connected to Fabric Interconnects. For the sake of this discussion lets just focus on our main production cluster which is where I am mostly concerned. This is composed of six B230 blades each with two 10-core processors, 128GB of RAM, and an M81KR "Palo" Cisco NIC. Currently I am running approximately thirty VMs aross these six hosts. 95% of these are Windows Server 2008 R2 and the other some sort of Linux virtual appliances.

During my troubleshooting I have tried changing the settings for LRO for vmxnet3 (net.Vmxnet3HwLRO and others) as well as many of these offloading settings on the adapter configuration in Windows. I have not been able to come to any conclusive result as to which is the optimal setting. Some of Cisco's documents for their voice applications indicate that LRO on the host should be disabled, but other documents I have read indicate that disabling LRO will cause CPU spikes to happen. VMWare's networking document indicates that lack of CPU cycles can be a cause for the vmnic receive discards I've been seeing so obviously I'd like to keep CPU spikes down if possible, but not if it leads to erratic traffic forwarding behavior like Cisco says.

So I am kind of at a loss at how to proceed here and am turning to the community for help.

Does anyone have any recommendations for how to configure all of the settings under Net for ESXi host configuration? Anyone have a similar setup to mine and have a configuration that is working for them? ANyone seen similar dropped packets before with a root cause as to the reason? Are regularly occruing dropped packets just normal with VMWare and something I have to learn to live with? What about these ASP.net errors which were never present in the physiacl world?

Any advice appreciated.

Thanks!

silverline · ‎05-15-2012

Bump. Sorry if this was a long winded post but I am certain others have to be seeing similar behavior and I really don't know where else to turn.

Would really appreciate everyone's thoughts on the matter.

Six9s · ‎06-02-2012

I'm not experiencing all of those symptoms, but daily, we do see millions of receive discards on one of two management NICs on our ESXi (5.0U1) BL490C G7 hosts behind FlexFabric interconnects. We have another chassis with Flex-10 interconnects and G6 hosts that do not experience this phenomenon. Upstream of our FlexFabric modules are two Cisco 6513 with 1Gb e-net blades. The ESXi hosts use standard vSwitches.

I'd appreciate if anyone would share insight to this.

Thanks!

vIBMer · ‎09-13-2012

Kinsei, re:

"This is composed of six (Cisco) B230 blades each with two 10-core processors, 128GB of RAM, and an M81KR "Palo" Cisco NIC."

I have a client who is seeing ths exact same problem on IBM blades with Emulex NICs. Another thread shows the problem also with HP and Emulex.

http://communities.vmware.com/message/1872894#1872894

Have you solved the problem yet?

Best regards

Josh26 · ‎09-13-2012

Hi,

I can certainly say with regards to LRO, disabling it within the OS has become a standard on Linux installations for us.

Refer here:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=102751...

I can't conclusively say anything about later versions, but ESXi 5.0 RTM definitely suffered from this issue on several of our servers, even though the KB article states it was fixed there.

Following those instructions has become a standard practise on everything I build and the number of networking issues has significantly dropped since.

Edit: I have often found "dissapearing default gateways" to be related to removal of legacy NICs. Example:

Build a VM using e1000 NIC

Remove that NIC and add a vmxnet3 NIC

Ignore the warning from windows about duplicate IP addresses on a legacy NIC

Perform this and, for some reason, at some point, Windows is likely to drop NIC settings. The fix is to remove the disconnected NIC from the "hidden devices" device manager after step 1.

None of this is official from VMware - it's just my experience with these issues.

silverline · ‎09-14-2012

I have not solved the problem yet.

I recently upgraded our Nexus environement to the latest VNMC, 1000v, VSG, and VEM codes for our ESX hosts.

This seems to have made a dramatic improvement on perceptible performance and certain errors have gone away.

The interface errors continue to show up every day so there is still some unknown problem.

I really wish that VMWare would focus some energy on explaining this and helping thir customers to diagnose with this happens.

Also, in response to the above post - when not using linux VMs, just my perception is that network works faster with LRO enabled and less packets are lost/dropped. But that is just after some basic testing I performed. More needs to be done to provide a general recommendation on what to do.

VMWare are you there?

silverline · ‎11-14-2012

I finally solved this for my environment.

The solution was two fold...

First thing I needed to do was edit the VMWare Adapter Polcy on my Cisco UCS to increase the ring 1 buffer size from 256 to 4096 (the maximum value) and enabled RSS (Receive Side Scaling)

The immediately stopped all of the discards happening on the vmnic interfaces of the ESX server and drastically reduced the overall drops I was seeng in my entire environment. It seems that Cisco's default adapter policy has a very small buffer and so if the traffic is bursty this can be overloaded and will cause drops. Changing this value along with enabling RSS to allow more cores to handle the workload seemed to remove that bottleneck and push the packets as quickly as possible to the actual VMs.

After a little while I did notice the side effect of this change - I was still seeing some periods of dropped packet spikes still but now these drops were showing as Tx drops on our Nexus 1000v ports and Rx drops on the OS adapter interface. What this meant to me was that the OS was not processing the incoming traffic quickly enough.

To make this problem better I logged into the OS and edited the Adapter configurations advanced settings for the VMXnet3 adapter to the following:

RSS - Enabled

RX Ring #1 Size - 4096

Small Rx Buffers - 8192

After tweaking these settings all discards and dropped packets in my environment have almost entirely ceased. Where I was seeing sometimes tens of thousands of drops each day I am now seeing less than 50.

These are some of the key takeaways I learned during this process:

- If you are seeing drops on the vmnic interfaces it is most likely due to some buffering issue somewhere in the path. In my case it was at the hardware level. Working with VMWare support we found a value of rx_no_buffs on the VMNic interface which perfectly correlated with the drops we were seeing. We traced this to a value of rq_drops on the Cisco Palo interface and this allowed us to see that it was indeed the hardware adapter that was dropping the packets

- RSS allows for more than one core in a system to process the TCP stack. In a VM that has a lot of traffic to handle, this can really help to clear the buffers more quickly. Even if you have huge buffers - if the OS is not clearing them quickly enough they will still fill up. This is also something to be concious of when you are using CPUs that have many cores with low clock rates since even less CPU is dedicated to process the network traffic without RSS.

- Packet drops of this amount are NOT normal. Keep fighting until you figure out what is causing it if you see this in your environment. Several issues in our network have improved since this discovery. VMware should be able to tell you what is causing these drops if you keep on them long enough to make them examine the adapters.

silverline · ‎11-14-2012

One more point - I had an issue with network not coming up after reboot. I am unsure if it is the same as the disappearing default gateway issue you list or not.

But I was able to find my problem and solution here:

https://social.technet.microsoft.com/Forums/en-US/winservergen/thread/8acb7cd1-7028-4ffe-86c9-eb4304...

http://lyngtinh.blogspot.com/2011/12/how-to-disable-autoconfiguration-ipv4.html

All

Advanced Network Settings & Best Practices