1q2w3e4r
Contributor
Contributor

Troubleshooting massive vmnic interface discards

I've just checked my Solarwinds Top 10 and found that my five ESXi 5 servers are generating up to 500 Million (!) receive discards per day each on the vmnic interfaces (and 0 transmit discards).

I have HP Blades with Flex Fabric connected to Nexus (all 10Gb), although the Blades NICs share their 10Gb between 3 ethernet NICs and 1 HBA.

Can anyone recomend where to begin to figure out why this is happening?

Tags (1)
0 Kudos
29 Replies
silverline
Contributor
Contributor

I have posted a lot about my setup and what I've already tried in this thread a while ago:  http://communities.vmware.com/message/2044097#2044097

If you like I could copy and paste it here but don't want to spam.

I have updated drivers and versions countless times with the problem never going away.  That fnic driver is new but I am updating my hosts to it as we speak.  I'll let you know if I see any difference but I don't have much hope.  Especially since the enic interfaces are where I am seeing the drops.

0 Kudos
kastlr
Expert
Expert

Hi NattyG,

my question regarding Cisco Drivers was addressed to Kinsei.

But it would be really helpfull if you could provide more details about what exactly did solves your problems?

I. e., VMware patch xxxx or Emulex driver/FW/BIOS yyyy.

This would at least help other VMware users with a similar environment.

Regards,

Ralf



Hope this helps a bit.
Greetings from Germany. (CET)
0 Kudos
silverline
Contributor
Contributor

Just wanted to update that the updated drivers had no effect on the problem.

I still receive discards on all of my hosts without much reason.

Any other suggestions people have?

0 Kudos
vIBMer
Contributor
Contributor

@NattyG,

Re: "...problem to be with the Emulex drivers, we have currently a driver on our environment and after installing this on 3 servers we have NO discards, the other 5 have hundreds of millions, even if nothing on them."

Which driver did you use to fix this?  Where did you obtain it?

Thanks in advance!

0 Kudos
joerockt
Contributor
Contributor

I too would like to know what driver you are using, NattyG.  I have the same setup (c7000 with Flex Fabric modules) and just recently added all my new ESXi 5 hosts to Solarwinds and am seeing these receive discards as well.

0 Kudos
NattyG
Contributor
Contributor

Hi,

We are using HP Blades BL460c G7.

Driver is: 4.1.324.48 - release date 22-08-2011

Firmware is: 4.1.402.8

This driver is causing us problems, vmware gave us a debug driver which is not available for production environment and this showed there was a bug in the driver we are using.  We are currently waiting on Emulex to release a new driver to fix this problem.

Hope this helps.

0 Kudos
silverline
Contributor
Contributor

I finally solved this for my environment.

The solution was two fold...

First thing I needed to do was edit the VMWare Adapter Polcy on my Cisco UCS to increase the ring 1 buffer size from 256 to 4096 (the maximum value) and enabled RSS (Receive Side Scaling)

The immediately stopped all of the discards happening on the vmnic interfaces of the ESX server and drastically reduced the overall drops I was seeng in my entire environment.  It seems that Cisco's default adapter policy has a very small buffer and so if the traffic is bursty this can be overloaded and will cause drops.  Changing this value along with enabling RSS to allow more cores to handle the workload seemed to remove that bottleneck and push the packets as quickly as possible to the actual VMs.

After a little while I did notice the side effect of this change - I was still seeing some periods of dropped packet spikes still but now these drops were showing as Tx drops on our Nexus 1000v ports and Rx drops on the OS adapter interface.  What this meant to me was that the OS was not processing the incoming traffic quickly enough.

To make this problem better I logged into the OS and edited the Adapter configurations advanced settings for the VMXnet3 adapter to the following:

RSS - Enabled

RX Ring #1 Size - 4096

Small Rx Buffers - 8192

After tweaking these settings all discards and dropped packets in my environment have almost entirely ceased.  Where I was seeing sometimes tens of thousands of drops each day I am now seeing less than 50.

These are some of the key takeaways I learned during this process:

- If you are seeing drops on the vmnic interfaces it is most likely due to some buffering issue somewhere in the path.  In my case it was at the hardware level.  Working with VMWare support we found a value of rx_no_buffs on the VMNic interface which perfectly correlated with the drops we were seeing.  We traced this to a value of rq_drops on the Cisco Palo interface and this allowed us to see that it was indeed the hardware adapter that was dropping the packets

- RSS allows for more than one core in a system to process the TCP stack.  In a VM that has a lot of traffic to handle, this can really help to clear the buffers more quickly.  Even if you have huge buffers - if the OS is not clearing them quickly enough they will still fill up.  This is also something to be concious of when you are using CPUs that have many cores with low clock rates since even less CPU is dedicated to process the network traffic without RSS.

- Packet drops of this amount are NOT normal.  Keep fighting until you figure out what is causing it if you see this in your environment.  Several issues in our network have improved since this discovery.  VMware should be able to tell you what is causing these drops if you keep on them long enough to make them examine the adapters.

0 Kudos
joerockt
Contributor
Contributor

Thank you for posting this info kinsei.  So the question for me is how I'm going to make those changes to the Emulex adapters in my Bladecenter. 

Do all of your VM's have VMXnet3 adapters?  I have a mix at the moment as just several months ago I migrated everything to ESXi 5.  I'm wondering if the same changes can be applied to the E1000 and VMXnet2 adapter types.

0 Kudos
silverline
Contributor
Contributor

On your first question -- I would contact IBM since I assume you are using the Bladecenter configuration product to configure your server environment.  They should be able to tell you where the Network Adapter hardware configurations are made.  Ask them about buffer sizes and RSS.  Sorry I cannot offer any more help but I've never worked with BladeCenter before.

For your second point - every single VM that we have which supports VMXNet 3 is configured for it.  I would highly recommend doing this in your environment as well.  It is as simple as taking a quick outage to delete the current adapter and then adding another one to reconfigure the IP address.

VMXNet3 brings MANY enhancements.  RSS is not even supported with VMXNet2.  Nor is LRO in Linux.  I think anything which supports VMXNet2 will also support VMXNet3 since both are using the drivers from VMWare tools.

Using an E1000 adapter will limit you to a Gigabit which would increase the likelihood of contention and buffer overflows.  This is sometimes required depending on the type of VM it is but I have not experienced issues with the VMs that require it.

0 Kudos
Smilyanski
Contributor
Contributor

Had the same problem on HP blades BL460 Gen8 with 554FLB adapter a.k.a. HP Emulex 10GbE CNA. Upgrading driver to be2net version 4.2.327.0 fixed the issue for me. Download it here:

http://www.vmware.com/resources/compatibility/detail.php?deviceCategory=io&productid=21422&deviceCat...

Running previous version 4.1.334.0 caused discards.

While updating the driver, it makes sense to update firmware to 4.1.450.16 (although the driver is definetely the fix, not firmware):

http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareIndex.jsp?lang=en&cc=us&prodNameId=5215390&...

To check what driver and FW you are running, ssh into the host and type:

ethtool -i vmnic0

Hope it will help someone

0 Kudos