Highlighted
Enthusiast
Enthusiast

Very High pNIC inbound packet loss rate counters... host issue or switch issue?

Hi,

I've just rebuilt my all flash vSAN cluster on version 6.7 and swapped to 10Gb NICs and switches and I am seeing very high pNIC inbound packet loss rate counters on all my hosts (under host, monitor, performance, vsan - physical adapters).

Packet loss counters range from 40% to 600% :smileycry:

I've struggled to find anything during my searches to help point me in the direction of where to look.

This is a 4 node home lab so unfortunately I don't have the luxury of logging a support ticket.

Can anyone help point me in the direction of where to troubleshoot further? Hosts? Switches?...etc.

Thanks!

In case the specs of my environment help I've got them below... all items are listed on the HCL.

2 x Dell 8024F switches stacked with 4 ports

2 x 10Gb DAC cables from NICs to switches (1 cable in each NIC goes to each switch)

4 x HP DL20 Gen9, each of which has:

HPE 546SFP+ Dual Port 10Gb NIC

HPE H240 HBA

HPE 400GB SAS 12G SSD cache disk (HPE branded HGST HUSMMxxxx)

HPE 600GB SATA SSD capacity disk (HPE branded Intel S3500)

vDS configuration/VMKernal Configuration wise for vSAN:

Dedicated port group, with own VLAN

MTU 9000 (Tried 1500 to start with... my environment supports both)

Route based on physical NIC load

Both NIC 1 and 2 are active uplinks

11 Replies
Highlighted
Contributor
Contributor

I just had the exact same experience as you with an almost identical hardware configuration.  I found the only way to stop the high ingress packet loss was to change the default distributed port group teaming and failover settings from "Route based on originating virtual port" to "Use explicit failover order" with a defined Active uplink and all others as Standby.

0 Kudos
Highlighted
VMware Employee
VMware Employee

which server brand / switches and NICs are you using camealy​ ?

0 Kudos
Highlighted
Contributor
Contributor

HP Servers, multiple brands of switches (tried during testing/troubleshooting), multiple brands of NICs (ending with Intel x520/x540), and didn't seem to be hardware/driver related.

0 Kudos
Highlighted
Enthusiast
Enthusiast

Thanks, I'll give it a try and see how it goes.

I know it has to be a switch/hardware thing, as I have the exact same config VMware wise at work for our massive vSAN on vxRail clusters and the counters flatline at 0% day in, day out!

(The only difference is Dell server kit vs the HPE I have, and Cisco network gear versus my Dell)

Surely there has to be a logical solution, or some simple setting we've forgotten to configure on the switches?

0 Kudos
Highlighted
Contributor
Contributor

Were you experiencing actual issues such as latency or user complaints?  We see the packed loss percentages of 600% as well but do not know if it correlates to actual problems.

0 Kudos
Highlighted
VMware Employee
VMware Employee

Hi!

Is it 600% (per-cent) or 600%o (per-mille)?

What is your virtual/physical network configuration?

0 Kudos
Highlighted
Contributor
Contributor

Got exactly the same issue here, I've set the VSAN dSwitch to Use Explicit failover, no change

Using Intel X552 (Onboard on SuperMicro X10 Mb) on a netgear 8 port 10Gbe switch.

0 Kudos
Highlighted
VMware Employee
VMware Employee

The first step I would take, is looking at the Firmware AND driver for the NICs. There is no NIC option on the vSAN VCG, but you should be looking at the vSphere VCG for this. It is VERY important to have the Firmware AND driver at the same level. This is known as FW/Driver combination, and a mismatch here will most likely cause you grief (I see this almost weekly...sadly). What you need to avoid is having the latest driver and a firmware that is 2 years old. The vSphere VCG will point you to the combination recommended.

VMware Compatibility Guide - I/O Device Search

The other problem I've been running into lately is the switch. Remember you will be pushing storage through whatever switch you have, If you wouldn't place traditional storage traffic through the switch you have vSAN on, you probably shouldn't be using it for vSAN traffic either. Be aware of switches with high port-to-port latency, and low buffers. I understand that networking is often a blackbox for vSphere/Storage admins, but please take it into consideration, and size properly according to your needs.

A+, DCSE, MCP, MCSA, MCSE, MCTS, MCITP, MCDBA, NCDA, NCIE-SAN, NCIE-BR, VCP4, VCP5, VCP5-DT, VCAP5-DCA _____________________ If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful.
0 Kudos
Highlighted
Contributor
Contributor

After upgrading my home lab recently I also have high inbound packet loss counter and TCP retransmits.

NICs: Intel x520 10GB

Switch: Mikrotik 10GB

Troubleshooting so far. Upgrade firmware and driver on NIC for one of the hosts, have not seem to help.

I have a colleague that has the same problem with his homelab and he is running HP proliant servers.

NICs: Some broadcom variant i think.

Switch: HP

Did anyone of you guys solve it? Found this thread also https://www.reddit.com/r/vmware/comments/93fbuw/massive_vsan_latency_increase_on_upgrade_to_67/

Any help with where to continue troubleshooting is appriciated.

0 Kudos
Highlighted
Champion
Champion

I'm also seeing the same issue on HPe hardware and X710 nics:

pastedImage_0.png

Lars

0 Kudos
Highlighted
VMware Employee
VMware Employee

Hello,

Just an update on this - this is a known issue with fix incoming, also fairly sure public documentation is in the works and I will keep an eye on that for updates.

So that you are aware: this is cosmetic and non-impactful, If you were experiencing that rate of loss it would be very apparent from performance issues (+ if you were getting packet drops due to network issues/misconfiguration they likely wouldn't be observed in just one place e.g. pNicRxPortDrops).

Reliable Network Connectivity in Hyper-Converged Environments - Virtual Blocks

Bob