I've just rebuilt my all flash vSAN cluster on version 6.7 and swapped to 10Gb NICs and switches and I am seeing very high pNIC inbound packet loss rate counters on all my hosts (under host, monitor, performance, vsan - physical adapters).
Packet loss counters range from 40% to 600% :smileycry:
I've struggled to find anything during my searches to help point me in the direction of where to look.
This is a 4 node home lab so unfortunately I don't have the luxury of logging a support ticket.
Can anyone help point me in the direction of where to troubleshoot further? Hosts? Switches?...etc.
In case the specs of my environment help I've got them below... all items are listed on the HCL.
2 x Dell 8024F switches stacked with 4 ports
2 x 10Gb DAC cables from NICs to switches (1 cable in each NIC goes to each switch)
4 x HP DL20 Gen9, each of which has:
HPE 546SFP+ Dual Port 10Gb NIC
HPE H240 HBA
HPE 400GB SAS 12G SSD cache disk (HPE branded HGST HUSMMxxxx)
HPE 600GB SATA SSD capacity disk (HPE branded Intel S3500)
vDS configuration/VMKernal Configuration wise for vSAN:
Dedicated port group, with own VLAN
MTU 9000 (Tried 1500 to start with... my environment supports both)
Route based on physical NIC load
Both NIC 1 and 2 are active uplinks
I just had the exact same experience as you with an almost identical hardware configuration. I found the only way to stop the high ingress packet loss was to change the default distributed port group teaming and failover settings from "Route based on originating virtual port" to "Use explicit failover order" with a defined Active uplink and all others as Standby.
HP Servers, multiple brands of switches (tried during testing/troubleshooting), multiple brands of NICs (ending with Intel x520/x540), and didn't seem to be hardware/driver related.
Thanks, I'll give it a try and see how it goes.
I know it has to be a switch/hardware thing, as I have the exact same config VMware wise at work for our massive vSAN on vxRail clusters and the counters flatline at 0% day in, day out!
(The only difference is Dell server kit vs the HPE I have, and Cisco network gear versus my Dell)
Surely there has to be a logical solution, or some simple setting we've forgotten to configure on the switches?
The first step I would take, is looking at the Firmware AND driver for the NICs. There is no NIC option on the vSAN VCG, but you should be looking at the vSphere VCG for this. It is VERY important to have the Firmware AND driver at the same level. This is known as FW/Driver combination, and a mismatch here will most likely cause you grief (I see this almost weekly...sadly). What you need to avoid is having the latest driver and a firmware that is 2 years old. The vSphere VCG will point you to the combination recommended.
The other problem I've been running into lately is the switch. Remember you will be pushing storage through whatever switch you have, If you wouldn't place traditional storage traffic through the switch you have vSAN on, you probably shouldn't be using it for vSAN traffic either. Be aware of switches with high port-to-port latency, and low buffers. I understand that networking is often a blackbox for vSphere/Storage admins, but please take it into consideration, and size properly according to your needs.
After upgrading my home lab recently I also have high inbound packet loss counter and TCP retransmits.
NICs: Intel x520 10GB
Switch: Mikrotik 10GB
Troubleshooting so far. Upgrade firmware and driver on NIC for one of the hosts, have not seem to help.
I have a colleague that has the same problem with his homelab and he is running HP proliant servers.
NICs: Some broadcom variant i think.
Did anyone of you guys solve it? Found this thread also https://www.reddit.com/r/vmware/comments/93fbuw/massive_vsan_latency_increase_on_upgrade_to_67/
Any help with where to continue troubleshooting is appriciated.
Just an update on this - this is a known issue with fix incoming, also fairly sure public documentation is in the works and I will keep an eye on that for updates.
So that you are aware: this is cosmetic and non-impactful, If you were experiencing that rate of loss it would be very apparent from performance issues (+ if you were getting packet drops due to network issues/misconfiguration they likely wouldn't be observed in just one place e.g. pNicRxPortDrops).