I've just rebuilt my all flash vSAN cluster on version 6.7 and swapped to 10Gb NICs and switches and I am seeing very high pNIC inbound packet loss rate counters on all my hosts (under host, monitor, performance, vsan - physical adapters).
Packet loss counters range from 40% to 600% :smileycry:
I've struggled to find anything during my searches to help point me in the direction of where to look.
This is a 4 node home lab so unfortunately I don't have the luxury of logging a support ticket.
Can anyone help point me in the direction of where to troubleshoot further? Hosts? Switches?...etc.
In case the specs of my environment help I've got them below... all items are listed on the HCL.
2 x Dell 8024F switches stacked with 4 ports
2 x 10Gb DAC cables from NICs to switches (1 cable in each NIC goes to each switch)
4 x HP DL20 Gen9, each of which has:
HPE 546SFP+ Dual Port 10Gb NIC
HPE H240 HBA
HPE 400GB SAS 12G SSD cache disk (HPE branded HGST HUSMMxxxx)
HPE 600GB SATA SSD capacity disk (HPE branded Intel S3500)
vDS configuration/VMKernal Configuration wise for vSAN:
Dedicated port group, with own VLAN
MTU 9000 (Tried 1500 to start with... my environment supports both)
Route based on physical NIC load
Both NIC 1 and 2 are active uplinks
I just had the exact same experience as you with an almost identical hardware configuration. I found the only way to stop the high ingress packet loss was to change the default distributed port group teaming and failover settings from "Route based on originating virtual port" to "Use explicit failover order" with a defined Active uplink and all others as Standby.
HP Servers, multiple brands of switches (tried during testing/troubleshooting), multiple brands of NICs (ending with Intel x520/x540), and didn't seem to be hardware/driver related.
Thanks, I'll give it a try and see how it goes.
I know it has to be a switch/hardware thing, as I have the exact same config VMware wise at work for our massive vSAN on vxRail clusters and the counters flatline at 0% day in, day out!
(The only difference is Dell server kit vs the HPE I have, and Cisco network gear versus my Dell)
Surely there has to be a logical solution, or some simple setting we've forgotten to configure on the switches?
The first step I would take, is looking at the Firmware AND driver for the NICs. There is no NIC option on the vSAN VCG, but you should be looking at the vSphere VCG for this. It is VERY important to have the Firmware AND driver at the same level. This is known as FW/Driver combination, and a mismatch here will most likely cause you grief (I see this almost weekly...sadly). What you need to avoid is having the latest driver and a firmware that is 2 years old. The vSphere VCG will point you to the combination recommended.
The other problem I've been running into lately is the switch. Remember you will be pushing storage through whatever switch you have, If you wouldn't place traditional storage traffic through the switch you have vSAN on, you probably shouldn't be using it for vSAN traffic either. Be aware of switches with high port-to-port latency, and low buffers. I understand that networking is often a blackbox for vSphere/Storage admins, but please take it into consideration, and size properly according to your needs.
After upgrading my home lab recently I also have high inbound packet loss counter and TCP retransmits.
NICs: Intel x520 10GB
Switch: Mikrotik 10GB
Troubleshooting so far. Upgrade firmware and driver on NIC for one of the hosts, have not seem to help.
I have a colleague that has the same problem with his homelab and he is running HP proliant servers.
NICs: Some broadcom variant i think.
Did anyone of you guys solve it? Found this thread also https://www.reddit.com/r/vmware/comments/93fbuw/massive_vsan_latency_increase_on_upgrade_to_67/
Any help with where to continue troubleshooting is appriciated.
Just an update on this - this is a known issue with fix incoming, also fairly sure public documentation is in the works and I will keep an eye on that for updates.
So that you are aware: this is cosmetic and non-impactful, If you were experiencing that rate of loss it would be very apparent from performance issues (+ if you were getting packet drops due to network issues/misconfiguration they likely wouldn't be observed in just one place e.g. pNicRxPortDrops).
I am having the same issue after installing ESXi 7.0U2. No obvious problems before that in that I had never seen an alert before then and they started appearing after the update.
First of all, it only appears on the second port of my X540-AT2 cards. On all 3 servers in my home lab. I was getting very high numbers at first. I did find that I had the default/Management network enabled on the physical switch ports but not on any of the portgroups of the vDS. Once I removed it from the physical switch ports, the number of dropped packets decreased to generally less that 1 or 2%. I have only gotten a couple of alerts/alarms since.
But I am still seeing this drop rate on only the second port of the X540-AT2. There are no drops on the first port.
Looking at the Intel website, I see no firmware updates available for this card, none at all. Makes me wonder if the firmware can be updated.
Has anyone else seen this issue on 7.0u2?
I started have this issue on one particular host after upgrading to 7 U2. I use two X540-AT2 cards per chassis and when looking at the switches, there are no errors at all.
The error that this one host has pNIC errors, comes back every once in a while. The other hosts have not ever reported this error so far.
Unfortunately, I have a problem with the new "physical adapters" view. I select a random host in the U2 cluster, then select "Monitor -> vSAN Performance -> Physical Adapters, the "Alien robot with a magnifying glass" appears and the donut just spins and spins forever.
All the other views like "VM", Backend, Disks, Host Network etc. all work fine. It's only the "Physical adapters" pane that acts like this. Rebooting vCenter did not help.
When I look at the regular, classic "advanced performance", select Network, the correct physical nics and show me all errors from the past day, no errors show up. So somehow vCenter has the opinion this particular host has pNIC issues but the statistics don't show them.
vROPS shows no errors on the NICs either.
I have some new information. The only cluster where this "Alien robot with a magnifying glass appears and the donut just spins and spins forever" is a problem are all hosts that where upgraded from / U1d to U2 GA using the patch-baseline. For some reason, the problem with "Failed to load crypto64.efi / Fatal error: 15 (Not Found)" did not occur there.
All the other vSAN clusters in this vCenter where updated via the ISO baseline and all those hosts DO NOT have the problem. The pNIC diagnostics work fine on all of those hosts.
Conclusion: in my environment, all hosts upgraded to U2 via the ISO have working vSAN pNIC statistics. Only the hosts that where upgraded to U2 via a patch baseline are affected.