does anybody know or have the same behavior about the "TCP Inbound Loss rate".
We use vSAN 6.7U3 and got a small loss rate of 0.1% to 0.2% on all hosts.
Take a look to the screenshot below. It's from the performance view (host network - time range 24 hours) of two vSAN hosts.
It's interessing that we see this small loss rate most time outside of our business hours.
This is a VDI environment and so there is no really load on the servers before 6 am or at weekend.
We don't have any vSAN Host packet discards or drop rate. Only get this small loss rate. I also don't know if this have any impact to our environment.
Maybe anyone can clarify?
So one thing to note is that this potentially isn't just during normal business hours - it is a percent-based metric e.g. if you have 100 packets per second during quiet hours then 1 packet will be seen as 1%, but if you have 100,000 packets per second during the day then that same 1 packet will be 0.001% and thus not observed.
As with all performance graphs (not just in vSAN but pretty much any monitoring solution), these are designed to be reliable under normal/expected loads and can potentially be anomalous when there is near-zero IO.
I have seen similar to what you have shared here from the back-end metrics we pull in host-bundles (basically a raw form of the stats.db data that we load into Grafana) but never correlated them to any issues and it ALWAYS seems to be inbound/Rx which to me indicates it is some packet (e.g. node-membership heartbeats) being received or registered twice and thus one dropped on the logical layers (e.g. vSwitch).
Are you able to narrow this down any further from the other graphs as to where in the logical layers this may be occurring?
The descriptions of what each graph at each level monitors can be found here:
Are there any corresponding incrementing counters on the NICs? nicinfo.sh script (/usr/lib/vmware/vmware-support/bin/nicinfo.sh) run on a host will generate this information, but for checking any unexpected non-zero values it should be run (and saved) at least twice over time (e.g. so comparison can be done).
thanks for that information.
What we already saw that something in one of our VM networks do an issue regarding "Receive length errors". So on all hosts we have few errors on the "Receive length errors" counter.
It's not really high and we are also not sure if this also cause of the "tcp inbound bound loss rate". I don't think so.
Here is the output of one nic (we got two per host) which manage vMotion traffic, vSAN traffic and also different VM Network traffic....
Advertised Auto Negotiation: true
Advertised Link Modes: Auto, 40000BaseSR4/Full
Auto Negotiation: true
Cable Type: FIBRE
Current Message Level: 0
Bus Info: 0000:87:00:0
Firmware Version: 7.10 0x800075ec 19.5.12
Link Detected: true
Link Status: Up
PHY Address: 0
Pause Autonegotiate: false
Pause RX: false
Pause TX: false
Supported Ports: FIBRE
Supports Auto Negotiation: true
Supports Pause: true
Supports Wakeon: false
Virtual Address: 00:50:56:58:98:f1
NIC statistics for vmnic8:
Packets received: 16959544308
Packets sent: 14730365567
Bytes received: 41506856918183
Bytes sent: 35450864462457
Receive packets dropped: 0
Transmit packets dropped: 0
Multicast packets received: 133120293
Broadcast packets received: 59140098
Multicast packets sent: 1009653
Broadcast packets sent: 132104
Total receive errors: 251
Receive length errors: 251
Receive over errors: 0
Receive CRC errors: 0
Receive frame errors: 0
Receive FIFO errors: 0
Receive missed errors: 0
Total transmit errors: 0
Transmit aborted errors: 0
Transmit carrier errors: 0
Transmit FIFO errors: 0
Transmit heartbeat errors: 0
Transmit window errors: 0
After one week the counter of "Receive length errors" will increase of 96.
If I check again the performance view of this host I can see permanent some percent of TCP inbound loss.
This host is a member of our server cluster.
Multiple hosts in this cluster have these kind of loss rate.
But some other hosts logs very different. For Example this...
There is no TCP inbound loss rate only a short time a TCP outbound retransmission rate.
All hosts in the cluster have equal components.
We try to figure our with physical device or VM send these "Receive length errors". This seems not to easy because in this network are very different devices like printers, scanners, server vms, etc.
Any ideas what else to check?