Re: vSAN - TCP Inbound Lose Rate

VMHero4Ever · ‎09-21-2020

Hi guys,

does anybody know or have the same behavior about the "TCP Inbound Loss rate".

We use vSAN 6.7U3 and got a small loss rate of 0.1% to 0.2% on all hosts.

Take a look to the screenshot below. It's from the performance view (host network - time range 24 hours) of two vSAN hosts.

It's interessing that we see this small loss rate most time outside of our business hours.

This is a VDI environment and so there is no really load on the servers before 6 am or at weekend.

We don't have any vSAN Host packet discards or drop rate. Only get this small loss rate. I also don't know if this have any impact to our environment.

Maybe anyone can clarify?

Regards,
VM-Master

depping · ‎09-21-2020

I have never seen this, maybe TheBobkin has

TheBobkin · ‎09-22-2020

Hello VMHero4Ever,

So one thing to note is that this potentially isn't just during normal business hours - it is a percent-based metric e.g. if you have 100 packets per second during quiet hours then 1 packet will be seen as 1%, but if you have 100,000 packets per second during the day then that same 1 packet will be 0.001% and thus not observed.

As with all performance graphs (not just in vSAN but pretty much any monitoring solution), these are designed to be reliable under normal/expected loads and can potentially be anomalous when there is near-zero IO.

I have seen similar to what you have shared here from the back-end metrics we pull in host-bundles (basically a raw form of the stats.db data that we load into Grafana) but never correlated them to any issues and it ALWAYS seems to be inbound/Rx which to me indicates it is some packet (e.g. node-membership heartbeats) being received or registered twice and thus one dropped on the logical layers (e.g. vSwitch).

Are you able to narrow this down any further from the other graphs as to where in the logical layers this may be occurring?

The descriptions of what each graph at each level monitors can be found here:

VMware Knowledge Base

Are there any corresponding incrementing counters on the NICs? nicinfo.sh script (/usr/lib/vmware/vmware-support/bin/nicinfo.sh) run on a host will generate this information, but for checking any unexpected non-zero values it should be run (and saved) at least twice over time (e.g. so comparison can be done).

Bob

VMHero4Ever · ‎09-24-2020

Hi Bob,

thanks for that information.

What we already saw that something in one of our VM networks do an issue regarding "Receive length errors". So on all hosts we have few errors on the "Receive length errors" counter.

It's not really high and we are also not sure if this also cause of the "tcp inbound bound loss rate". I don't think so.

Here is the output of one nic (we got two per host) which manage vMotion traffic, vSAN traffic and also different VM Network traffic....

NIC: vmnic8

NICInfo:

Advertised Auto Negotiation: true

Advertised Link Modes: Auto, 40000BaseSR4/Full

Auto Negotiation: true

Cable Type: FIBRE

Current Message Level: 0

Driver Info:

NICDriverInfo:

Bus Info: 0000:87:00:0

Driver: i40en

Firmware Version: 7.10 0x800075ec 19.5.12

Version: 1.9.5

Link Detected: true

Link Status: Up

Name: vmnic8

PHY Address: 0

Pause Autonegotiate: false

Pause RX: false

Pause TX: false

Supported Ports: FIBRE

Supports Auto Negotiation: true

Supports Pause: true

Supports Wakeon: false

Transceiver:

Virtual Address: 00:50:56:58:98:f1

Wakeon: None

NIC statistics for vmnic8:

Packets received: 16959544308

Packets sent: 14730365567

Bytes received: 41506856918183

Bytes sent: 35450864462457

Receive packets dropped: 0

Transmit packets dropped: 0

Multicast packets received: 133120293

Broadcast packets received: 59140098

Multicast packets sent: 1009653

Broadcast packets sent: 132104

Total receive errors: 251

Receive length errors: 251

Receive over errors: 0

Receive CRC errors: 0

Receive frame errors: 0

Receive FIFO errors: 0

Receive missed errors: 0

Total transmit errors: 0

Transmit aborted errors: 0

Transmit carrier errors: 0

Transmit FIFO errors: 0

Transmit heartbeat errors: 0

Transmit window errors: 0

After one week the counter of "Receive length errors" will increase of 96.

If I check again the performance view of this host I can see permanent some percent of TCP inbound loss.

This host is a member of our server cluster.

Multiple hosts in this cluster have these kind of loss rate.

But some other hosts logs very different. For Example this...

There is no TCP inbound loss rate only a short time a TCP outbound retransmission rate.

All hosts in the cluster have equal components.

We try to figure our with physical device or VM send these "Receive length errors". This seems not to easy because in this network are very different devices like printers, scanners, server vms, etc.

Any ideas what else to check?

mattladewig · ‎11-20-2020

having the same issue, were you able to resolve this?