Highlighted
Contributor
Contributor

vSAN - TCP Inbound Lose Rate

Hi guys,

does anybody know or have the same behavior about the "TCP Inbound Loss rate".

We use vSAN 6.7U3 and got a small loss rate of 0.1% to 0.2% on all hosts.

Take a look to the screenshot below. It's from the performance view (host network - time range 24 hours) of two vSAN hosts.

Loss Rate.png

2020-09-21 12_44_48-vSphere - fra7-esx203.kvhessen.de - Performance.png

It's interessing that we see this small loss rate most time outside of our business hours.

This is a VDI environment and so there is no really load on the servers before 6 am or at weekend.

We don't have any vSAN Host packet discards or drop rate. Only get this small loss rate. I also don't know if this have any impact to our environment.

Maybe anyone can clarify?

Regards,
VM-Master

0 Kudos
4 Replies
Highlighted
VMware Employee
VMware Employee

I have never seen this, maybe TheBobkin​ has

0 Kudos
Highlighted
VMware Employee
VMware Employee

Hello VMHero4Ever​,

So one thing to note is that this potentially isn't just during normal business hours - it is a percent-based metric e.g. if you have 100 packets per second during quiet hours then 1 packet will be seen as 1%, but if you have 100,000 packets per second during the day then that same 1 packet will be 0.001% and thus not observed.

As with all performance graphs (not just in vSAN but pretty much any monitoring solution), these are designed to be reliable under normal/expected loads and can potentially be anomalous when there is near-zero IO.

I have seen similar to what you have shared here from the back-end metrics we pull in host-bundles (basically a raw form of the stats.db data that we load into Grafana) but never correlated them to any issues and it ALWAYS seems to be inbound/Rx which to me indicates it is some packet (e.g. node-membership heartbeats) being received or registered twice and thus one dropped on the logical layers (e.g. vSwitch).

Are you able to narrow this down any further from the other graphs as to where in the logical layers this may be occurring?

The descriptions of what each graph at each level monitors can be found here:

VMware Knowledge Base

Are there any corresponding incrementing counters on the NICs? nicinfo.sh script (/usr/lib/vmware/vmware-support/bin/nicinfo.sh) run on a host will generate this information, but for checking any unexpected non-zero values it should be run (and saved) at least twice over time (e.g. so comparison can be done).

Bob

0 Kudos
Highlighted
Contributor
Contributor

Hi Bob,

thanks for that information.

What we already saw that something in one of our VM networks do an issue regarding "Receive length errors". So on all hosts we have few errors on the "Receive length errors" counter.

It's not really high and we are also not sure if this also cause of the "tcp inbound bound loss rate". I don't think so.

Here is the output of one nic (we got two per host) which manage vMotion traffic, vSAN traffic and also different VM Network traffic....

NIC:  vmnic8

NICInfo:

   Advertised Auto Negotiation: true

   Advertised Link Modes: Auto, 40000BaseSR4/Full

   Auto Negotiation: true

   Cable Type: FIBRE

   Current Message Level: 0

   Driver Info:

      NICDriverInfo:

         Bus Info: 0000:87:00:0

         Driver: i40en

         Firmware Version: 7.10 0x800075ec 19.5.12

         Version: 1.9.5

   Link Detected: true

   Link Status: Up

   Name: vmnic8

   PHY Address: 0

   Pause Autonegotiate: false

   Pause RX: false

   Pause TX: false

   Supported Ports: FIBRE

   Supports Auto Negotiation: true

   Supports Pause: true

   Supports Wakeon: false

   Transceiver:

   Virtual Address: 00:50:56:58:98:f1

   Wakeon: None

NIC statistics for vmnic8:

   Packets received: 16959544308

   Packets sent: 14730365567

   Bytes received: 41506856918183

   Bytes sent: 35450864462457

   Receive packets dropped: 0

   Transmit packets dropped: 0

   Multicast packets received: 133120293

   Broadcast packets received: 59140098

   Multicast packets sent: 1009653

   Broadcast packets sent: 132104

   Total receive errors: 251

   Receive length errors: 251

   Receive over errors: 0

   Receive CRC errors: 0

   Receive frame errors: 0

   Receive FIFO errors: 0

   Receive missed errors: 0

   Total transmit errors: 0

   Transmit aborted errors: 0

   Transmit carrier errors: 0

   Transmit FIFO errors: 0

   Transmit heartbeat errors: 0

   Transmit window errors: 0

After one week the counter of  "Receive length errors" will increase of 96.

If I check again the performance view of this host I can see permanent some percent of TCP inbound loss.

Bild1.png

This host is a member of our server cluster.

Multiple hosts in this cluster have these kind of loss rate.

But some other hosts logs very different. For Example this...

Bild2.png

There is no TCP inbound loss rate only a short time a TCP outbound retransmission rate.

All hosts in the cluster have equal components.

We try to figure our with physical device or VM send these "Receive length errors". This seems not to easy because in this network are very different devices like printers, scanners, server vms, etc.

Any ideas what else to check?

0 Kudos
Highlighted
Contributor
Contributor

having the same issue, were you able to resolve this?

0 Kudos