We have been running an All-Flash vSAN cluster for about 5 months now and noticed some spikes in latency that seemed odd.
We have six hosts with the following configuration, all hardware is on the HCL.
- ESXi 6.5.0 7526125, vSAN 6.6
- SuperMicro 1028U-TR4+
- 2 x Intel E5-2680v4 2.4Ghz CPU
- 512GB RAM
- AOC-S3008L-L8I (Supermicro 12Gb/s Eight-Port SAS Controller)
- 2 disk groups with (Cache: 800GB SATA, Capacity: 2x3.84TB SATA)
- 2 X710-DA2 10Gb network adapters (Firmware: 6.01, Driver: 1.5.8 i40en)
When first troubleshooting we noticed spikes in vmknic errors for DupAckRx, DupDataRx, and OutofOrderRx.
Working with VMWare support we updated the drivers/firmware on our x710-DA2 adapters as the x710's have many known issues. (We specifically looked into LRO/TSO issues this adapter is known to have have). The change in firmware/drivers has not seemed to make a difference in the latency spikes at all.
Digging more we noticed that our switches were discarding packets multiple times every hour, on ALL of our active vSAN interfaces.
VSAN-SWITCH1# sh queuing interface ethernet 1/2
Ethernet1/2 queuing information:
qos-group sched-type oper-bandwidth
0 WRR 100
Mcast pkts dropped : 0
HW MTU: 16356 (16356 configured)
drop-type: drop, xon: 0, xoff: 0
Ucast pkts dropped : 182232
VSAN-SWITCH1# sh interface ethernet 1/2 | grep discard
0 input with dribble 0 input discard
0 lost carrier 0 no carrier 0 babble 182232 output discard
Working with Cisco support they instructed us to enable Active Buffer Monitoring to check the shared buffer usage on the ports. The switch has 3 groups of buffers each group has 4MB of buffer (usually the switch has 6MB, but the jumbo frame config on this switch reduces it down to 4MB) for 12MB total.
Once we enabled this, we were able to see the buffer usage on each of our ports from the last hour.
VSAN-SWITCH1# show hardware profile buffer monitor interface ethernet 1/2 brief
Maximum buffer utilization detected
1sec 5sec 60sec 5min 1hr
------ ------ ------ ------ ------
Ethernet1/2 384KB 384KB 768KB 4224KB 4224KB
What we found was that all of our ports were bursting a few times every hour to use the entire 4MB buffer space for that shared buffer group. During that burst, the interfaces in that buffer group would discard packets. These discarded packets also correlate with our odd latency spikes we see.
Cisco support recommended that if we were unable to change the traffic patterns on the switch (all we have on the switch are dedicated vSAN ports), then we would need to look towards getting a switch with deep buffers.
I spoke with VMWare support regarding this and all they recommended was 10Gb switches, they did not reference anything about needing bigger buffers or anything. The concern I have is that we are only using 6 ports on this switch thus far, and because of these discards we cannot add anymore hosts.
Is this normal behavior that vSAN would require deep buffers? Does anyone have a recommendations on 10Gb switches that uses SFP+ ports to use with All-Flash vSAN?
Solutions like vSAN (or any very high performance ethernet based storage technology for that matter) requires switches that can handle the continuous beating by multipe hosts really pushing it (heavy IO). The advice of using switches with deep packet buffers is the correct advice.
It is generelly recommended to use separate physical switches and NIC's for the vSAN Network. As vSAN Clusters are often not very large, such switches do not need to have a lot of ports either, reducing cost.
By isolating the vSAN network, it is also protected against configuration-mishaps on the regular switches, which can see several re-configs per month in larger organisations. Treat vSAN (or any Ethernet based storage) as if it where a FC Environment: it's isolated, configured once and rarely touched afterwards which is nice and stable and safe.
What you could try, is by means of test, is stop using jumbo-frames. I've seen no noticeable difference and it's easily misconfigured end-to-end. I've also seen enough switches that crap their pants in Jumbo mode under load, while working fine with normal 1500 byte packets. And with 10G or faster on modern powerfull switches, 9k frames really won't blow you off your feet anyway. So what if a switch needs to handle 6x more packets in 1500 byte mode, as opposed to 9k mode. The ASICs of modern switches can deal with that easily so who cares. 9k frames where a thing in the 1gig days, sometimes really helping performance. But in a 10gig+ world and modern switching hardware, don't bother.
Same issue we are facing. When high I/O load started, one host is losting other host, host isolation is occuring. We checked everyting, HPE/VMware side, all is ok. Now we found a lot of discard at the port of Cisco Nexus 5548 switches in case of failure. We opened a ticket to Cisco, we are waiting their answer. We are searching an answer of what it is the root cause of discards on the network ports.
4 nodes streched cluster (total 8 ESxi hosts)
Do you have verified the counters on your Virtual Connect / Flex Fabric modules?
Can you see there anything related to the issue?
We are running the same configuration like you in a vSAN Stretched Cluster design spanned over multiple Synergy Frames...
No any counters which shows any error, VC statics are clean.
Which switch model are using in this environment? My switches are Cisco Nexus 5548, I see packet discards around 30.000 - 50.000
Cisco recommended us to apply QoS based on ACL (by catching VSAN IP block). We did, but the issue is ongoing.
I think the switch buffer is not enough to handle this huge IO traffic.
Nexus 55xx used a VoQ buffer system. Basically the (limited) buffer isn't split across all the ports in a shared pool. There's also very few queues in general per port.
It's not a great switch for high performance operations, and I believe CIsco abandoned that VoQ system in the 56xx.
If you want a deep buffer Cisco switch, get a C36180YC-R 8GB of buffer allocation for the ports.
The 55xx is a puny 470 for the default traffic class (it's 640KB but stuff by default is reserved for FCoE and other stuff).
The 3548 is interesting. IT can achieve some crazy low port to port lat, but that's really only a big deal with RDMA traffic. Might help in the future but not now.
Thanks for your switch recommendation. We will care them in next investment.
By the way second recommendation of Cisco related to my case is to increase connected ports to backbone switches. Because my servers connected to Nexus, Nexus connected to Catalyst6500 with 2 ports in port-channel. these 2 ports are congested ports, and so it is applying back pressure to end devices.
We will increase the count of ports or replace the switches.
My NIC model installed on servers is
Model : Synergy 3820C 10/20Gb CNA
Firmware : 18.104.22.168 BC: 7.15.24
Driver : qfle3 - 22.214.171.124
8 nodes are running in streched VSAN cluster.
Nowadays I am planing to update them to 6.7 U2 and the latest HPE Synergy SPP bundle.