Very little I/O, no congestion,but high latency

Mokii · ‎07-25-2019

My environment is vsan6.7, all flash. Recently, the write latency of a virtual machine is very large, and it will affect the delay of the entire cluster, but the strange thing is that I / Ø is very small whether it is this virtual machine or the entire cluster. And if it is only a write test of a file copy, there will be no congestion and high latency. What is the reason for this? And the latency of a virtual machine is high, why does the delay of the entire cluster increase?

TolgaAsik · ‎07-28-2019

Hello,

Perhaps you did, but even so I wanna ask you.

to specify the case --> Did you try to move virtual machines onto different ESXi hosts? I wanna sure it is happening on same host.

Compatibility --> On the other hand, please check NIC firmware and driver version, maybe NIC could not handle the traffic accurately.

Compatibiliy --> please check snart array controller card driver and firmware version? Are they certified for VMware VSAN?

Health --> Did you check VMware VSAN monitoring tab "health" on VSAN? İs it green?

Network --> What is your Network design for VSAN? Share network on VSAN or dedicated VDS with dedicated uplink. If you use shared network for VSAN/Data/Vmotion, did you enabled NIOC? You should prioritize VSAN and Data traffic on VDS by increasing their share value.

Mokii · ‎07-28-2019

Thanks for the reply, first of all, this phenomenon can occur on any host. The network is configured for each host 2x10GB network, the network delay is normal, and the daily traffic is very small. Vsan health is all green, and disk controllers and drivers are all certified.

TolgaAsik · ‎07-28-2019

Hi,

Did you check physical network site? Switch ports, discard packets, CRC..

I faced this issue, the problem was pyhsical switch side. It couldnt handle mircoburst traffic.

As far as I understand you use one VDS with 2x10 uplink per server. Did you prioritize VSAN traffic via NIOC on VDS? If there is a congestion on server side, it would evaluate VSAN traffic better.

JasonNash · ‎07-31-2019

Sounds similar to this post I've just commented on here.

Write latency and network errors

What switches are you running? Check your switches for output discards.

Do you see network errors in observer? We have duplicate data, acks, retransmits and out of order frames.

JasonNash · ‎07-31-2019

TolgaAsik how did you resolve this in your environment?

Did you replace switches? If so what with and what was the impact/outcome?

Mokii · ‎07-31-2019

Thanks for the reply, I think it is not a network problem, the first choice, continuous monitoring of network delay has been ongoing, no abnormalities have been found. At the same time, continuous copying and pull operation testing of large files between hosts and virtual machines does not cause the above write latency problem.

srodenburg · ‎08-02-2019

Try this: take a VM, apply a Storage Policy to it which has no redundancy (FTT=0), then look at the physical placement of the components (on which host) and vMotion the VM to that host and make sure it stays there. Now, you are forcing data-locality.

Then, check the latency.

If the problem is still there, you can rule out the network as the network is not used when the VM sits on the same host that has the only data (as there is no mirror copy somewhere else it needs to read/write from).

If the latency-problem is gone, you DO have a network issue. To verify, give the VM a policy with FTT=1 and let it finish re-creating the object mirrors. After that, the problem should be back, indicating that there is a delay being introduced when hosts communicate with each other over the network.

Saying "I do not think we have a network issue" is a bit dangerous as many vSAN problems come from the network. I see it all the time. I see people route the traffic between vSAN nodes over many hops and do all kind of things that introduce inter-node latency.

Another thing to try, force the vSAN vmkernel-port to flow over another NIC. This usually means another switch as well (hopefully). See what happens then.

Also, as you have a write-latency issue and not a read-latency issue (at least you don't speak about it), is the cluster stretched? Do you have Site-read-locality activated?

TolgaAsik · ‎08-02-2019

Hi,

Temporarily we solved the issue by migrating VSAN backend traffic onto existing Nexus 5672 UP switches. It is running now very stable now.

Data/vMotion/management traffic still running on Nexus 5548 switches.

Final decision with our network team, we will replace Cisco 5548UP switches in near time with recent Cisco Nexus series switches. Because old Nexus are not aware microburst traffic and doesnt handle high packet transmits.

Mokii · ‎08-03-2019

Thanks for your suggestion, this delay occurs on a certain number of virtual ones, and the delay problem mainly occurs in mysql write and query. After we have recently deployed the mysql database to another virtual machine, the problem is solved. We tried the Windows and centos7.6 versions of the operating system no longer have a delay. The virtual machine we had problems with was the centos7.3 version. Is this related to the operating system?

All

Very little I/O, no congestion,but high latency