Hi,
We have a strange problem with storage latency.
Here is roughly our system:
Host:
- ESXi 7.0.3. 21424296
- Intel Ethernet Controller 10 Gigabit X540-AT2
- ixgben 1.15.1.0
Storage:
Infortrend DS 1024REB2
9x SSD, Raid 6
Connected with iSCSI, one LUN, Jumbo frames not activated.
The Storage itself is connceted to a switch with a 20Gbit Port Channel (2x10 Gbit).
The Host was planned to be connected to the switch with 20Gbit Port Channel, but Essentials license does not allow port aggregation (surprise!), so 2x 10 Gbit is used instead on the host side.
After initial tests we migrated one VM to the new storage and noticed the first time latency.
I tried to copy files from a share to the VM and noticed severe lag.
I tried to copy files directly via the host (WinSCP) to the new storage volume and again severe lag.
I then added a new drive to the VM on the new storage and copied some 56GB of data there.
Then I copied the 56GB of data on a separat drive on the same VM an something really strange (for me) happend:
The copy action was running fine, around 800MB/s and 6000 IOPS combiend on the storage.
But then _after_ the copy action, severe lag occured, lasting many minutes. During this period the VM was nearly not usable.
I repeated that test a few times, same result.
See picture:
I checked with esxtop, and after the copy action, the DAVG values spiked too.
Looks like an "echo" of the previous operation.
What is going on?
And how do I fix it?
Thanks for your help. 🙂
Hello
Such problems can be caused by the firmware. Are you sure that the firmware of the server and storage you are using is in the latest version?
Hello,
In iSCSI storage, there are some best practices to follow, the most important of them are :
Enabeling Jambo frames
iSCSI traffic should reamin in the same switch and don't have to pass by multiple switches or by a firewall.
Check the link below for more details :
https://core.vmware.com/resource/best-practices-running-vmware-vsphere-iscsi
KUDO if you find my answer useful.
Have you check if your datastores are using round robin connection
Also, you can check the throughput per channel. The default is 10000, the recommended is 10
Thanks for the answers.
I have contacted the storage provider and there is a problem with latency and round robin.
As soon I have resolved that, I will see if that really was the solution.
"Throughput per channel":
Im not sure what that is.
Have to look it up, but thanks anyway.