vrugaitis
Contributor
Contributor

ESXi freezes when storage over ISCSI to Synology times out during high load

Hello,

I have a Fujitsu RX 300 S7 with ESXi 6.0 U3 and a Synology RS2416+ containing a RAID 10 with 4x 4 TB WD Red HDDs. I have created a file-based LUN on the Synology and I have connected the Target to the ESXi over 3 NICs via round-robin. This means, that all of the 3 NICs are used for the connection between the ESXi and the Synology (I have followed this tutorial).

I have made several tests with WMware IOMeter and CrystalDiskMark and the results are great (350 Mb/s and ~50k IOPS read), but during the tests, the ESXi hypervisor freezes. If I disable the Target on the Synology and enable it back again, the ESXi and the test VMs start working as if nothing had happened.

In the vmkernel.log I can find find several entries containing this line: "Waiting for timed out HB" (please find a screenshot attached)

Whan is the cause of the error? How can I fix the timeouts? I really need the higher throughput (=3 NICs) for better performance.

Thank you in advance!

Kind regards,

vrugaitis

Tags (3)
7 Replies
daphnissov
Immortal
Immortal

What version of DSM are you running on Synology? You should make sure you're always running the latest. I have two in my lab and they routinely release updates that address "iSCSI stability issues" and other things that can cause similar behavior.

0 Kudos
vrugaitis
Contributor
Contributor

Hey,

we are running the newest available version of DSM 6.1.3-15152 Update 3.

It seems somehow, that the ESXI looses the ISCSI connection, if the benchmarks are too intensive. On the current setup there are two LUNs attached to a VM and the VM is running CrystalDiskMark on the one LUN and does some Windows operations (copying a bigger folder) on the secod LUN. After a while, the VMs freezes.

Are there no possibilities to limit the network interaction to prevent such a behaviour? I mean, we need good throughput, but also some threshold to not overload ESXi or rather Synology.

Kind regards,

vrugaitis

0 Kudos
daphnissov
Immortal
Immortal

This could be due to your use of the file-based LUN. While that option is attractive because it offers the most VMware-rich data services, it comes at a price. In my tests, that option was least performing and more unstable. You may want to repeat your tests but using the block-level LUN option instead. I've found it produces lower latency and higher throughput.

0 Kudos
vrugaitis
Contributor
Contributor

Thank you for the fast reply! My results are basically completely different. File-based LUN is out-performing block-based LUN.

Furthermore, the file-based LUN managed to reach ~50.000 read IOPS, whereas the block-based LUN only had ~10.000. Does Synology have some kind of internal caching? Ist the RAM used for that? To which level can the Synology cache? GBs of data or rather only small bits - like for the IOPS benchmark? Because the HDDs are only at 5400 rpm.

----

Just to be sure,  you have a well performing setup? Could you please provide some information about it? Perhaps, I can transfer some settings to my own setup.

Kind regards,

vrugaitis

0 Kudos
daphnissov
Immortal
Immortal

Well, to be fair, you're using a Rack Station model with more CPU horsepower and more RAM, but it's odd you're seeing that much of a difference between file-based and block-based LUNs. There is a write cache setting in Storage Manager at HDD/SDD -> General. I don't know to what extent they can use RAM as a read or write cache buffer. But if you only have 4 x 4 TB drives at 5,400 RPM and you're seeing those numbers even from sequential reads, then that's pretty damn good. You might try to remove one of the uplinks and just go with two rather than three. Some systems have a hard time coping with odd-numbered links with a round-robin MPIO setting.

0 Kudos
vKopp
Contributor
Contributor

Hi Vrugaitis,

1) was you able to fix the ESXi freeze on Syno high load? If yes, how ?

2) is file-based LUN really faster than block-based LUN?

Thx for your feedback in advance!

Regards,

Cop

dogdaynoon
Enthusiast
Enthusiast

Were  you able to fix this issue? If so, how did  you go about it?

Thanks,
James

0 Kudos