I have a Fujitsu RX 300 S7 with ESXi 6.0 U3 and a Synology RS2416+ containing a RAID 10 with 4x 4 TB WD Red HDDs. I have created a file-based LUN on the Synology and I have connected the Target to the ESXi over 3 NICs via round-robin. This means, that all of the 3 NICs are used for the connection between the ESXi and the Synology (I have followed this tutorial).
I have made several tests with WMware IOMeter and CrystalDiskMark and the results are great (350 Mb/s and ~50k IOPS read), but during the tests, the ESXi hypervisor freezes. If I disable the Target on the Synology and enable it back again, the ESXi and the test VMs start working as if nothing had happened.
In the vmkernel.log I can find find several entries containing this line: "Waiting for timed out HB" (please find a screenshot attached)
Whan is the cause of the error? How can I fix the timeouts? I really need the higher throughput (=3 NICs) for better performance.
Thank you in advance!
What version of DSM are you running on Synology? You should make sure you're always running the latest. I have two in my lab and they routinely release updates that address "iSCSI stability issues" and other things that can cause similar behavior.
we are running the newest available version of DSM 6.1.3-15152 Update 3.
It seems somehow, that the ESXI looses the ISCSI connection, if the benchmarks are too intensive. On the current setup there are two LUNs attached to a VM and the VM is running CrystalDiskMark on the one LUN and does some Windows operations (copying a bigger folder) on the secod LUN. After a while, the VMs freezes.
Are there no possibilities to limit the network interaction to prevent such a behaviour? I mean, we need good throughput, but also some threshold to not overload ESXi or rather Synology.
This could be due to your use of the file-based LUN. While that option is attractive because it offers the most VMware-rich data services, it comes at a price. In my tests, that option was least performing and more unstable. You may want to repeat your tests but using the block-level LUN option instead. I've found it produces lower latency and higher throughput.
Thank you for the fast reply! My results are basically completely different. File-based LUN is out-performing block-based LUN.
Furthermore, the file-based LUN managed to reach ~50.000 read IOPS, whereas the block-based LUN only had ~10.000. Does Synology have some kind of internal caching? Ist the RAM used for that? To which level can the Synology cache? GBs of data or rather only small bits - like for the IOPS benchmark? Because the HDDs are only at 5400 rpm.
Just to be sure, you have a well performing setup? Could you please provide some information about it? Perhaps, I can transfer some settings to my own setup.
Well, to be fair, you're using a Rack Station model with more CPU horsepower and more RAM, but it's odd you're seeing that much of a difference between file-based and block-based LUNs. There is a write cache setting in Storage Manager at HDD/SDD -> General. I don't know to what extent they can use RAM as a read or write cache buffer. But if you only have 4 x 4 TB drives at 5,400 RPM and you're seeing those numbers even from sequential reads, then that's pretty damn good. You might try to remove one of the uplinks and just go with two rather than three. Some systems have a hard time coping with odd-numbered links with a round-robin MPIO setting.
1) was you able to fix the ESXi freeze on Syno high load? If yes, how ?
2) is file-based LUN really faster than block-based LUN?
Thx for your feedback in advance!