questions regarding VM guest file system corruptio...

herta_vde · ‎03-19-2017

We have a VMware environment which runs off a SAN over iSCSI. The SAN is also used by UNIX servers which are direct attached over FC. The VMware cluster runs 30-some VM guests on 5 ESXi hosts. We use the Standard Edition. 3 hosts in the cluster are at ESXi 6.5, 2 are still at 6.0.

Recent (past 3 months) changes:

- additions of extra VM guests, one of which is fairly I/O intensive

- move to Veeam backup

on the whole this is far less I/O intensive than our previous backup sw, which ran at the VM guest file system level, but we do have a few large vmdk's which take well over 48 hours for a full backup (taken once per month). Previously, the full backup was was split up in groups of directories and spread out over several days, with one back never lasting over 10 hours.

- move from ESXi 6.0 to 6.5

- possibly others that do not come to mind right now

Within a span of 3 months, we had 3 file system corruptions on 3 different VM guests. We're not ruling out 3 unrelated events at VM guest level, but it seems unlikely as the 3 VM guests are unrelated. 2 run different versions of Windows OS (NTFS), the 3rd one runs Linux (ext4). They all run on different datastores. Two of these datastores run on the same pool in the SAN, the third one runs on a different one.

So far, only VM guests are affected. In the 6 years I've worked at this company, we only had 1 other file system corruption, which was due to a DFS misconfiguration.

Our SAN vendor examined the SAN and hardware-wise it got a clean bill of health.

They did recommend limiting the ESXi queue depth on LUNs to 32, and on ports to 512/number of LUNs on the port/number of ESXI hosts, less than or equal to 32.

On the ESXi hosts, there are 2 parameters that more or less correspond to this:

iscsivmk_HostQDepth (Maximum Outstanding Commands Per Adapter)

iscsivmk_LunQDepth (Maximum Outstanding Commands Per LUN)

More or less, because the one sw iSCSI adapter connects to two of the SAN ports (2 NICs each one connected to a different member of a stacked switch).

We found the MaxCommands, which corresponds to iscsivmk_HostQDepth, and which has a default value of 128.

I cannot find any parameter that might translate to the iscsivmk_LunQDepth, nor did I find what its default value might be.

You can check the current queue length from esxtop, but queue length is not the same as queue depth.

You can set both parameters using the command:

"esxcli system module parameters set -p "iscsivmk_HostQDepth=n iscsivmk_LunQDepth=m" -m iscsi_vmk"

where m and n would be the values, but as I do not know how to unset them, I'd like to know what the current value is of iscsivmk_LunQDepth, in case changing the default value makes things worse.

(Setting the parameters requires a reboot.)

A rather lengthy introduction to a number concrete questions:

How can I find the current value for iscsivmk_LunQDepth?
("esxcli system module parameters list -m iscsi_vmk" does not list default values)
Given that the iSCSI adapter addresses 2 SAN ports, should we multiply the recommended setting for port-level queue depth by two and assign that value to iscsivmk_HostQDepth?
If we set the queue depth at ESXi-level, should we also make changes at the VM guest level?
The current queue depth settings cause latency. Can latency cause file system corruptions at the level of VM guests?
Can you think of other things we could investigate to get to the root cause of these corruptions?

All

questions regarding VM guest file system corruptions