Solved: vSAN 6.2 AF cluster extreme IO latency when host p...

jarrodbradford · ‎09-11-2017

Hey Everyone.

Had a severe vSAN problem in IBM Softlayer over the weekend and I'd like some thoughts on what happened. First some details on the environment:

IBM Softlayer provided hosts
6 hosts in a vSphere 6.0 U2 vSAN 6.2 all flash cluster
Each host has 1 disk group comprised of
- Cache device: 1.2TB write-intensive SSD
- Capacity devices: 3x 1.8TB general purpose SSD for capacity
All VMs are using FTT=1 and R5 erasure coding
Plenty of CPU and RAM availability

What happened is that when I went to put one host into maintenance mode, I accidentally chose the "Full data migration" option instead of the "Ensure availability" option. Within 30 minutes of doing this, I was getting log congestion warnings. After a couple of hours I had hosts disconnecting from vCenter and was seeing VM write latency numbers around 450ms. This resulted in VMs crashing and application data loss.

We make use of vR Ops, so I have lots of stats available for the workload. Right before I placed the host into maintenance mode, the cluster was generating around 500 IOPS which is nowhere near what I'd consider to be high for an all flash vSAN cluster. Even a small one like this. It took nine hours for vSAN to finally migrate that data onto the five other hosts in the cluster and get the host into maintenance mode.

I had engaged VMware support on this issue as it was happening and was told the following:

No safe way exists to abort a running evacuation. You just have to wait and deal with it
They reduced the "copy to write" value from 50 to 5 after about 6 hours of it evacuating data. Said that this was something that you tune after having a problem like this
Was told that performing a full data migration will always cause problems like this. That seems to be a serious problem if true.
I asked if I'd have the same problem should the cache device fail on any one disk group if it took me more than an hour to replace it. Was told that this would not happen, which frankly confuses me based on how I understand vSAN rebuild operations. Why would a full data evacuation of a disk group be more or less impactful than vSAN recovering (after an hour) from a disk group failure.

Any comments from the community on this situation?

jarrodbradford · ‎09-17-2017

VMware and IBM SoftLayer support finally got back with me on what we've ran into. There are two major things that conspired to cause our problems.

NIC driver and firmware - When ordering a bare metal host from SoftLayer with vSphere 6.0 licensed and installed, they are currently deploying SuperMicro 10Gb NICs (Intel X540-AT2) and use version 4.1.1.4-iov of the ixgbe driver. This version is not on the HCL and they recommend upgrading to 4.4.1. Support indicated that there are features that VSAN needs that are not activated in the version that SoftLayer deploys.

vSphere / ESXi version - When ordering a bare metal host from SoftLayer with vSphere 6.0 licensed and installed, they are currently deploying 6.0 U2 Patch 3. This release which as far as I can tell is the one that SoftLayer supports has a known vSAN bug called the "Zero drain" bug. It seems that this is resolved in 6.0 U3 but at this time I have thus far been unable to get clarification from the IBM SoftLayer "Hardware Solutions Group" on if we can deploy this version of code.

Ultimately, it seems that our problem is due to old code, both for the NICs and the version of ESXi that is deployed by SoftLayer. I will update this thread again if and when I get an updated support statement from SoftLayer. Do note that this is in the bare metal, "build your own" SoftLayer environment and not what IBM deploys if you are using VMware Cloud Foundation on IBM Cloud.

View solution in original post

TheBobkin · ‎09-13-2017

Hello Jarrod,

"Right before I placed the host into maintenance mode, the cluster was generating around 500 IOPS which is nowhere near what I'd consider to be high for an all flash vSAN cluster"

How much would be typical for this set-up from your metrics/tests during normal function?

Is it possible that congestion issues were occurring before you evacuated this host?

Congestion logging only starts in vmkernel.log once it reaches dangerous levels (200) so you may or may not be able to find this out from the logs.

Hosts and VMs can disconnect when a congested host reaches its congestion limit (255), if it was log congestion then it this was likely caused by the LLOG consumption using the whole default capacity (24GB) - this assigned capacity should be temporarily increased to avoid this occurring.

What build of vSAN 6.2 are you using? Congestion issues started abating after 6.0 U2 P04 (and were more easily managed when they did occur) and don't appear to affect 6.2 U3/6.5 clusters under normal circumstances (like evacuating a host).

If you are pre build-4600944 then find out if your RTQ with Softlayer allows for updating to this build, a later build of 6.2 or 6.0 U3 .

Bob

jarrodbradford · ‎09-13-2017

Thanks Bob. That's helpful feedback. We are running an older 6.0 U2 build, specifically 4192238. We still have a ticket open with VMware via SoftLayer and the Cork, Ireland team is working on it. I'll see if we have the option of updating the hosts and staying in support. This is in the IBM SoftLayer bare metal side of things and isn't IBM Cloud, so I think I may have more version control here.

Here is a link to a screenshot from our vR Ops stats on this cluster so you can see what we ran into. The command to enter maintenance mode with full migration was almost exactly at 8PM. There was an uptick in IOPS on the cluster that starts then. Congestions rapidly start climbing and peak at 132 at 9PM. Within a few minutes of that, write latency jumps to over 180ms. There is a sudden drop in congestions from 128 to 17 at 2:18 AM. Almost immediately, there is a massive jump in write latency up to 453ms which I believe lines up with when our app team lost their database VM.

Thanks again for taking the time to look at this. Your assistance is very much appreciated.

jarrodbradford · ‎09-17-2017

VMware and IBM SoftLayer support finally got back with me on what we've ran into. There are two major things that conspired to cause our problems.

NIC driver and firmware - When ordering a bare metal host from SoftLayer with vSphere 6.0 licensed and installed, they are currently deploying SuperMicro 10Gb NICs (Intel X540-AT2) and use version 4.1.1.4-iov of the ixgbe driver. This version is not on the HCL and they recommend upgrading to 4.4.1. Support indicated that there are features that VSAN needs that are not activated in the version that SoftLayer deploys.

vSphere / ESXi version - When ordering a bare metal host from SoftLayer with vSphere 6.0 licensed and installed, they are currently deploying 6.0 U2 Patch 3. This release which as far as I can tell is the one that SoftLayer supports has a known vSAN bug called the "Zero drain" bug. It seems that this is resolved in 6.0 U3 but at this time I have thus far been unable to get clarification from the IBM SoftLayer "Hardware Solutions Group" on if we can deploy this version of code.

Ultimately, it seems that our problem is due to old code, both for the NICs and the version of ESXi that is deployed by SoftLayer. I will update this thread again if and when I get an updated support statement from SoftLayer. Do note that this is in the bare metal, "build your own" SoftLayer environment and not what IBM deploys if you are using VMware Cloud Foundation on IBM Cloud.

All

vSAN 6.2 AF cluster extreme IO latency when host placed into maintenance mode