Had a severe vSAN problem in IBM Softlayer over the weekend and I'd like some thoughts on what happened. First some details on the environment:
- IBM Softlayer provided hosts
- 6 hosts in a vSphere 6.0 U2 vSAN 6.2 all flash cluster
- Each host has 1 disk group comprised of
- Cache device: 1.2TB write-intensive SSD
- Capacity devices: 3x 1.8TB general purpose SSD for capacity
- All VMs are using FTT=1 and R5 erasure coding
- Plenty of CPU and RAM availability
What happened is that when I went to put one host into maintenance mode, I accidentally chose the "Full data migration" option instead of the "Ensure availability" option. Within 30 minutes of doing this, I was getting log congestion warnings. After a couple of hours I had hosts disconnecting from vCenter and was seeing VM write latency numbers around 450ms. This resulted in VMs crashing and application data loss.
We make use of vR Ops, so I have lots of stats available for the workload. Right before I placed the host into maintenance mode, the cluster was generating around 500 IOPS which is nowhere near what I'd consider to be high for an all flash vSAN cluster. Even a small one like this. It took nine hours for vSAN to finally migrate that data onto the five other hosts in the cluster and get the host into maintenance mode.
I had engaged VMware support on this issue as it was happening and was told the following:
- No safe way exists to abort a running evacuation. You just have to wait and deal with it
- They reduced the "copy to write" value from 50 to 5 after about 6 hours of it evacuating data. Said that this was something that you tune after having a problem like this
- Was told that performing a full data migration will always cause problems like this. That seems to be a serious problem if true.
- I asked if I'd have the same problem should the cache device fail on any one disk group if it took me more than an hour to replace it. Was told that this would not happen, which frankly confuses me based on how I understand vSAN rebuild operations. Why would a full data evacuation of a disk group be more or less impactful than vSAN recovering (after an hour) from a disk group failure.
Any comments from the community on this situation?