First of all - the good news is that this is a lab/dev environment - so no production loss if something terminal has happened here. However, we do operate a similar setup in production so I'm very keen to understand what caused the failure as I never want to have to deal with this 'for real'.
This is a 4 host cluster, running 6.5.0 build 10175896, with an all-flash vSAN comprising one cache and one capacity disk per host. The hardware is a bit old hence running 6.5 (I did say it was the lab) and I will immediately confess that the SSDs (Samsung Evo 850s) are not on the HCL - they don't have high enough performance or endurance. However, a quick check of SMART indications on the disks during a shutdown this morning showed no signs that they were having any issues. vCenter also has no complaints about them and they are all appearing without error. I'll also add that this lab setup has worked pretty flawlessly for over a year now.
So on to my two days of "fun" ...
The first indication I had of a problem was when I tried to change a VM's storage policy to one that included encryption. This failed due to lack of space on the vSAN datastore, so I attempted to delete some old test VMs to free up disk space, and this just hung. At this point, things started to go very quickly downhill. Further excitement was added by DRS vmotioning machines around whilst I was trying to debug so that was swiftly put in manual-only mode.
vmkernel logs were showing complaints like:
2019-06-11T20:51:30.651Z cpu2:65675)WARNING: FS3J: 2113: Error freeing journal block (returned 0) <type 1 addr 68816> for 5cb8ea9e-e3c7361e-5dcb-0022195d7320: No space left on device
2019-06-11T20:51:30.652Z cpu2:65675)Vol3: 3016: Error closing the volume: No space left on device. Eviction fails.
2019-06-12T07:52:00.025Z cpu7:68393)WARNING: HBX: 274: No space for the journal on volume 5cb62da9-52994f44-0e50-14feb5d7d8e1 ("a92db65c-e880-df20-845e-14feb5d7d8e1"). Volume will remain in read-only metadata mode with limited write support unti$
I had a number of VMs showing up as orphaned in vCenter by now, which didn't look encouraging.
Looking around other posts and KB articles, It seems that I'm stuck in a situation where there's not enough space to create the journal on the filesystem which (sensibly) forces it to be read-only; but because of that, I can't delete anything to free up space. The vSAN filesystem showed around 96% use (output of df -h on one of the hosts). I looked in the vSAN config around now and could see all eight disks, and from memory, four were cache and four were capacity - as expected.
At this point, I shut down the four hosts from their consoles (forcibly stopping the VMs) and did the quick SMART disk check I mentioned earlier.
Bringing them back online, I was greeted with vCenter showing all VMs as either: orphaned, inaccessible or - in the case of four of them - not even able to show the VM name (it is showing a vSAN path and a UUID, marked as inaccessible).
I now have a catalogue of vSAN health alerts. It is the sort of screen that makes you very glad this is the lab:
vSphere cluster members do not match vSAN cluster members
Stats master election
Stats DB object
Current cluster situation
Hosts disconnected from VC
vSAN object health
vCenter is also claiming that two of the hosts (cc03 and cc04) are not responding (although I can happily reach them). I took a brief diversion here to check the network - each host has 2x 10G vSAN connections operating as active/standby. The network switches are behaving as expected and everything looks fine there. A ping check from one host to the other three showed that 9000 byte packets were getting through to the vmkernel adapters used for vSAN, as well as management being possible (I am logged in via ssh to the four hosts on the same IP that vCenter uses). So no clear finger of blame at the network here either.
Whilst I've been typing this, I have been waiting for the output of 'esxcli vsan cluster get' to appear in all four ssh windows. Two hosts (cc01 and cc02) replied immediately with a sensible-looking output showing one as master and the other as an agent, and with four member UUIDs. About 15 minutes later, cc04 responded and I'm still waiting for cc03 about 30 minutes later. vCenter now shows cc04 as being reachable again, so things are potentially improving.
Performing my df -h again I now see that the datastore is at 99% capacity so it is doing something in the background.
The output of 'esxcli vsan debug disk overview' currently differs from what I remember seeing in vCenter earlier. It now thinks I have three cache disks and five capacity disks:
[root@cc01:~] esxcli vsan debug disk overview UUID Name Owner Ver Disk Group Disk Tier SSD Metadata Ops Congestion CMMDS VSI Capacity Used Reserved ------------------------------------ ----------------------------------------- ----------------------- --- ------------------------------------ --------- ----- -------- ----- ---------- ----- ----- --------- --------- --------- 52ecabbb-0cb5-6c75-e079-f63c3be45c7f naa.5002538d41d892e1 cc03.lab1.xxxxxxx.xx.xx 5 52ecabbb-0cb5-6c75-e079-f63c3be45c7f Cache false N/A N/A No true false N/A N/A N/A 526f0487-00fc-a4d2-0e52-183b1cc9e80a naa.5002538d41d22299 cc03.lab1.xxxxxxx.xx.xx 5 52ecabbb-0cb5-6c75-e079-f63c3be45c7f Capacity false N/A N/A No true false 220.56 GB 220.56 GB 190.40 GB 52129b93-ab5b-bac3-8a9c-bebb137abb38 naa.5002538d41acade5 cc02.lab1.xxxxxxx.xx.xx 5 52129b93-ab5b-bac3-8a9c-bebb137abb38 Cache true green green No true true N/A N/A N/A 52d0a078-e5e6-c68b-1cf8-fef0d722e4ca naa.5002538d41ad15c1 cc02.lab1.xxxxxxx.xx.xx 5 52129b93-ab5b-bac3-8a9c-bebb137abb38 Capacity true green green No true true 220.56 GB 220.56 GB 198.24 GB 5269ba46-c950-0310-2665-c728611f77ad naa.5002538d425bea90 N/A -1 N/A Capacity true red red No false false N/A N/A N/A 52371ee6-cd84-5ff6-0633-51f1683956e1 vsan:52371ee6-cd84-5ff6-0633-51f1683956e1 N/A -1 N/A Capacity true green red No false false N/A N/A N/A 528cc02c-8e77-b354-e486-93d4a3a5295d naa.5002538d41d89b24 cc04.lab1.xxxxxxx.xx.xx 5 528cc02c-8e77-b354-e486-93d4a3a5295d Cache false N/A N/A No true false N/A N/A N/A 520174c9-f800-88aa-ed61-03e26a2db850 naa.5002538d41e057b9 cc04.lab1.xxxxxxx.xx.xx 5 528cc02c-8e77-b354-e486-93d4a3a5295d Capacity false N/A N/A No true false 220.56 GB 215.46 GB 164.96 GB
It looks like I've somehow lost cc01's disk group (1x cache and 1x capacity) as well. I see the same output if I run this on another host so there is consistency there. Quite why
After all of that rambling - I guess I have two questions:
1) Is there a way that I can deal with the logjam that I seem to have here?
2) Is this expected behaviour? If so, I am wondering if reserving 100% object space is the right way to go to ensure that in future a sudden failure doesn't result in pressure on available remaining space.