VMware Cloud Community
paulthornton
Contributor
Contributor

vSAN datastore suddenly totally unavailable

Hi all

First of all - the good news is that this is a lab/dev environment - so no production loss if something terminal has happened here.  However, we do operate a similar setup in production so I'm very keen to understand what caused the failure as I never want to have to deal with this 'for real'.

This is a 4 host cluster, running 6.5.0 build 10175896, with an all-flash vSAN comprising one cache and one capacity disk per host.  The hardware is a bit old hence running 6.5 (I did say it was the lab) and I will immediately confess that the SSDs (Samsung Evo 850s) are not on the HCL - they don't have high enough performance or endurance.  However, a quick check of SMART indications on the disks during a shutdown this morning showed no signs that they were having any issues.  vCenter also has no complaints about them and they are all appearing without error.  I'll also add that this lab setup has worked pretty flawlessly for over a year now.

So on to my two days of "fun" ...

The first indication I had of a problem was when I tried to change a VM's storage policy to one that included encryption.  This failed due to lack of space on the vSAN datastore, so I attempted to delete some old test VMs to free up disk space, and this just hung.  At this point, things started to go very quickly downhill.  Further excitement was added by DRS vmotioning machines around whilst I was trying to debug so that was swiftly put in manual-only mode.

vmkernel logs were showing complaints like:

2019-06-11T20:51:30.651Z cpu2:65675)WARNING: FS3J: 2113: Error freeing journal block (returned 0) <type 1 addr 68816> for 5cb8ea9e-e3c7361e-5dcb-0022195d7320: No space left on device

2019-06-11T20:51:30.652Z cpu2:65675)Vol3: 3016: Error closing the volume: No space left on device. Eviction fails.

and

2019-06-12T07:52:00.025Z cpu7:68393)WARNING: HBX: 274: No space for the journal on volume 5cb62da9-52994f44-0e50-14feb5d7d8e1 ("a92db65c-e880-df20-845e-14feb5d7d8e1"). Volume will remain in read-only metadata mode with limited write support unti$

I had a number of VMs showing up as orphaned in vCenter by now, which didn't look encouraging.

Looking around other posts and KB articles, It seems that I'm stuck in a situation where there's not enough space to create the journal on the filesystem which (sensibly) forces it to be read-only; but because of that, I can't delete anything to free up space.  The vSAN filesystem showed around 96% use (output of df -h on one of the hosts).  I looked in the vSAN config around now and could see all eight disks, and from memory, four were cache and four were capacity - as expected.

At this point, I shut down the four hosts from their consoles (forcibly stopping the VMs) and did the quick SMART disk check I mentioned earlier.

Bringing them back online, I was greeted with vCenter showing all VMs as either: orphaned, inaccessible or - in the case of four of them - not even able to show the VM name (it is showing a vSAN path and a UUID, marked as inaccessible).

I now have a catalogue of vSAN health alerts.  It is the sort of screen that makes you very glad this is the lab:

vSphere cluster members do not match vSAN cluster members

Stats master election

Stats DB object

Current cluster situation

Disk capacity

Hosts disconnected from VC

Operation health

vSAN object health

vCenter is also claiming that two of the hosts (cc03 and cc04) are not responding (although I can happily reach them).  I took a brief diversion here to check the network - each host has 2x 10G vSAN connections operating as active/standby.  The network switches are behaving as expected and everything looks fine there.  A ping check from one host to the other three showed that 9000 byte packets were getting through to the vmkernel adapters used for vSAN, as well as management being possible (I am logged in via ssh to the four hosts on the same IP that vCenter uses).  So no clear finger of blame at the network here either.

Whilst I've been typing this, I have been waiting for the output of 'esxcli vsan cluster get' to appear in all four ssh windows.  Two hosts (cc01 and cc02) replied immediately with a sensible-looking output showing one as master and the other as an agent, and with four member UUIDs.  About 15 minutes later, cc04 responded and I'm still waiting for cc03 about 30 minutes later.  vCenter now shows cc04 as being reachable again, so things are potentially improving.

Performing my df -h again I now see that the datastore is at 99% capacity so it is doing something in the background.

The output of 'esxcli vsan debug disk overview' currently differs from what I remember seeing in vCenter earlier.  It now thinks I have three cache disks and five capacity disks:

[root@cc01:~] esxcli vsan debug disk overview UUID                                  Name                                       Owner                    Ver  Disk Group                            Disk Tier    SSD  Metadata  Ops    Congestion  CMMDS    VSI  Capacity   Used       Reserved  ------------------------------------  -----------------------------------------  -----------------------  ---  ------------------------------------  ---------  -----  --------  -----  ----------  -----  -----  ---------  ---------  --------- 52ecabbb-0cb5-6c75-e079-f63c3be45c7f  naa.5002538d41d892e1                       cc03.lab1.xxxxxxx.xx.xx    5  52ecabbb-0cb5-6c75-e079-f63c3be45c7f  Cache      false  N/A       N/A    No           true  false  N/A        N/A        N/A       526f0487-00fc-a4d2-0e52-183b1cc9e80a  naa.5002538d41d22299                       cc03.lab1.xxxxxxx.xx.xx    5  52ecabbb-0cb5-6c75-e079-f63c3be45c7f  Capacity   false  N/A       N/A    No           true  false  220.56 GB  220.56 GB  190.40 GB 52129b93-ab5b-bac3-8a9c-bebb137abb38  naa.5002538d41acade5                       cc02.lab1.xxxxxxx.xx.xx    5  52129b93-ab5b-bac3-8a9c-bebb137abb38  Cache       true  green     green  No           true   true  N/A        N/A        N/A       52d0a078-e5e6-c68b-1cf8-fef0d722e4ca  naa.5002538d41ad15c1                       cc02.lab1.xxxxxxx.xx.xx    5  52129b93-ab5b-bac3-8a9c-bebb137abb38  Capacity    true  green     green  No           true   true  220.56 GB  220.56 GB  198.24 GB 5269ba46-c950-0310-2665-c728611f77ad  naa.5002538d425bea90                       N/A                       -1  N/A                                   Capacity    true  red       red    No          false  false  N/A        N/A        N/A       52371ee6-cd84-5ff6-0633-51f1683956e1  vsan:52371ee6-cd84-5ff6-0633-51f1683956e1  N/A                       -1  N/A                                   Capacity    true  green     red    No          false  false  N/A        N/A        N/A       528cc02c-8e77-b354-e486-93d4a3a5295d  naa.5002538d41d89b24                       cc04.lab1.xxxxxxx.xx.xx    5  528cc02c-8e77-b354-e486-93d4a3a5295d  Cache      false  N/A       N/A    No           true  false  N/A        N/A        N/A       520174c9-f800-88aa-ed61-03e26a2db850  naa.5002538d41e057b9                       cc04.lab1.xxxxxxx.xx.xx    5  528cc02c-8e77-b354-e486-93d4a3a5295d  Capacity   false  N/A       N/A    No           true  false  220.56 GB  215.46 GB  164.96 GB

It looks like I've somehow lost cc01's disk group (1x cache and 1x capacity) as well.  I see the same output if I run this on another host so there is consistency there.  Quite why

After all of that rambling - I guess I have two questions:

1) Is there a way that I can deal with the logjam that I seem to have here?

2) Is this expected behaviour?  If so, I am wondering if reserving 100% object space is the right way to go to ensure that in future a sudden failure doesn't result in pressure on available remaining space.

Thanks,

Paul.

Reply
0 Kudos
1 Reply
paulthornton
Contributor
Contributor

Reformatting the debug output to make it slightly more readable:

esxcli vsan debug disk overview

UUID                                  Name                                       Owner                    Ver  Disk Group                            Disk Tier    SSD  Metadata  Ops    Congestion  CMMDS    VSI  Capacity   Used       Reserved

------------------------------------  -----------------------------------------  -----------------------  ---  ------------------------------------  ---------  -----  --------  -----  ----------  -----  -----  ---------  ---------  ---------

52ecabbb-0cb5-6c75-e079-f63c3be45c7f  naa.5002538d41d892e1                       cc03.lab1.xxxxxxx.xx.xx    5  52ecabbb-0cb5-6c75-e079-f63c3be45c7f  Cache      false  N/A       N/A    No           true  false  N/A        N/A        N/A    

526f0487-00fc-a4d2-0e52-183b1cc9e80a  naa.5002538d41d22299                       cc03.lab1.xxxxxxx.xx.xx    5  52ecabbb-0cb5-6c75-e079-f63c3be45c7f  Capacity   false  N/A       N/A    No           true  false  220.56 GB  220.56 GB  190.40 GB

52129b93-ab5b-bac3-8a9c-bebb137abb38  naa.5002538d41acade5                       cc02.lab1.xxxxxxx.xx.xx    5  52129b93-ab5b-bac3-8a9c-bebb137abb38  Cache       true  green     green  No           true   true  N/A        N/A        N/A     

52d0a078-e5e6-c68b-1cf8-fef0d722e4ca  naa.5002538d41ad15c1                       cc02.lab1.xxxxxxx.xx.xx    5  52129b93-ab5b-bac3-8a9c-bebb137abb38  Capacity    true  green     green  No           true   true  220.56 GB  220.56 GB  198.24 GB

5269ba46-c950-0310-2665-c728611f77ad  naa.5002538d425bea90                       N/A                       -1  N/A                                   Capacity    true  red       red    No          false  false  N/A        N/A        N/A     

52371ee6-cd84-5ff6-0633-51f1683956e1  vsan:52371ee6-cd84-5ff6-0633-51f1683956e1  N/A                       -1  N/A                                   Capacity    true  green     red    No          false  false  N/A        N/A        N/A     

528cc02c-8e77-b354-e486-93d4a3a5295d  naa.5002538d41d89b24                       cc04.lab1.xxxxxxx.xx.xx    5  528cc02c-8e77-b354-e486-93d4a3a5295d  Cache      false  N/A       N/A    No           true  false  N/A        N/A        N/A     

520174c9-f800-88aa-ed61-03e26a2db850  naa.5002538d41e057b9                       cc04.lab1.xxxxxxx.xx.xx    5  528cc02c-8e77-b354-e486-93d4a3a5295d  Capacity   false  N/A       N/A    No           true  false  220.56 GB  215.46 GB  164.96 GB

Reply
0 Kudos