VMware Cloud Community
vHaridas
Expert
Expert

All Flash vSAN - Multiple Failures, VMs inaccessible

Hello Everyone,

I have spent 12+ hours with VMware Support on P1 SR#17666784912 and finally this SR has been sent to VMware Engineering Team.

Just want to share my experience with community. May be some vSAN geek may help me here.

Due to multiple issues, now I may loose VMs/data and I really don't want to loose VMs on vSAN.

Cluster Setup -

All Flash 6 Hosts vSAN Cluster, UCS C240 M4. call it ESXi Hosts 1 to 6.

Disk Group - Each host 2 Disk group, one disk group with 5 Capacity (900GB) and 1 Cache Disk (600 GB) and second Disk group with 4 Capacity and 1 Cache Disk

deduplication and Compression is enabled.

VM Storage Policy -  RAID 5/6, FTT=2, Number of disk stripes per object = 2 - All 6 VMs has this policy except one big Disk of 13.5 TB which has default storage policy.

Warning of Disaster -

On Host-2, second Disk group was running with 90% physical Disk usages for a month.

I tried to run disk balancing for n number of times without any success.

Then I created SR#17630294511 with VMware  and I got some workaround like apply new VM policy with number of Disk stripe object to 2.

I applied new VM Policy for all Vm except one big VMDK of 13.5 TB. I couldn't apply new policy to this VMDK as it was complaining of less number of resources available.

I did not get downtime from end user to apply another workaround like migrating this VM to other temporary SAN storage. and then migrate VM back to vSAN DS with new RAID5/6 policy.

So vSAN cluster continued to run with 90% Disk usages in one disk group for a month and on 29th Dec 2017, disaster happened.

Multiple Failures -

Suddenly that second Disk group Host-2 hit the 100% Disk usages on 29th Dec.

I got alarm email notification that Host-2 has been disconnected from vCenter.

After 10min, I received another email notification that Host-3 and Host-4 has been disconnected from vCenter.

pastedImage_11.png

Troubleshooting -

I logged in to vCenter and ESXi and did the basic check of Hosts, Networking, vSAN Hosts and Object status, VMKernel logs, vsan.disks_stats....etc.

I realized that this is not the regular failure and called to VMware Support with P1 incident.

First Engineer -

He tried all possible ways to restore vSAN cluster, reconnect ESXi Hosts but no luck.

Finally we decided to restart ESXi Host 4, 3 and 2 in sequence and that reconnected ESXi Hosts to vCenter.

Issue -

we noticed that second Disk group of Host-2 was 100% Full.

Some VM object resync was in progress but suddenly Host 3 and 4 were also disconnected because of that VMs object were marked Absent - resyncing.

so we tried to manually repair inaccessible object but that didn't work so we decided to give some time to vSAN and see if that automatically resync itself.

pastedImage_17.png

Day - 2

Second Engineer -

Next morning, again I noticed that host-4 is disconnected  and vSAN cluster status is still at is, vm inaccessible.

Called to VMware Support, this time we realized that Disk group which is 100% is causing other Hosts to disconnect from vCenter.

So we decided to unmount disk group 2 from host 2 and then Host-4 automatically reconnected to vCenter.

Again, we tried all possible ways to resync objects.

As his shift was over, this engineer passed this SR to next Engineer.

Third Engineer -

Again this engineer did some checks and decided to restart all 6 hosts all at once.

we restarted all Hosts and then he told me to power off Host-2 which is causing other hosts to fail.

Once all Hosts reconnected to vCenter, he did checks on VM Objects, Cluster, disk status...etc.

New trouble - this time he noticed that host-6 disks also has problem and one of the disk groups Disk were not mounted. it was reporting "In CMMDS: false"

Then we powered on Host-2 too.

Again after trying all possible ways to mount disk, disk groups, unmount he discussed this issue with other team members but no Luck.

So finally he collected logs and the SR has been escalated to VMware Engineering Team. am expecting a call back from VMware Team on 2nd Jan 2018.

Getting below error When tried to manually mount disk -

Unable to mount: Disk with VSAN uuid 52ed7f26-8608-3c7a-a005-24fc30b2db32 failed to appear in LSOM

All Engineers tried their best to fix the issue but they couldn't as this is very strange issue.

Day 3

Current Status -

5 VMs out of 6 are inaccessible as half of the VMs objects are in Absent or Absent - resyncing state.

Host-4 is disconnected.

Disk group 2 on host-2 is 100% used.

All Hosts disks are in "In CMMDS: True" Except Host -6

#localcli vsan storage list  | grep -i cmmds

   In CMMDS: true

   In CMMDS: false

   In CMMDS: false

   In CMMDS: false

   In CMMDS: true

   In CMMDS: true

   In CMMDS: true

   In CMMDS: true

   In CMMDS: true

   In CMMDS: false

   In CMMDS: false

After googling for couple of hours, I got this post with slimier error message during disk mounting.

http://vroomblog.com/vsan-overall-disks-health-et-software-state-health-errors

But I cannot try to remove disks from disk group as this is All flash setup with dedup and compression enabled. This will be my last option to try.

I will wait for engineering team to provide solution for this.

Thanks,

Haridas

https://vprhlabs.blogspot.in

Please consider awarding points for "Correct" or "Helpful" replies. Thanks....!!! https://vprhlabs.blogspot.in/
Reply
0 Kudos
2 Replies
SureshKumarMuth
Commander
Commander

I think almost all recovery steps have been tried and as you said the last sort of option is to remove and readd the disks to make them available under cmmds. But right now two hosts are in disconnected state, so to avoid any risks it is better to wait for Engineering response. Check resync state often to know what is happening that the backend. Also from support perspective you need not wait for one more day for Engineering team to come back as they provide 24x7 support for prod down situation or atleast they should come back at the earliest when they are ready with action plan not the next day.

Regards,
Suresh
https://vconnectit.wordpress.com/
Reply
0 Kudos
vHaridas
Expert
Expert

Data resync or rebuild is not happening since the first failure occurred on 29 Dec.

This is preprod setup, all SQL VMs.

Due to vacation period no one is asking me for these VMs so am not pushing much.

I have been told that Engineering team will revert back to me on 2nd Jan.

Hopefully VMware Engineering team will provide some solution.

Thanks,

Haridas

Please consider awarding points for "Correct" or "Helpful" replies. Thanks....!!! https://vprhlabs.blogspot.in/
Reply
0 Kudos