VMware Cloud Community
yemyat14
Contributor
Contributor
Jump to solution

Hot adding Harddisks to the Hosts made missing diskgroup

Hi all with respect,


We have 3nodes, each hosts contain 2x 400GB SSDs,4x 1TB HDDs,2x300GB HDDs storage devices.We created the vSAN Cluster with three diskgroup that contain 2xSSDs and 3x1TB HDDs.

It is running pretty fine before this happened. I removed the 2x300GB HDDs and added the 2x1TB for increasing the capacity. I unplugged these without entering maintenance or shutting down host.

When I looked back to configure the cluster,the diskgroup under this host is missing. The error said permanently device lost with specific UUID. The removing HDDs are not participating in this cluster. All my VMs with FTT=0 have been gone.

Can I recover this VMs? In the vsan datastore, there are no relevant files with these vms.

Why was that happened? And also please guide me troubleshoot. Smiley Sad

1 Solution

Accepted Solutions
TheBobkin
Champion
Champion
Jump to solution

Hello yemyat14​,

"Yes,we have another two hosts with Raid0 SSDs for cache tier. "

I would advise remediating these sooner rather than later - I am going to assume by the Dell hardware that you are using something like a H730 controller which does not support RAID0 disks on vSAN since a long time ago (as there were a multitude of issues).

"No Bob, we have 3 Disk-Groups with 3SSDs."

Thanks for clarifying - originally you said that you had 2 SSDs.

"No Bob, I thought that these disks are not contributing in diskgroup. I did just hot plugged and hot swapped then the diskgroup disappeared."

Do you have auto-claim enabled for disks? Or did you re-use the SSD to create a new Disk-Group?

"The flash devices missed partition in ESXi host."

Do you have more than 1 SSD per host? If you only have 1 SSD per host then the Absent disk you see referenced with UUID (not naa) is the PDL leftover from the SSD of pulling out those drives.

If you remove all capacity-drives from a Disk-Group you are essentially destroying the Disk-Group as a Disk-Group cannot persist with just a cache-device. *Potentially* it could have been recreated by remounting the SSD and original HDDs.

But if there is only 1 SSD and you have re-used this for the new Disk-Group then unfortunately the original Disk-Group and whatever data it contained is gone for good.

"I didn't remove the wrong storage devices. I swear ."

Apologies if my comments above seemed in any way accusatory, there is absolutely no judgement here.

Bob

View solution in original post

8 Replies
TheBobkin
Champion
Champion
Jump to solution

Hello yemyat14

"We created the vSAN Cluster with three diskgroup that contain 2xSSDs and 3x1TB HDDs."

You need at least one SSD per disk-group and you can only have 1 SSD per Hybrid Disk-Group so this makes no sense (unless you are marking a HDD as Flash which is a bad idea and unsupported).

"I unplugged these without entering maintenance or shutting down host."

Why?

"When I looked back to configure the cluster,the diskgroup under this host is missing. The error said permanently device lost with specific UUID. The removing HDDs are not participating in this cluster. All my VMs with FTT=0 have been gone."

"Why was that happened?"

Because you improperly removed the capacity-drives that these data resided on - obviously the disks and any data on them are not going to be accessible if you physically removed them from the host.

"Can I recover this VMs? In the vsan datastore, there are no relevant files with these vms."

If you haven't done anything other than swap the HDDs (e.g. removed or recreated the broken Disk-Group or re-used the SSD that was part of this Disk-Group) then potentially swapping the disks back in and remounting the Disk-Group will result in the data being accessible.

Please in future use the correct procedures for replacing disks - decommission the disks you are replacing properly if you care about your data (especially if it is FTT=0 as there is nothing to rebuild from if you lose this).

Bob

Reply
0 Kudos
yemyat14
Contributor
Contributor
Jump to solution

Thanks for your reply TheBobkin .

We do raid-0 the 2xSSDs. Each host have diskgroup with ~740 GB SSDs for cache tier and 3TB for capacity. I am surely marked the flash correctly.

The 300GB HDDs is not participating in these diskgroups. That's why I removed them. May the witness data stored on these?

Yes, later I will take care the steps and FTT=1 will be used.  Is there reference guide to troubleshoot?

Reply
0 Kudos
TheBobkin
Champion
Champion
Jump to solution

Hello yemyat14​,

"We do raid-0 the 2xSSDs."

What controller are you using here? I ask as almost no controllers nowadays support RAID0 disks in vSAN.

Do you have multiple SSDs RAID0'd together and presented? If so then this is an accident waiting to happen.

"I am surely marked the flash correctly."

As I said - you can't have 3 Disk-Groups with only 2 SSDs available so you have either marked a HDD as an SSD (bad idea) or have some weird RAID0 'Virtual Disk' configuratuion to carve up the SSDs before they are presented to vSAN which is also a very bad idea.

"The 300GB HDDs is not participating in these diskgroups. That's why I removed them."

If they are not participating in the Disk-Groups, exactly what process (and option, e.g.: 'Full Data Evacuation') did you use to remove them from the Disk-Groups before physically removing them from the hosts?

"May the witness data stored on these?""

FTT=0 data doesn't use witness components - each Data-Object consists of a single data-component.

If you pulled some disks out of the hosts and now your FTT=0 data is missing then it would appear you pulled disks that had this data on it.

Bob

yemyat14
Contributor
Contributor
Jump to solution

Hi TheBobkin​,

Do you have multiple SSDs RAID0'd together and presented?

Yes,we have another two hosts with Raid0 SSDs for cache tier. Smiley Sad

As I said - you can't have 3 Disk-Groups with only 2 SSDs................................................

No Bob, we have 3 Disk-Groups with 3SSDs. It works fine before I upgrading the disks. The thing I did wrong was that RAID-0 on SSDs.

did you use to remove them from the Disk-Groups before physically removing them from the hosts?

No Bob, I thought that these disks are not contributing in diskgroup. I did just hot plugged and hot swapped then the diskgroup disappeared. I attach come picture there. The flash devices missed partition in ESXi host. Why ? I didn't remove the wrong storage devices. I swear .

Reply
0 Kudos
a_p_
Leadership
Leadership
Jump to solution

Maybe related, but not sure though. What I've learned about Dell servers (RAID controllers in such servers) is that they support "Hot-Add", but not "Hot-Swap" (i.e. hot removing disks). In order to properly remove disks from a system it is highly recommended to set the disks offline, prior to physically removing them. You can find an ESXi command line tool for this at VMware PERCCLI Utility For All PERC Controllers | Dell US

André

TheBobkin
Champion
Champion
Jump to solution

Hello yemyat14​,

"Yes,we have another two hosts with Raid0 SSDs for cache tier. "

I would advise remediating these sooner rather than later - I am going to assume by the Dell hardware that you are using something like a H730 controller which does not support RAID0 disks on vSAN since a long time ago (as there were a multitude of issues).

"No Bob, we have 3 Disk-Groups with 3SSDs."

Thanks for clarifying - originally you said that you had 2 SSDs.

"No Bob, I thought that these disks are not contributing in diskgroup. I did just hot plugged and hot swapped then the diskgroup disappeared."

Do you have auto-claim enabled for disks? Or did you re-use the SSD to create a new Disk-Group?

"The flash devices missed partition in ESXi host."

Do you have more than 1 SSD per host? If you only have 1 SSD per host then the Absent disk you see referenced with UUID (not naa) is the PDL leftover from the SSD of pulling out those drives.

If you remove all capacity-drives from a Disk-Group you are essentially destroying the Disk-Group as a Disk-Group cannot persist with just a cache-device. *Potentially* it could have been recreated by remounting the SSD and original HDDs.

But if there is only 1 SSD and you have re-used this for the new Disk-Group then unfortunately the original Disk-Group and whatever data it contained is gone for good.

"I didn't remove the wrong storage devices. I swear ."

Apologies if my comments above seemed in any way accusatory, there is absolutely no judgement here.

Bob

yemyat14
Contributor
Contributor
Jump to solution

Hello a.p.  ,

Thanks for your reply. I will take care of this.

Reply
0 Kudos
yemyat14
Contributor
Contributor
Jump to solution

The Controller is H 710 Mini.

Thank you so much TheBobkin​. Much appreciated.

Sorry for my words. Thanks for your answers and your time. ^_^

Reply
0 Kudos