VMware Cloud Community
woter
Contributor
Contributor
Jump to solution

Adding a second disk group to a 2-node witness appliance

Hi,

I've just created a new vSAN cluster, with the new ESXi 6.7 configuration type "Two host vSAN cluster" option. The "wizard" has done a few things differently to how I set up my old 6.5 cluster and one such thing is that I notice the witness host has a disk group. The wizard only allows for one disk group, but my combination of SSDs and HDDs means I need two disk groups.

Do I need to create the second disk group on the witness appliance?

I've gone ahead and attached two new virtual disks, 10GB SSD and 15GB HDD to match what the wizard did. My disk groups now look like:

pastedImage_2.png

Many thanks.

W.

0 Kudos
1 Solution

Accepted Solutions
TheBobkin
Champion
Champion
Jump to solution

Hello woter​,

"The question is, should I match what the "wizard" did for the witness and create the second disk group on the witness appliance?"

If you are running this as a test or small environment then no it is not necessary to add more than the (yes it seems tiny) 1 Disk-Group with 15Gb capacity-tier device - as I said above, Witness components are relatively tiny compared to the average data-component and thus not much is required with regards to capacity (and performance).

"(I have done this, I'd just like to know if it is correct. I never created any disk groups on the old 6.5 cluster, which is probably why I could never get FTT=1 and lost all my data when the cache SSD failed (unsupported Samsung NVMe 960 evo) during an evacuation exercise which was clearly too much for the poor thing and burned it out."

If you want to save some space and remove any extraneous disks/Disk-Groups(DGs) on your Witness then it is merely a case of removing them with 'Full Data Evacuation' via Cluster > Configure > vSAN > Disk Management > Select DG > Delete > Select option 'Full Data Evacuation' - this will move anything on this DG to the other DG on the Witness.

If you never created DGs in previous setup then you were not really using vSAN (as it is intended) as this implies RAID1/RAID5 across numerous nodes to provide redundancy managed by SPBM.

Out of interest, did you actually manage to devoid a 960 and how? Coincidentally I use an M.2 NVMe 960 evo for most of my hosts cache-tier "*vmdks*" for vSAN.

Bob

View solution in original post

0 Kudos
4 Replies
TheBobkin
Champion
Champion
Jump to solution

Hello woter​,

So, just to clarify what the role of a Witness node does in a vSAN Stretched/2-node-cluster: this is only used to store Witness components (16MB per Object e.g. .vmdk or .vswp) and thus it does not (generally) require a huge amount of space nor redundancy as these metadata can be easily (and quickly) recreated. Thus this is why we allow nested as opposed to physical implementations of Witness Appliances - if you want to use a physical box that is fine of course (unclear here whether that is the case).

Here are our sizing guidelines for this (you can see that yours matches 'Tiny:750 Witness Components':

vSAN Witness Appliance Sizing

Bob

0 Kudos
woter
Contributor
Contributor
Jump to solution

Thanks Bob,

I'm running the witness appliance (OVA) on VM workstation.

Using the vSAN "wizard", when I "claimed" my disks, I was only able to create one group. The 6.7 "wizard" created the disk groups on the hosts and a disk group on the witness host. As the "wizard" only allows the creation of one disk group (per host) I manually added a second disk group on the two hosts. The question is, should I match what the "wizard" did for the witness and create the second disk group on the witness appliance?

(I have done this, I'd just like to know if it is correct. I never created any disk groups on the old 6.5 cluster, which is probably why I could never get FTT=1 and lost all my data when the cache SSD failed (unsupported Samsung NVMe 960 evo) during an evacuation exercise which was clearly too much for the poor thing and burned it out.

Thanks again.

0 Kudos
TheBobkin
Champion
Champion
Jump to solution

Hello woter​,

"The question is, should I match what the "wizard" did for the witness and create the second disk group on the witness appliance?"

If you are running this as a test or small environment then no it is not necessary to add more than the (yes it seems tiny) 1 Disk-Group with 15Gb capacity-tier device - as I said above, Witness components are relatively tiny compared to the average data-component and thus not much is required with regards to capacity (and performance).

"(I have done this, I'd just like to know if it is correct. I never created any disk groups on the old 6.5 cluster, which is probably why I could never get FTT=1 and lost all my data when the cache SSD failed (unsupported Samsung NVMe 960 evo) during an evacuation exercise which was clearly too much for the poor thing and burned it out."

If you want to save some space and remove any extraneous disks/Disk-Groups(DGs) on your Witness then it is merely a case of removing them with 'Full Data Evacuation' via Cluster > Configure > vSAN > Disk Management > Select DG > Delete > Select option 'Full Data Evacuation' - this will move anything on this DG to the other DG on the Witness.

If you never created DGs in previous setup then you were not really using vSAN (as it is intended) as this implies RAID1/RAID5 across numerous nodes to provide redundancy managed by SPBM.

Out of interest, did you actually manage to devoid a 960 and how? Coincidentally I use an M.2 NVMe 960 evo for most of my hosts cache-tier "*vmdks*" for vSAN.

Bob

0 Kudos
woter
Contributor
Contributor
Jump to solution

Hi Bob,

I understand the role of the witness and why the disks are so small. I guess I'm thinking along the lines of quorum, in that do the two hosts and the witness have to match "logically".

My old 6.5 cluster had disk groups on the hosts. I just never created the witness disk groups, and I guess this is why it always said "Configuration can tolerate 0 failures" and I couldn't have anything other than 0 in the "Primary level of failures to tolerate" on the storage policy. Lesson learned.

I was trying to evacuate running VM data to one host in order to patch the other host. I only have 2 direct attached 10GbE nics and the migration was taking ages. ~160GB and 24 hours later it was still going. I cancelled "enter maintenance mode" and restarted it where data to be evacuated was down to ~60GB, so it was doing something and after leaving it for a few days it actually entered maintenance mode. I applied the patches, repeated the steps for the second node, just as slow. When the patches had applied, all the VMs were either orphaned or inaccessible. I discovered the said SSD was marked as "Physical Device Loss". I was finding it hard to believe that the 1 year old SSD was really dead and tried all sorts of things to bring it online, including a successful firmware update. I eventually bought a new one which was detected straight away, however during my attempts to get it working again, I had deleted the disk group and there's no way I could find to get back from that. Creating a new disk group effectively formats the disk.

What I found strange also, was that it was only the running VMs (6) that I wanted to move, so why all data ended up on the failed disk group is beyond me.

I have since read that a failed cache disk effectively destroys the disk group anyway - which I find very odd as I would have thought it was just used for reads and copies at that. Another lesson learned.

When I get a Windows device with an NVMe M.2 slot, I'll plug in the SSD and run some tests using Samsung's Magician tool. Their warranty is 3 years or 200 TBW. It will be interesting to see what the TBW figure actually is. Of course, I could have been unlucky and just bought a dodgy SSD. Samsung claim that most warranty claims result in the device being returned as their tests usually find nothing wrong.

TBH, what with having to rebuild my VCSA 4 times now, due to disk corruption, an ever-dwindling HCL, no PAYG support for anything other than Essentials and now this, after 14 years using ESX as a consumer and administrator, I'll restore my key data, move it to the cloud or a little QNAP, sell off my hardware and close the door on VMWare. It's not the bullet-proof product it once was and I just don't have the time these days. I'm going to put it down to the change in business strategy by VMWare's new(ish) owners. It couldn't possibly be my fault that I didn't read the instructions properly :-).

Thanks again.

W.

0 Kudos