VMware Cloud Community
abckdc
Enthusiast
Enthusiast
Jump to solution

vSan 3 node consideration.

Hello All,

We are considering on setting up 3 node vsan cluster. Each host have 1 TB * 10 Disk= 10 DISK. We are considering to have 2 disk group of 4 capicity disk and 2 disk as cache on each node. We are considering using FTT policy of 1 with RAID 1 setup. My question is regarding the disk usage. As one host is used as witness node, will the diskgroup of the witness node be used to store data other than meta data of RAID 1 setup. Will I be able to use all the diskgroups of all three hosts or only 2 host(RAID 1) and 3rd host for witness. If so then I do not see a point in populating my 3rd host with 10 TB disks to just hold meta data. OR will the data be distributed among the disk groups of all 3 hosts so that all my disk gets utilized and will this increase my usable capacity slightly rather than 2 node usage.

Also, if 1 node fails then will I be able to repopulate a new node in the same cluster and will the RAID start rebuilding.

Sorry if my query has been answered already, I have not been able to find the exact answer to my query. 

Thank You.

 

 

Reply
0 Kudos
2 Solutions

Accepted Solutions
TheBobkin
Champion
Champion
Jump to solution

@abckdc In a 3-node cluster, all nodes are used for a combination of data and metadata e.g. there is no dedicated Witness. Now, if you don't need anywhere near 30TB (e.g. 20TB space is more than enough) and have enough compute resources in 2 nodes, you could run a 2-node + Witness - this is basically a VM running somewhere else.

 

Yes, a node can be replaced or another added and data resynced to it.

View solution in original post

Tibmeister
Expert
Expert
Jump to solution

Like @TheBobkin mentioned, in a 3-node vSAN cluster there is no dedicated witness node like we would see in a 2-node cluster.  The witness role "floats" between the hosts as needed, but when you look at the virtual objects, you will see the Witness piece of your components, which is just a small component and doesn't take a lot of space, it's just metadata.

With your design, two disk groups with 4 1TB capacity disks, with a FTT1, should have something like 12TB total usable for the cluster.  Also, I would look at the 960GB high endurance disks for the cache, only 800GB of cache can be used per disk group, so the extra capacity is for wear of the disk.  As SSD's are used and wear, the cells become unusable which results in reduction of capacity.

So all 3 hosts will contain data, and with FTT1 it will be mirrored, so each object will have two copies.  So, with FTT1 and SW1, and a disk size of < 255GB, you will have your object that represents your VMDK on a diskgroup on one node, and a copy of that component on a disk group on a different node.  In this scenario, if you lost a node, and of the components would be rebuilt on the remaining two nodes from the mirrored sets.  The thing to watch for is that you plan on the ability to loose a node, so plan your storage max as 66% of the total available, and the compute max also as 66% of normal.  In your case, I would not provision more than 7TB of data, which is ~58% of your total, as you will need some unused space for rebuilds, snapshots, etc.  This will ensure you can do maintenance on the cluster without issue, and that you can loose a host and not be in a degraded state, but continue to operate normally.

Once you repair a failed node, or introduce a new node, the storage policy will continue to take effect, and some components will move to the new node as appropriate.

Quick example, simple VM with a single, 100GB VMDK.  There's 3 objects that make up the VM; the VMDK itself, the VM Home (config files and such), and the VM swap object.  The VM itself is executing from node2.

The VMDK object has a RAID-1 tree established between node2 and node1, meaning there's a component on node2 and the copy of that component on node1, and the RAID-1 tree mirrors the writes to both components, over the network, just like normal RAID-1 would do.  The Witness component, remember, just metadata, is on node3.

The VM Home object is the same as the VMDK object in this case.  The VM swap object on the other hand is slightly different.  The RAID-1 tree is between node2 and node3, with the Witness component being on node1.

So in this example, loosing node1, a new component for the VMDK object and VM home object will have to be created on node3, and the Wintess components will remain on node3.  The Witness component for the VM swap object will need to be re-created on either node2 or node3.  Once I get node1 back in service, the cluster will rebalance itself to ensure that everything's spread back out again.

If you wanted to instead go with a 2-node cluster using a Shared Witness, remember that the Shared Witness CANNOT run on the 2-node cluster it is servicing, and the Shared Witness is best as a Virtual Appliance, not an expensive piece of hardware.  Also, with a 2-node cluster, the 66% estimates I provided above, they now become 50%, so you will loose 50% of your capacity instead of 33% using a 2-node cluster.

View solution in original post

5 Replies
TheBobkin
Champion
Champion
Jump to solution

@abckdc In a 3-node cluster, all nodes are used for a combination of data and metadata e.g. there is no dedicated Witness. Now, if you don't need anywhere near 30TB (e.g. 20TB space is more than enough) and have enough compute resources in 2 nodes, you could run a 2-node + Witness - this is basically a VM running somewhere else.

 

Yes, a node can be replaced or another added and data resynced to it.

Tibmeister
Expert
Expert
Jump to solution

Like @TheBobkin mentioned, in a 3-node vSAN cluster there is no dedicated witness node like we would see in a 2-node cluster.  The witness role "floats" between the hosts as needed, but when you look at the virtual objects, you will see the Witness piece of your components, which is just a small component and doesn't take a lot of space, it's just metadata.

With your design, two disk groups with 4 1TB capacity disks, with a FTT1, should have something like 12TB total usable for the cluster.  Also, I would look at the 960GB high endurance disks for the cache, only 800GB of cache can be used per disk group, so the extra capacity is for wear of the disk.  As SSD's are used and wear, the cells become unusable which results in reduction of capacity.

So all 3 hosts will contain data, and with FTT1 it will be mirrored, so each object will have two copies.  So, with FTT1 and SW1, and a disk size of < 255GB, you will have your object that represents your VMDK on a diskgroup on one node, and a copy of that component on a disk group on a different node.  In this scenario, if you lost a node, and of the components would be rebuilt on the remaining two nodes from the mirrored sets.  The thing to watch for is that you plan on the ability to loose a node, so plan your storage max as 66% of the total available, and the compute max also as 66% of normal.  In your case, I would not provision more than 7TB of data, which is ~58% of your total, as you will need some unused space for rebuilds, snapshots, etc.  This will ensure you can do maintenance on the cluster without issue, and that you can loose a host and not be in a degraded state, but continue to operate normally.

Once you repair a failed node, or introduce a new node, the storage policy will continue to take effect, and some components will move to the new node as appropriate.

Quick example, simple VM with a single, 100GB VMDK.  There's 3 objects that make up the VM; the VMDK itself, the VM Home (config files and such), and the VM swap object.  The VM itself is executing from node2.

The VMDK object has a RAID-1 tree established between node2 and node1, meaning there's a component on node2 and the copy of that component on node1, and the RAID-1 tree mirrors the writes to both components, over the network, just like normal RAID-1 would do.  The Witness component, remember, just metadata, is on node3.

The VM Home object is the same as the VMDK object in this case.  The VM swap object on the other hand is slightly different.  The RAID-1 tree is between node2 and node3, with the Witness component being on node1.

So in this example, loosing node1, a new component for the VMDK object and VM home object will have to be created on node3, and the Wintess components will remain on node3.  The Witness component for the VM swap object will need to be re-created on either node2 or node3.  Once I get node1 back in service, the cluster will rebalance itself to ensure that everything's spread back out again.

If you wanted to instead go with a 2-node cluster using a Shared Witness, remember that the Shared Witness CANNOT run on the 2-node cluster it is servicing, and the Shared Witness is best as a Virtual Appliance, not an expensive piece of hardware.  Also, with a 2-node cluster, the 66% estimates I provided above, they now become 50%, so you will loose 50% of your capacity instead of 33% using a 2-node cluster.

TheBobkin
Champion
Champion
Jump to solution

@Tibmeister Just a minor clarification - in a non-stretched cluster, NO host would have 'Witness' role, just Master('Leader' in later builds)+Backup+Agent roles exist from a cluster-role perspective.

My understanding of it is that where witness-components and data-components get placed is based purely on storage space available on each node (e.g. if node1 and node2 have higher disk usage than node3 then node3 will get the smaller witness-component).

Tibmeister
Expert
Expert
Jump to solution

@TheBobkin yeah, that's a hard one for folks to grasp sometimes.  The Witness node in a stretched or 2-node simple is a dedicated node for metadata to perform tie-breaking, but in other vSAN cluster this isn't needed and the Witness component exists on any node as well as data.

As for the placement, it does seem random for sure, but I have seen that if I take one of my 3 nodes out for an extended period of time or I did a full data migration when entering Maint Mode, there is a rebalance that occurs afterwards.  I've always assumed that this is due to the vSAN cluster setting of Automatic rebalance, which is not enabled by default, to ensure that all disks are within the threshold.  Otherwise, isn't there a Skyline alert that get's generated when things are too far out of balance?

Both good points for the OP to know when going into a non-stretched cluster, so good talking points.  I will admit, I am still new at vSAN, so there's a lot of in-depth details I am still learning, but answering these posts and looking at other answers definitely helps to expand my knowledge base.

abckdc
Enthusiast
Enthusiast
Jump to solution

Thanks @Tibmeister and @TheBobkin for the detailed clarification. I am very grateful for the response. I now have better understanding for the cluster.

Thanks Again.

Reply
0 Kudos