VMware Cloud Community
andvm
Hot Shot
Hot Shot
Jump to solution

Multiple Hosts down evaluation

Hi,

I need to evaluate if possible to bring down 2 hosts out of a cluster at a time and keep the vSAN healthy and in recommended availability (Default storage policy) in case of a host failure which would bring the cluster to just 3 out of 6

I tried to run vsan.whatif_host_failures -n2 but getting “Only simulation of 1 host failure has been implemented“

Are there any other commands or considerations I should consider as I want to upgrade these hosts 2x at a time.

Thanks

Reply
0 Kudos
1 Solution

Accepted Solutions
TheBobkin
Champion
Champion
Jump to solution

Hello andvm​,

"As long as they have no VM's running on them I can make life simpler by only working on one of them at a time. (MM with FDM to keep data always compliant)"

Enter Maintenance-Mode with either EA or FDM options migrates all VMs off the hosts as part of this process(including powered-off ones unless specified otherwise).

Depending on what you are doing (e.g. patching a host vs physically moving it) using FDM option is potentially overkill as this will take an order of magnitude longer and put additional wear on your disks. For short downtimes (e.g. updates/patches) the vast majority of vSAN customers tend to go with ensuring back-ups are current and then using MM EA.

Those that have a requirement (legal/contract/SLA etc.) to maintain redundancy during maintenance windows tend to have FTT=2 policies in use and thus still have FTT=1 during rolling updates - I cannot recall the last customer that was using FTT=1 and doing MM FDM for anything relatively simple/short for the above reasons.

"Keeping on the same subject how can I find exactly how many hosts failures can a specific cluster support whilst satisfying data compliance (Default vSAN storage)?"

This isn't necessarily going to be the same for all data in any one cluster e.g. you can have a mix of FTT=0, FTT=1 and FTT=2 data in the same cluster - I would advise with looking at what Storage Policies are in use in each cluster.

"Of course these failures will have to be at different times to avoid data loss."

A data Object can withstand as many current failures as it's FTT provides - if it has the resources (e.g. enough FDs, and usable space that doesn't violate the SP) to rebuild after a failure then really it can have as many non-concurrent failures as it has resources to rebuild.

Bob

View solution in original post

Reply
0 Kudos
4 Replies
TheBobkin
Champion
Champion
Jump to solution

Hello andvm

This won't be possible unless you are doing 'Full Data Migration' option on at least one host (which itself will likely take a long time) OR the data is FTT=2 (what you said indicates it is not) OR this is a stretched-cluster and both nodes are in the same site (and you have no local-site-only Storage Policies).

Why do you want to bring down 2 hosts at the same time?

Bob

Reply
0 Kudos
andvm
Hot Shot
Hot Shot
Jump to solution

Hi TheBobkin

Actually you are right I do not need to do both hosts at the same time (not sure why I thought so)

As long as they have no VM's running on them I can make life simpler by only working on one of them at a time. (MM with FDM to keep data always compliant)

Keeping on the same subject how can I find exactly how many hosts failures can a specific cluster support whilst satisfying data compliance (Default vSAN storage)?

Of course these failures will have to be at different times to avoid data loss.

Thanks

Reply
0 Kudos
TheBobkin
Champion
Champion
Jump to solution

Hello andvm​,

"As long as they have no VM's running on them I can make life simpler by only working on one of them at a time. (MM with FDM to keep data always compliant)"

Enter Maintenance-Mode with either EA or FDM options migrates all VMs off the hosts as part of this process(including powered-off ones unless specified otherwise).

Depending on what you are doing (e.g. patching a host vs physically moving it) using FDM option is potentially overkill as this will take an order of magnitude longer and put additional wear on your disks. For short downtimes (e.g. updates/patches) the vast majority of vSAN customers tend to go with ensuring back-ups are current and then using MM EA.

Those that have a requirement (legal/contract/SLA etc.) to maintain redundancy during maintenance windows tend to have FTT=2 policies in use and thus still have FTT=1 during rolling updates - I cannot recall the last customer that was using FTT=1 and doing MM FDM for anything relatively simple/short for the above reasons.

"Keeping on the same subject how can I find exactly how many hosts failures can a specific cluster support whilst satisfying data compliance (Default vSAN storage)?"

This isn't necessarily going to be the same for all data in any one cluster e.g. you can have a mix of FTT=0, FTT=1 and FTT=2 data in the same cluster - I would advise with looking at what Storage Policies are in use in each cluster.

"Of course these failures will have to be at different times to avoid data loss."

A data Object can withstand as many current failures as it's FTT provides - if it has the resources (e.g. enough FDs, and usable space that doesn't violate the SP) to rebuild after a failure then really it can have as many non-concurrent failures as it has resources to rebuild.

Bob

Reply
0 Kudos
andvm
Hot Shot
Hot Shot
Jump to solution

Ok only (Default Storage Policy in use)

Downtime can last a day so it’s a considerable risk if done without FDM for this setup. Can be more for testing.

Yes FTT2 provides better availability but at the cost of more storage consumption

Think you have answered my question/s.

I just would have liked that rvc command to check whatif failed hosts to allow n number of hosts to work as that would fully answer my question per cluster.

Thanks

Reply
0 Kudos