VMware Cloud Community
redsnapper76
Contributor
Contributor

estimated time to evacuate data on host maintenance mode

Hello,

I'm looking into to implement a 4 host vSan cluster.

I'm a bit concerned about the time it would take to evacuate data from a node when you need to put it in maintenance mode.

When there is a host issue, the first thing the vendor asks is to update firmware etc, this means that the host will be put in maintenance mode for longer then 60 minutes.

As a best practice this would mean you need to evacuate the data to other hosts in the cluster.

What would be the estimated time needed to evacuate for example, 1TB, 5TB, 10TB from one host?

What would the performance impact be on the cluster when you need to evacuate or rebuild all data from one node.

0 Kudos
3 Replies
TheBobkin
Champion
Champion

Hello redsnapper76

There is no accurate answer to how long an evacuation of xTB will take as this depends on numerous factors including but not limited to disk-type and speed, VM workload on the cluster, network throughput, number of DGs available for component placement, Fault Tolerance Method applied to the data, deduplication & compression enabled etc.

But from experience, realistically even a hybrid 4-node cluster with decent hardware shouldn't have issue with pushing upwards of 1TB an hour (but then again this could be underestimating or even best case scenario as it depends on so many of the factors listed above).

You don't have to do 'Full Data Evacuation' if putting host into Maintenance Mode for longer than 60 minutes you can always increase the clom repair delay timer and use 'Ensure Accessibility' and the delta of the data will resync once it is added back - this does however mean that while the cluster is in this reduced-redundancy state that a single hardware failure could result in data unavailability or loss so back-ups before doing this is a must.

https://kb.vmware.com/s/article/2075456

Bob

0 Kudos
redsnapper76
Contributor
Contributor

Hello Bobkin

Thank you for your answer.

When there is an issue with a host, often the time to put it back in production is > 1 week

You know how it goes, issue occurs, open ticket, uploading logs, firmware upgrade, eventually hardware repair, etc.

These are often a lengthy processes and out of experience the resolution time is > 1 week.

So running 1 week in a reduced-redundancy environment is a risk.

We can reduce that risk by putting a host failure tolerance of 2, or changing maintenance contracts.

So it is safe to say, when having a Full Flash environment with 10 Gbit backbone, pushing up 5 TB would take approx 4-5 hours.

0 Kudos
TheBobkin
Champion
Champion

Hello redsnapper76​,

Yes, I agree that a week is potentially too long other than for workloads such as easily replaced VDIs or anything that might be run as FTT=0 anyway.

For FTT=2 with RAID1 a minimum of 5-node cluster is required (and 6-node for FTT=2 with RAID6).

With All-Flash (again so so many variables in that alone) full evacuation should go a lot faster than a Hybrid implementation would, but really I wouldn't put any number on it until it's in place and can start testing - do also factor in VM workloads when planning/testing as the data-move performance of a cluster with little data and barely any actively-in-use data may starkly contrast to how fast stuff will move while the cluster is fully operational.

Bob

0 Kudos