Solved: Limit after 1 additional Host Failure?

Evan5 · ‎03-13-2017

Hi - please help me understand why I am getting the error message in the vSAN Health Check: "Limits > After 1 additional host failure"

From the screen shot below you can see it tells me I have used over 85% of disk space utilization. It says the total is roughly 26.5TB

But if you look at the second screen shot of the capacity of my vSAN you can see the total capacity is 34.57TB

My Hybrid vSAN 6.2 consists:

4 Hosts

2 disk groups per host

Each disk group consists of:

1x... 800GB SSD

4x... 1.2TB HDD

So in total 32x..1.2 TB drives as capacity drives.

Help me understand why I have errors about disk space when I still have 12.28TB of free disk space?

Thanks,

EDIT: spelling typo

TheBobkin · ‎03-13-2017

Hello,

Firstly, every category in the vSAN Health GUI should be looked at with a bit of background knowledge of how vSAN functions and why it alerts to things such as this:

Some of these 'warnings/alarms' have conservative thresholds, such as this one that is triggered by the fact that "After 1 additional host failure" you will exceed 80% vsandatastore capacity (after all data has been resynced, assuming the failed host never comes back).

80% max utilized capacity is best practice for vSAN clusters for reasons such as this (to allow overhead to protect data in case of failures).

26.5TB is what usable capacity would be available if 1 of your 4 hosts failed.

This alarm is merely informing you that if one host failed (and it never recovered) you would, at that point, be using 85% of the available space after rebuilding the components/objects from the failed host (which are marked as degraded and automatically resynced after 60 minutes with default settings).

If you do require more space on this cluster to account for the increasing usage of the vsandatastore (I am assuming this warning is resultant of slowly growing thin-provisioned vmdks and/or you have been adding more VMs/vmdks to this cluster) consider adding 1/2 more capacity drives to each disk-group (the same make+model if possible), you will still be far below the best practice 10% cache to capacity ratio per disk-group so no problem there (assuming you have free disk bays).

Bob

View solution in original post

virtualg_uk · ‎03-13-2017

According to the documentation this alarm is produced from the results of a very basic health check / simultation that does not take into account certain scenarios. It's probably better for it to be on the side of caution, although I have not personally encountered this so it still could be that something is wrong. Do you have vROps installed, this might give us a clue?

vSAN Health Service - Limits Health – After one additional host failure (2108743) | VMware KB

Graham | User Moderator | https://virtualg.uk

TheBobkin · ‎03-13-2017

Hello,

Firstly, every category in the vSAN Health GUI should be looked at with a bit of background knowledge of how vSAN functions and why it alerts to things such as this:

Some of these 'warnings/alarms' have conservative thresholds, such as this one that is triggered by the fact that "After 1 additional host failure" you will exceed 80% vsandatastore capacity (after all data has been resynced, assuming the failed host never comes back).

80% max utilized capacity is best practice for vSAN clusters for reasons such as this (to allow overhead to protect data in case of failures).

26.5TB is what usable capacity would be available if 1 of your 4 hosts failed.

This alarm is merely informing you that if one host failed (and it never recovered) you would, at that point, be using 85% of the available space after rebuilding the components/objects from the failed host (which are marked as degraded and automatically resynced after 60 minutes with default settings).

If you do require more space on this cluster to account for the increasing usage of the vsandatastore (I am assuming this warning is resultant of slowly growing thin-provisioned vmdks and/or you have been adding more VMs/vmdks to this cluster) consider adding 1/2 more capacity drives to each disk-group (the same make+model if possible), you will still be far below the best practice 10% cache to capacity ratio per disk-group so no problem there (assuming you have free disk bays).

Bob

admin · ‎03-13-2017

now you have total 34.57TB

current usage is 22.24TB

current free space is 12.28

if you lose one host, basically you will lose 8.6~8.7TB raw space. so it makes your total to 26.5TB

and vSAN object in failed hosts needs to be resynced to meet your vsan storage policy. it will consume free space.

let's assume worst case. you use FTT=1, there are 8.6TB of object in failed host. this needs to resynced to other hosts to meet your storage policy. so that you will lose 8.6TB for free space.

12.28 - 8.6 = 3.68TB left. your current usage will be 22.82TB and it is more than 85% of total space.

that's why you get warning.

Evan5 · ‎03-14-2017

Thanks to all for responding, its much appreciated.

It does indeed make sense now that you have explained it so well. I thought initially that it was telling me I had already used over 85% usage but its more of a "you will be over 85% if one host fails" situation.

The constant warning on my vSAN is quite annoying (I love to see green check marks) but I guess understandable. I suppose I will need to buy more disks to remove the warning and cover myself for future expansion and accompanying compliancy.

Thanks again.

PS: Wish I could mark more than one correct answer.

TheBobkin · ‎03-14-2017

There may be other options than adding additional disks that might be worth checking:

- VMs with 100% Object Space Reservation in their Storage Policy, applying a policy to these Objects with the same rules except for 0% Object Space Reservation may save space.

- VMs with large and/or numerous snapshots which can be consolidated, quick check for snapshots from CLI via SSH is:

# find -iname *000* You won't be able to see how big they are on vSAN datastore by checking here though as they are basically just pointer to snapshot Objects.

- Test/unused VMs that can be deleted.

- Lower priority VMs that can be moved to lower-tier or slower storage (such as NFS if available).

- Least priority VMs that are rarely used and can be easily recreated from templates or restored from back-up, which you could apply an FTT=0 Storage Policy to (bear in mind that if you have a physical disk failure and these components reside on the failed disk/disk-group they may be PERMANENTLY GONE and thus why the easily recreated/restored part is very important).

As I said before, if you do have to add more disks, try and get the same model as the ones that are in the disk-groups already and add 1 to each disk-group (you *may* get away with adding a single disk to one of the disk-groups on each host for now, while perfectly homogeneous disk-groups are ideal, having the same capacity per host is more important from a functional aspect).

Evan5 · ‎03-14-2017

Thanks Bobkin , I think I will go the way of adding more disks because I will probably be adding more VMs and data to my vSAN environment soon so might as well invest accordingly and remain compliant.

I had planned for future growth when I spec'ed by servers so I still have a few open disk bays available on each of my servers so its just a case of finding the budget to buy more disks 🙂

Makes me smile and shake my head when I think that having over 12TB of free disk space is not enough. Things in IT change so quickly 🙂

Thanks again.

Simonx182 · ‎05-02-2017

One additional clarification on that:

What would happen if he would go over this limit? Would vSAN try to accomplish the storage-Policy or would it remain in this "risky" state?
Would this also happen in Maintainmance mode or is there a way to prevent the rebuild?

Thanks
Simon

admin · ‎05-02-2017

Resync will not be going to be finish soon enough due to lack of resource so that some object might be in risky state.

Maintenance mode, yes it can be. because after 60 min, vsan cluster will start resync.

you can modify this "60 min" to higher value if you are sure a host won't be back to online within 60 min.

Wadymus · ‎05-05-2020

Hi Guy.

I have next situation.

Why 0 GB ?

TheBobkin · ‎05-05-2020

Hello Wadymus,

Welcome to Communities.

You likely should have asked this as its own question as your screenshot indicates this is in an error state while the topic in this thread is in relation to how this health check works and what it means.

I am going to hazard an educated guess (aside from potential issues with Health and/or incompatible vCenter to host versions) that your vsanDatastore capacity is currently being seen as 0GB in size (or was when Health test was run).

If this is/was the case, there are a few possible causes:

VMware Knowledge Base

Bob

All

Limit after 1 additional Host Failure?