VMware Cloud Community
dvclcs
Contributor
Contributor

Error putting in maintenance mode

Hi,

I cannot put the hosts in maintenance mode, attach the errors and the environment.

Insufficient resource message is displayed. I have 8 hosts

I do not understand.

Tks!

0 Kudos
9 Replies
TheBobkin
Champion
Champion

Hello dvclcs

A few questions and points that should help clarify the situation here:

Are any/many of the other hosts in the cluster already in Maintenance Mode(MM)? - If you want to do MM with 'Full Data Evacuation' then you need enough available nodes and/or disks that can be target for this data and still be in compliance with the applied Storage Policy(SP) (e.g. require 3 hosts for FTT=1 with RAID1 FTM).

/var/log/clomd.log will tell you precisely why it is not able to enter MM with this option here if you want to take a look, e.g. 3 fault-domains needed, 2 available or 3 disks needed, 2 available.

Is this a stretched-cluster? - These basically work with each data-site + Witness being a Fault Domain so that the same applies (e.g. you can't evacuate all the data off one site as it would have no-where to go without violating the FTT of the SP).

Is that SP you added as screen-shot 'rule.jpg' the only SP in use? (it doesn't look right as if you have an FTT, one must also apply at the very least an FTM)

Do you have any host-affinity rules in place?

Is there a particular reason you are putting hosts in MM with Full Data Migration?

What exactly are you trying to achieve here?

Bob

0 Kudos
dvclcs
Contributor
Contributor

It was an stretched cluster, I disabled it because I will divide the servers into 2 Sites (4 hosts Site A / 4 hosts Site B);

It's my first server that I put into maintenance mode;

No affinity rule applied;

The final goal is to remove 4 hosts from the cluster to add to Site B.

rule.jpg is the only rule applied to all VMs.

tks

0 Kudos
TheBobkin
Champion
Champion

Hello dvclcs​,

"It was an stretched cluster, I disabled it because I will divide the servers into 2 Sites"

How did you disable this? - If you did this correctly it *should* just see the cluster as an 8-node cluster (as opposed to a 4+4+1) and allow data evacuation accordingly.

Is the Witness still a part of the cluster? Did you remove the Fault-Domains per site that are configured as part of a stretched cluster?

"It's my first server that I put into maintenance mode"

This would strongly indicate that you either have something like a RAID5 SP per site, high amount of striping, or the stretched cluster configuration was not removed.

What does the clomd.log say? I can take a look if you would like, PM it to me if you would prefer to not post it here.

"rule.jpg is the only rule applied to all VMs."

Not to mistrust what you say about the SP applied to the Objects but can you check the SPs in use here? The first one to check would be the Default, though if others are in present check the specifics of these and what Objects/VMs they are applied to e.g.:

Home > Policies & Profiles > VM Storage Policies > vSAN Default Storage Policy

Bob

0 Kudos
dvclcs
Contributor
Contributor

"How did you disable this? - If you did this correctly it *should* just see the cluster as an 8-node cluster (as opposed to a 4+4+1) and allow data evacuation accordingly.

Is the Witness still a part of the cluster? Did you remove the Fault-Domains per site that are configured as part of a stretched cluster?"

I followed that step by step:

Convert a Stretched Cluster to a Standard vSAN Cluster

Witness I Power OFF. That's right?

clomd attached, print stretched cluster and SP.

0 Kudos
TheBobkin
Champion
Champion

Hello dvclcs​,

That clomd file you uploaded appears more akin to /etc clomd service .sh :smileygrin:  (or something else).

What I specified was /var/log/clomd.log

Can you check that fault domains were properly removed via the CLI via SSH to a host?:

#  esxcli vsan faultdomain get

This should just return the Local Node UUID as FD ID and nothing under 'Fault Domain Name' if these have been fully decommissioned.

"Witness I Power OFF. That's right?"

It shouldn't matter if the stretched cluster was decommissioned fully - it should not be a cluster member.

Bob

0 Kudos
dvclcs
Contributor
Contributor

I send clomd.log in PM

esxcli vsan faultdomain get

   Fault Domain Id: 5ade5b72-b9a6-44ac-a95d-e472e2f58422

   Fault Domain Name:

0 Kudos
dvclcs
Contributor
Contributor

I think of putting SP in Datastore Default, for all VMs.

0 Kudos
TheBobkin
Champion
Champion

Hello dvclcs

That clomd is not giving as much sense as expected but seeing a lot of decom state failed or none for nodes, can you refresh the decom state for each node:

# esxcli vsan maintenancemode cancel  (this should have no negative impact)

An alternative to try here would be decommissioning disk-groups on the hosts on one site with full-data evacuation option (Cluster > Configure > Disk Management > Host > DG > Delete 'Full Data Migration') - sure this will evacuate onto all available disks (not just the 4 hosts you want to evacuate) but worth seeing if this works or not as a secondary option.

Is this a production cluster? If so then I would advise giving my colleagues in GSS a call if you have S&S (as I am  AFK until tomorrow).

If the data on this cluster is not crucial and/or is backed up, you could just apply an FTT=0 Storage Policy to it all and then march on, make it FTT=1 after.

Bob

0 Kudos
dvclcs
Contributor
Contributor

This is production Cluster. 

"decommissioning disk-groups"  - You have data loss?

"you could just apply an FTT=0 Storage Policy to it all" - Attached. Already FTT=0

0 Kudos