VMware Cloud Community
time81
Contributor
Contributor
Jump to solution

vsan 6.6 insufficient vsphere ha failover resources

Hi,

we are testing a brand new 6.6 vsan but my cluster instantly turns to: insufficient vsphere ha failover resources

even if all 3 nodes (dell r730xd with 2x10 core) are empty.

reconfigure for HA didnt help.

turn HA on/off didnt help

admission controll is turned off, but even if i say 1 host can fail or it should reserve 25% cpu resources if still gives the error ?

any suggestions ?

0 Kudos
1 Solution

Accepted Solutions
TheBobkin
Champion
Champion
Jump to solution

Hello time,

You can identify 'who' it thinks are the 4 cluster members from any host using this (I populated it based on the sub-cluster membership UUIDs):

# cmmds-tool find -t HOSTNAME -f json -u 597f05ca-7472-cc68-de1c-a0369fd8e08c |grep content

# cmmds-tool find -t HOSTNAME -f json -u 597f1c09-d372-daff-b609-a0369fcc4db4 |grep content

# cmmds-tool find -t HOSTNAME -f json -u 597f23c4-8c11-3784-1cc7-a0369fcc5184 |grep content

# cmmds-tool find -t HOSTNAME -f json -u 59805405-6568-a8c0-5277-000c299d2d36 |grep content

Once you have identified the member who should not be present, try removing it from the cluster. If the member is the Witness that is already gone then it seems it was not decommissioned properly, in this case (if removal of it is not possible) creating a new cluster and adding nodes to it is the best option.  How simple this will be depends on if you have data on here and/or can move it all off/backup-restore.

Bob

View solution in original post

0 Kudos
9 Replies
jameseydoyle
VMware Employee
VMware Employee
Jump to solution

Hi,

It would be important to verify that the HA cluster is forming correctly. Can you check Monitor > vSphere HA > Summary and ensure that you have 1 master and 2 hosts connected to the Master. Have you also got your vSAN network properly configured? HA uses the vSAN-enabled VMkernel port for HA traffic in clusters where vSAN is enabled, so ensure there are no issues with the vSAN configuration. Check Monitor > vSAN > Health and ensure that all Network tests are passing.

TheBobkin
Champion
Champion
Jump to solution

Hello time,

Check the current cluster membership from a host and make sure all nodes are correctly clustered:

#esxcli vsan cluster get

Check that no hosts are in vSAN Maintenance Mode (should be all "decomState": 0:

# cmmds-tool find -t NODE_DECOM_STATE -f json

Bob

0 Kudos
time81
Contributor
Contributor
Jump to solution

All decom State 0

Cluster Information

   Enabled: true

   Current Local Time: 2017-08-02T14:02:27Z

   Local Node UUID: 597f1c09-d372-daff-b609-a0369fcc4db4

   Local Node Type: NORMAL

   Local Node State: MASTER

   Local Node Health State: HEALTHY

   Sub-Cluster Master UUID: 597f1c09-d372-daff-b609-a0369fcc4db4

   Sub-Cluster Backup UUID: 597f05ca-7472-cc68-de1c-a0369fd8e08c

   Sub-Cluster UUID: 52c9da94-4b46-c911-8976-e2888f0c1bde

   Sub-Cluster Membership Entry Revision: 2

   Sub-Cluster Member Count: 4

   Sub-Cluster Member UUIDs: 597f05ca-7472-cc68-de1c-a0369fd8e08c, 597f1c09-d372-daff-b609-a0369fcc4db4, 597f23c4-8c11-3784-1cc7-a0369fcc5184, 59805405-6568-a8c0-5277-000c299d2d36

   Sub-Cluster Membership UUID: c9718059-3d6e-187e-d9cd-a0369fcc4db4

   Unicast Mode Enabled: true

   Maintenance Mode State: OFF

   Config Generation: 11fcd4eb-d081-48e9-8158-157ab295dc95 5 2017-08-01T11:51:46.716

0 Kudos
TheBobkin
Champion
Champion
Jump to solution

Hello time,

I thought this was a 3-node cluster?

Sub-Cluster Member Count: 4

Bob

0 Kudos
jameseydoyle
VMware Employee
VMware Employee
Jump to solution

Can you send a summary of the HA Cluster status?

You could send the contents from the /opt/vmware/fdm/fdm/hostlist file from any of the hosts.

0 Kudos
time81
Contributor
Contributor
Jump to solution

Well it is 3 !  See the pic below. Does it have to do with the witness host ? (VMWare Witness appliance 6.5)

We had a 2 node cluster before, with the 3rd physical host selected as witness but we destroyed it and i had to delete all the partitions and rebootet the esxi.

I cant find the hostlist file

[root@:/opt/vmware/fdm/fdm] ls -la

total 25380

drwxr-xr-x    1 root     root           512 Aug  1 10:02 .

drwxr-xr-x    1 root     root           512 Aug  1 10:02 ..

-r-xr-xr-x    1 root     root      22846120 Jul  7 14:47 fdm

-r-xr-xr-x    1 root     root           649 Jul  7 14:50 fdm-dump.sh

-r--r--r--    1 root     root       2174892 Jul  7 14:50 libcrypto.so.1.0.2

-r--r--r--    1 root     root        398060 Jul  7 14:50 libssl.so.1.0.2

-r-xr-xr-x    1 root     root           963 Jul  7 14:50 prettyPrint.sh

-r-xr-xr-x    1 root     root        502488 Jul  7 14:47 readCompressed

-r-xr-xr-x    1 root     root         40057 Jul  7 14:50 vpxResultFilter.xml

-r-xr-xr-x    1 root     root          1544 Jul  7 14:50 xmlpp.py

0 Kudos
jameseydoyle
VMware Employee
VMware Employee
Jump to solution

I apologise, I gave you the wrong filepath:

/etc/opt/vmware/fdm/hostlist

is the correct location.

However, it does appear that the cluster is expecting 4 hosts as vSAN lists 4 hosts. This means your cluster is always one host down!

The cleanest way to get out of the scenario would be to create new cluster. Remove all the diskgroups from the hosts and add the hosts to the new cluster.

Next time you need to remove a Witness Host, ensure to disable the Stretched Cluster configuration first and remove the host cleanly.

0 Kudos
TheBobkin
Champion
Champion
Jump to solution

Hello time,

You can identify 'who' it thinks are the 4 cluster members from any host using this (I populated it based on the sub-cluster membership UUIDs):

# cmmds-tool find -t HOSTNAME -f json -u 597f05ca-7472-cc68-de1c-a0369fd8e08c |grep content

# cmmds-tool find -t HOSTNAME -f json -u 597f1c09-d372-daff-b609-a0369fcc4db4 |grep content

# cmmds-tool find -t HOSTNAME -f json -u 597f23c4-8c11-3784-1cc7-a0369fcc5184 |grep content

# cmmds-tool find -t HOSTNAME -f json -u 59805405-6568-a8c0-5277-000c299d2d36 |grep content

Once you have identified the member who should not be present, try removing it from the cluster. If the member is the Witness that is already gone then it seems it was not decommissioned properly, in this case (if removal of it is not possible) creating a new cluster and adding nodes to it is the best option.  How simple this will be depends on if you have data on here and/or can move it all off/backup-restore.

Bob

0 Kudos
time81
Contributor
Contributor
Jump to solution

Thanks guys.

Re-Created it with 3 hosts in 1 fault domain without witness. its showing 3 now and the error is gone.

0 Kudos