Hi,
we are testing a brand new 6.6 vsan but my cluster instantly turns to: insufficient vsphere ha failover resources
even if all 3 nodes (dell r730xd with 2x10 core) are empty.
reconfigure for HA didnt help.
turn HA on/off didnt help
admission controll is turned off, but even if i say 1 host can fail or it should reserve 25% cpu resources if still gives the error ?
any suggestions ?
Hello time,
You can identify 'who' it thinks are the 4 cluster members from any host using this (I populated it based on the sub-cluster membership UUIDs):
# cmmds-tool find -t HOSTNAME -f json -u 597f05ca-7472-cc68-de1c-a0369fd8e08c |grep content
# cmmds-tool find -t HOSTNAME -f json -u 597f1c09-d372-daff-b609-a0369fcc4db4 |grep content
# cmmds-tool find -t HOSTNAME -f json -u 597f23c4-8c11-3784-1cc7-a0369fcc5184 |grep content
# cmmds-tool find -t HOSTNAME -f json -u 59805405-6568-a8c0-5277-000c299d2d36 |grep content
Once you have identified the member who should not be present, try removing it from the cluster. If the member is the Witness that is already gone then it seems it was not decommissioned properly, in this case (if removal of it is not possible) creating a new cluster and adding nodes to it is the best option. How simple this will be depends on if you have data on here and/or can move it all off/backup-restore.
Bob
Hi,
It would be important to verify that the HA cluster is forming correctly. Can you check Monitor > vSphere HA > Summary and ensure that you have 1 master and 2 hosts connected to the Master. Have you also got your vSAN network properly configured? HA uses the vSAN-enabled VMkernel port for HA traffic in clusters where vSAN is enabled, so ensure there are no issues with the vSAN configuration. Check Monitor > vSAN > Health and ensure that all Network tests are passing.
Hello time,
Check the current cluster membership from a host and make sure all nodes are correctly clustered:
#esxcli vsan cluster get
Check that no hosts are in vSAN Maintenance Mode (should be all "decomState": 0:
# cmmds-tool find -t NODE_DECOM_STATE -f json
Bob
All decom State 0
Cluster Information
Enabled: true
Current Local Time: 2017-08-02T14:02:27Z
Local Node UUID: 597f1c09-d372-daff-b609-a0369fcc4db4
Local Node Type: NORMAL
Local Node State: MASTER
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 597f1c09-d372-daff-b609-a0369fcc4db4
Sub-Cluster Backup UUID: 597f05ca-7472-cc68-de1c-a0369fd8e08c
Sub-Cluster UUID: 52c9da94-4b46-c911-8976-e2888f0c1bde
Sub-Cluster Membership Entry Revision: 2
Sub-Cluster Member Count: 4
Sub-Cluster Member UUIDs: 597f05ca-7472-cc68-de1c-a0369fd8e08c, 597f1c09-d372-daff-b609-a0369fcc4db4, 597f23c4-8c11-3784-1cc7-a0369fcc5184, 59805405-6568-a8c0-5277-000c299d2d36
Sub-Cluster Membership UUID: c9718059-3d6e-187e-d9cd-a0369fcc4db4
Unicast Mode Enabled: true
Maintenance Mode State: OFF
Config Generation: 11fcd4eb-d081-48e9-8158-157ab295dc95 5 2017-08-01T11:51:46.716
Hello time,
I thought this was a 3-node cluster?
Sub-Cluster Member Count: 4
Bob
Can you send a summary of the HA Cluster status?
You could send the contents from the /opt/vmware/fdm/fdm/hostlist file from any of the hosts.
Well it is 3 ! See the pic below. Does it have to do with the witness host ? (VMWare Witness appliance 6.5)
We had a 2 node cluster before, with the 3rd physical host selected as witness but we destroyed it and i had to delete all the partitions and rebootet the esxi.
I cant find the hostlist file
[root@:/opt/vmware/fdm/fdm] ls -la
total 25380
drwxr-xr-x 1 root root 512 Aug 1 10:02 .
drwxr-xr-x 1 root root 512 Aug 1 10:02 ..
-r-xr-xr-x 1 root root 22846120 Jul 7 14:47 fdm
-r-xr-xr-x 1 root root 649 Jul 7 14:50 fdm-dump.sh
-r--r--r-- 1 root root 2174892 Jul 7 14:50 libcrypto.so.1.0.2
-r--r--r-- 1 root root 398060 Jul 7 14:50 libssl.so.1.0.2
-r-xr-xr-x 1 root root 963 Jul 7 14:50 prettyPrint.sh
-r-xr-xr-x 1 root root 502488 Jul 7 14:47 readCompressed
-r-xr-xr-x 1 root root 40057 Jul 7 14:50 vpxResultFilter.xml
-r-xr-xr-x 1 root root 1544 Jul 7 14:50 xmlpp.py
I apologise, I gave you the wrong filepath:
/etc/opt/vmware/fdm/hostlist
is the correct location.
However, it does appear that the cluster is expecting 4 hosts as vSAN lists 4 hosts. This means your cluster is always one host down!
The cleanest way to get out of the scenario would be to create new cluster. Remove all the diskgroups from the hosts and add the hosts to the new cluster.
Next time you need to remove a Witness Host, ensure to disable the Stretched Cluster configuration first and remove the host cleanly.
Hello time,
You can identify 'who' it thinks are the 4 cluster members from any host using this (I populated it based on the sub-cluster membership UUIDs):
# cmmds-tool find -t HOSTNAME -f json -u 597f05ca-7472-cc68-de1c-a0369fd8e08c |grep content
# cmmds-tool find -t HOSTNAME -f json -u 597f1c09-d372-daff-b609-a0369fcc4db4 |grep content
# cmmds-tool find -t HOSTNAME -f json -u 597f23c4-8c11-3784-1cc7-a0369fcc5184 |grep content
# cmmds-tool find -t HOSTNAME -f json -u 59805405-6568-a8c0-5277-000c299d2d36 |grep content
Once you have identified the member who should not be present, try removing it from the cluster. If the member is the Witness that is already gone then it seems it was not decommissioned properly, in this case (if removal of it is not possible) creating a new cluster and adding nodes to it is the best option. How simple this will be depends on if you have data on here and/or can move it all off/backup-restore.
Bob
Thanks guys.
Re-Created it with 3 hosts in 1 fault domain without witness. its showing 3 now and the error is gone.