hp_avik2401
Contributor
Contributor

vSphere HA Issue ESXi 7.0

I have a 12 node cluster, and most of the nodes doesn't join the vSphere HA cluster, nodes gets stuck at " Elect " state. Here is the fdm.log extracts from one of the nodes... any feedback is appreciated.

11119] [Originator@6876 sub=Message opID=SWI-25dea6b0] Destroying connection
2021-06-09T06:35:50.190Z info fdm[2611130] [Originator@6876 sub=Cluster opID=SWI-24af052a] Untrusted thumbprint (D7:71:82:E1:08:9D:35:10:BF:61:8D:4B:A8:04:CB:C9:46:BB:43:95) for host (10.6.203.38)- failing verify
2021-06-09T06:35:50.190Z verbose fdm[2611130] [Originator@6876 sub=Cluster opID=SWI-24af052a] Blacklisting ip address 10.6.203.38 for 60 seconds
2021-06-09T06:35:50.190Z verbose fdm[2611130] [Originator@6876 sub=Cluster opID=SWI-24af052a] IP 10.6.203.38 marked bad for reason Invalid Credentials
2021-06-09T06:35:50.190Z warning fdm[2611130] [Originator@6876 sub=Cluster opID=SWI-24af052a] Failed to verify host (10.6.203.38) - closing connection
2021-06-09T06:35:50.190Z info fdm[2611129] [Originator@6876 sub=Cluster opID=SWI-5e7c1b36] Untrusted thumbprint (9C:80:49:17:AA:A6:D6:4A:E8:35:3E:20:CB:DF:69:DB:8D:0B:67:89) for host (10.6.203.33)- failing verify
2021-06-09T06:35:50.190Z verbose fdm[2611129] [Originator@6876 sub=Cluster opID=SWI-5e7c1b36] Blacklisting ip address 10.6.203.33 for 60 seconds
2021-06-09T06:35:50.190Z verbose fdm[2611129] [Originator@6876 sub=Cluster opID=SWI-5e7c1b36] IP 10.6.203.33 marked bad for reason Invalid Credentials
2021-06-09T06:35:50.190Z warning fdm[2611129] [Originator@6876 sub=Cluster opID=SWI-5e7c1b36] Failed to verify host (10.6.203.33) - closing connection
2021-06-09T06:35:50.190Z verbose fdm[2611130] [Originator@6876 sub=Message opID=SWI-24af052a] Accept completion callback error N5Vmomi5Fault13SecurityError9ExceptionE(Fault cause: vmodl.fault.SecurityError
--> )
--> [context]zKq7AVECAAAAAOllCAESZmRtAADtBdhmZG0AAPJdyAC9o8cAy0bFAASocQCBifMAkrb1ADbN9QDBz/UAVyLEAGIvxAC8DMAAtRPAAJnLygBNE8sA8qXWATt9AGxpYnB0aHJlYWQuc28uMAAC/acObGliYy5zby42AA==[/context]
2021-06-09T06:35:50.190Z info fdm[2611130] [Originator@6876 sub=Message opID=SWI-24af052a] Destroying connection
2021-06-09T06:35:50.190Z verbose fdm[2611129] [Originator@6876 sub=Message opID=SWI-5e7c1b36] Accept completion callback error N5Vmomi5Fault13SecurityError9ExceptionE(Fault cause: vmodl.fault.SecurityError
--> )
--> [context]zKq7AVECAAAAAOllCAESZmRtAADtBdhmZG0AAPJdyAC9o8cAy0bFAASocQCBifMAkrb1ADbN9QDBz/UAVyLEAGIvxAC8DMAAtRPAAJnLygBNE8sA8qXWATt9AGxpYnB0aHJlYWQuc28uMAAC/acObGliYy5zby42AA==[/context]
2021-06-09T06:35:50.190Z info fdm[2611129] [Originator@6876 sub=Message opID=SWI-5e7c1b36] Destroying connection
2021-06-09T06:35:50.435Z verbose fdm[2611134] [Originator@6876 sub=Election opID=SWI-60b7acd9] CheckVersion: Version[0] Other host GT : 231 > 230
2021-06-09T06:35:50.435Z verbose fdm[2611134] [Originator@6876 sub=Cluster opID=SWI-60b7acd9] Version[0] 231 from host-1033,10.6.203.39
2021-06-09T06:35:50.435Z info fdm[2611134] [Originator@6876 sub=Cluster opID=SWI-60b7acd9] Host host-1033 not in host list - aborting fetch for 0
2021-06-09T06:35:50.435Z verbose fdm[2611134] [Originator@6876 sub=E

0 Kudos
4 Replies
depping
Leadership
Leadership

disconnect the host and reconnect it, see if that solves the problem, it typically does

0 Kudos
hp_avik2401
Contributor
Contributor

Thanks depping.I had tried this but no luck...whats the best method to ensure this is not caused by the certificate? functionaly, rest everything is working fine including SRM. Is it possible that a non-pingable default gateway /isolation network address be the reason for such behavior ?

 

0 Kudos
a_p_
Leadership
Leadership

Has something been changed in the environment, e.g. has vCenter Server been updated/patched?
Another user reported HA issues after applying the latest patch.

Anyway, from what you said about your environment, I assume that you have an active support contract, so I'd suggest that you consider opening a support case for this.

André

0 Kudos
depping
Leadership
Leadership

No, a non-pingable default gateway shouldn't cause this. I would indeed file a support request to get it fixed.

0 Kudos