Welcome to VMTN. :-)
I don't think logs on primary VCSA could be causing this. Still, make sure all VCSA partitions have sufficient free space by running this command: df -h
Need information on below before I further suggest anything.
Q1: Are both VCSA in same subnet/VLAN or different VLAN ?
Can you give us the IP segment of Primary VCSA, secondary VCSA and ESXi hosts ?
What is the version of ESXi and VCSA , provide the build number
When the host is in not responding state in VC, have you tried accessing the hostclient https://<ESXi IP of FQDN> from browser, how the host responds that time.
Did you check the vpxd log and hostd log to see why the host goes to unresponsive state ?
vpxd log is located in the vCenter appliance and hostd log located in ESXi, these logs can give us some hint why the host is not responding to VC.
IP Segments are:
primary vcsa: 188.40.xx.xx
secondary vcsa: 46.225.xx.xx
esxi host: 185.81.xx.xx
version and build numbers:
primary vcsa: 6.00 build 5112529
secondary vcsa: 6.00 but not sure about exact build number, it is something lower than primary vcsa (255xxxxx)
esxi host: 6.00 build 3620759
by the way we also tried to update vcsa and esxi host to other build numbers but nothing changed.
in this currect state I can directly access to esxi host with vsphere client software without any problem.
I also attached the last lines of those logs that you wanted. vpxd and hostd
and there is another maybe usefull screenshot of an error when I try ro reconnect the host in primary vcsa
From the screenshot and log message in vpxd.log, this is clear that the host cannot send the heartbeat to VC, this could be due to the network as host and VC are in different segments . There could be high latency at the network end which is causing the issue
2019-04-11T07:14:07.458+04:30 error vpxd[7F48EE448700] [Originator@6876 sub=Default] Reading additional bytes from the stream timed out : Read timeout after approximately 305000ms. Closing stream <SSL(<io_obj p:0x00007f4 8cc4b4700, h:171, <TCP '188.40.xx.xx:53482'>, <TCP '185.81.xx.xxx:443'>>)>
2019-04-11T08:03:55.918+04:30 error vpxd[7F490D08E700] [Originator@6876 sub=MoHost opID=0AA8F7F3-000079DA-fc][HostMo::Reconnect] Got unexpected exception: Server closed connection after 0 response bytes read; <SSL(<io_obj p:0x00007f48d876edf0, h:97, <TCP '188.40.xx.xx:34320'>, <TCP '185.81.xx.xxx:443'>>)> while reconnecting to host 185.81.xx.xxx--> reason = "Server closed connection after 0 response bytes read; <SSL(<io_obj p:0x00007f48d876edf0, h:97, <TCP '188.40.xx.xx:34320'>, <TCP '185.81.xx.xxx:443'>>)>",
Any specific reason why you your vCenter and hosts are in different network segments ? Try to bring ESXi and VC on same network if possible. Following article gives some hint on this issue and resolution steps
Thank you for your assessment, actullay our primary vcenter and hosts in addtition to being in different subnets also physically are in different countries becasue we have many hosts accross the europe and USA and Middle East we didn't want to dedicate a vcenter for every country. our primary vCenter is in germany and those problematic hosts are in Iran (Middle East) where it has lower internet quality and network against Europe, and as you mentioned that the most possible cause is latency in network ends, it make sense to me but my Boss says if this is the problem, why our other hosts in Iran in same subnet and same datancenter which are connected to Europe's vCenter, don't have this issue? we have 8 hosts in Iran, which only one of them has this issue and the other one sometimes become not respondig but will come back by itself and the other 6 host haven't had this problem ever until now. do you think there must be another underlying cause? or we must setup a dedicated vCenter in Iran for those host?
Is there firewall between ESXi and vCenter VLAN networks/IPs?
Can you check through your monitoring tools if there is a network packet drop between the affected host and vCenter ?
If the affected ESXi host is working with 1 vCenter but not with another then one of the most probable cause would be conflicting or not configured firewall rules to allow the required ports for communication.
actullay I tested ports like 902 and 443 via telnet from both sides and they are open, there is no firewall in between, and there's a new weird update to my problem that is I installed a new vcenter appliance in same subnet of my primary vcsa ( both are in germany within a same subnet) and problematic esxi like before is in other subnet (in Iran) the things is when I try to add that esxi in primary vcsa I got the error "Request timed out" but when i try to add that host in new vcsa which is exactly in same subnet of primary vcsa , it adds it without any problem. so i'm very suspicious to some logs or anything like that in primary vcsa which prevents from adding that esxi to inventory or maybe there's a logs problem in that host which prevent the host from being added to the primary vcsa. but after searching through logs and delete some of them still not resolved. anyone have an idea?
We have to check logs again then, to isolate the issue. Try to reproduce the steps
1. connect ESXi host to primary VCSA and capture the vpxd , vpxa and hostd log.
2. Connect the same host to new vcsa and capture the above logs,
we can compare and check what is the difference.
thanks for spending your time on this problem.
I did what you said and collected those logs just after getting "request timed out error" on primary vcsa and adding host without problem in secondary vcsa,
and for reminder these are my ips right now:
primary vcsa: 188.40.xx.50
secondary vcsa : 188.40.xx.45 (same subnet as primary)
host : 185.81.xx.125
This time the communication failed due to SSL issue ,
Primary VCSA logs :
vcenter:/var/log/vmware/vpxd # grep "185.81.xx.125" vpxd.log
2019-04-25T07:50:27.028+04:30 error vpxd[7F9B2D072700] [Originator@6876 sub=HttpConnectionPool-000001] [ConnectComplete] Connect failed to <cs p:00007f9b3c2dbdd0, TCP:185.81.xx.125:443>; cnx: (null), error: N7Vmacore3Ssl18SSLVerifyExceptionE(SSL Exception: Verification parameters:
--> ExpectedPeerName: 185.81.xx.125
2019-04-25T07:53:42.501+04:30 warning vpxd[7F9B2ECAC700] [Originator@6876 sub=Default] Failed to connect socket; <io_obj p:0x00007f9b20e933e0, h:-1, <TCP '0.0.0.0:0'>, <TCP '185.81.xx.125:443'>>, e: system:125(Operation canceled)
Matching KB - VMware Knowledge Base
I saw this log and also that KB before and regenerated SSL Certificate on the host but nothing resolved and also if problem is at host's ssl, why it can be add without any ssl problem to secondary VCSA right now? I think issue is at primary VCSA and if there are any ssl certificate cache or something like which must be removed to be able to add that host on primary VCSA, wthat do you think?
Have you removed the host from primary vCSA before adding it to secondary VCSA or you directly added in secondary VCSA without removing the host from primary ?
Try removing the host from primary VCSA completely and add it again, it may resolve the SSL issue as SSL refresh will occur.