esxi host is not responding

vmisagh · ‎04-10-2019

Hi all

we have an strange problem,

in VCSA 6 , many of our esxi 6 hosts which have valid IP addresses become "not responding" and after disconnect and reconnect, we get the error say: "request time out"

I tried most of solutions like, restaring esxi host or restaring vcsa, regenerating ssl certificare on host, checking 902 and 443 ports on both sides (either host or vcenter) they are open and can also both sides can ping each other. (even disabling firewall totally on host). but didn't resolved. there are only 2 points, when we change IP address of the esxi host with another ip in same subnet, it will add successfully to vcenter but after a few days it became like the previous IP and not responding, same problem... and the second point is when we try to add the problematic host to a secondary vcsa, it will add without problem. my doubt is that main problem should exist on our main VCSA and if some logs are causing problem which ones are safe to delete on vcsa? or anyone can kindly help us with this issue? thanks in advandce

pragg12 · ‎04-10-2019

Hi,

Welcome to VMTN. 🙂

I don't think logs on primary VCSA could be causing this. Still, make sure all VCSA partitions have sufficient free space by running this command: df -h

Need information on below before I further suggest anything.

Q1: Are both VCSA in same subnet/VLAN or different VLAN ?

Consider marking this response as "Correct" or "Helpful" if you think my response helped you in any way.

vmisagh · ‎04-10-2019

Hi pragg12, thank you

this is screenshot of (df -h) from primary VCSA

and answer of your Question1 is no, both vcsa are totally on diffrenet subnets, (diffrent countries)

SureshKumarMuth · ‎04-10-2019

Can you give us the IP segment of Primary VCSA, secondary VCSA and ESXi hosts ?

What is the version of ESXi and VCSA , provide the build number

When the host is in not responding state in VC, have you tried accessing the hostclient https://<ESXi IP of FQDN> from browser, how the host responds that time.

Did you check the vpxd log and hostd log to see why the host goes to unresponsive state ?

vpxd log is located in the vCenter appliance and hostd log located in ESXi, these logs can give us some hint why the host is not responding to VC.

Regards,
Suresh
https://vconnectit.wordpress.com/

vmisagh · ‎04-10-2019

IP Segments are:

primary vcsa: 188.40.xx.xx

secondary vcsa: 46.225.xx.xx

esxi host: 185.81.xx.xx

-------------------------------------------

version and build numbers:

primary vcsa: 6.00 build 5112529

secondary vcsa: 6.00 but not sure about exact build number, it is something lower than primary vcsa (255xxxxx)

esxi host: 6.00 build 3620759

by the way we also tried to update vcsa and esxi host to other build numbers but nothing changed.

-------------------------------------------------

in this currect state I can directly access to esxi host with vsphere client software without any problem.

---------------------------------------------------

I also attached the last lines of those logs that you wanted. vpxd and hostd

and there is another maybe usefull screenshot of an error when I try ro reconnect the host in primary vcsa

SureshKumarMuth · ‎04-11-2019

From the screenshot and log message in vpxd.log, this is clear that the host cannot send the heartbeat to VC, this could be due to the network as host and VC are in different segments . There could be high latency at the network end which is causing the issue

2019-04-11T07:14:07.458+04:30 error vpxd[7F48EE448700] [Originator@6876 sub=Default] Reading additional bytes from the stream timed out : Read timeout after approximately 305000ms. Closing stream <SSL(<io_obj p:0x00007f4 8cc4b4700, h:171, <TCP '188.40.xx.xx:53482'>, <TCP '185.81.xx.xxx:443'>>)>

2019-04-11T08:03:55.918+04:30 error vpxd[7F490D08E700] [Originator@6876 sub=MoHost opID=0AA8F7F3-000079DA-fc][HostMo::Reconnect] Got unexpected exception: Server closed connection after 0 response bytes read; <SSL(<io_obj p:0x00007f48d876edf0, h:97, <TCP '188.40.xx.xx:34320'>, <TCP '185.81.xx.xxx:443'>>)> while reconnecting to host 185.81.xx.xxx--> reason = "Server closed connection after 0 response bytes read; <SSL(<io_obj p:0x00007f48d876edf0, h:97, <TCP '188.40.xx.xx:34320'>, <TCP '185.81.xx.xxx:443'>>)>",

Any specific reason why you your vCenter and hosts are in different network segments ? Try to bring ESXi and VC on same network if possible. Following article gives some hint on this issue and resolution steps

https://sflanders.net/2013/02/01/host-is-not-responding/

Regards,
Suresh
https://vconnectit.wordpress.com/

vmisagh · ‎04-12-2019

Thank you for your assessment, actullay our primary vcenter and hosts in addtition to being in different subnets also physically are in different countries becasue we have many hosts accross the europe and USA and Middle East we didn't want to dedicate a vcenter for every country. our primary vCenter is in germany and those problematic hosts are in Iran (Middle East) where it has lower internet quality and network against Europe, and as you mentioned that the most possible cause is latency in network ends, it make sense to me but my Boss says if this is the problem, why our other hosts in Iran in same subnet and same datancenter which are connected to Europe's vCenter, don't have this issue? we have 8 hosts in Iran, which only one of them has this issue and the other one sometimes become not respondig but will come back by itself and the other 6 host haven't had this problem ever until now. do you think there must be another underlying cause? or we must setup a dedicated vCenter in Iran for those host?

pragg12 · ‎04-16-2019

Is there firewall between ESXi and vCenter VLAN networks/IPs?

Can you check through your monitoring tools if there is a network packet drop between the affected host and vCenter ?

If the affected ESXi host is working with 1 vCenter but not with another then one of the most probable cause would be conflicting or not configured firewall rules to allow the required ports for communication.

Consider marking this response as "Correct" or "Helpful" if you think my response helped you in any way.

vmisagh · ‎04-17-2019

actullay I tested ports like 902 and 443 via telnet from both sides and they are open, there is no firewall in between, and there's a new weird update to my problem that is I installed a new vcenter appliance in same subnet of my primary vcsa ( both are in germany within a same subnet) and problematic esxi like before is in other subnet (in Iran) the things is when I try to add that esxi in primary vcsa I got the error "Request timed out" but when i try to add that host in new vcsa which is exactly in same subnet of primary vcsa , it adds it without any problem. so i'm very suspicious to some logs or anything like that in primary vcsa which prevents from adding that esxi to inventory or maybe there's a logs problem in that host which prevent the host from being added to the primary vcsa. but after searching through logs and delete some of them still not resolved. anyone have an idea?

SureshKumarMuth · ‎04-24-2019

We have to check logs again then, to isolate the issue. Try to reproduce the steps

1. connect ESXi host to primary VCSA and capture the vpxd , vpxa and hostd log.

2. Connect the same host to new vcsa and capture the above logs,

we can compare and check what is the difference.

Regards,
Suresh
https://vconnectit.wordpress.com/

vmisagh · ‎04-24-2019

thanks for spending your time on this problem.

I did what you said and collected those logs just after getting "request timed out error" on primary vcsa and adding host without problem in secondary vcsa,

and for reminder these are my ips right now:

primary vcsa: 188.40.xx.50

secondary vcsa : 188.40.xx.45 (same subnet as primary)

host : 185.81.xx.125

vmisagh · ‎04-24-2019

allowed maximum file attachments are 5 and these are remained logs

SureshKumarMuth · ‎04-25-2019

This time the communication failed due to SSL issue ,

Primary VCSA logs :

vcenter:/var/log/vmware/vpxd # grep "185.81.xx.125" vpxd.log

2019-04-25T07:50:27.028+04:30 error vpxd[7F9B2D072700] [Originator@6876 sub=HttpConnectionPool-000001] [ConnectComplete] Connect failed to <cs p:00007f9b3c2dbdd0, TCP:185.81.xx.125:443>; cnx: (null), error: N7Vmacore3Ssl18SSLVerifyExceptionE(SSL Exception: Verification parameters:

--> ExpectedPeerName: 185.81.xx.125

--> "185.81.xx.125"

2019-04-25T07:53:42.501+04:30 warning vpxd[7F9B2ECAC700] [Originator@6876 sub=Default] Failed to connect socket; <io_obj p:0x00007f9b20e933e0, h:-1, <TCP '0.0.0.0:0'>, <TCP '185.81.xx.125:443'>>, e: system:125(Operation canceled)

Matching KB - VMware Knowledge Base

Regards,
Suresh
https://vconnectit.wordpress.com/

vmisagh · ‎04-27-2019

I saw this log and also that KB before and regenerated SSL Certificate on the host but nothing resolved and also if problem is at host's ssl, why it can be add without any ssl problem to secondary VCSA right now? I think issue is at primary VCSA and if there are any ssl certificate cache or something like which must be removed to be able to add that host on primary VCSA, wthat do you think?

SureshKumarMuth · ‎04-30-2019

Have you removed the host from primary vCSA before adding it to secondary VCSA or you directly added in secondary VCSA without removing the host from primary ?

Try removing the host from primary VCSA completely and add it again, it may resolve the SSL issue as SSL refresh will occur.

Regards,
Suresh
https://vconnectit.wordpress.com/

vmisagh · ‎04-30-2019

I consider your last reply about VCSA vice versa, because the host can not be added in Primary VCSA, but yeah I removed it from secondary VCSA and when I try to add it to the Primary VCSA I get "request timed out" error and you know the rest of it...

SureshKumarMuth · ‎05-01-2019

If we check the logs we may see the same logs like connection timed out due to latency, still it is strange that the host successfully connects to the secondary vcsa without any issues everytime. We can narrow down by comparing both vcsa and ESXi connectivity indepth, but that is tedious and time consuming. End to end connectivity need to be analyzed. Have your raised SR with vmware to check the setup ? they can analyze and give some hint on it.

Regards,
Suresh
https://vconnectit.wordpress.com/

vmisagh · ‎05-01-2019

what is SR ? and should I check for what ?

SureshKumarMuth · ‎05-01-2019

Service Request(SR), basically a ticket with VMware support to analyze and provide solution if you have a support contract with vmware. Check with your VMware TAM or sales representative for more details.

Regards,
Suresh
https://vconnectit.wordpress.com/

All

esxi host is not responding