krjhitch
Contributor
Contributor

Issues with ESX hosts not responding WHEN SELECTED in VI Client

From the VI client, if I select one of my hosts, it changes to 'not responding' and becomes italic, along with all it's running VMs.

When I select another host, the 'not responding' one starts working again. If I select it again, it immediately goes not responding. This is an issue if I wanted to go into maint mode or rescan the SAN.

However, the ESX host is still working from all other perspectives. The VMs are fine, I can SSH into the host, it doesn't log out of the SAN fabric, it's just sort of annoying/causes the virtual machine admins to freak out when 20 or so VMs become italic.

Anyone have an idea where to start? I had this issue in the past, and I think I solved it via reinstalling ESX, but it's happening again now, and I suspect there's an easier solution, like unjoining/rejoining the DRS cluster, or reinstalling the vpxagent somehow, rejoining the farm, etc. The only thing we've done was restart VirtualCenter (to solve another issue) and it didn't seem to help with the 'intermittent not responding' issues.

Any ideas?

0 Kudos
15 Replies
weinstein5
Immortal
Immortal

Very Odd - I would start looking at the logs on your VC Server - because that is an idication VC has lost communication with your ESX host - check the VCC lg, windows event viewer, and when the host is in that not responding mode see what sore ot network connectivity are you getting form VC to the ESX host - pings and the like -

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful
mike_laspina
Champion
Champion

Hi,

I have some ideas of where to start, but it does sound like it is possibly on the VC side.

Is you VC dual homed? Binding order becomes important.

Have you reset the SSL certificate on the ESX host side? Client initiated fails host initiated is OK. Just delete the cert at the host and it will get recreated automatically on a mgmt-vmware service restart.

Have you directly connected from the VC using the VI client? Maybe something outside the VC code.

Have you tried to manage it using the vimsh shell? GUI problem maybe.

http://blog.laspina.ca/ vExpert 2009
0 Kudos
krjhitch
Contributor
Contributor

VirtualCenter is not homed. But the vpxd logs on virtualcenter do seem to indicate that it's a cert issue with the host.

-- BEGIN task-internal-60678 -- host-9655 -- VpxdInvtHostSyncHostLRO.Synchronize

Synchronizing host: hqgtnesx23 (10.23.1.83)

Failed to send request. Retrying. Error: class Vmacore::Ssl::SSLException(SSL Exception: error:00000001:lib(0):func(0):reason(1))

SSLVerifyCertAgainstSystemStore: The remote host certificate has these problems:

  • A certificate in the host's chain is based on an untrusted root.

SSLVerifyCertAgainstSystemStore: Certificate verification is disabled, so connection will proceed despite the error

Retrieved host update to 8233

When we say 'cert on host' we mean the ESX host? (in this case 23). How do you delete the cert? I don't know where it's located. I could try deleting it, then restarting the mgmt-vmware service like you suggested.

0 Kudos
mike_laspina
Champion
Champion

That error is not usually an issue that will stop the communication. However it is easy to recreate the cert as follows.

cp /etc/vmware/ssl/rui.* /root

rm /etc/vmware/ssl/rui.*

service mgmt-vmware restart

http://blog.laspina.ca/ vExpert 2009
krjhitch
Contributor
Contributor

A few updates for those interested.

I've removed/readded hosts from virtualcenter, and it hasn't fixed anything. I've rebuilt one of the esx hosts and it hasn't fixed anything. I did however notice that the issue I'm having currently isn't 'when selected' as I put in the title. It just seemed that way. It does it pretty much whenever it wants, on as many hosts as it wants, but only ever for a few seconds. Looking at the times that it happens against the hostd.log doesn't show much, but virutalcenter does.

Queuing 10.23.1.91:790 (host-13521)

502c3559-cecb-ec0d-81ef-24445bacea25:790 (host-13521)

Need inventory sync for: host-13521

Queuing sync LRO for: /vpx/host/#13521/

-- BEGIN task-internal-172795 -- host-13521 -- VpxdInvtHostSyncHostLRO.Synchronize

Synchronizing host: hqgtnesx31 (10.23.1.91)

Failed to send request. Retrying. Error: class Vmacore::Ssl::SSLException(SSL Exception: error:00000001:lib(0):func(0):reason(1))

SSLVerifyCertAgainstSystemStore: The remote host certificate has these problems:

  • A certificate in the host's chain is based on an untrusted root.

SSLVerifyCertAgainstSystemStore: Certificate verification is disabled, so connection will proceed despite the error

Got vmacore exception: SSL Exception: error:00000001:lib(0):func(0):reason(1)

Backtrace:

backtrace[00] eip 0x0122b526 ?GenerateCoreDump@System@Vmacore@@YAXXZ

backtrace[01] eip 0x01176caa ?CreateBacktrace@SystemFactoryImpl@System@Vmacore@@UAEXAAV?$Ref@VBacktrace@System@Vmacore@@@3@@Z

backtrace[02] eip 0x0115192e ??0Throwable@Vmacore@@QAE@ABV?$basic_string@DU?$char_traits@D@std@@V?$allocator@D@2@@std@@@Z

backtrace[03] eip 0x01141b38 ??4Throwable@Vmacore@@QAEAAV01@ABV01@@Z

backtrace[04] eip 0x0117e254 ?InitServer@Authd@Vmacore@@YAXABV?$basic_string@DU?$char_traits@D@std@@V?$allocator@D@2@@std@@0@Z

backtrace[05] eip 0x011d7e1f ?CreateSSLContext@Ssl@Vmacore@@YAXPAVKeyStore@Crypto@2@W4SupportedVersion@12@AAV?$Ref@VSSLContext@Ssl@Vmacore@@@2@@Z

backtrace[06] eip 0x011d7f3b ?CreateSSLContext@Ssl@Vmacore@@YAXPAVKeyStore@Crypto@2@W4SupportedVersion@12@AAV?$Ref@VSSLContext@Ssl@Vmacore@@@2@@Z

backtrace[07] eip 0x011d8364 ?CreateSSLContext@Ssl@Vmacore@@YAXPAVKeyStore@Crypto@2@W4SupportedVersion@12@AAV?$Ref@VSSLContext@Ssl@Vmacore@@@2@@Z

backtrace[08] eip 0x011d55c6 ?CreateSSLContext@Ssl@Vmacore@@YAXPAVKeyStore@Crypto@2@W4SupportedVersion@12@AAV?$Ref@VSSLContext@Ssl@Vmacore@@@2@@Z

backtrace[09] eip 0x0118ce06 ?CreateHttpConnectionPool@Http@Vmacore@@YAXHAAV?$Ref@VHttpConnectionPool@Http@Vmacore@@@2@@Z

backtrace[10] eip 0x0115bbff ?CreateElementNode@Xml@Vmacore@@YAXABV?$basic_string@DU?$char_traits@D@std@@V?$allocator@D@2@@std@@0AAV?$Ref@VElementNode@Xml@Vmacore@@@2@@Z

backtrace[11] eip 0x013a618a ?CreateSoapSerializationVisitor@Vmomi@@YAXPAVVersion@1@PAVWriter@Vmacore@@AAV?$Ref@VSerializationVisitor@Vmomi@@@4@PBD_N@Z

backtrace[12] eip 0x013a6332 ?CreateSoapSerializationVisitor@Vmomi@@YAXPAVVersion@1@PAVWriter@Vmacore@@AAV?$Ref@VSerializationVisitor@Vmomi@@@4@PBD_N@Z

backtrace[13] eip 0x013a81ff ?CreateSoapStubAdapter@Vmomi@@YAXABV?$basic_string@DU?$char_traits@D@std@@V?$allocator@D@2@@std@@HPAVSSLContext@Ssl@Vmacore@@00PAVLogger@Service@6@AAV?$Ref@VStubAdapter@Vmomi@@@6@@Z

backtrace[14] eip 0x013a9051 ?CreateSoapStubAdapter@Vmomi@@YAXABV?$basic_string@DU?$char_traits@D@std@@V?$allocator@D@2@@std@@HPAVSSLContext@Ssl@Vmacore@@00PAVLogger@Service@6@AAV?$Ref@VStubAdapter@Vmomi@@@6@@Z

backtrace[15] eip 0x00510d11 (no symbol)

backtrace[16] eip 0x0051169f (no symbol)

backtrace[17] eip 0x00511af1 (no symbol)

backtrace[18] eip 0x013779ce ?_Invoke_Task@StubImpl@Vmomi@@UAEXPAVManagedMethod@2@AAV?$RefVector@VAny@Vmomi@@@Vmacore@@AAV?$Ref@VAny@Vmomi@@@5@@Z

backtrace[19] eip 0x008630a2 (no symbol)

backtrace[20] eip 0x0053b0e5 (no symbol)

backtrace[21] eip 0x0053bad8 (no symbol)

backtrace[22] eip 0x006d5df6 (no symbol)

backtrace[23] eip 0x006dad6e (no symbol)

backtrace[24] eip 0x006e0e04 (no symbol)

backtrace[25] eip 0x00a1c06a (no symbol)

backtrace[26] eip 0x77e64829 GetModuleHandleA

Host sync on hqgtnesx31 in progress, waiting..

pool:0, 9 out of 10 connections are available

Recording (285c1c8:0) UPDATE VPX_HOST SET BOOT_TIME = ? , MAINTENANCE_MODE = ? , POWER_STATE = ? WHERE ID = ?

host connection state changed to

I haven't tried the cert rebuilding as was suggested before because I've nuilt new esx hosts, and they are having the same issue.. The certs aren't really important to me, that's why they're disabled. Is it possible that it's an issue with the cert on the virtualcenter server? Should I try recreating that? Would it affect anything?

0 Kudos
krjhitch
Contributor
Contributor

Well, it didn't seem to work. I recreated the cert on the ESX hosts, like suggested, but it didn't affect anything. I also added the cert from virtualcenter to the trusted root authorities, and it didn't affect anything either. However, I keep seeing the same issues in the hostd.log on the ESX servers at the exact time that they issues occur, but I don't know what it means

Stream ended abruptly: No data left to read

Exception while processing request: EndofStream

Any ideas?

0 Kudos
krjhitch
Contributor
Contributor

I added the certs from each host to the trusted stores on virtualcenter. Even though they were the generic mock certs and there shouldn't have been a problem, it fixed it.

0 Kudos
silicoon
Contributor
Contributor

Hi

I am having the same problem as you

I have added the ESX certifications to trusted root on the VCC but still getting the hosts dropping off

Are you still getting the "2008-06-03 23:28:10.792 'BaseLibs' 208 warning SSLVerifyCertAgainstSystemStore: Certificate verification is disabled, so connection will proceed despite the error" error after you have added the certificate? and also the stream ended error?

Any help would be appreciated

0 Kudos
krjhitch
Contributor
Contributor

Both errors disappeared when I added the 'mock' certs from the ESX hosts to the Trusted Root Authorities cert store on the VirtualCenter server.

That fixed the issue for me. The other thing I looked into was the 'ProxySVC' errors that I was getting on the ESX hosts, and as it turns out, if you turn the VirtualCenter logging level to 'trivial' you capture ProxySVC events in the VirtualCenter logs.That may help.

0 Kudos
silicoon
Contributor
Contributor

So did you just use https: to each esx server to add the certificates? or did you get it straight from the ESX servers /etc/vmware/ssl/ and import it that way?

I added them using the https but im still getting the errors

Thanks for the help

0 Kudos
admin
Immortal
Immortal

sillicoon, did you manage to fix this?

Someone else is having the same problem on this thread: http://communities.vmware.com/message/978424

They've opened an SR with VMware Support.

0 Kudos
silicoon
Contributor
Contributor

This issue is still happening - even though i have re'added certs to VCS and restarted everything the ssl error and stream ended error are still occuring - it seems the certs are not being picked up for some reason

thanks for the other post! i have an SR open at the moment with VMware... not helping too much though

0 Kudos
krjhitch
Contributor
Contributor

I added the certs from /etc/vmware/ssl

I took the rui.crt from each vmware server and put them on the virtualcenter server

Then opened up the mmc, added certificates as a snap-in, and selected 'this computer' when it asked me which cert store I wanted to use. I then selected 'personal certificates, right clicked on it, and 'imported' each of the certificates.

At that point, when I double clicked on the rui.crt on the hard drive of the virtual center server, I didn't have a trusted cert chain error. Instead it was trusted

Then the logs with the SSL exception stopped appearing.

0 Kudos
cdillon
Contributor
Contributor

I just now had a similar problem, and I thought it was related to me changing SSL certs on the VC server. I could connect my hosts and work with them for a little while but they would eventually change status to "Not Responding". Looking at vpxa.log on the host showed many repeated attempts at sending heartbeat packets to the VC server. It turns out that VC server had been removed from the Windows Firewall by the Server 2003 Security Configuration Wizard I ran, and I guess I hadn't noticed that the VC service was disallowed incoming network connections in the SCW. I re-ran SCW and made sure that the VC service was allowed to accept incoming network connections and the hosts are now staying connected to VC.

0 Kudos
erickmiller
Enthusiast
Enthusiast

Funny how old problems still occur. Smiley Happy  I just had the same thing happen, after joining a vCenter to a domain.  We had changed a couple other things and had thought these changes were the cause of hosts "not responding", but it ended up being the Windows Firewall being turned on for domain networks, but no others.  So, it was actually the joining of the machine to the domain that turned on the firewall indirectly, causing our ESX host issue. Smiley Happy

Eric

Eric K. Miller, Genesis Hosting Solutions, LLC http://www.genesishosting.com/ - Lease part of our ESX cluster!
0 Kudos