From the VI client, if I select one of my hosts, it changes to 'not responding' and becomes italic, along with all it's running VMs.
When I select another host, the 'not responding' one starts working again. If I select it again, it immediately goes not responding. This is an issue if I wanted to go into maint mode or rescan the SAN.
However, the ESX host is still working from all other perspectives. The VMs are fine, I can SSH into the host, it doesn't log out of the SAN fabric, it's just sort of annoying/causes the virtual machine admins to freak out when 20 or so VMs become italic.
Anyone have an idea where to start? I had this issue in the past, and I think I solved it via reinstalling ESX, but it's happening again now, and I suspect there's an easier solution, like unjoining/rejoining the DRS cluster, or reinstalling the vpxagent somehow, rejoining the farm, etc. The only thing we've done was restart VirtualCenter (to solve another issue) and it didn't seem to help with the 'intermittent not responding' issues.
Very Odd - I would start looking at the logs on your VC Server - because that is an idication VC has lost communication with your ESX host - check the VCC lg, windows event viewer, and when the host is in that not responding mode see what sore ot network connectivity are you getting form VC to the ESX host - pings and the like -
I have some ideas of where to start, but it does sound like it is possibly on the VC side.
Is you VC dual homed? Binding order becomes important.
Have you reset the SSL certificate on the ESX host side? Client initiated fails host initiated is OK. Just delete the cert at the host and it will get recreated automatically on a mgmt-vmware service restart.
Have you directly connected from the VC using the VI client? Maybe something outside the VC code.
Have you tried to manage it using the vimsh shell? GUI problem maybe.
VirtualCenter is not homed. But the vpxd logs on virtualcenter do seem to indicate that it's a cert issue with the host.
A certificate in the host's chain is based on an untrusted root.
When we say 'cert on host' we mean the ESX host? (in this case 23). How do you delete the cert? I don't know where it's located. I could try deleting it, then restarting the mgmt-vmware service like you suggested.
That error is not usually an issue that will stop the communication. However it is easy to recreate the cert as follows.
cp /etc/vmware/ssl/rui.* /root
service mgmt-vmware restart
A few updates for those interested.
I've removed/readded hosts from virtualcenter, and it hasn't fixed anything. I've rebuilt one of the esx hosts and it hasn't fixed anything. I did however notice that the issue I'm having currently isn't 'when selected' as I put in the title. It just seemed that way. It does it pretty much whenever it wants, on as many hosts as it wants, but only ever for a few seconds. Looking at the times that it happens against the hostd.log doesn't show much, but virutalcenter does.
A certificate in the host's chain is based on an untrusted root.
backtrace eip 0x0122b526 ?GenerateCoreDump@System@Vmacore@@YAXXZ
backtrace eip 0x01176caa ?CreateBacktrace@SystemFactoryImpl@System@Vmacore@@UAEXAAV?$Ref@VBacktrace@System@Vmacore@@@3@@Z
backtrace eip 0x0115192e ??0Throwable@Vmacore@@QAE@ABV?$basic_string@DU?$char_traits@D@std@@V?$allocator@D@2@@std@@@Z
backtrace eip 0x01141b38 ??4Throwable@Vmacore@@QAEAAV01@ABV01@@Z
backtrace eip 0x0117e254 ?InitServer@Authd@Vmacore@@YAXABV?$basic_string@DU?$char_traits@D@std@@V?$allocator@D@2@@std@@0@Z
backtrace eip 0x011d7e1f ?CreateSSLContext@Ssl@Vmacore@@YAXPAVKeyStore@Crypto@2@W4SupportedVersion@12@AAV?$Ref@VSSLContext@Ssl@Vmacore@@@2@@Z
backtrace eip 0x011d7f3b ?CreateSSLContext@Ssl@Vmacore@@YAXPAVKeyStore@Crypto@2@W4SupportedVersion@12@AAV?$Ref@VSSLContext@Ssl@Vmacore@@@2@@Z
backtrace eip 0x011d8364 ?CreateSSLContext@Ssl@Vmacore@@YAXPAVKeyStore@Crypto@2@W4SupportedVersion@12@AAV?$Ref@VSSLContext@Ssl@Vmacore@@@2@@Z
backtrace eip 0x011d55c6 ?CreateSSLContext@Ssl@Vmacore@@YAXPAVKeyStore@Crypto@2@W4SupportedVersion@12@AAV?$Ref@VSSLContext@Ssl@Vmacore@@@2@@Z
backtrace eip 0x0118ce06 ?CreateHttpConnectionPool@Http@Vmacore@@YAXHAAV?$Ref@VHttpConnectionPool@Http@Vmacore@@@2@@Z
backtrace eip 0x0115bbff ?CreateElementNode@Xml@Vmacore@@YAXABV?$basic_string@DU?$char_traits@D@std@@V?$allocator@D@2@@std@@0AAV?$Ref@VElementNode@Xml@Vmacore@@@2@@Z
backtrace eip 0x013a618a ?CreateSoapSerializationVisitor@Vmomi@@YAXPAVVersion@1@PAVWriter@Vmacore@@AAV?$Ref@VSerializationVisitor@Vmomi@@@4@PBD_N@Z
backtrace eip 0x013a6332 ?CreateSoapSerializationVisitor@Vmomi@@YAXPAVVersion@1@PAVWriter@Vmacore@@AAV?$Ref@VSerializationVisitor@Vmomi@@@4@PBD_N@Z
backtrace eip 0x013a81ff ?CreateSoapStubAdapter@Vmomi@@YAXABV?$basic_string@DU?$char_traits@D@std@@V?$allocator@D@2@@std@@HPAVSSLContext@Ssl@Vmacore@@00PAVLogger@Service@6@AAV?$Ref@VStubAdapter@Vmomi@@@6@@Z
backtrace eip 0x013a9051 ?CreateSoapStubAdapter@Vmomi@@YAXABV?$basic_string@DU?$char_traits@D@std@@V?$allocator@D@2@@std@@HPAVSSLContext@Ssl@Vmacore@@00PAVLogger@Service@6@AAV?$Ref@VStubAdapter@Vmomi@@@6@@Z
backtrace eip 0x00510d11 (no symbol)
backtrace eip 0x0051169f (no symbol)
backtrace eip 0x00511af1 (no symbol)
backtrace eip 0x013779ce ?_Invoke_Task@StubImpl@Vmomi@@UAEXPAVManagedMethod@2@AAV?$RefVector@VAny@Vmomi@@@Vmacore@@AAV?$Ref@VAny@Vmomi@@@5@@Z
backtrace eip 0x008630a2 (no symbol)
backtrace eip 0x0053b0e5 (no symbol)
backtrace eip 0x0053bad8 (no symbol)
backtrace eip 0x006d5df6 (no symbol)
backtrace eip 0x006dad6e (no symbol)
backtrace eip 0x006e0e04 (no symbol)
backtrace eip 0x00a1c06a (no symbol)
backtrace eip 0x77e64829 GetModuleHandleA
I haven't tried the cert rebuilding as was suggested before because I've nuilt new esx hosts, and they are having the same issue.. The certs aren't really important to me, that's why they're disabled. Is it possible that it's an issue with the cert on the virtualcenter server? Should I try recreating that? Would it affect anything?
Well, it didn't seem to work. I recreated the cert on the ESX hosts, like suggested, but it didn't affect anything. I also added the cert from virtualcenter to the trusted root authorities, and it didn't affect anything either. However, I keep seeing the same issues in the hostd.log on the ESX servers at the exact time that they issues occur, but I don't know what it means
I am having the same problem as you
I have added the ESX certifications to trusted root on the VCC but still getting the hosts dropping off
Are you still getting the "
2008-06-03 23:28:10.792 'BaseLibs' 208 warning SSLVerifyCertAgainstSystemStore: Certificate verification is disabled, so connection will proceed despite the error" error after you have added the certificate? and also the stream ended error?
Any help would be appreciated
Both errors disappeared when I added the 'mock' certs from the ESX hosts to the Trusted Root Authorities cert store on the VirtualCenter server.
That fixed the issue for me. The other thing I looked into was the 'ProxySVC' errors that I was getting on the ESX hosts, and as it turns out, if you turn the VirtualCenter logging level to 'trivial' you capture ProxySVC events in the VirtualCenter logs.That may help.
So did you just use https: to each esx server to add the certificates? or did you get it straight from the ESX servers /etc/vmware/ssl/ and import it that way?
I added them using the https but im still getting the errors
Thanks for the help
This issue is still happening - even though i have re'added certs to VCS and restarted everything the ssl error and stream ended error are still occuring - it seems the certs are not being picked up for some reason
thanks for the other post! i have an SR open at the moment with VMware... not helping too much though
I added the certs from /etc/vmware/ssl
I took the rui.crt from each vmware server and put them on the virtualcenter server
Then opened up the mmc, added certificates as a snap-in, and selected 'this computer' when it asked me which cert store I wanted to use. I then selected 'personal certificates, right clicked on it, and 'imported' each of the certificates.
At that point, when I double clicked on the rui.crt on the hard drive of the virtual center server, I didn't have a trusted cert chain error. Instead it was trusted
Then the logs with the SSL exception stopped appearing.
I just now had a similar problem, and I thought it was related to me changing SSL certs on the VC server. I could connect my hosts and work with them for a little while but they would eventually change status to "Not Responding". Looking at vpxa.log on the host showed many repeated attempts at sending heartbeat packets to the VC server. It turns out that VC server had been removed from the Windows Firewall by the Server 2003 Security Configuration Wizard I ran, and I guess I hadn't noticed that the VC service was disallowed incoming network connections in the SCW. I re-ran SCW and made sure that the VC service was allowed to accept incoming network connections and the hosts are now staying connected to VC.
Funny how old problems still occur. I just had the same thing happen, after joining a vCenter to a domain. We had changed a couple other things and had thought these changes were the cause of hosts "not responding", but it ended up being the Windows Firewall being turned on for domain networks, but no others. So, it was actually the joining of the machine to the domain that turned on the firewall indirectly, causing our ESX host issue.