ESX 4.1U1 host unresponsive

jcrowland · ‎05-11-2011

We've only had this occur once in a server which has been up for about 60 days (since we installed U1), and discovered that although we could ping the Windows 2008R2 guests and the host, we could not use vCenter (4.1.0 Build 345043) and vSphere client (4.1.0 Build 345043) to administer or RDP into any of the guest servers. We've run through server hardware and LAN searching for anomolies, but did not turn-up anything. (We have other ESX4.1U1 hosts sharing the same LAN and storage infrastructure, so far, without this problem.)

We ssh'ed into the host console and restarted the entire host server gracefully which resolved the issue. Within minutes of the host and guest servers becoming generally unresponsive, we recorded the following in /var/logs/messages:

May 11 00:15:12 localhost cimslp: --- SLP Agent got error code 1 while doing enumInstances. Trying again (attempt 1, sleeping 15) ---
May 11 00:17:29 localhost sfcb-CIMXML-Processor[1025]: Timeout (or other socket error) waiting for response from provider
May 11 00:25:27 localhost sfcb-CIMXML-Processor[2218]: Timeout (or other socket error) waiting for response from provider
May 11 00:25:28 localhost cimslp: --- SLP Agent got error code 1 while doing enumInstances. Trying again (attempt 2, sleeping 60) ---
May 11 00:27:29 localhost sfcb-CIMXML-Processor[1025]: Timeout (or other socket error) waiting for response from provider
May 11 00:35:27 localhost sfcb-CIMXML-Processor[2218]: Timeout (or other socket error) waiting for response from provider
May 11 00:36:28 localhost sfcb-CIMXML-Processor[4002]: Timeout (or other socket error) waiting for response from provider
May 11 00:36:29 localhost cimslp: --- SLP Agent got error code 1 while doing enumInstances. Trying again (attempt 3, sleeping 405) ---
May 11 00:43:16 localhost cimslp: --- HTTP-Daemon no longer active. Deregistering service with slp
May 11 00:43:16 localhost cimslp: Callback Code -3
May 11 00:43:16 localhost cimslp: --- Error deregistering service with slp (0) ... it will now timeout
May 11 00:43:16 localhost cimslp: Error retrieving SLP info. Will try again next interval.
May 11 00:46:28 localhost sfcb-CIMXML-Processor[4002]: Timeout (or other socket error) waiting for response from provider

Upon researching this, I found some references to an issue where SFCB consumes all the TCP/IP ports in 4.1U1, but only for ESXi, not ESX. Has anyone encountered something similar to this?

Thanks for any help.

--John

ICT-Freak · ‎05-26-2011

Hi John,

I have the exact same issue on one of my Dell R710 vSphere hosts. But in my case the Service Console got out-of-memory and started to kill random processes. We managed to get the host back in vCenter but the host responds very slow. The only thing left was a reset of the host to get it back to normal. Did you contact VMware support? If so, what is your case number so I can attach that to my case when I finished collecting the log files.

Which hardware do you use and did you find a solution yet?

afokkema

jcrowland · ‎05-26-2011

We did not open a ticket as we thought we found some references to an issue with ESX4.1 (http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1035564) but didn’t realize it was only ESXi, so we enacted the fix, but upon further examination it does not appear relevant at this point.

If this reoccurs we will open a ticket with VMWare, it does sound similar to your issue; did you find similar information in your logs? Are you opening a case?

We have not seen this issue happen before and have identical hardware and host configurations in place using IBM 3650M3 servers connecting to a DS3500 via fibre.

--John

ICT-Freak · ‎05-27-2011

I have seen this issue two times and on the same host in less then 30 days. So I am going to open a support case. The host is isolated and doesn't run in production for now. If I find a solution or when I have news about the support case, I will post it here.

-- afokkema

vmNIU · ‎06-16-2011

Hello,

We have been getting a lot of these errors lately and the esx hosts seem to have problems communicating with each other. Additionally some VM's have trouble communicating with other VM's. This started happenning after upgrading CISCO 5K and 1000V components in our environment.

We use ESX 4.1 with QLogic 8152's . We just updated the QLogics to the current firmware, SNIA API, and drivers and the errors seem to have gone away.

Hope this helps.

Thanks

-Fred

MauroBonder · ‎06-16-2011

You tried restart management agents? - follow how restart --> KB

Please, don't forget the awarding points for "helpful" and/or "correct" answers.

Mauro Bonder - Moderator

*Please, don't forget the awarding points for "helpful" and/or "correct" answers. *Por favor, não esqueça de atribuir os pontos se a resposta foi útil ou resolveu o problema.* Thank you/Obrigado

vmNIU · ‎06-16-2011

yes, we tried restart of agents, reboots etc.. restart agents did not work. reboot worked temporarily.. only driver/firmware updates seemd to fix it.

Thanks

vmNIU · ‎06-16-2011

I don't think you understand.. i am not seeking help.. i was letting others know what we did to fix the issue that was described in the first post... just in case it could also help them.

thanks

Flight1234 · ‎08-18-2011

Hi jcrowland,

You mentioned that you found some references to an issue where SFCB consumes all the TCP/IP ports in 4.1U1 for ESXi, can you shared it with us. Thanks.

--Yew

All

ESX 4.1U1 host unresponsive