VMware Cloud Community
Gilles29
Contributor
Contributor
Jump to solution

Unstable Agent Management on ESX Server 3i, 3.5.0 ,123629

Hello

We use ESXI on 10 HP Proliant DL360 G5 and every day we must restart the Management Agent with the ILO console port

Any idea ?

Thanks

Gilles

Reply
0 Kudos
1 Solution

Accepted Solutions
marcelo_soares
Champion
Champion
Jump to solution

You are obvisously running without memory on your host. I suggest you disable the pegasus at your server by entering on advanced settings and inputting a 0 at the Misc.CIMEnabled option. Reboot the server and monitor it.

Marcelo Soares

VMWare Certified Professional 310

Technical Support Engineer

Linux Server Senior Administrator

Marcelo Soares

View solution in original post

Reply
0 Kudos
16 Replies
DSTAVERT
Immortal
Immortal
Jump to solution

Have you considered updating to a newer version of ESXi 3.5. VMware makes changes to fix things. The current build number is 176892. I would also check to see what firmware updates are available from HP.

-- David -- VMware Communities Moderator
bulletprooffool
Champion
Champion
Jump to solution

HP firmware updates are definitely worth a check, however at teh sasme time it would be worth running a trace on your network over night and seing if something is occuring each night that kills the network?

Are backups or something silly running on the same segment and flooding it - making the SC unable to communicate for a prolonged time?

Lastly, check that your NIC speeds are being negotiated correctly with your Switches

finally, if all else fails . . blame the Networks guys Smiley Wink

One day I will virtualise myself . . .
Gilles29
Contributor
Contributor
Jump to solution

Hello

Thank you DSTAVERT and bulletprooffool for your quick response.

It's not the session who fail but the process Agent Mangement who randomly stopped on the server .

You confirm that the network and interface configuration could affect the process Agent Mangement on the VM server ?

Thanks

Reply
0 Kudos
marcelo_soares
Champion
Champion
Jump to solution

One quesion: how do you figure out the Agents are stopped? VC disconnection? Or you look directly into the COS?

Marcelo Soares

VMWare Certified Professional 310

Technical Support Engineer

Linux Server Senior Administrator

Marcelo Soares
Reply
0 Kudos
Gilles29
Contributor
Contributor
Jump to solution

On the VMware servers we have remote ILO acces and could restart the agent management who abnormally frozen or stop .

In the log files and during the incident, i find this PAM error 'Hostd: PAM unable to dlopen(/lib/security/pam_stack.so' ; This server should installed be in 64 bits .

/var/log # uname -a

VMkernel xxxxxxx 3.5.0 #1 SMP Release build-153875 Mar 13 2009 17:29:00 i686 unknown

Aug 20 21:20:30 Hostd: Event 51961 : User cacti@127.0.0.1 logged in

Aug 20 21:20:33 root: sfcbd-watchdog:sfcbd has exited

Aug 20 21:20:45 root: sfcbd-watchdog:stopping sfcbd

Aug 20 21:20:45 root: sfcbd Stopping sfcbd

Aug 20 21:20:45 root: sfcbd-watchdog:starting sfcbd

Aug 20 21:20:45 root: sfcbd Starting sfcbd

Aug 20 21:20:45 vmkernel: 70:10:42:17.657 cpu7:22273680)WARNING: LinuxFileDesc: 3789: Unrecoverable exec failure: Failure during exec while original state already lost

Aug 20 21:20:45 Hostd: Event 51962 : User cacti logged out

Aug 20 21:20:49 Hostd: Event 51963 : User cacti logged out

Aug 20 21:20:58 vmkernel: 70:10:42:30.516 cpu6:18106003)WARNING: LinuxFileDesc: 3789: Unrecoverable exec failure: Failure during exec while original state already lost

Aug 20 21:21:06 Hostd: PAM unable to dlopen(/lib/security/pam_stack.so)

Aug 20 21:21:06 Hostd: PAM

Aug 20 21:21:06 Hostd: PAM adding faulty module: /lib/security/pam_stack.so

Aug 20 21:21:06 Hostd: PAM unable to dlopen(/lib/security/pam_deny.so)

Aug 20 21:21:06 Hostd: PAM

Aug 20 21:21:06 Hostd: PAM adding faulty module: /lib/security/pam_deny.so

Aug 20 21:21:07 Hostd: Event 51964 : Failed login attempt for cacti@127.0.0.1

Aug 20 21:21:07 Hostd: Rejected password for user cacti from 127.0.0.1

Reply
0 Kudos
s1xth
VMware Employee
VMware Employee
Jump to solution

This is very interesting. I have this occur on U4 on hosts randomly, but I could never find anything in the logs that would explain the reason...interesting to see someone else with a slightly similiar issue.

http://www.virtualizationimpact.com http://www.handsonvirtualization.com Twitter: @jfranconi
Reply
0 Kudos
DSTAVERT
Immortal
Immortal
Jump to solution

I would still look at the fact that you have an old version. Try turning off your monitoring tool. Perhaps the cacti interaction causes issues.

-- David -- VMware Communities Moderator
Reply
0 Kudos
Gilles29
Contributor
Contributor
Jump to solution

This issue occur before i have created cacti user and check the VM server with 'cacti vmware template' ( cpu, memory, vmfs and network ).

Is there a conflict between PAM library 32 and 64 bits ?

Reply
0 Kudos
marcelo_soares
Champion
Champion
Jump to solution

You are obvisously running without memory on your host. I suggest you disable the pegasus at your server by entering on advanced settings and inputting a 0 at the Misc.CIMEnabled option. Reboot the server and monitor it.

Marcelo Soares

VMWare Certified Professional 310

Technical Support Engineer

Linux Server Senior Administrator

Marcelo Soares
Reply
0 Kudos
Gilles29
Contributor
Contributor
Jump to solution

Perfect; i disable the MiscCIM parameter and monitor now the 3 VMware servers. I give you the result of this change in 2 or 3 days .

Thank your for your help .

Gilles

Reply
0 Kudos
Gilles29
Contributor
Contributor
Jump to solution

Anyway, It's looks ok after 4 days monitoring; no alarm like this'Login failed due to a bad username or password'

Thank you for help.

Reply
0 Kudos
Gilles29
Contributor
Contributor
Jump to solution

After 10 days monitoring the Agent Management frozen yestarday on 5 ESX Servers the same day but not at the same hour :smileyshocked:

Reply
0 Kudos
s1xth
VMware Employee
VMware Employee
Jump to solution

You are running esxi correct? You said esx in the last post. If you

are running esxi, do you have ssh enabled on the hosts that are

loosing connectivity? (editing the inetd.conf). If you have enabled

it, disable it. This caused my hosts management network to loose

connectivity (agent would drop out). I disabled the service as it was

and everything has been perfect. Having that service enabled can cause

the kernal to fail when it doesn't have enough memory to support it

under high load.

Sent from my iPhone

On Sep 4, 2009, at 6:52 AM, Gilles29 <communities-emailer@vmware.com

http://www.virtualizationimpact.com http://www.handsonvirtualization.com Twitter: @jfranconi
Reply
0 Kudos
Gilles29
Contributor
Contributor
Jump to solution

If i disabled the ssh service, do i loose definitly the command mode access to the ESXi host ?

It working fine on 10 others ESXi server ( same OS centos 5, soft and hard ware) with the ssh enabled; i'am not really persuaded it fix the fault .

i test now this solution on 1 of my ESXi server to see .

Gilles

Reply
0 Kudos
Gilles29
Contributor
Contributor
Jump to solution

I follow your recommandations but Misc.CIMEnabled to 0 and disabled ssh .But this actions never fix the unstability of the management Agent.

But now i loose definitly the command mode access to the ESXi host.

I will try to give you more log to push investigation.

Gilles

Reply
0 Kudos
Elwappo
Contributor
Contributor
Jump to solution

I'm noticing that this issue of slow/connection failure issues to vmware esxi servers seems to be all over the forums but in slightly different forms. Just to add some two cents here we were having very similar issues where we suddenly could not connect to the VMware ESXi server with the infrastructure client. It would give us a login failed due to invalid password. Also we would notice the web service would eventually go down on the ESXi host. Rebooting fixed the issue but it slowly creeps back. I read someone advised to turn off the unsupported console as a possible cause. My advice is this is probably not the issue as we did not enable this on our until after we had the problem. And the only reason we did that is so we could have a way to access the logs from the host as we do not have Virtual Center. And without Virtual Center if you loose your Infrastructure client access your rather dead in the water.

I'm starting to think the issue has something to to with CIM and the way it's collecting data from the host. Disabling is good troubleshooting advice but your taking away a very important part of monitoring a system for hardware stability. So if this works by turning it off it would be nice to hear why. There is a patch for ESX full that refers to a problem in an older verison of Pegasus that was running as part of the CIM collection process. Could it be this issue exists in ESXi as well?

Do any VMware tech engineers answer posts on these forums? It seems to me the only people I hear from are the community.

Reply
0 Kudos