Hello
We use ESXI on 10 HP Proliant DL360 G5 and every day we must restart the Management Agent with the ILO console port
Any idea ?
Thanks
Gilles
You are obvisously running without memory on your host. I suggest you disable the pegasus at your server by entering on advanced settings and inputting a 0 at the Misc.CIMEnabled option. Reboot the server and monitor it.
Marcelo Soares
VMWare Certified Professional 310
Technical Support Engineer
Linux Server Senior Administrator
Have you considered updating to a newer version of ESXi 3.5. VMware makes changes to fix things. The current build number is 176892. I would also check to see what firmware updates are available from HP.
HP firmware updates are definitely worth a check, however at teh sasme time it would be worth running a trace on your network over night and seing if something is occuring each night that kills the network?
Are backups or something silly running on the same segment and flooding it - making the SC unable to communicate for a prolonged time?
Lastly, check that your NIC speeds are being negotiated correctly with your Switches
finally, if all else fails . . blame the Networks guys
Hello
Thank you DSTAVERT and bulletprooffool for your quick response.
It's not the session who fail but the process Agent Mangement who randomly stopped on the server .
You confirm that the network and interface configuration could affect the process Agent Mangement on the VM server ?
Thanks
One quesion: how do you figure out the Agents are stopped? VC disconnection? Or you look directly into the COS?
Marcelo Soares
VMWare Certified Professional 310
Technical Support Engineer
Linux Server Senior Administrator
On the VMware servers we have remote ILO acces and could restart the agent management who abnormally frozen or stop .
In the log files and during the incident, i find this PAM error 'Hostd: PAM unable to dlopen(/lib/security/pam_stack.so' ; This server should installed be in 64 bits .
/var/log # uname -a
VMkernel xxxxxxx 3.5.0 #1 SMP Release build-153875 Mar 13 2009 17:29:00 i686 unknown
Aug 20 21:20:30 Hostd: Event 51961 : User cacti@127.0.0.1 logged in
Aug 20 21:20:33 root: sfcbd-watchdog:sfcbd has exited
Aug 20 21:20:45 root: sfcbd-watchdog:stopping sfcbd
Aug 20 21:20:45 root: sfcbd Stopping sfcbd
Aug 20 21:20:45 root: sfcbd-watchdog:starting sfcbd
Aug 20 21:20:45 root: sfcbd Starting sfcbd
Aug 20 21:20:45 vmkernel: 70:10:42:17.657 cpu7:22273680)WARNING: LinuxFileDesc: 3789: Unrecoverable exec failure: Failure during exec while original state already lost
Aug 20 21:20:45 Hostd: Event 51962 : User cacti logged out
Aug 20 21:20:49 Hostd: Event 51963 : User cacti logged out
Aug 20 21:20:58 vmkernel: 70:10:42:30.516 cpu6:18106003)WARNING: LinuxFileDesc: 3789: Unrecoverable exec failure: Failure during exec while original state already lost
Aug 20 21:21:06 Hostd: PAM unable to dlopen(/lib/security/pam_stack.so)
Aug 20 21:21:06 Hostd: PAM adding faulty module: /lib/security/pam_stack.so
Aug 20 21:21:06 Hostd: PAM unable to dlopen(/lib/security/pam_deny.so)
Aug 20 21:21:06 Hostd: PAM adding faulty module: /lib/security/pam_deny.so
Aug 20 21:21:07 Hostd: Event 51964 : Failed login attempt for cacti@127.0.0.1
Aug 20 21:21:07 Hostd: Rejected password for user cacti from 127.0.0.1
This is very interesting. I have this occur on U4 on hosts randomly, but I could never find anything in the logs that would explain the reason...interesting to see someone else with a slightly similiar issue.
I would still look at the fact that you have an old version. Try turning off your monitoring tool. Perhaps the cacti interaction causes issues.
This issue occur before i have created cacti user and check the VM server with 'cacti vmware template' ( cpu, memory, vmfs and network ).
Is there a conflict between PAM library 32 and 64 bits ?
You are obvisously running without memory on your host. I suggest you disable the pegasus at your server by entering on advanced settings and inputting a 0 at the Misc.CIMEnabled option. Reboot the server and monitor it.
Marcelo Soares
VMWare Certified Professional 310
Technical Support Engineer
Linux Server Senior Administrator
Perfect; i disable the MiscCIM parameter and monitor now the 3 VMware servers. I give you the result of this change in 2 or 3 days .
Thank your for your help .
Gilles
Anyway, It's looks ok after 4 days monitoring; no alarm like this'Login failed due to a bad username or password'
Thank you for help.
After 10 days monitoring the Agent Management frozen yestarday on 5 ESX Servers the same day but not at the same hour :smileyshocked:
You are running esxi correct? You said esx in the last post. If you
are running esxi, do you have ssh enabled on the hosts that are
loosing connectivity? (editing the inetd.conf). If you have enabled
it, disable it. This caused my hosts management network to loose
connectivity (agent would drop out). I disabled the service as it was
and everything has been perfect. Having that service enabled can cause
the kernal to fail when it doesn't have enough memory to support it
under high load.
Sent from my iPhone
On Sep 4, 2009, at 6:52 AM, Gilles29 <communities-emailer@vmware.com
If i disabled the ssh service, do i loose definitly the command mode access to the ESXi host ?
It working fine on 10 others ESXi server ( same OS centos 5, soft and hard ware) with the ssh enabled; i'am not really persuaded it fix the fault .
i test now this solution on 1 of my ESXi server to see .
Gilles
I follow your recommandations but Misc.CIMEnabled to 0 and disabled ssh .But this actions never fix the unstability of the management Agent.
But now i loose definitly the command mode access to the ESXi host.
I will try to give you more log to push investigation.
Gilles
I'm noticing that this issue of slow/connection failure issues to vmware esxi servers seems to be all over the forums but in slightly different forms. Just to add some two cents here we were having very similar issues where we suddenly could not connect to the VMware ESXi server with the infrastructure client. It would give us a login failed due to invalid password. Also we would notice the web service would eventually go down on the ESXi host. Rebooting fixed the issue but it slowly creeps back. I read someone advised to turn off the unsupported console as a possible cause. My advice is this is probably not the issue as we did not enable this on our until after we had the problem. And the only reason we did that is so we could have a way to access the logs from the host as we do not have Virtual Center. And without Virtual Center if you loose your Infrastructure client access your rather dead in the water.
I'm starting to think the issue has something to to with CIM and the way it's collecting data from the host. Disabling is good troubleshooting advice but your taking away a very important part of monitoring a system for hardware stability. So if this works by turning it off it would be nice to hear why. There is a patch for ESX full that refers to a problem in an older verison of Pegasus that was running as part of the CIM collection process. Could it be this issue exists in ESXi as well?
Do any VMware tech engineers answer posts on these forums? It seems to me the only people I hear from are the community.