This worked for us. In my /var/log/messages I was only seeing entries that corresponds to the "Daily System Identification Task" in HP Sim. Fortunately ours only runs weekly so not too many entries yet in our ESX hosts. I just disabled the scan task until HP/VmWare comes up with a long-term solution. Good catch on this on MalcO
I don't really use the SIM ( it's managed by someone else ) so I turned the port off on the firewall so I don't get any more more defunct processes. So far it's been about 24 hours with no additional processes:
esxcfg-firewall -d CIMHttpsServer && service pegasus restart
I've been following this thread for quite sometime and have been working with our HP rep. He has issued the following email with the fix. I have not yet been able to test in our environment. For those using SIM, hopefully this helps.
Just got word from an HP team that there is a patch kit for HP SIM that is supposed to include a fix to prevent queries to the ESX cimservera process to solve this issue. The cause seems to be a change in the cim providers of the latest ESX 3.5 U2 release and this patch changes the way the Insight Agents and SIM communicate with it. I tested this on my lab server and it took about 2 minutes but did re-start SIM.
Link to patch: http://h18013.www1.hp.com/products/servers/management/hpsim/dl_windows52sp1.html?jumpid=reg_R1002_US...
Will this apply if you are already running HP SIM 5.2 sp2? I as because when i run the patch kit it says "HP SIM 5.2 not detected" I know it is because it says
Build Version C.05.02.02.00 | |
Build date: | 2008-07-04 10:23 |
Will this apply if you are already running HP SIM 5.2 sp2? I as because when i run the patch kit it says "HP SIM 5.2 not detected" I know it is because it says
|Build Version C.05.02.02.00|
|Build date:|2008-07-04 10:23|
I'm working with HP on this now, hopefully I'll have an answer soon.
Just for added info (and resolution/workaround clarification), Troy steered me to this thread from another thread. A big help. I am on Insight Manager 5.1 (Insight Addition) with every plug-in HP has ever released for the darn thing. Upgrading to 5.2 will be a pain, so I went in and disabled scheduled jobs. The zombies/defunct processes have quit. The only scheduled jobs I have remaining are Initial Data Collection and Initial Hardware Status polling. Not a big issue since I am using SNMP andI will still recieve hardare failure notification via SNMP traps. I'll leave it this way until either VMWare releases a patch or I can get the Insight Manager update scheduled (after someone else confirms this does indeed resolve the issue).
I also have a case open with VMWare and will push for a patch.
Thanks for the help guys.
here's some more fuel to add to the fire regarding the Insight Manager Agents on ESX
http://kb.vmware.com/kb/1004771
To us... it's now clear, bye bye IM agents
FYI, I've now updated to 5.2 and also ran the hotfix patch kit successfully and....still no resolution. I forced a Daily Inventory scan and then checked my /var/log/messages and the same error appeared. So any download you see on the HPSim website will NOT fix the issue. We will have to wait for a new HP or VmWare update to solve the Daily Inventory Scan task issue; I've disabled mine to stop the errors from logging.
I have exact the same problem.
I've tried serveral things, namely i've upgraded the HP Management Agents to the latest version 8.1.0 and i've updated/patched my ESX-servers to the latest build (from 110268 to 113339).
But unfortunately no luck with this. And I have several log messages in my /var/log/messages that indicate there is an authentication problem somehow.
Which I don't understand, but the service account it says to've failed to authenticate with is our HP Insight Manager service account. So I'm guessing it's the user that let's the ESX server talk to the SNMP service.
Sep 30 05:11:08 vm-srv01 cimservera[15644]: user "our-service-account" failed to authenticate Sep 30 05:11:10 vm-srv01 cimservera[15645]: user "our-service-account" failed to authenticate Sep 30 05:11:13 vm-srv01 cimservera[15646]: user "our-service-account" failed to authenticate Sep 30 05:11:14 vm-srv01 cimservera[15647]: user "our-service-account" failed to authenticate Sep 30 05:11:16 vm-srv01 cimservera[15648]: user "our-service-account" failed to authenticate Sep 30 05:11:18 vm-srv01 cimservera[15649]: user "our-service-account" failed to authenticate Sep 30 05:11:20 vm-srv01 cimservera[15650]: user "our-service-account" failed to authenticate Sep 30 05:11:22 vm-srv01 cimservera[15651]: user "our-service-account" failed to authenticate |
Any update on this issue? This hit me this morning and my entire cluster failed. I basically stop the pegasus service and stopped it from starting on a restart.
I tried the HP SIM updates (I have DL3805's), but no change. I also updated to VirtualCenter 2.5 Update 3.0, but agin no change. I have a ticket open at VMWare.
Commands to stop and disable server on restart.
service pegasus stop
chkconfig --level 5 pegasus off
chkconfig --level 3 pegasus off
Level 3 as well
Hey,
This will stop the service from running and keep "Pegasus" from starting if the servers reboot. This is just a stop gap until the real fix comes out. Basically, the Pegasus service cannot handle certain messages correctly and causes it to fail, thus leaving the defunct process going. In my network (80 VM's across 6 servers) I see about 10-15 per hour and eventually will cause the servers to hang and must be restarted. Vmotion does not even work. Below are the exact commands to run on each server to keep this from occurring until the fix comes out.
service pegasus stop
chkconfig --level 5 pegasus off
chkconfig --level 3 pegasus off
Good luck,
Dave
We have this too - have disabled the WBEM discovery in HP SIM for now - had to reboot two hosts yesterday grrrrrr ... is it me or are they getting a little lax nowadays?
Yeah..This is a very big issue. I had to reboot 2 out of our 6 servers, HA nd DRS was freaking out and VMotion was failing. 25 VM's were rebooted at 9:40am, not good.
Support from VMWare was great, but it seems that this issue would be resolved by now via patch instead of these workarounds.
Can someone post the current suggested work around for us all? Saves calling support
Thanks
VMware Rocks!
I don't know that anyone has received an official workaround from VMware. I know that I have been disabling the firewall port for the CIMHttpsServer and this has stopped all the defunct processes and has been working well for a few months. Disabling pegasus I think will break the hardware monitoring ( not that it's super awesome right now ). The only side effect to disabling the http port on the firewall is that some tools like the EMC ECC Agent don't work, though it turns out it barely works anyway.
esxcfg-firewall -d CIMHttpsServer && service pegasus restart
I have a call with VMWare engineer today at 2:00cst and will provide any update on this issue.
1) Stop the pegasus task and disable it from starting if esx server is restarted:
service pegasus stop
chkconfig --level 5 pegasus off
chkconfig --level 3 pegasus off
2) Disable WBEM discovery in HP SIM (Note: This can be done as well, but if you perform #1 you don't need to do this)
HPSIM -> Options -> Protocol Settings -> Global Protocol Settings .... Uncheck WBEM
Oh yeah.. Step #1 was recommended by VMWare engineer, but I have not seen any "official" workaround posted by VMWare..
are there any negatives to turning this service off?