ESX 3.5.0 Update 2 [cimservera <defunct>]

dominic7 · ‎07-31-2008

Is anyone else getting a tone of defunct processes from cimservera after they install ESX 3.5.0 Update 2?

I've got cluster that is generating a lot of these that was all rebuilt at update 2. Unfortunately the output doesn't format well in here.

stick2golf · ‎11-12-2008

I updated my test environment (3 DL380G5's) with ESX 3.5 U 3 and the HP Released Version 8.1.1 of the Management Agents and I am still seeing the problem. I rebooted the servers after the installation if 8.1.1 as well. And to make matters worse, the defunct processes seem to occur quicker. Below show over 30 defunct processes about 2 mins after I started the pegasus task and enabled WBEM again. I am calling back VMWare on this issue to see what can be done. I don't think they fully understand the "time-bomb" that can occur with this issue. Systems that are not monitored for defunct process will lock up and no vmotion or HA will occur if your systems lock-up. Regardless if this an HP issue, the software should handle these errors better IMO and I know we all have enough work besides this one issue....pegasus task stays off for now:)

# ps -ef | grep def

root 15452 13116 0 11:31 ? 00:00:00

root 15453 13116 0 11:31 ? 00:00:00

root 15454 13116 0 11:31 ? 00:00:00

root 15455 13116 0 11:31 ? 00:00:00

root 15463 13116 0 11:31 ? 00:00:00

root 15806 13116 0 11:31 ? 00:00:00

root 15815 13116 0 11:31 ? 00:00:00

root 15816 13116 0 11:31 ? 00:00:00

root 15847 13116 0 11:31 ? 00:00:00

root 15851 13116 0 11:31 ? 00:00:00

root 15852 13116 0 11:31 ? 00:00:00

root 15855 13116 0 11:31 ? 00:00:00

root 15856 13116 0 11:31 ? 00:00:00

root 15857 13116 0 11:31 ? 00:00:00

root 15858 13116 0 11:31 ? 00:00:00

root 15859 13116 0 11:31 ? 00:00:00

root 15862 13116 0 11:31 ? 00:00:00

root 16206 13116 0 11:31 ? 00:00:00

root 16211 13116 0 11:31 ? 00:00:00

root 16212 13116 0 11:31 ? 00:00:00

root 16213 13116 0 11:31 ? 00:00:00

root 16244 13116 0 11:31 ? 00:00:00

root 16248 13116 0 11:31 ? 00:00:00

root 16249 13116 0 11:31 ? 00:00:00

root 16250 13116 0 11:31 ? 00:00:00

root 16253 13116 0 11:31 ? 00:00:00

root 16263 13116 0 11:32 ? 00:00:00

root 16598 13116 0 11:32 ? 00:00:00

root 16607 13116 0 11:32 ? 00:00:00

root 16654 13116 0 11:32 ? 00:00:00

root 17085 3713 0 11:32 pts/1 00:00:00 grep def

#

Dollar · ‎11-12-2008

Yep... the defunct processes continue for me as well. For me, turning Pegasus off is not a valid option. I have over 100 Host servers (all HP Blades). I need to know when a drive fails, memory fails, overheats, etc...... Restarting the HPASM service clears out the defunct processes, returning the zombie count to zero, and I was doing this on (at least) a daily bases via automation. Problem with that is that immediately following the one of the restarts of this service on all Host servers (last week) I had 15 Hosts experience an ASR and reboot. (taking all of their hosted VM's down with them). So I now have to turn off ASR on all of the devices as well and am only restarting the HPASM service once per week (on an off hour).

I agree. It's a mess and neither HP or VMWare have come to the realization of how significant the issue is. Certainly, neither company has stepped up to the plate thus far with a resolution. I am (and have been) attempting resolution with both companies via our support agreements. Thus far there has not been a lot of acknowlegement that this is a significant issue..... even though it is effecting every HP customer on the planet that has upgraded to U2 (or better). I think that some of the issue is that many of those "HP Customers" just have "lock ups" without knowing what causes them (they are not monitoring for defunct processes) and just attribute the issue as an "accepted" VMWare instability. Either that, or they are taking the route you are, disabling all hardware monitoring.

PaulB2 · ‎11-12-2008

I am not experiencing the issue and I am wondering what might be different in my enviroment.

VM 3.5. 0.110268 (U2)

HP agent 8.1.0

HP SIM 5.1 sp1 C.05.01.00.02 and hotfix51_ 1,2,3,5,8,11,12,15,16

pb

Dollar · ‎11-12-2008

I would assume, if the HP Agents are not resulting in defunct processes, that you have the Insight Manager tasks (that result in the defunct processes) disabled (or not installed) and/or you have WBEM disabled underneath Global Protocol Settings. Or possibly, your HP Servers are not fully registered in Insight Manager.... or, possibly, Pegasus is not running properly on your Hosts. There is (was) an issue with some of the Post 3.5 U1 patches that results in a failed Pegasus upgrade (look in the /var/pegasus/vmware/install_queue folder to see if any folders exist).

In my environment, I have several Insight Manager Servers running different "modules". My primary server has the following modules installed:

HP BladeSystem Integrated Manager

HP Insight Power Manager

HP Performance Mnagment Pack

HP Server Migration Pack

HP Virtual Machine Management Pack

Proliant Essentials Vulnerability and Patch Management Pack

I have another Insight Manager Server that maintains the "Support Contract" information for my servers (HP Asset Maintenance) and acts as the "HP Phone Home" server whenever there is a hardware failure (automatically sends ticket to HP to repair the hardware).

It may be that one of the many different modules I have installed is causing the problem and you do not have this particular module installed.... although I have disabled almost everything except hardware snmp trap acceptance (and notification) and still have the problem. I have not attempted to "upgrade" all of my Insight Manager installations and modules because others in this thread have tried the same, with no results.

Dollar · ‎11-14-2008

Just to give folks an update, escalations with HP have occurred and this issue is currently in a L3 Lab within HP being worked. It's going to take a while but at least we have gotten beyond the denial phase, or beyond the concept that the workarounds are the resolution, which means that, eventually, a real fix will be released.

On a side note.... I believe this thread was helpful in the escalation. It helps when you can demonstrate impacts are being noticed in more than one location.

plin · ‎11-14-2008

Thanks Dollar and others for raising this to our attention. Engineering teams from both VMware and HP are investigating this issue. As soon as we have more detail, we'll provide an update to you all.

Regards -

VMware Infrastructure Product Management

johnchaneyVM · ‎11-18-2008

Okay everyone, I may have a little different issue in that I'm getting the exact same cimserver issue but I'm running on SUN servers. My cpu has been pegged since upgrading to up2 a month ago but I just had time to look into the issue. It is my lab environment and I was out of the office a lot so I wasn't as concerned.

I just tried these steps:

service pegasus stop

chkconfig --level 5 pegasus off

chkconfig --level 3 pegasus off

The defunct processes went away but the cpu was still pegged so I tried rebooting.

After the reboot I can login with the VI client or to the server with putty. That also means I can start my VM's. At the moment I don't have a monitor on the server so I can see what it is doing but thought I'd run this out to you guys to see what you think might be the issue.

I can ping the server with no problem.

Do I just use the on parameter to reenable pegasus? I'm going to try it.

Please keep us posted on any updates to the issue.

Thanks,

John Chaney

johnchaneyVM · ‎11-18-2008

It's amazing how you find typo's after hitting send.

I can NOT log in with VI client or putty.

John

johnchaneyVM · ‎11-18-2008

New update.

After waiting about 30 minutes I can now login.

John

dominic7 · ‎11-18-2008

It's probably the same problem. As I stated in the original post, I don't actually use any of the HP SIM agents. I'm guessing that you have a SIM server somewhere in your environment though.

johnchaneyVM · ‎11-18-2008

Thanks for the reply Dominic.

I do all of the IT as well and I don't know of any SIM servers. We're a pretty small shop. I'll do some investigating to see if something was installed that I don't know about.

John Chaney

Senior Systems Engineer

Soccour Solutions

jchaney@soccour.com

www.soccour.com

214.708.4358 mobile

Dollar · ‎11-19-2008

The inability to login via Putty or the Console is exactly what I've experienced whenever the server has exceeded somewhere in excess of 4,000 "zombies".... although I've never had one recover on it's own once it reaches this stage.

Without HP Insight in the mix, I cannot be certain your issue is the same as others are experiencing. As far as I am aware, this has (thus far) been an HP unique issue. Check your /var/logs/messages log and see if you have a grouping of "failed to authenticate" messages similiar to the following:

Nov 16 17:04:23 vm02102 cimservera[3400]: user "" failed to authenticate

Nov 16 17:04:23 vm02102 wbem(pam_unix)[3401]: bad username []

Saturnous · ‎11-25-2008

I saw a system where a backup administrator has logged on and produce defunct processes too - seems that ESXRanger use WBEM too.

With working LDAP authentification they go away. Just read manpage of esxcfg-auth.

jawad · ‎12-05-2008

We have had this same problem on both Dell and HP blades. Lets see if any of the new updates helps.

In the shadows...

Troy_Clavell · ‎12-05-2008

An email I received from my HP rep yesterday.... Thought I would share

Info from one of our internal con calls with VMware.

So there is a PR (Problem Report / Bug) open.

As you can see, U4 should fix it. No KB or advisory published yet.

• HP Case 3603759836/(Several SRs/PR336780) – Defunct processes from cimservera process running on ESX 3.5 U3. Engineering teams on both sides are investigating this issue. The bug is revealed using an inaduqately configured HP SIM Server (CIM Client) requesting a login to the Pegasus CIMOM (cimserver) with incorrect credentials. These failed logins to Pegasus causes the cimserver to shutdown resulting in defaunct processes.

STATUS: Independent fix is required on both HP and VMware side. VMware fix has been identified and it will be included as part of U4 GA. We are currently working on publishing a KB. On HP side, HP engineering is still investigating this further and they are also working on publishing a separate advisory (to be published after thanksgiving holiday)

martin_schmidt · ‎12-14-2008

KB released:

msemon1 · ‎01-23-2009

We had problems with version 7.x of the HP Agents. VMware advised to upgrade to version 8.1. We were seeing strange performance on one of our hosts. So far we have not had any problems with our hosts. Are most people not installing HP Agents in hosts?

Mike

vmproteau · ‎01-23-2009

We install the HP Management Agents on all of our HP Hosts. Until there is another method that provides the same hardware level details and alerting, I don't see an alternative.

Besides this "defunct process" issue, we have not had any problems. The only thing I've done and you might consider is disabling ASR. On our standard Windows server I've often found it a bit flaky causing random reboots with little or no explanation. On those servers it's annoying enough but, with some of our ESX Hosts having 20+ VMs, this would be a bit more distressing.

stick2golf · ‎01-23-2009

I was told the same thing and 8.1 still has the problem with perfromance and the defunct processes. I basically had to disable the pegasus task to ensure the HP agent does not cause the defunct processes, thus causing the server(s) to crash. I think VMWare and HP are pointing fingers at each other, while the end users run into this problem when their server farm craps out.

There are a couple of ways to handle this issue as noted on the previous posts.

Good Luck.

stick2golf · ‎01-23-2009

Has anyone tested this yet to see if this fixes the issue?

All