ESX 3.5.0 Update 2 [cimservera <defunct>] - Page 4

dominic7 · ‎07-31-2008

Is anyone else getting a tone of defunct processes from cimservera after they install ESX 3.5.0 Update 2?

I've got cluster that is generating a lot of these that was all rebuilt at update 2. Unfortunately the output doesn't format well in here.

stick2golf · ‎10-15-2008

The Pegasus task control the CIMON software used by venders to obtain details of VM's and the ESX server. There are no issues with VMWare or Virtual Center management, but services like HPSIM will not be able to obtain information from the ESX servers concerning the VMWare environment.

Below are some details of the Pegasus application and if anyone wants more details they can read:

"The goal of the VMware® CIM SDK is to provide independent software vendors (ISVs) and the enterprise storage management industry a CIM-compliant object model for virtual machines and their related storage devices. The SDK also includes a Pegasus CIMOM installed with VMware ESX Server, as well as sample client code, to allow ISVs to explore virtual machine resources and to incorporate them into their management applications. VMware, Inc. considers the first version of the CIM SDK to be experimental. The interface may change in future releases to align it more closely with evolving standards."

Good luck and hope this helps..:)

stick2golf · ‎10-15-2008

I just posted the answer to your questions-:)

larden · ‎10-18-2008

Would setting a cron job to restart the pegasus service each day\ every few hours have the same effect?

VMware Rocks!

cfranke · ‎10-18-2008

-

stick2golf · ‎10-18-2008

restarting the Pegasus task will clear the defunct processes but I would recommend cron run 3-4 times daily because the defunct processes can fill up process table causing system restart.

Dollar · ‎10-28-2008

Is anyone aware of anything going on inside HP on this? They have an entire VMWare Support Infrastructure that can be purchased for Insight Manager. You would think that in order to sell such a thing the agents would have to work fully without blowing up Hosts.

I know that VMWare has washed their hands of it and I can find nothing that indicates HP is doing anything about it.

Thorsten_Schnei · ‎10-28-2008

I'm on a business trip till the 30th of October with limited access to my mails. I'll answer your mail after my return.

In urgent cases please call the helpdesk: +44 191 511 5555

Regards

Thorsten Schneider

This message may contain information that is confidential and/or protected by law. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution, copying or communication of this message is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete the message. Please note that although we will take all commercially reasonable efforts to prevent viruses from being transmitted from our systems, it is the responsibility of the recipient to check for and prevent adverse action by viruses on its own systems.

stick2golf · ‎10-28-2008

I can't get anywhere with HP. They are pointing their problem to VMWare Update 2 and VMWare is not working on this issue since they beleive this is HP's issue with their open interface.

From my perspective, regardless of the HP interface problem, VMWare ESX needs to be more proactive on fixing this issue. If I have an open interface that someone is not usinfg correctly on a mission critical application, such as ESX hypervisor, I would not allow the entire system to crash. Which is what will happen if you do not catch this problem soon enough. I do have a case open with VMWare that I am trying to get some type of resolution, but so far no luck. HP Needs to fix their problem and VMWare needs to provide better error handling, in my opionion.

Below is a discussion link at HP.

Dollar · ‎10-30-2008

Ditto: I thought I had the problem resolved and moved on. I came in this Monday and had seven host servers in a "not responding" mode. You could not even log into the console. All of my hosts had over 4,000 zombies. I stopped the HP Agents and the defunct processes disappeared. The "not responding" hosts had to be powered off. They could not be shut down cleanly.

One thing VMWare needs to understand, when the business community was walking into my office Monday morning to find out why all of their Virtual Machines were down they were not asking me "What's wrong with the HP Insight Manager Agents". They were asking me what's wrong with VMWare.

Has anyone developed a cron job to monitor "defunct processes" (zombies) and alert when it reaches a certain number?

stick2golf · ‎10-30-2008

That is what happend to our servers as well. Not good....I have yet another call into VMWare, but I have found a couple of items of interest. While I was debugging an ESX Server reset issue on my DL360 when I power on/off more than 10 VM's I noticed a BIOS update for the DL360 and a HP Management Agent update. I am trying this out now on my DL360 and will see if I get the defunct processes anymore.

I'ts funny how these fixes are released and the will have a small blurb about a a fix that resolves a critical issue. HP is getting really weak on this....

HP Management update link for DL360...

Help pages now exist for the NIC section in System Management Home Page
System Management Homepage (SMH) might show 'connection-timeout' error message if we try to refresh it a couple of times quickly. You might have to restart the hpmshd service to be able to login to SMH again. This usually happens when iLo/iLO2 is being reset and the SMH is refreshed.
The hpsmhd service might show the following error messages while restarting it after a 'connection-timeout' error is displayed by the SMH, but still the hpsmhd service will be restarted successfully:
Problem receiving data: Connection reset by peer (104)
Connected to 127.0.0.1:2301....
sent /proxy/reconnect...counter=1, status=104
Under Home->Storage-><Controller Link>->Physical Drive, there is a button named Identify Drives. When this button is clicked, the LEDs on the Physical drive(s) starts blinking; then the stop button is clicked, the LEDs on the physical drives stop blinking but the drive status on the web page still shows blinking.
It is no longer necessary to stop the Pegasus service before performing "service hpasm start" or "service hpasm stop"
On the NIC display for the System Management Homepage when the NIC being shown is a 10G NIC, the speed is now displayed correctly as 10Gbps

Dollar · ‎10-30-2008

I'm running Proliant BL460s, Proliant BL465s, Proliant BL680s, Proliant BL685s, Proliant BL45Ps, Proliant BL25Ps, and Proliant BL20Ps. All are on the latest BIOS, Firmwares, VMWare Patches, and HP Management Agents (8.1)..... and the problem exists universally.

stick2golf · ‎10-30-2008

Yeap.. still an issue....I am keep the service stopped and setting the restart level to never start......

ridizy · ‎10-30-2008

@Dollar. That is spot-on as far as where the business believes the blame lies. Convincing the business that virtualization is beneficial to them is already an uphill battle. Any hiccups in the virtual environment, regardless of their true cause, is blamed squarely on VMware.

I'm keeping my agents disabled for now.

Schorschi · ‎10-31-2008

We also have just seen this recently, about 1 week or so ago. We have a case open with HP, do you by chance have the HP case or VMware SR #s? It would be a good idea to coordinate with ours? Or at least point out to HP and VMware that this issue is not unique.

DGI_Drift · ‎11-11-2008

Hi

Have anyone updated ESX to update 3 yet, and will it resolve this problem?

Or is there any new patch which has been released, that fix this problem?

j_dubbs · ‎11-11-2008

After my experiences with update 2 I think we will pass on that for now and see how it goes. We have just removed our agents for now unfortunately.

stick2golf · ‎11-11-2008

No update on this issue. Going between HP and VMWare. :smileyshocked:

Also, I am updating my ESX Servers in my test environment to update 3 this week. I am going to run it through my test plan, heavily focused on HA and DRS, since this is were I have had the most problems lately. Updated VIrtualCenter to update 3 about 2 weeks ago in both environments. A couple of issues on the installation, but overall working well. For the VirtualCenter update 3, the database schema changed, so make sure you do a backup first and un-install update manager befor the installation or it will fail.

mronsman · ‎11-11-2008

I just went through the ringer yesterday with this issue and was on the phone with VMware for 5 hours diagnosing. They indicated that neither Update 3 nor HP PSP 8.1 for VMware will fix this issue. The workaround I used was to stop the Pegasus service and the HPASM services. Not cool.

Thanx,

Matt

stick2golf · ‎11-11-2008

You may also want to ensure the the service does not start back up after a shutdown or restart by issueing the following commands:

service pegasus stop *Already done...

chkconfig --level 5 pegasus off

chkconfig --level 3 pegasus off

Dollar · ‎11-12-2008

FYI: HP Released Version 8.1.1 of the Management Agents today. While the release notes do not mention anything specific about resolution of this issue it does specify "Support for VMWare ESX 3.5 U3"...... so I am taking one of my servers that is experiencing this issue, updating to the new IM Agents, and installing U3. I'll post the results when completed.