ESX 3.5.0 Update 2 [cimservera <defunct>] - Page 6

dominic7 · ‎07-31-2008

Is anyone else getting a tone of defunct processes from cimservera after they install ESX 3.5.0 Update 2?

I've got cluster that is generating a lot of these that was all rebuilt at update 2. Unfortunately the output doesn't format well in here.

dominic7 · ‎01-23-2009

I'll reiterate this a bit, but the defunt processes are not a result of the SIM agents, they exist regardless if the agents are installed. The problems is with the SIM management console talking to the CIM provider. Havning spoken to VMware at length about this issue, I can assure you they are aware of it and are working on a fix for it. If you have problems with defunct processes I suggest you open an SR and talk to them as well.

ie1e0955 · ‎01-23-2009

I have specifically avoided installing any updated agents and later versions of the ESX hypervisor, because of the issues listed in this thread.

ESX 3.5.0 Update 3 does not fix the issue either and has a host of other problems if your VMware solution is SAN attached by iscsi or fibre, which ours is. We currently have enough capacity, that if a host fails due to a cimserver issue or hardware failure, we can migrate guests to another host. This is not a solution long term and need the functionality you guys do, to effectively monitor the server hardware for any issues.

Looks like we will have to wait for U4 later this year and any developments from HP.

Saturnous · ‎01-24-2009

Update :

Yes the U3 produce under the described circumstances also cimservera defuncts

Circumstances are still :

- Any ESX 3.5 U2 or higher

AND

- Something from outside try to authentificate with an Domain account

AND

- No one configured properly domain authentification on the ESX (read the manpage of esxcfg-auth).

Whenever you prevent one of the circumstances you will not have the problem.

It is not direct connected to the HP Agents, its just that people who install the HP Agents mostly use the HP-SIM too. And the Serviceconsole crashes earlier when you have forgot to configure the Agents properly (deactivate the unused agents - increase the polling times to minimize SCSI Reservation) and the agent consume most of the available SC memory. I saw the same issue without agents and a bad configured verizon product, which also left failed authentifications in the /var/log/messages log.

You cant blame HP for doing something wrong (except that they supported the Open-Pegasus project with money and manpower, which VMWare has used in ESX 3.5U2+). Now the good news, VMWare use in the ESX 4.0 beta not the pegasus package as foundation of their CIM provider instead they use sfcb . So you will have definitvly no cimservera defunct processes in ESX 4. I would suggest that they make Directory Integration a bit easier by providing a wizzard in the VC.

solutions are :

- find out what try to connect (see messages) and replace there the seen account with root

or

- get your ESX in the Active Directory, be aware that you have to use adduser with any user which could try connect (ESX dont accept AD authentificaion with a username it does not know)

or

- make a cronjob which restarts the pegasus service

Dollar · ‎02-02-2009

The patch for this released and is available as a component of the 01/30/09 Patches (available on the patch downloads page).

This specific patch is detailed here:

Who want's to guinea pig this for the rest of us.

stick2golf · ‎02-02-2009

I have a 3 ESX server test enviornment that I can use to test with (3 DL380G5's). I will try this later this evening and provide results....This is very easy to reproduce, so it should not take long to test.

DGI_Drift · ‎02-03-2009

I want to try this myself:)

But I have another problem now aswell.

My update manager haven't downloaded the newest patches from 30. january.

I have restartet the server that update manager is, but it still not working.

I know this is a little off the topic, but I hope someone can help me...

Thanks

Rabie · ‎02-03-2009

Hi,

The update service is not yet very reliable, I have found the best way is to periodically check for updates on www.vmware.com.

Then when a new set of updates are availible.

Wait then checkthe forums and if all is ok then: (this step is advisable but not required)

Go into the update manager and change the settings, go to the scheldule and schedule an imediate update that should then

pop you an email that it has found updates and is busy downloading them.

Done

Regards

Rabie

EPL · ‎02-03-2009

I was having problems with Update Manager as well. I called in a tech support case and was informed that they were having update issues with their update manager... go figure! They've since emailed me to try a manual sync now and it did download the updated patches. I've installed the Pegasus patch on one of my dev boxes. I've had it running for about 16 hours with no extra happiness.

I'm running ESX 3.5 U3 with all the latest patches.

One thing to note after patch that I didn't notice before in the messages log:

watchdog-cimserver: Begin '/var/pegasus/bin/cimserver daemon=fals

e', min-uptime = 60, max-quick-failures = 5, max-total-failures = 1000000

then a few hours later:

watchdog-cimserver: '/var/pegasus/bin/cimserver daemon=false' exited after 22844 seconds

watchdog-cimserver: Executing '/var/pegasus/bin/cimserver daemon=false'

DGI_Drift · ‎02-04-2009

I have schelduled tasks to do imediate updates for some days now, but it wouldn't work.

But suddenly today, it updated update manager, so I shall update one host now.

Just disable the cronjob first:-)

Troy_Clavell · ‎02-05-2009

bump

Can anyone confirm if this patch has indeed fixed this issue?

http://kb.vmware.com/kb/1006657

MKguy · ‎02-05-2009

GENTLEMEN,

Seems like this is finally fixed. Patched 2 ESX 3.5 U3 hosts yesterday and today with this and I don't see zombie cimservera processes ^yet.

Man, they really took their sweet damn time to ~~fix that issue~~ feast on our _impatient banter.

-- http://alpacapowered.wordpress.com

Dollar · ‎02-05-2009

That's a BIG concur!!!!!!

It took two months to get someone to acknowlege there was a problem. Took an additional month to get the cause of the problem pin-pointed down to a VMWare issue, not an HP issue. Then an additional 2 months to get a patch released.

ccastaneda · ‎03-07-2009

What's the general consensus on using the HP management agents? I know that this wasn't actually a problem with them, but then again, they're the only reason why our SIM servers are pointed to our systems and thus triggered and exposed the Pegasus issue. Guess I'm just concerned about the whole combo (SIM, Agents, and Pegasus).

I've had issues with the agents on a couple of instances that resulted in crashes, whether it was having to uninstall a prior version to get to a newer version, recent upgrade of the iLO firmware to 1.70, or just unexplainable crashes, not to mention early version 7 agents that were known to crash ESX (too long for me to even site). I've had more incidents caused by the agents or suspicion towards them about issues vs. the value they have actually provided - an alert on a failed hard drive here and there. So this leads me to question whether it's better to just run my HP Blades (bl465) blinded and do visual inspections or continue to take my chances with the whole combo (SIM, Agents, Pegasus).

Anyone? Maybe someone can shed some light into their implementation, tweaks, best practices - I can't believe this would be necessary and would suggest that they are problematic, but I'm all ears.

Thanks,

martin_schmidt · ‎03-07-2009

Hi all,

I know that many people blame Insight Agents for different problems.

Here are some best practices that should help improving things.

1. Disable ASR in ProLiant BIOS.

Otherwise a busy ESX Service Console could lead to a misinterpretation of a hanging ESX server.

2. Make sure you use the latest Insight Agent version. A compatibility list can be found on http://h71028.www7.hp.com/enterprise/cache/505363-0-0-0-121.html.

Normally the latest agent version runs on any 3.x release.

Make sure you uninstall the older version before you install the latest one.

3. The cimservera defunct issue was not an HP, but VMware problem.

More details and patch link in .

4. Not directly related to Insight Agents, but all 3rd party software installed in Service Console: Make sure that your SC memory is not on default 272 MB, but much higher. Give it at least 512 MB or better maximum of 800 MB.

You can monitor with "free -m", "less /proc/meminfo" and "top" if your system is swapping.

Note: Many ESX servers are swapping when they run some VMs and have HA enabled.

Several issues were seen in the past whenever the Service Console on the ESX host was swapping. This leads hostd (on classic ESX) to hang causing different kind of issues:

- The ESX host is grayed out (dimmed) in VC.

- There are intermittent disconnects of the host in VC.

- The Service Console is no longer controllable.

- Sometimes HA failures occur in combination with the previous issues.

- vMotion issues occur in combination with the previous issues.

272 MB is too low for most configurations. So please increase this value. Unfortunately a reboot is needed.

References:

http://kb.vmware.com/kb/1003496 -> Checking for resource starvation of the ESX Server service console

http://kb.vmware.com/kb/1003501 -> Increasing the amount of RAM assigned to the ESX Server service console

http://kb.vmware.com/kb/1002325 -> ESX hosts disconnect randomly disconnects from VirtualCenter

http://kb.vmware.com/kb/1003313 -> Troubleshooting an ESX host that intermittently disconnects from VirtualCenter

http://kb.vmware.com/kb/1002713 -> hostd and vpxa exceed hard limit for memory use during VMware Consolidated Backup

5. Optimize the agents configuration.

The idea is to disable unneeded agents and to lower the polling frequency for others.

Here just an example. Please decide by yourself which agents are needed in your environment:

How to lower polling frequency with Insight Agents on ESX?

Stop agents: "service hpasm stop"

First decide which agents make sense in your environment. For example:

cmaidad -> Used for SmartArray monitoring (Hard disks and defects on Controller).

cmaeventd -> creating traps and IML entries on ProLiant servers (For RAM, cooler, power supplies and other board facilities).

cmafcad -> Collects data from FC HBA. Normally only useful for MSA arrays.

Should be disable for unsupported storage subsystems and 3rd party arrays.

Please read: http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?lang=en&cc=us&objectID=c01519875&jumpi...

Please edit /opt/compaq/cma.conf with nano and exclude agents:

cmaCloseCcissHandle OFF # Agent no longer opens and closes disk block requests every 15 seconds.

exclude cmafcad # Fiber Devices

exclude cmaided # IDE devices

exclude cmasasd # SAS devices (Serial Attached SCSI - tapes etc)

exclude cmascsid # pure SCSI devices and SCSI card

exclude hprackd # intelligent Rack monitor

Increase polling interval for agents. 15 seconds are default, 60 seconds is our recommendation.

Change polling interval in /opt/compaq/storage/etc/cmaeventd file:

Original settings: PFLAGS="-p 15"

Change to: PFLAGS="-p 60"

Change polling interval in /opt/compaq/storage/etc/cmaidad file :

Original settings: PFLAGS="-p 15 -s OK"

Change to: PFLAGS="-p 60 -s OK"

Change polling interval in /opt/compaq/foundation/etc/cmathreshd file:

Original settings: PFLAGS="-p 5 -s OK"

Change to: PFLAGS="-p 120 -s OK"

Change polling interval in /opt/compaq/foundation/etc/cmahostd file:

Original settings: PFLAGS="-p 5 -s OK"

Change to: PFLAGS="-p 120 -s OK"

Start agents again: "service hpasm start"

I hope this helps a little bit. All suggestions made without any guarantee for anything.

Regards, Martin

ccastaneda · ‎03-09-2009

Martin, thanks for your feedback, I'll give those tweaks a try.

Oddly enough, the matrix doesn't list the 8.1.1 agents at all.

-C

Saturnous · ‎03-10-2009

You speak about ILO related crashes .. i dont belief that they can be triggered by the Agents directly.

did you got kernel dumps - use vmkdump -l to extract the last entries.

This points to ILO (port x61 contains 0xb1) ..

ccastaneda · ‎03-10-2009

The agents do have direct communication with iLO, this is how resetting and preforming some minor iLO changes is possible through the Systems Manangement Homepage. (Home -> Management Processor -> Integrated Lights-Out 2).

Here's an output on services with hp in the name: Status of HP Lights-Out Drivers and Agents (hprsm): cpqci cpqriisd cmasm2d cmarackd

At the time of the crash, we were updating the iLO firmware. 2 out of 12 blades crashed, so IMO, there was a direct correlation to the agents being at fault.

Saturnous · ‎03-11-2009

Hmm i dont get the logic here. "Because the Agents are running and communicating they MUST ?:| be the cause of the crash."

Why you are sooo sure that the plain (compared with the Agents older) VMWare CIM Provider

which are assuming that they got the adress of an plain Intel IPMI

didnt caused the crash .. at least they access the Kernel and not the

SC only.

The openipmi driver is for me more suspect here .. maybe he was not updated or failed while starting.

When you talk about a "crash" you mean SC only or Kernel ?

Do you have a dump ? Or a log from a dump ?

You have to proof the chain ... does the Agents something which lead the HW act in a wrong way or did crash the kernel because an agent sent some strange request to them.

Schorschi · ‎03-14-2009

Guys, are we not missing the point? Neither HP nor VMware support 'in-band' flashing on ESX OS. Be it iLO or mainboard. We as a hard rule never flash a server from the active OS. Furthermore, VMkernel does not support device level support from the COS out side of the VMware API/command set, so even if the COS can somehow communicate to the hardware, this is not valid. I am very surprised, what, you said only 2 fail out of 10 or 12? It should have failed 100% of the time per HP and VMware best practice recommendations. We have never had a flash failure be it mainboard or iLO device by out-of-band methods. In fact, wee can not say this of Dell and IBM. We have extensive HP, later Dell, and now a growing base of IBM servers, and HP, in reference to flashing of firmware is still the most stable and consistent, but again, this always out-of-band, never in-band. In fact, if using ESXi, you have no option but to use out-of-band methods.

ie1e0955 · ‎03-18-2009

Hi.. in response more to the posts on 5th Feb and the patch released by vmware, has anyone updated to the latest version of HP SIM, v5.3

There is an updated MIB List for the management server, available as a download. I couldn't see anything related to ESX directly, however am hoping it addresses some of the issues with cimservera processes as well, from a HP perspective.

Downloaded it, but as yet, untested.