I have been noticing an issue lately with two of our ESX host servers. Over the period of about a day, one of the cimserver processes eventually starts using 100% of CPU 0 on the ESX hosts. When this is occurring a reboot of the ESX host stalls displaying backtrace messages from the cimserver process. I have been able to manually restart pegasus (service pegasus restart) and clear up the hog process, but the issue returns again the next day.
ESX hosts are Sun x4140 servers.
ESX version is 3.5.0 U3 Build 123630
I have run all the updates on both our ESX host servers using Update Manager. Both ESX hosts are now at ESX 3.5.0 143128. One of the hosts has already developed the cimserver process issue again after the updates completed this morning. Core 0 is at 100% useage and a top command displays that it is a cimserver process again. Again a service pegasus restart command got rid of the cimserver process, but it looks like even with the updates the issue will come back.
I have the same situation on 2 esx server 3.0.5 build 143128.
The patch you say, is applied:
Installed software bundles:
Install Date --- --- Summary ---
3.5.0-64607 21:06:12 02/21/08 Full bundle of ESX 3.5.0-64607
ESX350-200802305-SG 09:47:07 03/11/08 openssl security update
ESX350-200802303-SG 09:47:07 03/11/08 util-linux security update
ESX350-200802408-SG 09:47:07 03/11/08 Security Updates to the Python Package.
ESX350-200803209-UG 09:34:34 02/04/09 Update to the ESX Server Service Console
ESX350-200810201-UG 09:39:44 02/04/09 Updates VMkernel, Service Console, hostd
ESX350-200803212-UG 09:50:31 02/04/09 Update VMware qla4010/qla4022 drivers
ESX350-200803213-UG 09:51:50 02/04/09 Driver Versioning Method Changes
ESX350-200803214-UG 09:52:53 02/04/09 Update to Third Party Code Libraries
ESX350-200804405-BG 09:53:40 02/04/09 Update to VMware-esx-drivers-scsi-megara
ESX350-200805504-SG 09:54:28 02/04/09 Security Update to Cyrus SASL
ESX350-200805505-SG 09:55:12 02/04/09 Security Update to unzip
ESX350-200805506-SG 09:55:58 02/04/09 Security Update to Tcl/Tk
ESX350-200805507-SG 09:56:46 02/04/09 Security Update to krb5
ESX350-200805514-BG 09:57:35 02/04/09 Update to VMware-esx-drivers-net-e1000
ESX350-200808206-UG 09:59:07 02/04/09 Update to vmware-hwdata
ESX350-200808210-UG 09:59:49 02/04/09 Update to VMware-esx-drivers-net-ixgbe
ESX350-200808211-UG 10:00:31 02/04/09 Update to the tg3 Driver
ESX350-200808212-UG 10:01:20 02/04/09 Update to the MegaRAID SAS Driver
ESX350-200808215-UG 10:02:08 02/04/09 Update to the Emulex SCSI Driver
ESX350-200808218-UG 10:03:13 02/04/09 Security Update to Samba
ESX350-200808406-SG 10:05:02 02/04/09 Security Update to Perl
ESX350-200808407-BG 10:05:48 02/04/09 Updates Software QLogic FC Driver
ESX350-200808409-SG 10:06:37 02/04/09 Security Update to BIND
ESX350-200810203-UG 10:08:13 02/04/09 Updates MPT SCSI Driver
ESX350-200810204-UG 10:09:00 02/04/09 Updates bnx2x Driver for Broadcom
ESX350-200810205-UG 10:10:32 02/04/09 Updates CIM and Pegasus
ESX350-200810208-UG 10:13:10 02/04/09 Updates esxupdate documentation
ESX350-200810209-UG 10:14:01 02/04/09 Updates bnx2 Driver for Broadcom
ESX350-200810210-UG 10:24:53 02/04/09 Updates HP Storage Component Drivers
ESX350-200810212-UG 10:25:46 02/04/09 Updates VMkernel iSCSI Driver
ESX350-200810214-UG 10:26:43 02/04/09 Updated Time Zone Rules
ESX350-200810215-UG 10:28:10 02/04/09 Updates Web Access
ESX350-Update-02 10:28:24 02/04/09 ESX Server 3.5.0 Update 2
ESX350-Update01 10:28:34 02/04/09 ESX Server 3.5.0 Update 1
ESX350-Update03 10:28:44 02/04/09 ESX Server 3.5.0 Update 3
ESX350-200901402-SG 20:40:51 02/05/09 Security Update to ESX Scripts
ESX350-200811401-SG 20:43:48 02/05/09 Updates VMkernel, hostd, and Other RPMs
ESX350-200811406-SG 20:44:38 02/05/09 Security Update to bzip2
ESX350-200901406-BG 20:47:29 02/05/09 Updates Kernel Source and VMNIX
ESX350-200811408-BG 20:48:25 02/05/09 Updates QLogic Software Driver
ESX350-200901401-SG 20:50:16 02/05/09 Updates VMkernel, VMX, hostd etc
ESX350-200901404-BG 20:51:35 02/05/09 Updates VMware Tools
ESX350-200901405-BG 20:52:21 02/05/09 Updates lnxcfg
ESX350-200901407-BG 20:53:47 02/05/09 Updates Pegasus
ESX350-200901408-BG 20:54:45 02/05/09 Updates SATA Drivers
ESX350-200901409-SG 20:55:55 02/05/09 SNMP Security Update
ESX350-200901410-SG 20:56:48 02/05/09 Security Update for libxml2
With the command service pegasus restart, this process end. But i will be at 100% in a few days again.
Say hello to running VMware ESX 3.5 on the Sun X64 Opterons family.
We have 7 X4600 M2 in the field, we've gone from 3.0.2 and stepped through 3.5, U1, U2, U3 and soon U4. Throughout the whole adventure, 3.5 has given the cimserver/pegasus problems to the point where we have turned off the pegasus service all together. I would highly advise the doing the same, we've had so many support issues with VMware and these servers, at one point they told us we weren't on the HCL, however that turned out to be a documentation issue on their end. Vmware has said this problem won't be fixed until U5, and there's a somewhat partial fix in U4. We've also had many PSODs which VMware couldn't explain. We are looking at maybe moving to the X4140s since we're not nearly using all the bays on the X4600, and they have a newer motherboard chipset.
A side note about X4600 M2, please make sure you monitor your local hardrives through syslog if you're not using boot from san, neither the ILOM nor the VI3 Health Status will monitor them for you. Sun told me they don't have the LSI SAS controller talking to the ILOM yet, they said maybe by the end of this year last I talked to them at VMworld Vegas 2008. He also said that applies to their whole x64 family but I haven't checked into that.
Thanks for the info. I have been leaning towards disabling the Pegasus service for a while now. It has taken Sun more than a month to get our support on VMware licenses worked out and it's still not resolved. Otherwise I would have logged a support ticket on the issue.
Luckily no PSOD for us yet with the x4140. There have been some quirks but nothing major yet. Occasional keyboard controller (probably from the ILOM console) freaking out and displaying a lot of messages on the systems console till it calms down. In the beginning the x4140 with ESX installed would not talk to the 4 nvidia nics in the server when a device was in PCIe slot one (LSI Adapter). Windows Server talked to the NICs just fine no matter what slot the LSI adapter was in. Very weird behavior, but a BIOS update resolved this issue. Other than that, never try to update the SP bios while the server is running. Prolly something i should have learned before hand. Dell servers have taught me bad habits about updating bios while system is running.
Good info on the LSI controller and the ILOM. I'll have to look into that some.
Thanks for this Post.
We are currently using the same hardware and are also having the same problem with this pegasus service.
therefore we are also waiting until vmware is fixing this problem.
We contacted Sun on this issue. They said a partial update is in U4 of ESX and a full fix should be in U5. After increasing the memory given to the service console from 200MB to 800MB and going to U4, we have not seen this issue return.