After upgrade to VCSA 6: High cpu load after a few...

blackcopy · ‎03-25-2015

Hi!

I've upgraded our VCSA appliance from 5.5U2 to 6 and everything seemed to went fine but after it ran about 12 hours, the appliance began to eat up the two vCPUs almost completely, rendering the web client as well as any SOAP interfaces pretty useless:

Diagnosing the cause is a little bit difficult for me because even the SSH daemon refuses to accept any connections during those high load periods.

Sometimes I've managed it to get a VIMtop output which was basically telling me that the "CIM health service" is taking up more than 100% cpu time. After this, I've noticed that the health stats for my HP ProLiant G5 (still using an 5.5U2/HP image) were missing when using the web client, although they are still accessible using WBEM directly on the host. Then I've simply deactivated the CIM server on that specific host ... but it didn't help.

When look at VCSA's dmesg output, there are dozens of messages like this one:

IPfilter Dropped: IN=eth OUT= MAC:ff:ff:ff:ff:ff:ff:ff:ff (...) SRC=(varies) DST=255.255.255.255 LEN=68 TOS=0x00 PREC=0x00 TTL=128 ID=26090 PROTO=UDP SPT=58980 DPT=1947 LEN=48
Is this somehow related?

So my question is if anyone has an idea how to troubleshoot this issue in order to track down the root cause?

Nubje · ‎04-01-2015

Hi,

I am having the same problem of CPU load getting higher and finally reaching 100%. I dont know what is causing it but i guess its the same problem like you have. I also upgraded from v5.5u2 to v6.0 recently..

Did you fix it already? How can i check if i have the same cause like you?

Best regards,

Roel.

blackcopy · ‎04-01-2015

I don't have a solution yet. After upgrading all hosts to ESXi 6, the CIM communication seems to be working again but still, those 100% periods reoccur every few hours.

Sometimes, vCenter reports that the vPostgres service is causing this heavy workload ... which sounds curious to me because my cluster only contains three hosts with about 15 VMs (most of them are powered down).

michaelgioia · ‎10-05-2015

Same.. it's a mess.

I can't even log into :5480 to enable shell and so I can't get into bash, issue a 'top', and see what is going on....

The CPU cycles are so all consuming, it just is so unresponsive.

I'm running the latest VCSA, 6.0.0-3040890.

michaelgioia · ‎10-05-2015

There.. just settled down..

I can get a horde of /var/log collateral if anyone has support contract to raise with VMware.. I currently do not.

chaithu4u · ‎10-06-2015

Apply the 5.5.5.190 patch to reduce the number of events created. For more information, see VMware vSphere Data Protection 5.5.x appliance /space partition fills up due to Postgres events (206....

To work around the issue, you must delete the large amount of vpx_event events from the VCSA database. To delete the large amount of vpx_event events from the VCSA database:

Take a backup of the vCSA database. For more information, see Backing up and restoring the vCenter Server Appliance vPostgres database (2034505).
Start an SSH session to the vCSA. When you are prompted to log in, enter username as root and the default password is vmware.
Stop the vmware-vpxd service. For more information, see Stopping, starting, or restarting vCenter Server Appliance services (2054085).
Log in as the Postgres user by running this command:

su - postgres
Log in to the database using the password from the earlier command:

/opt/vmware/vpostgres/1.0/bin/psql -d VCDB vc
Connect to vCDB by running this Postgres command:

\c VCDB
Delete the vpx_event entries:
- Run this command to delete all entries:
  DELETE FROM vpx_event;
- To delete a specific timeframe from the vpx_event table:
  - Run this command to delete entries of past 5 minutes:
    
    DELETE FROM vpx_event WHERE create_time > now() - interval '5 minutes';
  - Run this command to delete entries from the beginning until 7 days ago:
    
    DELETE FROM vpx_event WHERE create_time < now() - interval '7 days';
  - Run this command to delete entries in a time range:
    
    DELETE FROM vpx_event WHERE create_time BETWEEN '01 jun 2013' AND '01 jul 2013';
- To verify the number of entries by running an query similar to:
  
  SELECT count(*) FROM vpx_event; SELECT count(*) FROM vpx_event WHERE event_type in ('vim.event.UserLogoutSessionEvent','vim.event.UserLoginSessionEvent');
Restart the vpxd service by running this command:

service vmware-vpxd start

SteveGalbincea · ‎07-28-2016

Were any of you able to find a solution for this? I am experiencing the exact same issue at a client today. Please let me know if so, thanks!

Steve

All

After upgrade to VCSA 6: High cpu load after a few hours?