My lab instance, a vSphere 5.1 vCenter Server Appliance with 2 clusters, 8 hosts (all vSphere 5.1), 80 VMs, exhibits these exact same symptoms. It is using the built in database, no external. I can say for certain that the VCSA time zone has not been touched since initial install, so that is not the issue. My issues started a couple of weeks ago or more. The vSphere Client would just crash every 15 minutes or so, but then I could log back in so I just did that and worked on other more pressing matters, the lab is important, but it would have to wait. It progessed to much worse this week and vSphere client or Web Client not being able to stay logged for more than a couple of seconds. I haven't had time to deal with this until now, so here we go.
You know you are in trouble when on the VCSA management page, vCenter Server > Summary > Storage > Coredumps = 100%! The core dumps or "coredumps" are stored on the VCSA in /storage/core with files named core.vpxd-worker.xxxx where 'x' is a number. I emptied that via SSH with the rm command, but of course it just started filling up again with the same files. Time to break out the toolbox.
After 20 minutes of research and an easy fix, I made a change that has let me login and perform operations in the vSphere Client. So far, so good, no more core dump files 2 hours later. It may be best to perform a basic, surface-level health check (or use LucD's) before rebuilding a VCSA from scratch. That can be a lot of work and will have consequences for VDS, Clusters, policies, etc. Did you document everything?
These are the steps and actions I took to get the lab back online:
From the VCSA management page, stop and started vCenter and Inventory services, but no effect.
From the VCSA management page, reboot the appliance, but no effect.
Stopped vCenter Server and Inventory Services to keep them off during troubleshooting
Logged into each host with vSphere Client first to check all the basics, DNS name, time, verify that IPV6 is disabled (I had problems with that enabled before even though it wasn't in use), etc.
Then (still in vSphere client) check the host logs Home > Administration > System Logs for odd patterns or excessive errors in any of the logs (there are several, click the drop-down). The first 4 hosts I check all looked normal, but one was having what looked like an excessive amount of errors related to snapshot delta disks from snapshots. The problem rang a bell, there was a VM I noticed I had trouble with recently that has a bunch of snapshots and delta disks from some failed backups. I didn't have time to look into it a month or so when I first saw that (it is lab and we never have the time do we?).
The VM was already shut down, so I removed it from the host inventory, then SSH'd to the host and ran /sbin/services.sh restart to restart the management agents --but you can also use the DCUI to accomplish same.
Started vCenter Server and Inventory Services and voila, I could log in and manage my lab again.
Side notes:
I got this message on all of the VMs that were running on the host. The VMs never went down, but HA was attempted:
vSphere HA virtual machine failover failed
I have to work on the snapshot problem which I think was caused during backups by the VMware Data Recovery Appliance, but I don't know the cause.
Message was edited by: aSmoot