vCenter Server Appliance (SLES) out of disk space,...

benbradley · ‎07-15-2013

Hi everyone

We had a scheduled maintenance window over the weekend while some work was carried out elsewhere in the building. As a precaution we shutdown the infrastructure that we we still keep onsite (the rest is in colo). So all we have is a 2 node ESXi cluster, managed by vCenter Server Appliance (Novell SUSE Linux Enterprise 11 64bit). It only has 6 VMs running on it as most of them have been migrated offsite.

This morning we started up the OpenFiler NFS storage box and the ESXi nodes. All VMs started ok, including the vCenter Server Appliance.

However I was unable to connect to the vCenter Appliance using vSphere Client. I could connect to the ESXi nodes directly without problems. DNS hostnames all resolve to the IP addresses without problems.

Looking at the vSphere Client logs I see the following message each time I tried to connect:

System.Net.Sockets.SocketException: No connection could be made because the target machine actively refused it 10.10.0.205:443

So I connected to the vCenter Appliance over SSH and it looks like it's run out of disk space. Here's the df and mount output:

vlevcenter01:/ # df -h

Filesystem Size Used Avail Use% Mounted on

/dev/sda3 9.8G 9.8G 0 100% /

devtmpfs 4.0G 104K 4.0G 1% /dev

tmpfs 4.0G 4.0K 4.0G 1% /dev/shm

/dev/sda1 130M 18M 105M 15% /boot

/dev/sdb1 20G 3.8G 15G 21% /storage/core

/dev/sdb2 20G 717M 18G 4% /storage/log

/dev/sdb3 20G 19G 0 100% /storage/db

vlevcenter01:/ # mount -l

/dev/sda3 on / type ext3 (rw,acl,user_xattr)

proc on /proc type proc (rw)

sysfs on /sys type sysfs (rw)

debugfs on /sys/kernel/debug type debugfs (rw)

devtmpfs on /dev type devtmpfs (rw,mode=0755)

tmpfs on /dev/shm type tmpfs (rw,mode=1777)

devpts on /dev/pts type devpts (rw,mode=0620,gid=5)

/dev/sda1 on /boot type ext3 (rw,acl,user_xattr)

/dev/sdb1 on /storage/core type ext3 (rw)

/dev/sdb2 on /storage/log type ext3 (rw)

/dev/sdb3 on /storage/db type ext3 (rw)

So the database and root partitions have run out of space. Looking in /var/log/localmesages:

Jul 15 10:10:51 vlevcenter01 startproc: startproc: Empty pid file /var/run/slapd/slapd.pid for /usr/lib/openldap/slapd
Jul 15 10:10:51 vlevcenter01 startproc: startproc: exit status of parent of /usr/lib/openldap/slapd: 1
Jul 15 10:10:52 vlevcenter01 checkproc: checkproc: Empty pid file /var/run/slapd/slapd.pid for /usr/lib/openldap/slapd
It looks like it can't create the pid files for the services because there is no disk space.

We already have 60GB of vmdks allocated to this VM though we do have space on our datastore to add more. But I'm concerned that the vCenter Appliance will just grow and grow.

1) Is there any scheduled maintenance that runs on the vCenter Appliance to keep disk usage under control? This is what I would expect from an enterprise product, I'm very surprised to see such a problem happen.

2) Where should I start looking to clear some disk space so we can get the vCenter service back up and running?

I'd like to find a longer term solution to this otherwise we might have to just keep adding vmdks until our datastore runs out of space.

Cheers, B

julienvarela · ‎07-15-2013

Hi,

Did you always have only 2 Hosts and 6 Vms on your environnement? Maybe it is the log file that take all the space and not the DB, can you check.

What is the type of database? DB2 or postgres?

If you need to increase the FS, you can check here.VMware vCenter Appliance (VCVA) /dev/sdb3 100% full | HiperLogic

Julien.

Regards, J.Varela http://vthink.fr

benbradley · ‎07-15-2013

Hi, thanks for the reply!

I don't believe we always had 6 VMs, may have had 10-15 initially but that was before I joined. Only ever 2 hosts though. In the last 12 months we have only had 6 VMs.

I believe the database is DB2, judging by the size of the /storage/db/db2 directory (19GB)

Doing a du -shc * from /...

vlevcenter01:/ # du -shc *

7.1M bin

13M boot

108K dev

375M etc

80K home

88M lib

9.9M lib64

16K lost+found

4.0K media

8.0K mnt

1.2G opt

0 proc

396K root

8.3M sbin

4.0K selinux

20K srv

23G storage

0 sys

504K tftpboot

15M tmp

2.3G usr

5.7G var

33G total

/var/log/faillog which looks like a binary file is reporting a size of 41GB.

It would be good to prune some logs down or trim the database if it's storing archived performance data.

I'm not that happy in just adding more storage to this as it doesn't really solve the actual problem.

Cheers, B

julienvarela · ‎07-15-2013

Ok so, the DB is not the issue.

I am not familiar to the faillog but you can try this .Local account lockout in ESX4.0 ??

But i think you have an other issue that generate a lot of failure. Can you post a tail -f 100 of /var/log/faillog

Thank you.

Julien.

Regards, J.Varela http://vthink.fr

benbradley · ‎07-15-2013

vlevcenter01:/var/log # tail -f 100 faillog

tail: cannot open `100' for reading: No such file or directory

==> faillog <==

The output from df -h I posted above shows that the /storage/db partition is at 100% usage, so are you sure that's not the problem?

julienvarela · ‎07-15-2013

Oh sorry, i haven't see that the DB FS was 19GB.

So you have multiple issue , on your / and /storage/db but that i don't understand is how and why faillog have a size of 43GB.

I try to find more info.

Julien.

Regards, J.Varela http://vthink.fr

ScreamingSilenc · ‎07-15-2013

Check this tread http://communities.vmware.com/thread/301932?start=0&tstart=0 which discussed about similar problem.

Please consider marking this answer "correct" or "helpful" if you found it useful.

benbradley · ‎07-15-2013

I restarted the vCenter Appliance and was able to connect using vSphere Client. Then I modified the following settings under vCenter Server Settings:

Database Retention Policy... Tasks and Events retained for 120 days. Previously they were unticked, presumably allowing for unlimited storage.

But disk usage still remains at 100% on / and /storage/db partitions.

Will the database retention settings reduce that disk usage? Do I need to trigger a database clean-up manually like I did on SQL Server Express?

When should this take effect?

config.log.maxFileNum is 30

config.log.maxFileSize is 52428800

Should I change those options to lower values?

Thanks, B

julienvarela · ‎07-15-2013

You can reduce the statistics level to 1 if it isn't already set.

Julien.

Regards, J.Varela http://vthink.fr

benbradley · ‎07-15-2013

The statistics optins are already set to those values.

I'm going to attempt to remove the 41GB /var/log/faillog file to see if that temporarily clears some space. Though I'm not sure how it's reporting that size as that partition should only be 20GB. I will of course be taking a snapshot first.

benbradley · ‎07-15-2013

Well deleting the faillog file hasn't cleared much space. But I tarred two log files that weren't compressed on the last rotate, messages and warn, 2.8GB each, to a bz2 that freed up 5GB.

I still need to get the /storage/db partition down though as it's only showing 2.8MB available.

I'm finding /var/log/ldapmessages filling up with this every 20-30 seconds:

Jul 15 16:32:20 vlevcenter01 slapd[4055]: conn=1002 op=1137 SRCH base="ou=Privileges,dc=virtualcenter,dc=vmware,dc=int" scope=1 deref=0 filter="(objectClass=*)"

Jul 15 16:32:20 vlevcenter01 slapd[4055]: conn=1002 op=1137 SRCH attr=cn vmw-vc-PrivIsOnParent vmw-vc-PrivGroup isDeleted modifyTimestamp entryUUID lastKnownParent

Jul 15 16:32:20 vlevcenter01 slapd[4055]: conn=1002 op=1137 SEARCH RESULT tag=101 err=0 nentries=246 text=

Jul 15 16:32:20 vlevcenter01 slapd[4055]: conn=1002 op=1138 SRCH base="ou=UserRoles,dc=virtualcenter,dc=vmware,dc=int" scope=1 deref=0 filter="(objectClass=*)"

Jul 15 16:32:20 vlevcenter01 slapd[4055]: conn=1002 op=1138 SRCH attr=cn vmw-vc-RoleName vmw-vc-PrivilegeList isDeleted modifyTimestamp entryUUID lastKnownParent

Jul 15 16:32:20 vlevcenter01 slapd[4055]: conn=1002 op=1138 SEARCH RESULT tag=101 err=0 nentries=6 text=

Jul 15 16:32:20 vlevcenter01 slapd[4055]: conn=1002 op=1139 SRCH base="ou=Licenses,ou=Licensing,dc=virtualcenter,dc=vmware,dc=int" scope=1 deref=0 filter="(objectClass=*)"

Jul 15 16:32:20 vlevcenter01 slapd[4055]: conn=1002 op=1139 SRCH attr=cn revision isDeleted modifyTimestamp entryUUID lastKnownParent

Jul 15 16:32:20 vlevcenter01 slapd[4055]: conn=1002 op=1139 SEARCH RESULT tag=101 err=0 nentries=128 text=

Jul 15 16:32:20 vlevcenter01 slapd[4055]: conn=1002 op=1140 SRCH base="ou=LicenseEntities,ou=Licensing,dc=virtualcenter,dc=vmware,dc=int" scope=1 deref=0 filter="(objectClass=*)"

Jul 15 16:32:20 vlevcenter01 slapd[4055]: conn=1002 op=1140 SRCH attr=cn revision isDeleted modifyTimestamp entryUUID lastKnownParent

Jul 15 16:32:20 vlevcenter01 slapd[4055]: conn=1002 op=1140 SEARCH RESULT tag=101 err=0 nentries=3 text=

Jul 15 16:32:20 vlevcenter01 slapd[4055]: conn=1002 op=1141 SRCH base="ou=Licenses,ou=Licensing,dc=virtualcenter,dc=vmware,dc=int" scope=1 deref=0 filter="(objectClass=*)"

Jul 15 16:32:20 vlevcenter01 slapd[4055]: conn=1002 op=1141 SRCH attr=vmw-vc-LicenseFileName cn vmw-vc-LicenseFileContent objectClass isDeleted modifyTimestamp entryUUID lastKnownParent

Jul 15 16:32:20 vlevcenter01 slapd[4055]: conn=1002 op=1141 SEARCH RESULT tag=101 err=0 nentries=128 text=

If there's some sort of auth problem then that might explain the massive faillog... even though the faillog didn't appear to be that big after all as it didn't release the space.

Anyone got any idea what's going on here as this is driving me insane.

Cheers, B

benbradley · ‎07-15-2013

Just thought I would try editing the size of the VMDK files attached to the vCenter Server Appliance VM but the Provisioned Size box is greyed out in Edit Settings.

How about this for a plan:

Add a new larger VMDK

Copy over the files currently mounted in /storage/db

Edit fstab to mount the new VMDK as /storage/db on startup and then reboot the vCenter Server Appliance.

Is there any reason why that might not work?

julienvarela · ‎07-15-2013

No reason, if you are doing this, it should work.

For more safety, you can create a clone of your VM.

Julien.

Regards, J.Varela http://vthink.fr

benbradley · ‎07-16-2013

I added some extra space to the /storage/db partition so hopefully the DB can grow a bit if it needs to.

Is there any way I can check that maintenance is running on the DB2 database correctly? I don't really want to have to keep adding 30GB each time.

Cheers, B

DavidSilva77 · ‎06-06-2014

Hi Bro, i have the same problem and i fix following this KB

VMware KB: Increase the disk space in vCenter Server Appliance

azuziel · ‎06-16-2017

I had the same issue. Easiest way to fix it, is to login as root, cd to /var/log and delete all ldapmessage-date log files. Problem solved.

All

vCenter Server Appliance (SLES) out of disk space, service cannot start