I just noticed an issue that I reproduced twice in a row. If I make a bunch of changes via the VI Client on ESXi 6 (adding a vSwitch, creating a couple of new VMs, etc.) and then yank the power on the server, all of the changes are gone when I power the host back up. If I shut the host down in an orderly fashion, changes are saved, but pulling the plug results in all changes being lost.
If you're wondering why I keep pulling the plug on my host, it's because I am playing with a bridged pfSense cluster, and it gets unhappy if I forget to kill the bridge on one of the nodes and fire them both up at the same time with both sides of the bridge connected on both nodes. Instant network loop, unresponsive host, and the need to just force the shutdown.
Anyway, I can see this being a big problem if you have to reboot the host every time a change is made to make sure it's committed to disk. Surely this isn't the way it's meant to be. I don't believe ESXi 5.5 and prior did this - at least, I never noticed it.
Has anyone else run into this? Is there a setting I need to tweak to tell it to flush to disk more regularly (much more - the first time I did this, the host had been up for 2 days with changes made, and all were lost after the power "outage").
Hey Paul,
Could you run ./sbin/autobackup.sh and see if the changes remain after a power pull ? This is a cron job that should run every hour, you can check by running cat /var/spool/cron/crontabs/root
where is your scratch location ?
Rich
Hi, Rich,
The command was /sbin/auto-backup.sh on my host. It's set to run every minute in the crontab:
#min hour day mon dow command
1 1 * * * /sbin/tmpwatch.py
1 * * * * /sbin/auto-backup.sh
0 * * * * /usr/lib/vmware/vmksummary/log-heartbeat.py
*/5 * * * * /sbin/hostd-probe ++group=host/vim/vmvisor/hostd-probe
Looks like the command ran fine when I executed it.
I'm not sure how to check where my scratch location is. Looking at the script, it looks like it might be /tmp/auto-backup.$$ perhaps?
All that being said, I did notice that vCenter (which I'd cloned to an SSD before one of these memory losses) was completely corrupted on the old RAID1 spindle array where ESX is installed (I noticed it when I rebooted and went to shut down vCenter and found it was complaining about corrupted disks and wouldn't start - then I realized that it was trying to start the old one and turfed it from inventory and reinventoried the one on the SSD again). I did find it sort of odd that the old copy of it was damaged, though. Maybe I have a disk issue that's leading to config changes not saving regularly - though the cron script seemed to be happy when I ran it manually. I wonder if cron is borked?
To test, I created a new vSwitch, ran the auto-backup script, then yanked the plug and rebooted to see if my new vSwitch was still there, and it was, so the script is working - just not running as scheduled, it seems.
Ok, at least we have some success.
you should be able to see the cron jobs in the syslog.log /var/log/syslog.log
tail the log and you should see an output like this :
2015-06-24T03:01:01Z crond[33413]: crond: USER root pid 146297 cmd /sbin/auto-backup.sh
This should mean you can confirm if it was running earlier.
For the scratch location, if you have vi client, go to configuration, advanced settings under software, then ScratchConfig and check the location is going to a datastore (is assume it will be). There is a way to see this from ssh but it escapes me right now.
Rich
I don't seem to have a /var/log/syslog.log. Maybe it's somewhere else on ESXi 6? I don't seem to have a software entry under Advanced, either.
I did find the scratch directory, though - it's in /scratch (which maps to a spot on a VMFS volume, /vmfs/volumes/5585ce88-c2bf9bea-18bc-001517af1ce4). In there is a log directory, and in there is my syslog.log file. I did a cat syslog.log | grep backup and got this:
2015-06-20T21:01:01Z crond[33520]: crond: USER root pid 35983 cmd /sbin/auto-backup.sh
2015-06-21T20:01:01Z crond[33498]: crond: USER root pid 35638 cmd /sbin/auto-backup.sh
2015-06-21T21:01:01Z crond[33500]: crond: USER root pid 33726 cmd /sbin/auto-backup.sh
2015-06-21T22:01:01Z crond[33500]: crond: USER root pid 38142 cmd /sbin/auto-backup.sh
2015-06-21T23:01:01Z crond[33500]: crond: USER root pid 40674 cmd /sbin/auto-backup.sh
2015-06-22T00:01:01Z crond[33500]: crond: USER root pid 43079 cmd /sbin/auto-backup.sh
2015-06-22T01:01:01Z crond[33500]: crond: USER root pid 46609 cmd /sbin/auto-backup.sh
2015-06-22T02:01:01Z crond[33500]: crond: USER root pid 49948 cmd /sbin/auto-backup.sh
2015-06-22T03:01:01Z crond[33500]: crond: USER root pid 53823 cmd /sbin/auto-backup.sh
2015-06-22T04:01:01Z crond[33500]: crond: USER root pid 57597 cmd /sbin/auto-backup.sh
2015-06-23T21:01:01Z crond[33498]: crond: USER root pid 36841 cmd /sbin/auto-backup.sh
2015-06-24T01:01:01Z crond[33498]: crond: USER root pid 36783 cmd /sbin/auto-backup.sh
2015-06-24T02:01:01Z crond[33498]: crond: USER root pid 36964 cmd /sbin/auto-backup.sh
It looks like once per hour (the host hasn't always been on, as you can probably tell). I'll make a change, then leave the host on for a bit, then check the log again for new entries. If I spot one, I'll yank the power and see what we've got.
I could have sworn the host was one for well over an hour before I had to yank the plug, and my config changes were gone, but the log seems to imply the backup script is running.
Thats interesting, you should definitely have a /var/log folder, most of the files under there are symbolic links to files in the /scratch/log/ folder.
For example
[root@Thor:/var/log] ls -l
total 264
lrwxrwxrwx 1 root root 21 Jun 22 04:48 Xorg.log -> /scratch/log/Xorg.log
lrwxrwxrwx 1 root root 21 Jun 22 04:48 auth.log -> /scratch/log/auth.log
-rw-rw-rw- 1 root root 35982 Jun 22 04:48 boot.gz
lrwxrwxrwx 1 root root 22 Jun 22 04:48 clomd.log -> /scratch/log/clomd.log
-rw-r--r-- 1 root root 23626 Jun 22 04:48 configRP.log
lrwxrwxrwx 1 root root 25 Jun 22 04:48 dhclient.log -> /scratch/log/dhclient.log
lrwxrwxrwx 1 root root 20 Jun 22 04:48 epd.log -> /scratch/log/epd.log
-rw-r--r-- 1 root root 3995 Jun 22 04:48 esxcli.log
lrwxrwxrwx 1 root root 26 Jun 22 04:48 esxupdate.log -> /scratch/log/esxupdate.log
This is on version 6.0.0 2159203.
I wonder if you have something wrong at a deeper level ?
It's possible. The host is a fresh install from a few days ago, so you wouldn't think so - but I've learned to expect just about anything from IT related stuff these days *grin*.
This is my /var/log folder:
[root@localhost:/var/log] ls -l
total 64
-rw-r--r-- 1 root root 22932 Jun 24 03:00 configRP.log
-rw-r--r-- 1 root root 1796 Jun 24 03:00 esxcli.log
drwxr-xr-x 1 root root 512 Jun 24 03:02 ipmi
-rw-r--r-- 1 root root 4563 Jun 24 03:00 jumpstart-stdout.log
-rw------- 1 root root 3363 Jun 24 03:02 smbios.bin
-rw------- 1 root root 9362 Jun 24 03:00 sysboot.log
-rw------- 1 root root 64 Jun 24 04:04 tallylog
drwxr-xr-x 1 root root 512 Jun 24 03:00 vmware
Mine is 6.0.0 2494585:
[root@localhost:/var/log] vmware -vl
VMware ESXi 6.0.0 build-2494585
VMware ESXi 6.0.0 GA
Well, it's 9:09pm now, and the logs were indicating the backup script was running 1 minute past the hour - and no new log entry. Hmm. Maybe something deeper broken, like you say.
Looking at that folder, you have every file in there, but no symbolic links, at all.
can you try running
ps | grep vmsyslogd
See if the process is running..
This is the only KB i can see that looks similar to your issue, have you configured a syslog server ? bit of a long shot though.
ps | grep vmsyslogd returns a bunch of nothing. I don't seem to have one of those.
I think it's this Intel server's RAID controller. Something weird is happening to data on the spindles, which the Intel's controller is responsible for (and that's where ESX is installed, as well as where that corrupted old vCenter instance. I can't even delete all the files from the old vCenter off the spindles now - there's half a dozen that refuse to go).
Fortunately, I moved all my VMs to the Samsung Pro 850 SSD previously, and those seem to be happy. The Samsung SSD is connected to a 16-port LSI HBA I had hanging around.
At this point, my little Gigabyte BRIX is the reliable host. Sadly, it's too small for me to stage an Exchange 2013 cluster on, thus me using this Intel thing. I think I shall stop and wait for the new Supermicro server to arrive before I bother going forward, just so I can be more sure that I'm not wasting my time and ending up with slightly screwed up installs of stuff that will haunt me later.
I appreciate your help, Rich. This one looks like corruption-town, and is beyond a repair.
No problem at all, shame we couldn't get it resolved, but I suspect that as you say something is amiss deeper down.
Take it easy.
Rich
Hi,
just to solve one of the mysteries that the cronjob runs once per hour and not every minute. Your cron config said (from above):
1 * * * * /sbin/auto-backup.sh
That's a common misunderstanding and it does not mean you run the script every minute but at 1 minute after the hour (hh:01).
You have to use the following syntax to run it every minute:
*/1 * * * * /sbin/auto-backup.sh
or every 5 minutes with */5 for example.
cykVM
Ah ha! I wondered what that */5 in my cron config. I figured out the 1 meant 1 minute past the hour when I looked at the logs, but that */5 made no sense to me at all until now. Thanks for the heads up!