jsbattig
Contributor
Contributor

MacPro 6,1 fails to boot with state.tgz CRC check error

We have two Mac Pro 6,1 @ the company. One is being prepared for use by our engineering team, while the other was loaded with ESXi 7.0.2.

 

The one with ESXi 7.0.2 has shown problems when rebooting (for any reason, including coming back from maintenance mode and a reboot) when attempting to decompress state.tgz file. Essentially the machine will fail to boot and the only solution is to re-install ESXi. 

We were successful reinstalling ESXi keeping the datastore (and all the vms on it) and then re-configuring the machine and re-registering the vms. Very painful when you take into account this machine is on a datacenter 30 minutes to 2 hours driving from the nearest IT person.

At one point we did do some tests with 7.0.0 on a Mac Mini and didn't experience this CRC check of death with it, so went for a test with another Mac Pro system installing 7.0.0 rather than 7.0.2 (latest at the time of this post). Tried multiple times with reboots, shutdowns, config changes in between, installing new VMs, etc. and could not reproduce the state.tgz CRC check of death. A fluke? Don't know yet, we will continue trying.

I'm reporting this here because this definitively smells like something is broken with 7.0.2 and Mac Pros. 

The bad behavior has been reproduced on a Mac Pro with 12 cores and on another Mac Pro with 4 cores. Always with 7.0.2 and so far could not reproduce with 7.0.0.

0 Kudos
4 Replies
jsbattig
Contributor
Contributor

After multiple reboots finally it also happened with 7.0.0.

VMWare, this is so annoying! Every reboot literally is a toss of a coin for ESXi to come back in shape in the MacPro.

The last thing I'm trying is moving all the vms out of the local storage into a iSCSI store to see if reducing the data traffic makes this problem less prevalent.

0 Kudos
jsbattig
Contributor
Contributor

Came back to 7.0.2. I think I finally got to the crux of the problem. 

The issue appears to be a bug with ESXi and Mac Pro 2013 where the server shutdown sequence is cut short before local storage buffers are fully flushed to disk. 

When ESXi is on the shutdown sequence, one of the operations it performs is calling /sbin/auto-backup.sh in order to re-create the package with the base configuration used in the next boot cycle. This effectively recreates the state.tgz file, which is the one always failing to unarchive properly on the next boot sequence.

I've seen too in two case at least vmdks getting corrupted to the point the vm was unusable and unrecoverable.

When looking at the screen monitor of the ESXi server shutting down or rebooting, I noticed that the shutdown sequence seems to be cut short. There's a progress bar and it doesn't move far to the right and usually the machine shuts down or reboots during the "shut down of drivers". My hypothesis is that during the shutdown of some of these drivers the machine is simply halted and if there were any dirty buffers, they are not written to disk. Maybe there's some force flushing of buffers later in the sequence but the process never gets there.

Now going back to the state.tgz file. This file is re-built as part of an hourly cron job that calls /sbin/auto-backup.sh (see article https://kb.vmware.com/s/article/2001780). There's some logic within this command that will decide if saving or not the state.tgz file. Certainly every time I've run it after making changes from a fresh reboot it will save the config. If run right away, it won't. But calling /sbin/backup.sh will always save the changes immediately. 

People has been complaining that Mac Pros fail to boot with the dreaded error opening state.tgz randomly. Some people say "50/50". I guess most people do a reboot after some config changes, maybe changing the hardware pass-thorough or some other setting that requires a config. Well, in those specific cases a reboot without doing a /sbin/auto-backup.sh is guaranteed to end up corrupting the state.tgz (I tested it).

The workaround I've been testing and so far seems to work well is to enable SSH access, execute /sbin/auto-backup.sh, wait about 60 seconds to let all buffers flush to disk and finally shutdown or reboot.

Of course you don't want vm running and you want as quiet an environment as possible. My guess is that ESXi has a bug with Mac Pros that make the shutdown process extremely brittle and dangerous. If there's any data not saved to the local disk it will likely cause corruption on the next reboot.

Hopefully vmware takes notices this problem and addresses the issue. I can't believe it can't be hard to see that the shutdown sequence seem to be cut short and wrecks havoc with these machines.

Tags (1)
0 Kudos
continuum
Immortal
Immortal

This also happens with other hardware.

It may be a good idea to practise how to inject a healthy copy of state.tgz into the bootbank partition with a Linux LiveCD.

Yes - that suggestion is ridiculous - thats exactly why I mention it.

 

Ulli

Do you need support with a recovery problem ? - send a message via skype "sanbarrow"
0 Kudos
jsbattig
Contributor
Contributor

This feels to me basic stuff. Looking thorough the forums you see this reported for quite a while. I don't get it how VMWare has not solved this being such an annoyance to have to rebuilt the machine over and over.

It can't be that hard. Change the sequence of events, make sure you know when you write stuff to disk, flush everything afterwards when on the shutdown sequence.

0 Kudos