VMware Cloud Community
jsbattig
Enthusiast
Enthusiast

MacPro 6,1 fails to boot with state.tgz CRC check error

We have two Mac Pro 6,1 @ the company. One is being prepared for use by our engineering team, while the other was loaded with ESXi 7.0.2.

 

The one with ESXi 7.0.2 has shown problems when rebooting (for any reason, including coming back from maintenance mode and a reboot) when attempting to decompress state.tgz file. Essentially the machine will fail to boot and the only solution is to re-install ESXi. 

We were successful reinstalling ESXi keeping the datastore (and all the vms on it) and then re-configuring the machine and re-registering the vms. Very painful when you take into account this machine is on a datacenter 30 minutes to 2 hours driving from the nearest IT person.

At one point we did do some tests with 7.0.0 on a Mac Mini and didn't experience this CRC check of death with it, so went for a test with another Mac Pro system installing 7.0.0 rather than 7.0.2 (latest at the time of this post). Tried multiple times with reboots, shutdowns, config changes in between, installing new VMs, etc. and could not reproduce the state.tgz CRC check of death. A fluke? Don't know yet, we will continue trying.

I'm reporting this here because this definitively smells like something is broken with 7.0.2 and Mac Pros. 

The bad behavior has been reproduced on a Mac Pro with 12 cores and on another Mac Pro with 4 cores. Always with 7.0.2 and so far could not reproduce with 7.0.0.

Tags (3)
Reply
0 Kudos
13 Replies
jsbattig
Enthusiast
Enthusiast

After multiple reboots finally it also happened with 7.0.0.

VMWare, this is so annoying! Every reboot literally is a toss of a coin for ESXi to come back in shape in the MacPro.

The last thing I'm trying is moving all the vms out of the local storage into a iSCSI store to see if reducing the data traffic makes this problem less prevalent.

Reply
0 Kudos
jsbattig
Enthusiast
Enthusiast

Came back to 7.0.2. I think I finally got to the crux of the problem. 

The issue appears to be a bug with ESXi and Mac Pro 2013 where the server shutdown sequence is cut short before local storage buffers are fully flushed to disk. 

When ESXi is on the shutdown sequence, one of the operations it performs is calling /sbin/auto-backup.sh in order to re-create the package with the base configuration used in the next boot cycle. This effectively recreates the state.tgz file, which is the one always failing to unarchive properly on the next boot sequence.

I've seen too in two case at least vmdks getting corrupted to the point the vm was unusable and unrecoverable.

When looking at the screen monitor of the ESXi server shutting down or rebooting, I noticed that the shutdown sequence seems to be cut short. There's a progress bar and it doesn't move far to the right and usually the machine shuts down or reboots during the "shut down of drivers". My hypothesis is that during the shutdown of some of these drivers the machine is simply halted and if there were any dirty buffers, they are not written to disk. Maybe there's some force flushing of buffers later in the sequence but the process never gets there.

Now going back to the state.tgz file. This file is re-built as part of an hourly cron job that calls /sbin/auto-backup.sh (see article https://kb.vmware.com/s/article/2001780). There's some logic within this command that will decide if saving or not the state.tgz file. Certainly every time I've run it after making changes from a fresh reboot it will save the config. If run right away, it won't. But calling /sbin/backup.sh will always save the changes immediately. 

People has been complaining that Mac Pros fail to boot with the dreaded error opening state.tgz randomly. Some people say "50/50". I guess most people do a reboot after some config changes, maybe changing the hardware pass-thorough or some other setting that requires a config. Well, in those specific cases a reboot without doing a /sbin/auto-backup.sh is guaranteed to end up corrupting the state.tgz (I tested it).

The workaround I've been testing and so far seems to work well is to enable SSH access, execute /sbin/auto-backup.sh, wait about 60 seconds to let all buffers flush to disk and finally shutdown or reboot.

Of course you don't want vm running and you want as quiet an environment as possible. My guess is that ESXi has a bug with Mac Pros that make the shutdown process extremely brittle and dangerous. If there's any data not saved to the local disk it will likely cause corruption on the next reboot.

Hopefully vmware takes notices this problem and addresses the issue. I can't believe it can't be hard to see that the shutdown sequence seem to be cut short and wrecks havoc with these machines.

Tags (1)
Reply
0 Kudos
continuum
Immortal
Immortal

This also happens with other hardware.

It may be a good idea to practise how to inject a healthy copy of state.tgz into the bootbank partition with a Linux LiveCD.

Yes - that suggestion is ridiculous - thats exactly why I mention it.

 

Ulli


________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

Reply
0 Kudos
jsbattig
Enthusiast
Enthusiast

This feels to me basic stuff. Looking thorough the forums you see this reported for quite a while. I don't get it how VMWare has not solved this being such an annoyance to have to rebuilt the machine over and over.

It can't be that hard. Change the sequence of events, make sure you know when you write stuff to disk, flush everything afterwards when on the shutdown sequence.

Reply
0 Kudos
jsbattig
Enthusiast
Enthusiast

Hey VMware, any news on fixing this problem?

Every shutdown/restart is a gamble and a flip of a coin.

Reply
0 Kudos
GAGENCY-Lille
Contributor
Contributor

Hi,

I feel less alone by reading your message. Same problem on all my MacPro. 

When i install esxi on the MacPro SSD, and reboot (and maybe unplug power cord before reboot), 99 chances to get a CRC error. "Module content is likely corrupt". Like you, state.tgz / error 28.

 

The full error message at boot is : 

Module content is likely corrupt
isize 78591, hdrlen: 10, recdCRC: 346, size: 0
gzip_extract failed for state.tgz (size 78591) : CRC error
Error 28 (CRC error) while loading module: state.tgz
Decompressed MD5: 0000000000000000000000000 (a lot of 0)
Fatal error : 28(CRC error)

 

And i have no idea to fix the corrupted file. The only "solution" is to erase all and retry...

Same problem if you desactivate SIP and amfi before install.

 

The only "fix" is to install the ESXi OS on an external USB Key and boot from this key.

 

jsbattig
Enthusiast
Enthusiast

Same boat. I guess the only challenge with booting from external device is that you have to be physically present and with a keyboard hooked to the MacPro for it to boot from the pendrive. Is there a way to compel the MacPro to always boot from the pendrive? So you haven't had the same corruption problem when using the pen drive as boot drive?

What I've done is actually document the entire procedure to rebuild this Mac Pros and moved all VM storage out of the onboard SSD (using iSCSI). I've had cases where the actual data store in the SSD gets badly damaged on reboot or shutdown and the vms on it have to be repaired or forever busted.

Reply
0 Kudos
GAGENCY-Lille
Contributor
Contributor

The esxi os is installed on the usb key and the Mac automatically boots on it since it is the only disk that contains a bootable system. I don't need to be present locally when i restart the Mac.

All VMs are on the SSD and I never had any degradation on this part.

This is a solution, but it's very annoying that VMware (or Apple ?) still doesn't fix this problem.

Reply
0 Kudos
GAGENCY-Lille
Contributor
Contributor

I must add that concerning the problem of state.tgz corrupt, i think it occurs only if I disconnect the Mac electrically (for example to move it in the room once set up!)
Maybe a clue...

Reply
0 Kudos
jsbattig
Enthusiast
Enthusiast

This is good. I haven't tried this approach indeed. I guess it should be easy to test without causing any trouble by using two pen drives, one with the installer and another one as a target. 

I didn't know the Mac Pro would automatically boot from the pen drive if it found no suitable OS on the on board ssd. 

Reply
0 Kudos
jsbattig
Enthusiast
Enthusiast

In my case I was able to cause this error either by rebooting the machine or shutting it down. No need to pull the cord to make it corrupt the boot drive.

jsbattig
Enthusiast
Enthusiast

Well, finally rebuilt a MacPro using your approach of installing the OS on a USB pendrive. Tested reboot, shutdown, no issue. It always came back to life with no problem.

The only operation that required a bit of work was removing all the partitions from the onboard SSD drive on the MacPro (so it boots automatically from the pendrive). For that I had to create a Linux LiveCD, boot, identify the SSD (which happened to be the sda partition). In order to wipe clean the onboard SSD I had to run:

sudo wipefs -a /dev/sda

Next reboot after that the MacPro started from the ESXi USB pendrive with no problem.

Reply
0 Kudos
JRBv3
Contributor
Contributor

You can force a Mac to boot from any device by default.  Hold down Option while it's booting, which takes you to the Startup Manager.  Use the keyboard to select the device you want and then hover the mouse over the up arrow icon below it.  When you do this the up arrow will change to a circle arrow.  Click this circle arrow and the Mac will use that device to boot from then on, even if there are other bootable drives/devices.

Reply
0 Kudos