I have a 2-node vSAN cluster with Dell r730xd hosts, running ESXi v7.0. I put one of the hosts into maintenance mode today, to start updating them both to 7.0b. When I tried to run the update from vCenter, it stopped with an error and checking the esxupdate log made it seem like something to do with the bootbank not being found. The host has ESXi installed on SD card via the Dell Dual SD Module. iDRAC shows both SD cards in the mirror to be healthy and working normally.
I rebooted the host, and it does seem to boot okay, and comes back online to vCenter but now shows that "No coredump target has been configured. Host core dumps cannot be saved." and trying the update again still produces same error in esxupdate log. I really am still learning vSphere but I would deduce something is wrong with the bootbank partition on the SD card, even though the system still boots okay... I even tried to use a 7.0b ISO USB installer to upgrade ESXi but when the installer gets to scanning devices, it quickly produces some error about "UnicodeDecodeError" and not being able to "decode byte 0x99 in position 5" because of "Invalid Start Byte". I'm guessing when the installer is scanning the SD cards to find existing ESXi, whatever bootbank issue is there causes an error out on installer booting from USB stick.
Either way, what can I do to check into why I am suddenly having what seems to be a bootbank issue and can I repair it somehow?
Post Title was edited: After further review of event logs, it appears the /bootbank symlinks only went missing after the attempted update to 7.0b via Lifecycle Manager in vCenter. See below posts.
EDIT: For anyone looking in future, this was only resolved by re-installing ESXi and copying the state.tgz file from the bootbank on old install to new. But first, to make sure latest info was in state.tgz, I manually created the symlink /bootbank to /vmfs/volumes/BOOTBANK1. This then allowed "auto-backup.sh" to run and update the stale state.tgz to current state before copying it off host for new install use. After host was back up like normal, I tried the update from 7.0 to 7.0b again but this time from booting 7.0b ISO, and it worked. Not attempting with lifecycle manager again since it caused this issue the first time.
If you try to "cd" into /bootbank are you able to list the files? Also have you checked the space?
Please run this two commands to idenfity if maybe one partition is full:
If I do "cd /bootbank" it returns no such directory. The only place I see bootbank files is under "/vmfs/volumes/BOOTBANK1" and "/vmfs/volumes/BOOTBANK2". There are files in both with bootbank1 having files more recently modified and bootbank2 have files last modified a couple months ago.
If I run "df -h" I get:
|Filesystem||Size Used||Available Use% Mounted on|
|VMFS-6||558.2G 5.3G||553.0G 1% /vmfs/volumes/esxi-03-LDS01|
|VMFS-L||12.2G 1.6G||10.6G 13% /vmfs/volumes/LOCKER-5jjcbe6e-96115ffe-50e1-72g11c1cu7q6|
|vfat||1023.8M 173.5M||850.3M 17% /vmfs/volumes/BOOTBANK1|
|vfat||1023.8M 173.6M||850.2M 17% /vmfs/volumes/BOOTBANK2|
|vsan||16.4T 8.1T||8.3T 49% /vmfs/volumes/VSAN01-Datastore|
If I run "vdf -h" I get:
|Ramdisk||Size||Used||Available Use% Mounted on|
|root||32M||4M||27M 13% --|
|etc||28M||812K||27M 2% --|
|opt||32M||8K||31M 0% --|
|var||48M||752K||47M 1% --|
|tmp||256M||9M||246M 3% --|
|iofilters||32M||0B||32M 0% --|
|shm||1024M||0B||1024M 0% --|
|crx||1024M||0B||1024M 0% --|
|configstore||32M||52K||31M 0% --|
|configstorebkp||32M||52K||31M 0% --|
|hostdstats||1479M||4M||1474M 0% --|
Also, event log on host shows warnings about every hour that says "Bootbank cannot be found at path '/bootbank'." Although, I can navigate to /vmfs/volumes/BOOTBANK1 and /vmfs/volumes/BOOTBANK2 and files are there, as mentioned. If I look in "/" bootbank and altbootbank are missing, where they do exist on my other hosts.
Also looking at the event log, it appears that the /bootbank missing warnings only started after I rebooted the host after the original update attempt failed yesterday. All I did was try to remediate with the 7.0b baseline and it failed with this error, I tried applying update one more time after that first attempt, failed with same error, then rebooted host, then /bootbank is missing.
It seems that there is a whole discussing about this issue and SDs cards on Dell and HPE servers but not finding the SD Card could be related to a hardware issue but as you see that the SD cards shown as healthy in the iDRAC we can assume they are both okay.
I have a few questions to see if we can find the issue:
If it is solved then it is something related with the version that you are upgrading and could have correlation with the SD Card, drivers you are using, firmware,etc.
I saw that discussion too but yes seemed to be different since hardware checks out okay AND even if though /bootbank is missing, when I reboot the host, it still boots up okay... The Vmware KB articles I found on missing /bootbank error indicate the symlink goes missing when the boot device actually can't be accessed. But I can get to it via the long path under /vmfs/volumes so is it possible I am just missing the symlinks in "/" to the path of each bootbank?
1.) vCenter shows the boot SD card as up on that host, it reports the size, the partitions (4 total), and path with a green icon next to it.
2.) The SD card itself is supported as far as I know. I am using the same SD cards on the other hosts and there seem to be no issues there.
3.) Drivers and firmware are on the same versions between the two hosts and are listed on HCL as supported for ESXi 7.0.
To confirm, I can't try rolling back the update to 7.0b because it didn't actually complete. I'm still on 7.0, the first public build of it. I am afraid to try updating on the other host before I know what happened with this one since if there is a problem/bug with the 7.0b update, I don't want to also cause the issue on the other host and then both hosts in the vSAN cluster could be at risk of a boot issue.
Oh well you are running ESXi 7 so there are some changes on the structure inside the ESXi: vSphere 7 - ESXi System Storage Changes - VMware vSphere Blog. How ever if you list on the root path you should see something like this:
What is your output there?
Here is what I get. Since the symlink is not there, yet bootbanks show up under /vmfs/volumes and the system does indeed boot, and the boot sd card checks out okay, is it possible the only issue here is the symlinks just being missing? If so, it's strange the attempt to update to 7.0b would cause that. Is there a script that runs on boot that creates those symlinks in "/" that maybe got affected? As you can see in previous post, the event log error that comes up when updating fails mentions VIBs regarding the bootbank, I wonder if there is a bug in the update process?
It is particularly weird that the bootbank and altbootbank disappeared. There is an issue where the bootbank points to tmp but this is not your case. From here the only thing that i can suggest is give it a try at the next procedure: remount Esxi boot bank – Nick's computer on the cloud or to re-install the ESXi
You can backup and then restore the configurations following the steps in the next KB: VMware Knowledge Base
For sure you can also wait for somebody in this community as they maybe have the solution for your issue.
That localcli command to restore bootbanks returns "Error, Invalid Parameter: bootbanks". I can't backup the config via that method unfortunately, running it returns:
faultCause = (vmodl.MethodFault) null,
faultMessage = <unset>,
reason = "Internal error"
msg = "Received SOAP response fault from [<cs p:000000fd472457f0, TCP:localhost:8307>]: backupConfiguration
A general system error occurred: Internal error"
I tried recreating the bootbank and altbootbank symlinks manually, which works to create them in "/" but that backup command still doesn't work. And then when I reboot esxi, the symlinks are gone again.
Is there a solution to this? I've raised it with VMware but they were not able to provide solution yet (case is still open).
Frustrating as the upgrade from ESXi6.7 to 7.0.3 went fine on 3 out of 4 of the same hosts (DELL R730) in the same cluster.