bo_busillo
VMware Employee
VMware Employee

SD Boot issue Solution in 7.x

Issue The host goes into an un-responsive state due to: "Bootbank cannot be found at path '/bootbank” and boot device is in an APD state.

This issue is seen due to the boot device failing to respond & enter APD state (All paths down). Some cases, Host goes to non-responsive state & shows disconnected from vCenter.

As of 7.0 Update 1, the format of the ESX-OSData boot data partition has been changed. Instead of using FAT it is using a new format called VMFS-L. This new format allows much more and faster I/O to the partition. The level of read and write traffic is overwhelming and corrupting many less capable SD cards.

We have come across lot of customer’s reporting bootbank errors (host booting from SD cards) and host going into un-responsive state in ESXi version 7. 

Our VMware engineering team is gathering information for a fix, there is a new vmkusb driver version available for testing. There is currently a workaround in place, which is to install version-2 of vmkusb driver and monitor the host.

The action plan for future resolution would be to replace the SD card/s with a capable device/disk. Per the best practices mentioned on Installation guide. 

The version 7.0 Update 2 VMware ESXi Installation and Setup Guide, page 12, specifically says that the ESX-OSData partition "must be created on high-endurance storage devices".

https://docs.vmware.com/en/VMware-vSphere/7.0/vsphere-esxi-702-installation-setup-guide.pdf

You can also refer to the below KB:

Reference: https://kb.vmware.com/s/article/83376?lang=en_US

Resolution

VMware engineering has a fix that will be in the next release of 7.02 P03 which is planned for sometime in July 2021.

162 Replies
JailBreak
Hot Shot
Hot Shot


@sbd27 wrote:

@PatrickDLong So you are correct. I would not recommend replacing any current embedded ESXi solution mainly because, at least with Dell, you can't! When you purchase a diskless server from Dell without a Perc card and drive cages, they do not support installing them afterwards, you are stuck.

What makes matters even worse for me (and I have to assume other customers) is that I have some R730s that are diskless with only the IDSDM solution, and the R730 (which is still an ESXi supported server) does not support the BOSS card . If this fix does not work I have to replace servers I did not budget for in my upgrade project. 

However, all new servers that I purchase will not longer utilize a diskless config. I can easily have a non-technical person replace a bad swappable SSD RAID drive, but replacing a BOSS card or its attached SSD requires downtime and opening the hood of the server. No Thanks!


Fortunately, all my Dell was acquired with SSDs. All, so I don't have this issue with the Dells, only with HPE.


Luciano Patrao
ICT Senior Infrastructure Engineer
Tech Lead for VMware / Virtual Backups
________________________________
If helpful Please award points
Thank You
Blog: https://www.provirtualzone.com | Twitter: @Luciano_PT
0 Kudos
vivithemage
Contributor
Contributor

Is the update out yet to resolve this? 

0 Kudos
lukaslang
Enthusiast
Enthusiast

Seems it got delayed until end of August (only rumors). But since there are no official statements, it is really difficult to tell an exact release date.

0 Kudos
JailBreak
Hot Shot
Hot Shot

My latest issue was on 14/07 with 2 hosts, and today was a nightmare, with 5 hosts with the issue.

I never had these numbers on the same Cluster. So in a 12 ESXi hosts Cluster, 5 had the issue today (or during the weekend). One of those was running the vCenter. So had double of issues.

Because until I don't recover the ESXi host where the vCenter was running, it was all crazy and unstable.

PS: If you leave the ESXi host with the issue for a long time (10/12h), VMs then start to get affected and CPU 100% usage, together with performance issues.


Luciano Patrao
ICT Senior Infrastructure Engineer
Tech Lead for VMware / Virtual Backups
________________________________
If helpful Please award points
Thank You
Blog: https://www.provirtualzone.com | Twitter: @Luciano_PT
0 Kudos
sysadmin84
Enthusiast
Enthusiast

I managed to get my hands on a BOSS card for one of our hosts and moved all the VMs to that host. That will hold me over. I feel bad for people with bigger environments where it's not an option to replace the boot device for dozens or hundreds of hosts.

0 Kudos
lukaslang
Enthusiast
Enthusiast

Status of our SD Card Testserver: 33 days uptime with no problems. It runs along with 19 other servers in a cluster.

Quick question to the HPE owners: Have you updated the firmware with the latest SPP for Gen9 (2021.05.0) and installed ESXi with the U2a customized image? Maybe this prevents or slows down the issue?

0 Kudos
A13x
Hot Shot
Hot Shot

Patch release due next month to resolve this and also support secure boot.

0 Kudos
PatrickDLong
Enthusiast
Enthusiast

@A13xDo you care to share either your source or confidence level in your statement "due next month"?  I will point out that the OP (employee) statement of "recommending the install of P3 in July sometime." was clearly either incorrect from the outset or invalidated as the date approached, and Duncan clarified this in his response to complaints on this thread after nothing was released on July 15 as had been widely speculated here and elsewhere. 

https://communities.vmware.com/t5/ESXi-Discussions/SD-Boot-issue-Solution-in-7-x/m-p/2857776/highlig...

"release dates are typically not shared, mainly as they change based on various aspects. In this case your source was/is wrong."

It seems exceedingly clear to me that VMware is not going to make any official statement regarding release date for this patch, and speculation on release dates only serves to improperly set expectations, justified or not.

0 Kudos
A13x
Hot Shot
Hot Shot

The source is VMware via a sr. I obtained the patch before but it never supported secure boot. I opened up a new case to seek an ETA and was told esxi patches for 6.7 and 7 will be released next month.

They also confirmed it several times. This SD card patch would be included and also support secure boot.

0 Kudos
vmrulz
Hot Shot
Hot Shot

We too have hit this issue with HP G9 BL460's in a dev cluster on 7.02. We asked to be put on pre-release of the patch from VMware which is supposedly in U3 mid August. Sounds like VMware needs to validate this fix and release it ASAP. Sorry for those that have this issue in Production!

0 Kudos
vmrulz
Hot Shot
Hot Shot

Apparently downgrading to 7.01 is an option to get around this issue. Anybody know if you can do that with VUM? I've been doing vmware for a million years and never had to downgrade a host.  I suppose we could just install a fresh copy of U1 on each.

esxcfg-rescan -d vmhba32 just hangs and hangs

https://www.provirtualzone.com/vsphere-7-update-2-loses-connection-with-sd-cards-workaround/

ls -al on server this morning.. still hanging 8 hours later.. server hasn't disconnected but it's basically useless other than hosting vms.

vmrulz_0-1626994938597.png

 

0 Kudos
einstein-a-go-g
Hot Shot
Hot Shot

@vmrulz 

does recovery mode work for you ? Shift-R at BOOT rollback ?

0 Kudos
vivithemage
Contributor
Contributor

Even the work around does not resolve the issue. 

0 Kudos
JailBreak
Hot Shot
Hot Shot


@vivithemage wrote:

Even the work around does not resolve the issue. 


The workaround is a temporary workaround. Mainly to recover ESXi hosts and able to reboot the host properly.


Luciano Patrao
ICT Senior Infrastructure Engineer
Tech Lead for VMware / Virtual Backups
________________________________
If helpful Please award points
Thank You
Blog: https://www.provirtualzone.com | Twitter: @Luciano_PT
0 Kudos
JailBreak
Hot Shot
Hot Shot


@vmrulz wrote:

Apparently downgrading to 7.01 is an option to get around this issue. Anybody know if you can do that with VUM? I've been doing vmware for a million years and never had to downgrade a host.  I suppose we could just install a fresh copy of U1 on each.

esxcfg-rescan -d vmhba32 just hangs and hangs

https://www.provirtualzone.com/vsphere-7-update-2-loses-connection-with-sd-cards-workaround/

ls -al on server this morning.. still hanging 8 hours later.. server hasn't disconnected but it's basically useless other than hosting vms.

vmrulz_0-1626994938597.png

 


If the host has the issue, you can't do ls or even a df -h or other Linux OS commands, it will hang. You need first fix the issue with the esxcfg-rescan -d vmhba3 and reboot. Then you can do other commands normaly.


Luciano Patrao
ICT Senior Infrastructure Engineer
Tech Lead for VMware / Virtual Backups
________________________________
If helpful Please award points
Thank You
Blog: https://www.provirtualzone.com | Twitter: @Luciano_PT
0 Kudos
vivithemage
Contributor
Contributor

  • I thought the cache tools workaround was the fix?
0 Kudos
A13x
Hot Shot
Hot Shot

the fix is a later version of the vib from vmware which you can request via a support SR. the release is hopefully due next month with the rest of the vmware host and vcenter patches

0 Kudos
vivithemage
Contributor
Contributor

Ah, so what was that work around for? It's in their fix bulletin.

 

I only use the free version, so no support contract.

0 Kudos
JailBreak
Hot Shot
Hot Shot


@vivithemage wrote:
  • I thought the cache tools workaround was the fix?

In some of my ESXi hosts did fix the issue. Others reduce the number of times I get the issue. Instead of having every 24/48h, I get one time a week.

So is not 100% a silver bullet.


Luciano Patrao
ICT Senior Infrastructure Engineer
Tech Lead for VMware / Virtual Backups
________________________________
If helpful Please award points
Thank You
Blog: https://www.provirtualzone.com | Twitter: @Luciano_PT
Tags (1)
0 Kudos
JailBreak
Hot Shot
Hot Shot


@A13x wrote:

the fix is a later version of the vib from vmware which you can request via a support SR. the release is hopefully due next month with the rest of the vmware host and vcenter patches


Unfortunately upgrading or even downgrade vmkusb vib did not fix all systems, only a couple were fixed. Many customers have stated that this solution did not fix the issue and they still get the ESXi host U2a issue.


Luciano Patrao
ICT Senior Infrastructure Engineer
Tech Lead for VMware / Virtual Backups
________________________________
If helpful Please award points
Thank You
Blog: https://www.provirtualzone.com | Twitter: @Luciano_PT
0 Kudos