bo_busillo
VMware Employee
VMware Employee

SD Boot issue Solution in 7.x

Issue The host goes into an un-responsive state due to: "Bootbank cannot be found at path '/bootbank” and boot device is in an APD state.

This issue is seen due to the boot device failing to respond & enter APD state (All paths down). Some cases, Host goes to non-responsive state & shows disconnected from vCenter.

As of 7.0 Update 1, the format of the ESX-OSData boot data partition has been changed. Instead of using FAT it is using a new format called VMFS-L. This new format allows much more and faster I/O to the partition. The level of read and write traffic is overwhelming and corrupting many less capable SD cards.

We have come across lot of customer’s reporting bootbank errors (host booting from SD cards) and host going into un-responsive state in ESXi version 7. 

Our VMware engineering team is gathering information for a fix, there is a new vmkusb driver version available for testing. There is currently a workaround in place, which is to install version-2 of vmkusb driver and monitor the host.

The action plan for future resolution would be to replace the SD card/s with a capable device/disk. Per the best practices mentioned on Installation guide. 

The version 7.0 Update 2 VMware ESXi Installation and Setup Guide, page 12, specifically says that the ESX-OSData partition "must be created on high-endurance storage devices".

https://docs.vmware.com/en/VMware-vSphere/7.0/vsphere-esxi-702-installation-setup-guide.pdf

You can also refer to the below KB:

Reference: https://kb.vmware.com/s/article/83376?lang=en_US

Resolution

VMware engineering has a fix that will be in the next release of 7.02 P03 which is planned for sometime in July 2021.

162 Replies
bo_busillo
VMware Employee
VMware Employee

You need to verify that your SD cards are "High Endurance"

The fastest UHS-I microSD cards are the U3-rated Extreme PLUS line, which offer maximum read speeds of 100 MB/s and maximum write speeds of 90 MB/s, and are available in capacities of 32GB, 64GB, and 128GB. 

UHS Speed Class

The next speed class up is the UHS (Ultra-High Speed) Speed Class and it’s denoted with the “U” symbol. There are two ratings within the UHS Speed Class:

  • U1 (UHS Speed Class 1): minimum write speed of 10MB/s
  • U3 (UHS Speed Class 3): minimum write speed of 30MB/s

The UHS Speed Class is more commonly used nowadays than the Speed Class and many high-end cameras require at least a U3-rated memory card for many of its functions, such as recording high-resolution videos. The UHS Speed Class mainly refers to the minimum sustained write performance for recording videos and came about due to 4K-capable video recording devices needing faster write speeds. As a rule of thumb, 4K-capable recording cameras will usually require at least a U3-rated SD card.

What makes the U1 and U3 memory cards more advanced than those in the Speed Class are that they use one of two UHS bus interfaces:

  • UHS-I: theoretical maximum transfer speeds up to 104MB/s
  • UHS-II: theoretical maximum transfer speeds up to 312MB/s

Both U1 and U3 memory cards can utilise the UHS-I bus interface, but are not compatible with the UHS-II bus interface.

These UHS bus interfaces indicate the theoretical maximum read and write speeds, unlike the sustained write speeds of speed classes. The UHS bus interfaces are denoted by a Roman numeral “I” or “II” symbol on the front of the card. The bus speeds refer to the theoretical data transfer rate of the interface itself while a U3-rated SD card has its own sustained write speed of 30MB/s. For example, a UHS-I U3-rated card guarantees a write speed of 30MB/s but has the potential for a read and write speed of up to 104MB/s if used with a device that supports a UHS-I bus interface.

A UHS-II compatible card has a potential read and write speed of up to 312MB/s. The UHS bus interfaces are backwards compatible so you can use a UHS-II card in a device that supports UHS-I, but you won’t see the speed benefits of UHS-II as the card will default back to the lower specs of UHS-I. Both the card and bus interface must be fully compatible to experience the speed benefits.

 

PatrickDLong
Enthusiast
Enthusiast

@bo_busillothis is fantastic!  Are you aware of any way to programatically retrieve this information (or in fact any information at all - mfr, serial# model # etc.) from the SD or Micro-SD media that is installed in a server without physically removing the card to look at it?  The lsusb -v command only provides details about the USB hub/reader device, not the media inserted in it and I'm unaware of any other commands that might return this type of information. It's not visible in iLo, and my quotes from BL460c Gen9 servers purchased in 2016 and SY660 Gen10 servers just purchased 2021 show the same vendor P/N, although they are most certainly NOT the same physical card.  Last time I worked on one of my HPE Synergy SY660 compute modules I did take a picture of the card - thanks to your info now I know what all the symbols and numbers mean ; - ) Thank you!

HPE_32GB_microSD.jpg

0 Kudos
einstein-a-go-g
Hot Shot
Hot Shot

Thanks for the SD specifications, as a video photographer I'm aware of the specifications.

BUT, what date did this become a requirement?

How does any Client/Architect/installation know, purchasing a server from a Vendor, e.g. DELL EMC or HPE, which gets purchased for use with ESXi 6.5, 6.7 or 7.0 and comes pre-installed using an IDSM, are we supposed to remove the server from the rack, and open the server to expose the SD cards in use ?

But how would we know that a Dell or HPE branded SD card, with ESXi pre-installed meet these requirements ?

Does this also mean that an upgrade to ESXi 7.0 is OFF THE TABLE, e.g. in-place upgrade of any system, which does not meet these requirements, and when is VMware going to publish a VMware KB (unless they already have) for me to distribute to ALL Clients tomorrow, reminding them they will have to purchase

1. New servers with BOSS or SATADOM, M2, NVMe

2. Upgrades required for existing HCL based hardware, because the solution in place no longer meets the SD requirements, although those servers will still be on the HCL.

68% of our clients are using SD cards because that's what was sold since 2004, and ESXi Embedded installations!

einstein-a-go-g
Hot Shot
Hot Shot

@PatrickDLong

this is a good read at identifying SD/MicroSD cards

 

https://www.bunniestudios.com/blog/?p=2297

0 Kudos
bo_busillo
VMware Employee
VMware Employee

I am not sure of the exact date I will see if I can dig up any docs/details, but the "recommendation" from this KB (updated Feb 2021) 

What is the recommendation if you already have these older devices?

  • We recommend to install larger boot media. You should consider moving from USB/SDCard devices because high performance devices are required for predictable application behavior, some of them requiring larger and more reliable storage. Server OEM vendors to ensure the device meets the required endurance parameters provided in the guidance documentation.
  • https://kb.vmware.com/s/article/82515 
0 Kudos
bo_busillo
VMware Employee
VMware Employee

I am checking on commands/a way to validate - but check out this link

http://partnerweb.vmware.com/programs/server_docs/Approved%20Flash%20Devices.pdf

 

0 Kudos
LeoKurzKDA
Contributor
Contributor

Hello,

replacing all boot devices in all servers definitely is not an option for us and also for many others, I think. I need a quick solution and I don't like to experiment with any old drivers or work-arounds in a production environment. I wonder if relocating scratch (KB 1033696) and ProductLocker (KB 2129825) to a shared "capable" disk/LUN would solve the problem for "normal" SD boot devices.

__Leo

0 Kudos
JailBreak
Hot Shot
Hot Shot


@LeoKurzKDA wrote:

Hello,

replacing all boot devices in all servers definitely is not an option for us and also for many others, I think. I need a quick solution and I don't like to experiment with any old drivers or work-arounds in a production environment. I wonder if relocating scratch (KB 1033696) and ProductLocker (KB 2129825) to a shared "capable" disk/LUN would solve the problem for "normal" SD boot devices.

__Leo


Unfortunately, the workaround is all we got for now and is the only way to recover the server without the option to do a hard reboot and restart all VMs running on it. I also will not apply any old, or beta drivers in my production environment.

Regarding your question: First, that should be set if you are using SD Cards. Scratch and locker should not run in SD Cards. The best practice is that they should run in a datastore (local os SAN, iSCSI not recommended).

By moving this will not fix your problem, but it will reduce the time this bug can trigger.


Luciano Patrao
ICT Senior Infrastructure Engineer
Tech Lead for VMware / Virtual Backups
________________________________
If helpful Please award points
Thank You
Blog: https://www.provirtualzone.com | Twitter: @Luciano_PT
0 Kudos
lukaslang
Enthusiast
Enthusiast

It is a shame, that such important things are not published properly (but seen from the side of VMware, this is understandable. Imagine the rumor IF they make an official announcement that SD Cards are no longer supported).

At the moment we have a test Host running on HPE SD with redirected ProductLocker, Scratch and syslog. It is running fine for 7 days now. Hope this will stay so.

But this mess does not end, we found out that a bunch of our Hosts are using a Disk Controller that with vSphere 7 is no longer supported: B140i and vSphere 7 

So we HAD Hosts with "High Endurance Storage" but had to insert SD Cards to be able to install vSphere 7 (clearly HPEs fault). And now VMware is telling us, that with U2a SD Card is no more the best practice? The Hypervisor with the smallest possible footprint? What in Gods name is so important, that you have to read it a million times from local media instead of loading this c*** inside the memory of the Host since we have hundreds of GB of RAM. Get your things together.

jzbv73
Contributor
Contributor

Based on VMware's info, is it safe to say the best course is to use a read-intensive SSD?  Or would mixed-use be the better bet?

0 Kudos
kurtd
Enthusiast
Enthusiast

Even if this problem goes away, it seems Dual SD modules are no longer recommend so is it best to move to a BOSS card with two M.2. or add two SSDs to the raid controller supporting my data store?

0 Kudos
Kahonu84
Hot Shot
Hot Shot

Our solution was a bit more radical but long lasting. We bought hard drives and reinstalled ESX on six hosts and blow-torched the SDs. They were nothing but misery. Dell took us down a dark alley.

einstein-a-go-g
Hot Shot
Hot Shot

SD-Failure.PNG

two servers failed now, after 46 hours, using High Endurance MicroSD cards as per specification!

0 Kudos
sysadmin84
Enthusiast
Enthusiast

Our second host just failed as well after ~2 months (Dell r740 with IDSM (Dell SD cards)). I now ordered a couple of BOSS cards since I can't keep waiting on a patch anymore.

0 Kudos
johnmcc22
Contributor
Contributor

So, these SD cards \ flash devices are certified to run esxi 7.02 and beyond?  If so we'll look into purchasing something off of this list.

0 Kudos
kurtd
Enthusiast
Enthusiast

Dell won't let me order a Boss card because it's on backorder.  They blame it on a chip shortage but it's probably because of this SD card bug.  Going to open a support ticket with Dell and complain.  7.02 should have been pulled.  All it does is brick systems yet they still have it out there to download.  Time to migrate off vmware??

0 Kudos
sysadmin84
Enthusiast
Enthusiast

Got the same info from my seller, 3 months lead time on BOSS cards. Since our servers are diskless, we'll have to setup boot from SAN.

I'm still wondering though: Will SD cards be ok again with the newest patch or will it just slow down the problem. VMWare will hopefully make this clear.

0 Kudos
e_espinel
Expert
Expert

Hello.
I have been following this post and others because of the serious problems of using SD Card as Boot device. Which apparently are more critical in version 7.
I remembered that when working with version 4, they started to use USB key (4Gb or 8GB) as Boot device without problems, maybe this can be an alternative, to solve the problem with a patch and/or change the SD card by internal mechanical disks.

 

Enrique Espinel
Senior Technical Consultant IBM, Lenovo and VMware.
VMware VSP-SV 2018, VTSP-SV 2018 VMware Technical Solutions Professional Hyper-Converged Infrastructure (VTSP-HCI 2018)
VMware Technical Solutions Professional (VTSP) 4 / 5.
Please mark my comment as the Correct Answer/Kudos if this solution resolved your problem Thank you.
Пожалуйста, отметьте мой комментарий как "Правильный ответ/Кудос", если это решение решило вашу проблему. Спасибо.
0 Kudos
einstein-a-go-g
Hot Shot
Hot Shot

You should have got DELL to replace with BOSS - F.O.C - we did with many clients.

 

Under UK LAW, Not fit for FECKING PURPOSE!!!!

0 Kudos
einstein-a-go-g
Hot Shot
Hot Shot

It's very vaugue and a grey area, VMware just state "huigh endurance flash" - whatever that means ?

 

and we've had that fail!

 

Is it speed, or is it that the writes overwhelm the media.....

0 Kudos