VMware Cloud Community
bo_busillo
VMware Employee
VMware Employee

SD Boot issue Solution in 7.x

Issue The host goes into an un-responsive state due to: "Bootbank cannot be found at path '/bootbank” and boot device is in an APD state.

This issue is seen due to the boot device failing to respond & enter APD state (All paths down). Some cases, Host goes to non-responsive state & shows disconnected from vCenter.

As of 7.0 Update 1, the format of the ESX-OSData boot data partition has been changed. Instead of using FAT it is using a new format called VMFS-L. This new format allows much more and faster I/O to the partition. The level of read and write traffic is overwhelming and corrupting many less capable SD cards.

We have come across lot of customer’s reporting bootbank errors (host booting from SD cards) and host going into un-responsive state in ESXi version 7. 

Our VMware engineering team is gathering information for a fix, there is a new vmkusb driver version available for testing. There is currently a workaround in place, which is to install version-2 of vmkusb driver and monitor the host.

The action plan for future resolution would be to replace the SD card/s with a capable device/disk. Per the best practices mentioned on Installation guide. 

The version 7.0 Update 2 VMware ESXi Installation and Setup Guide, page 12, specifically says that the ESX-OSData partition "must be created on high-endurance storage devices".

https://docs.vmware.com/en/VMware-vSphere/7.0/vsphere-esxi-702-installation-setup-guide.pdf

You can also refer to the below KB:

Reference: https://kb.vmware.com/s/article/83376?lang=en_US

Resolution

VMware engineering has a fix that will be in the next release of 7.02 P03 which is planned for sometime in July 2021.

175 Replies
tomas_strand
Enthusiast
Enthusiast

I went from 7.0.1 to 7 update2a using VMware-VMvisor-Installer-7.0.0.update02-17867351.x86_64-DellEMC_Customized-A04.iso

Then using LifeCycle manager to patch to Update2c.

Only strange thing that happened to me on two hosts was when I upgraded to U2a ESXi would boot up with missing all vmknic's (but seeing ordinary Nic's). I reverted to 7.0.1 again using Shift+r while booting. Then ran the upgrade again and it would work. 

And because I'm using Skyline it triggered this https://kb.vmware.com/s/article/83851

Patching to u2c fixed it.

 

Reply
0 Kudos
mbartle
Enthusiast
Enthusiast

I am running the Dell custom ISO : DEL-ESXi-701_17551050-A05 . My concern about using Lifecycle Manager is the Dell addons won't be upgraded.  What version were you on prior to updating via LM ? 

I am hoping Dell EMC releases a custom ISO for this, but VMware hasn't even yet. They released a 7GB ISO for vCenter 7.0.2Uc but the ESXi available to download is still 7.0.2Ua

Reply
0 Kudos
cbs44
Contributor
Contributor

I was using build 2a.  I noticed also that the 2c wasn't available for ESXi...just vCenter.  However, with this issue we were having with the SD cards I had to apply the update asap.  Also, from my understanding, the 2c update is more of a patch than an upgrade so I went with the Lifecycle Manager method.  Update went smooth and I haven't had any issues since then.  I also have DRS and HA enabled with no issues.  Having DRS and HA enabled with the SD card issue was a nightmare.

Additionally, I am using Skyline and it doesn't report any issues regarding this...

Tags (1)
Reply
0 Kudos
LucianoPatrão


@mbartle wrote:

I am running the Dell custom ISO : DEL-ESXi-701_17551050-A05 . My concern about using Lifecycle Manager is the Dell addons won't be upgraded.  What version were you on prior to updating via LM ? 

I am hoping Dell EMC releases a custom ISO for this, but VMware hasn't even yet. They released a 7GB ISO for vCenter 7.0.2Uc but the ESXi available to download is still 7.0.2Ua


There is no ISO from VMware so no Customize ISO will be available for now. If VMware did not provide one, third-party companies will also not.

We should see a new version ISO soon, until then we can only apply the patch. If you cannot have access to the patch through vCenter Lifecycle Manager, import it or do it manually in ESXi console.

Check my blog post for more details
https://www.provirtualzone.com/vmware-finally-launched-esxi-7-0-update-2c/

 

Luciano Patrão

VCP-DCV, VCAP-DCV Design 2023, VCP-Cloud 2023
vExpert vSAN, NSX, Cloud Provider, Veeam Vanguard
Solutions Architect - Tech Lead for VMware / Virtual Backups

________________________________
If helpful Please award points
Thank You
Blog: https://www.provirtualzone.com | Twitter: @Luciano_PT
Reply
0 Kudos
A13x
Hot Shot
Hot Shot

Lcm has the Dell add-on images,the same ones that are built into the Dell iso. You just create a baseline and patch.

Reply
0 Kudos
mbartle
Enthusiast
Enthusiast

Luciano.

Something does not make sense here.  I ran these in my test environment. vCenter was 7.0.2 Build 17958471 , I ran the upgrade to .400 from the appliance update section. Patched successfully and brought vCenter to Build 18356314.

The build number for ESXi 7.0.2Uc is : 18426014 , which is higher than vCenter. I can't ever remember in all my years of working with these products that the ESXi will have a build higher than vCenter.  My understanding has always been vCenter must be equal to or higher in build numbers than ESXi.  I am not sure I want to patch production without some clarification here.

I did the patch using the Host Security Patches and Critical Host Patches baseline in test, however the build is the same if I were to create an image for my production cluster - the vCenter will still be lower than ESXi.  

Is this ok ?

Reply
0 Kudos
pkvmw
VMware Employee
VMware Employee

@mbartle For your concern, the build number is not relevant. The ESXi offline bundle (the ZIP file) was simply built/created AFTER the vCenter-ISO, hence the it has a higher build number.

You don't need to check build numbers (at least I don't know any scenarios where relevant). Checking the "public version naming" and supportability via https://interopmatrix.vmware.com is enough.

Regards,
Patrik

Reply
0 Kudos
einstein-a-go-g
Hot Shot
Hot Shot

does not make any difference, what matters is the MAJOR version number e.g. 7.

if you go back far enough, some minor versions caused issues with SSL !

Reply
0 Kudos
vbabic
Enthusiast
Enthusiast

Well this is great... Now there's a new article saying that having only SD card (or USB) is unsupported.

Now I am doubting even more that they actually fixed the issue (maybe just lessened the impact), obviously they changed the design too much without thinking or testing the currently supported hardware and it would be too much trouble for them to fix it now. So...dear customers, suddenly you're unsupported, tough luck! Oh wait, there's a simple workaround, just reinstall all your servers on new hardware...I am sure VMware will cover the cost...

https://kb.vmware.com/s/article/85615?lang=en_US

Great job!

Reply
0 Kudos
lukaslang
Enthusiast
Enthusiast

@vbabic 

Well you have to say, that this article does mention only the problem with non-persistent storage of /scratch which in SD-Card Installations always led to this message. If you redirect the scratch location to persistent storage (like a LUN) the message will not pop up.

The article also says "A system with only a SD-Card/USB boot device is operating in an unsupported state with the potential for premature corruption", which in most cases, will not be the case.

Reply
0 Kudos
vbabic
Enthusiast
Enthusiast

Hi,

If it is as you say, just about the scratch partition, like it always was, why the new article and new warnings - "Please move installation to persistent storage". *installation*, not /scratch

They clearly say that the only supported configuration is having a local persistent storage device that is not a SD-Card/USB boot device.

That is new. Maybe some people design their servers with extra disks lying around in the servers not being used, we don't. If I have local disks, they are for vSAN.

If I was buying local disks just for the /scratch partition, why would I even have the SD Card, I would use those disks for booting...

/scratch partition is of course redirected to a remote datastore, but this is about all other parts of the shiny new much improved ESX-OSData partition which they found out doesn't support SD Cards, even though it did until 7.0 U2

Reply
0 Kudos
einstein-a-go-g
Hot Shot
Hot Shot

this was inevitable based on the evidence we've seen between versions at the electrical logging data level !!!!!

We never believed for one minute they attempted to fix the issue!

Once HPE and DELL issued statements of not supported/not recommended for IDSM modules!

Smoke and Mirrors !

 

Reply
0 Kudos
einstein-a-go-g
Hot Shot
Hot Shot

just to clear up....

we've seen corruption issues with even having re-directed scratch partitions!

and we can generate the issue and reproduce in 13m-26mins!

Reply
0 Kudos
vbabic
Enthusiast
Enthusiast

Exactly, it's not about the scratch. This is the biggest screw up I have seen from VMware... and the sad part is they don't care and they try to minimize it and blame OEMs and customers for using supposedly low endurance cards (now it's obvious no card has a high enough endurance for their briliant design)

As I said, I haven't yet started upgrading to 7.0 because I don't consider it a stable version (I don't think anyone can argue about that in this thread)...what am I supposed to do now? Stay on 6.7 until my servers are old enough for replacement in 3-4 years?

Reply
0 Kudos
lukaslang
Enthusiast
Enthusiast

 The biggest problem is, that there, until now, is no 100% clear statement. If you have to plan for the future this course is unacceptible.

The strangest thing is, that my test host with an 8 GB HPE SD-Card is now running for almost two months without a problem on U2a.

For the other hosts, since the B140i controller is unsupported with ESXi 7, I am going to think to disable the RAID functionality and use the Wellsburg SATA controller. Has anyone done that yet on HPE hardware? I was not aware that this is possible.

einstein-a-go-g
Hot Shot
Hot Shot

This was a discussion (or argument) I was having with fellow vExperts, which also happen to be VMware Employees!!!!

when this broke, imagine the dis-array this is causing our clients, that have been convinced by VMware, VMware Partners to install ESXi (Hypervisor) in "Embedded Mode" ever since ESX "i" was invented claiming one of the "benefits" over Hyper-V, smaller footprint, no need for spinning disks for mirrored installation, smaller footprint, less security issues, less overhead...

and we all bought into that, now to be told.... 

USB/SD cards, even though supplied by OEM vendors, NOT SUPPORTED!!!!

oh!  since 2004 we've been installing onto USB/SD card installs, even the vendors have been doing it and supplying IDSM, and SD pre-installed

It has been said that VMware, advised OEM's off this change in 2018, and people poke fun at SD/microSD card installations, and we always thought it was not about USB/SD type cards, we've seen high endurance industrial SD cards FAIL that are used in more serious life-threatening applications and systems than an ESXi server!!!!

Some in the latest ESXi 7.x is foo-barred!!!

It does leave an awkward situation for upgrades from 6.5 and 6.7, to 7.0, where there is no easy upgrade path, other than full install to something you can get fitted in your servers, as I understand BOSS/SATADOM may not fit some servers, like the Dell IDSM module.

It is a **bleep**-up! 

(not to be confused with a vSAN installation, which does have a requirement for BOSS/SATADOM because of the vSAN trace logs!)

I do feel for everyone now.... abandoned by VMware !

Reply
0 Kudos
A13x
Hot Shot
Hot Shot

@einstein-a-go-g i really do not see what the issue is, times have changed and so have recommendations. It also baffles me why anyone would upgrade to major releases and not do a complete full rebuild of a host. On the next host refresh just spec in some disks rather than SD cards. 

the errors and log spew would have been detected after upgrading to make you stop the roll out.

if you can afford to run VMware i am sure a few disks are next to nothing. ESXi rebuilds can be done in minutes.

Reply
0 Kudos
vbabic
Enthusiast
Enthusiast

Even for vSAN, another local device is not a requirement, trace logs can be redirected to a remote storage or (partly) to a syslog server or limited in size.

Anyway, even the vSAN Ready Nodes (still on the HCL) have the option to boot from SD card (with no additional local disks other than vSAN disks). Imagine you buy a bunch of them today, trusting they are a sure thing to be checked and tested, and by the time they arrive, they are unsupported.

At least now I know what I'll say to the VMware sales team trying to sell Tanzu to me...sorry, unsupported...

Reply
0 Kudos
vbabic
Enthusiast
Enthusiast

Yes, the recommendation changed abruptly without prior notice. When things like this are done, responsible thing to do is do announce for example, this is the last major version to support this configuration, the next one will not. And VMware usually does this, but this was obviously not planned, but the consequence of a major screw up in planning and/or development.

When are you available to come and rebuild all our hosts free of charge? Of course, bring a bunch of boot devices with you... 😄

Really, rebuild all hosts instead of an upgrade? Is this a Microsoft forum? That used to be the benefit of using VMware, same host going through 4-5 major releases without problems during their lifetime (that was when releases were every year)

sbd27
Contributor
Contributor

@A13x  Almost my entire environment has diskless server configurations from Dell (you know the parent company of VMWare). So, my Servers do not have PERC or Disk Cages, and some are R730s that do not support BOSS cards. Dell also does not support installing PERCs into servers that came from the factory without them.

So, now my VMWare upgrade which should have had almost zero cost is now costing me over $100k because I need to replace the R730 hosts and purchase new BOSS cards.

Also, BOSS cards have a 90-day turnaround right now.

And by the way, I have never had to do a complete host rebuild to do an upgrade.