bo_busillo
VMware Employee
VMware Employee

SD Boot issue Solution in 7.x

Issue The host goes into an un-responsive state due to: "Bootbank cannot be found at path '/bootbank” and boot device is in an APD state.

This issue is seen due to the boot device failing to respond & enter APD state (All paths down). Some cases, Host goes to non-responsive state & shows disconnected from vCenter.

As of 7.0 Update 1, the format of the ESX-OSData boot data partition has been changed. Instead of using FAT it is using a new format called VMFS-L. This new format allows much more and faster I/O to the partition. The level of read and write traffic is overwhelming and corrupting many less capable SD cards.

We have come across lot of customer’s reporting bootbank errors (host booting from SD cards) and host going into un-responsive state in ESXi version 7. 

Our VMware engineering team is gathering information for a fix, there is a new vmkusb driver version available for testing. There is currently a workaround in place, which is to install version-2 of vmkusb driver and monitor the host.

The action plan for future resolution would be to replace the SD card/s with a capable device/disk. Per the best practices mentioned on Installation guide. 

The version 7.0 Update 2 VMware ESXi Installation and Setup Guide, page 12, specifically says that the ESX-OSData partition "must be created on high-endurance storage devices".

https://docs.vmware.com/en/VMware-vSphere/7.0/vsphere-esxi-702-installation-setup-guide.pdf

You can also refer to the below KB:

Reference: https://kb.vmware.com/s/article/83376?lang=en_US

Resolution

VMware engineering has a fix that will be in the next release of 7.02 P03 which is planned for sometime in July 2021.

175 Replies
einstein-a-go-g
Hot Shot
Hot Shot

I think we can agree there is a serious amount of ill-feeling in this thread, and if this is a representation of the VMware installations out there, VMware Admins and Organisations feel fecking let down by VMware!

Right or Wrong, New or Old technology.

It's not what I expected, and we've been using their products since their inception!

 

0 Kudos
PatrickDLong
Enthusiast
Enthusiast

@A13xappreciate your thoughts, but one of your comments IMO highlights part of why so many VMware administrators are quite upset about this issue - WE were the QA.

"the errors and log spew would have been detected after upgrading to make you stop the roll out." ...by VMware QA and stopped the 7.0 U2a release to GA.  There, I fixed it for you.:-) 

I honestly don't think most people would have a problem with a major configuration supportability change like this given enough warning and procurement cycle runway to spec replacement hosts with actual disks- but that type of change should only happen at a MAJOR release version.  This issue did not - the VMFS-L formatting change occurred at 7.0GA, true enough, but no one running 7.0U1 is having any issues with USB or SD card boot devices that I am aware of; something clearly changed in the I/O profile of the vmkusb driver released with 7.0.U2 that is causing issues with that class of devices.  Thankfully I've seen no issues so far on the diskless hosts I've patched with U2c.

Your statement regarding full and complete rebuild of a host is also a bit confusing.  I happen to agree that reinstalling from scratch is the cleanest method to upgrade between major versions - we've probably both been in this game long enough to see quite a few support issues caused by artifacts left over from previous installations.  But installing 7.0 from scratch would not have saved a diskless host from this issue eventually being triggered by patching to 7.02.  If you mean "rebuild" in the sense of retrofit existing hosts with additional hardware - that might make sense for some smaller implementations, but for larger environments in multiple fully-remote data centers the expense- in new physical equipment, travel, man-hours, and opportunity cost - would be simply staggering.  I should be able to upgrade my current hardware to the latest available version so long as I am compliant with VMware's HCL.  If VMware wants to stop supporting installation on USB/SD card media, they need to have given PLENTY of notice of a change like that coming for a future Major release AND figure out a way to incorporate that information into the HCL when selecting servers from various manufacturers.

After waiting an interminable length of time for the patch I'm trying to move on - I've wasted enough time playing whack a mole with this issue and with U2c I'm rather enjoying not waking up every morning and having to check in on which hosts can no longer see their boot device 😉

JailBreak
Hot Shot
Hot Shot


@vbabic wrote:

Well this is great... Now there's a new article saying that having only SD card (or USB) is unsupported.

Now I am doubting even more that they actually fixed the issue (maybe just lessened the impact), obviously they changed the design too much without thinking or testing the currently supported hardware and it would be too much trouble for them to fix it now. So...dear customers, suddenly you're unsupported, tough luck! Oh wait, there's a simple workaround, just reinstall all your servers on new hardware...I am sure VMware will cover the cost...

https://kb.vmware.com/s/article/85615?lang=en_US

Great job!


No, please read carefully. Is not stating that but that you should use persistent storage for .locker, coredump, logs etc. That is will not be supported if you not used. That is it, not the USB/SD cards.

But this is not new, that was the Best Practices anyway if you use a USB/SD card, you should always move this to persistence Storage.

The only problem I see here is when using vSAN and USB/SD Cards and if the only Storage you have is the vSAN, then you have a not supported system. We had the same before but was not explicit not supported and it seems the future we will have.

In the long run? I am pretty sure the path is to remove any possibility to use USB/SD Cards, that is what VMware will do in the long run. But that is different from what we have today and will have in a near future.


Luciano Patrao
ICT Senior Infrastructure Engineer
Tech Lead for VMware / Virtual Backups
________________________________
If helpful Please award points
Thank You
Blog: https://www.provirtualzone.com | Twitter: @Luciano_PT
vbabic
Enthusiast
Enthusiast

Again, this is not about the scratch partition, that was always the case. Where do you see ".locker, coredump, logs" in the article?

"Please move installation to persistent storage"

"ESXi requires local persistent storage for operating system use, to store system state, configuration, logs, and live data"

"A system with only a SD-Card/USB boot device is operating in an unsupported state with the potential for premature corruption"

You really don't see anything new here? Even if all that was best practice, the sudden change to "unsupported" is the main issue. 

I am sure you followed all of those best practices you mention but I know from your website that you had a lot of PSODs because of this, why if this is nothing new? There are things on the boot disk that can't be redirected (so it couldn't have been a best practice) that previously could be on the SD Cards, but now they can't.

If they said when they released 7.0, SD Cards are deprecated and won't be supported in the next major release, that would be fine. No, they specifically stated they are still supported and what are the minimum sizes for upgrades and for new installs. Otherwise, I would have a lot less SD Cards in my servers by now...

0 Kudos
lukaslang
Enthusiast
Enthusiast

Now the link to https://kb.vmware.com/s/article/85615?lang=en_US shows pagenotfound. Interesting...

0 Kudos
vbabic
Enthusiast
Enthusiast

I'm not surprised... Fortunately, screenshot was made 😉 Just in case someone calls us crazy for thinking VMware would ever do something like that..

Let's just hope they learned something and the new version of the article will be more "customer friendly"

0 Kudos
lukaslang
Enthusiast
Enthusiast

Well maybe they call it 7.5 instead of 7.0 U3 - "as in earlier versions, non-persistent storage was considered supported, now, systems with only a SD-Card/USB boot devices are considered unsupported" 🙈

Or the upgrade requires a fresh install like they did with the vcsa and for esxi you then need "high endurance fast low latency high iops flash media"

0 Kudos
einstein-a-go-g
Hot Shot
Hot Shot

I'm waiting for this one to be pulled!!!

Installing ESXi on a supported USB flash drive or SD flash card (2004784)

https://kb.vmware.com/s/article/2004784

which was updated August 2021 !

 

 

0 Kudos
depping
Leadership
Leadership

as far as I can tell the KB article was prematurely published, and I think (from what I have understand in terms of what we are planning) that it wasn't completely accurate either.

0 Kudos
vbabic
Enthusiast
Enthusiast

Thanks for the info Duncan, we eagerly wait for the new article..

Not that the "main" article about the issue is completely clear, saying that 6.7 is affected by the VMFS-L (!) corruption on SD cards...which also causes Skyline alerts on 6.7 hosts..

https://kb.vmware.com/s/article/83376

 

0 Kudos
einstein-a-go-g
Hot Shot
Hot Shot

@depping more shambles!!!!

0 Kudos
sysadmin84
Enthusiast
Enthusiast

I don't mind if the industry moves away from SD cards, but it could have been coordinated better. To be fair, I have now heard that VMware in 2018 did communicate to vendors that SD cards should not be used anymore. Imo two things should have happened that are very easy to implement: Vendors warn customers when configuring a server with SD cards, that it is not supported with ESXi 7 and the same on the ESXi 7 download page. It should have been stated in big red letters to not install it onto SD cards (page 12 of the installation guide isn't good enough imo).

I just configured a Dell server with an IDSM and ESXi 7 to see what happens and Dell does not allow this configuration (anymore):

2021-09-03 14_51_24-PowerEdge R740 Rack Server _ Dell USA.png

0 Kudos
einstein-a-go-g
Hot Shot
Hot Shot

@sysadmin84 but that IDSM statement of "not recommended" only appeared recently when all this started....

after DELL shipped out many servers Dell R740 and R640 to many of our clients with IDSM modules!!!!

 

 

0 Kudos
vbabic
Enthusiast
Enthusiast

I'm sorry, but that story sounds like a face saving exercise for VMware, because

1. Vendors obviously haven't heard that because they continued to sell and recommend SD Cards, on VMware's own HCL! Event if they  ignored warnings from VMware, wouldn't they (VMware) say something about it publicly then so we, the customers, know about it?

2. VMware itself obviously hasn't heard that, because 2 years later, in 2020, when they released 7.0, they explicitly documented that SD Cards are fully supported in 7.0 the same way as in 6.7. The only change was raising the minimum size to 32 GB, and that only for new installs.

So the next opportunity to do what you say they should have done is the next major release, and even then it should be deprecated, not immediately unsupported (except maybe for new installs)

einstein-a-go-g
Hot Shot
Hot Shot

we will have to wait with baited breath as to what the updated KB states.

0 Kudos
WuGeDe
Enthusiast
Enthusiast

I talked to Dell and HPE these days mentioning that issue and they all referring to that linked blog entry and told me ESX7 on SD card installation is still fine 🤔😕

https://core.vmware.com/resource/esxi-system-storage-changes#section1

0 Kudos
lukaslang
Enthusiast
Enthusiast

Has anyone else experienced that U2c slows the local disk extremely down? Installations of the NSX-T or HA agents take a very long time, even on SSDs connect via the chipset SATA controller (Wellsburg oder Lewisburg).

0 Kudos
Srijithk
Enthusiast
Enthusiast

0 Kudos
JailBreak
Hot Shot
Hot Shot

Even yesterday I deploy NSX-T / vSAN with the same version and did not notice extra slowness.


Luciano Patrao
ICT Senior Infrastructure Engineer
Tech Lead for VMware / Virtual Backups
________________________________
If helpful Please award points
Thank You
Blog: https://www.provirtualzone.com | Twitter: @Luciano_PT
0 Kudos
A13x
Hot Shot
Hot Shot

It sounds like your NSX appliance might need a db cleanup. Our T and V appliances always clog up after sometime and we clear the db down before any upgrade

0 Kudos