VMware Cloud Community
bo_busillo
VMware Employee
VMware Employee

SD Boot issue Solution in 7.x

Issue The host goes into an un-responsive state due to: "Bootbank cannot be found at path '/bootbank” and boot device is in an APD state.

This issue is seen due to the boot device failing to respond & enter APD state (All paths down). Some cases, Host goes to non-responsive state & shows disconnected from vCenter.

As of 7.0 Update 1, the format of the ESX-OSData boot data partition has been changed. Instead of using FAT it is using a new format called VMFS-L. This new format allows much more and faster I/O to the partition. The level of read and write traffic is overwhelming and corrupting many less capable SD cards.

We have come across lot of customer’s reporting bootbank errors (host booting from SD cards) and host going into un-responsive state in ESXi version 7. 

Our VMware engineering team is gathering information for a fix, there is a new vmkusb driver version available for testing. There is currently a workaround in place, which is to install version-2 of vmkusb driver and monitor the host.

The action plan for future resolution would be to replace the SD card/s with a capable device/disk. Per the best practices mentioned on Installation guide. 

The version 7.0 Update 2 VMware ESXi Installation and Setup Guide, page 12, specifically says that the ESX-OSData partition "must be created on high-endurance storage devices".

https://docs.vmware.com/en/VMware-vSphere/7.0/vsphere-esxi-702-installation-setup-guide.pdf

You can also refer to the below KB:

Reference: https://kb.vmware.com/s/article/83376?lang=en_US

Resolution

VMware engineering has a fix that will be in the next release of 7.02 P03 which is planned for sometime in July 2021.

175 Replies
lukaslang
Enthusiast
Enthusiast

Seems related to the AHCI integrated driver of ESXi. Today I reinstalled another server which has a RAID setup and there everything installed with normal speed.

Reply
0 Kudos
mbartle
Enthusiast
Enthusiast

I posted in another thread but for extra visibility, I will add it here as well

ESXi 7.0.2c or d , did not fix the SD card bug for us.

We run Dell FC640 Blades, the dual SD card firmware is at 1.15 (all other firmware is up to date). 

We had been running 7.01d without a single issue. Within 24 hours of upgrading to 7.0.2c one node had a SD card die.

A day later a second node had a card die

I applied 7.0.2d and it did nothing . So we've rolled back once again to 7.0.1 and have ordered BOSS cards and M.2

I know many folks have had success with this, but I wanted to let people know that something is still causing cards to die. If anyone from VMware is reading this, I would be happy to provide logs to help diagnose this.

TL:DR - Stay away from 7.02 if you value your free time and enjoy stable servers

Tags (1)
einstein-a-go-g
Hot Shot
Hot Shot

no surprises there!

lets wait for NVMe, SATADOM, M2 to get corrupted!

and then the fun will begin......

people close to/in VMware already know this, hence the recent statements about MOVE AWAY FROM SD/USB flash drives!

Reply
0 Kudos
sysadmin84
Enthusiast
Enthusiast

Reply
0 Kudos
einstein-a-go-g
Hot Shot
Hot Shot

also the original link has been updated and 7,.0U3 states SD/USB configuration is now deprecated!

So maybe it was never fixed!!!!???

Reply
0 Kudos
pkvmw
VMware Employee
VMware Employee

Since quite some times, especially with vSphere 7.0 GA, using SD-cards wasn't the preferred way going forward, even when it wasn't clearly stated somewhere. Now vSphere 7.0 Update 3 just now makes the obvious official to deprecate boot from SD-cards.

While it's still supported for vSphere 7.x as of today, the support to boot from SD-cards might be removed in a next major release. )(Unclear if the ability to boot from such devices might be removed as well.)

I'm not sure where you've got the data and evidence from, but from what I know and have seen the issue introduced in 7.0 U2a was fixed in 7.0 U2c. Even when there're similar messages like "state in doubt", it doesn't necessarily mean it's the exact same underlying root cause. The behavior can be seen in many, many different scenarios - not just with the USB sd-card bug.

Regards,
Patrik

Reply
0 Kudos
LucianoPatrão


@einstein-a-go-g wrote:

also the original link has been updated and 7,.0U3 states SD/USB configuration is now deprecated!

So maybe it was never fixed!!!!???


Yes pretty sure it was fixed. In more than 100 ESXi servers, none had any issue after the patch was applied. Before was weekly 2/3x times. So of course if fixed.

But again, the patch will not fix corrupted SD cards, will not fix using crap SD cards, or will not fix not having the best practices in place.

Regarding not supporting SD/USB cards anymore, I don't read that.

"VMware is moving away from the support of SD cards and USB drives as boot media. ESXi Boot configuration with only SD card or USB drive, without any persistent device, is deprecated with vSphere 7 Update 3. In future vSphere releases, it will be an unsupported configuration. Customers are advised to move away from SD cards or USB drives completely. If that is not currently a feasible situation, please ensure a minimum of 8GB SD cards or USB drive is present and an additional minimum of 32 GB locally attached high endurance device available for ESX-OSData Partition"

I don't read here they are not supported anymore or in the future. Having only SD/USB without any local storage for ESX-OSData Partition will not be supported in the future. That is a different statement saying that SD/USB is completely not supported in the future.

Luciano Patrão

VCP-DCV, VCAP-DCV Design 2023, VCP-Cloud 2023
vExpert vSAN, NSX, Cloud Provider, Veeam Vanguard
Solutions Architect - Tech Lead for VMware / Virtual Backups

________________________________
If helpful Please award points
Thank You
Blog: https://www.provirtualzone.com | Twitter: @Luciano_PT
Reply
0 Kudos
einstein-a-go-g
Hot Shot
Hot Shot

that's the issue define "crap SD cards" nobody knows what Vendors supplied!

you would think an Enterprise SD from HPE would be better than a $1 WALMART SD-card!

Reply
0 Kudos
LucianoPatrão


@einstein-a-go-g wrote:

that's the issue define "crap SD cards" nobody knows what Vendors supplied!

you would think an Enterprise SD from HPE would be better than a $1 WALMART SD-card!


We don't use one SD card that was supplied by Server Vendors. Not one!

A couple of years ago we had many that did break, like 2/3 per week, so we decided to replace all the SD cards with better ones. Until now, I think I replace 1 or 2 nothing more.

Luciano Patrão

VCP-DCV, VCAP-DCV Design 2023, VCP-Cloud 2023
vExpert vSAN, NSX, Cloud Provider, Veeam Vanguard
Solutions Architect - Tech Lead for VMware / Virtual Backups

________________________________
If helpful Please award points
Thank You
Blog: https://www.provirtualzone.com | Twitter: @Luciano_PT
Reply
0 Kudos
LucianoPatrão

But one thing is clear, we will not buy any more servers with SD cards, that is for sure. While we are replacing servers, we will replace them with local disks or NVMe.

Today an M.2 128Gb is cheap, almost the same as a good SD card in the past.

Luciano Patrão

VCP-DCV, VCAP-DCV Design 2023, VCP-Cloud 2023
vExpert vSAN, NSX, Cloud Provider, Veeam Vanguard
Solutions Architect - Tech Lead for VMware / Virtual Backups

________________________________
If helpful Please award points
Thank You
Blog: https://www.provirtualzone.com | Twitter: @Luciano_PT
Reply
0 Kudos
vbabic
Enthusiast
Enthusiast

Yes, that was clear as soon as these issues started. But what to do with all the current servers... Since somehow I doubt the ones who caused this mess (VMware) will cover the costs of retrofitting servers with new boot devices, is a (not shared) SAN LUN an option for the ESX-OSData partition?

The new blog post and the KB articles mention only locally attached devices, but the "summary" table in the blog post also has "Managed FCoE/iSCSI LUN" as an option, so which is true?

https://blogs.vmware.com/vsphere/2021/09/esxi-7-boot-media-consideration-vmware-technical-guidance.h...

Another useful thing to have would be a supported way of replacing a boot disk (without reinstall), since it's not only a question of buying new devices, but (AFAIK) there's no supported way of replacing a boot disk and a reinstall of all servers would probably be even more expensive (in man hours) than the hardware itself..

LucianoPatrão


@vbabic wrote:

Yes, that was clear as soon as these issues started. But what to do with all the current servers... Since somehow I doubt the ones who caused this mess (VMware) will cover the costs of retrofitting servers with new boot devices, is a (not shared) SAN LUN an option for the ESX-OSData partition?

The new blog post and the KB articles mention only locally attached devices, but the "summary" table in the blog post also has "Managed FCoE/iSCSI LUN" as an option, so which is true?

https://blogs.vmware.com/vsphere/2021/09/esxi-7-boot-media-consideration-vmware-technical-guidance.h...

Another useful thing to have would be a supported way of replacing a boot disk (without reinstall), since it's not only a question of buying new devices, but (AFAIK) there's no supported way of replacing a boot disk and a reinstall of all servers would probably be even more expensive (in man hours) than the hardware itself..


If we are talking about the bug, yes I agreed was a **bleep** show. But if you are talking about (like others) that new versions need to be reinstalled and not SD/USB cards I don't see where is the scandal here.

We have seen this times and times with other products, with ESXi regarding CPU support etc. Before we had a lot of G5, then G6, then G7, and now none of those are supported and some can't run the new versions and those servers you just don't change the CPU, is not possible you need full new servers and yes no upgrade was possible, new installations only.

So now why all the fuss because we need to change servers, add local disks, or NVMe(almost the price of SD cards) and do a fresh install? Is the same, but with different hardware.

If everyone has seen my comments and blog posts about this stupid issue and the way VMware handle I was and I am very, very critical about that something that was unacceptable. But criticizing these changes just because?

Hardware changes, systems changes, OS changes, Hypervisor changes, that is why this is a multibillion-dollar industry for everyone.

Luciano Patrão

VCP-DCV, VCAP-DCV Design 2023, VCP-Cloud 2023
vExpert vSAN, NSX, Cloud Provider, Veeam Vanguard
Solutions Architect - Tech Lead for VMware / Virtual Backups

________________________________
If helpful Please award points
Thank You
Blog: https://www.provirtualzone.com | Twitter: @Luciano_PT
Reply
0 Kudos
vbabic
Enthusiast
Enthusiast

Hi,

I'm sorry, but there's a big difference between a new major release not supporting a server/CPU generation that is literally 10 years old and not sold for 7-8 years and suddenly (in an update) deprecating something that was until "yesterday" fully supported and is still being sold today! Even today, go look at the VMware's own HCL (!), vsan ready nodes for example and you will find nodes with SD cards fully supported even for version 7.0 Update 3! That means people are still buying them if they are not reading every blog and KB.

And like us with a 1 year old server (so expected to be in productions for at least 4 more years), "tomorrow" they will not be able to upgrade to the next version. And not because they didn't check the HCL and bought 10 year old hardware, but because of a software fiasco.. It's normal to do a fresh install when you retire a 5 year old server, It's not normal to be forced to do a HW upgrade and reinstall a 2 year old server...

If you think that is just the way it always was and should be, I respectfully disagree...maybe it's expected for free software, it was never normal for VMware..

einstein-a-go-g
Hot Shot
Hot Shot

@vbabic I have to agree, and we are facing the same discussions with Clients at present, it is very difficult for the implementation and consultants, and VMware Partners at present!

The take home here is how a major change was implemented in an update! Not a major version change e.g. 8.0 ! 

Reply
0 Kudos
LucianoPatrão


@vbabic wrote:

Hi,

I'm sorry, but there's a big difference between a new major release not supporting a server/CPU generation that is literally 10 years old and not sold for 7-8 years and suddenly (in an update) deprecating something that was until "yesterday" fully supported and is still being sold today! Even today, go look at the VMware's own HCL (!), vsan ready nodes for example and you will find nodes with SD cards fully supported even for version 7.0 Update 3! That means people are still buying them if they are not reading every blog and KB.

And like us with a 1 year old server (so expected to be in productions for at least 4 more years), "tomorrow" they will not be able to upgrade to the next version. And not because they didn't check the HCL and bought 10 year old hardware, but because of a software fiasco.. It's normal to do a fresh install when you retire a 5 year old server, It's not normal to be forced to do a HW upgrade and reinstall a 2 year old server...

If you think that is just the way it always was and should be, I respectfully disagree...maybe it's expected for free software, it was never normal for VMware..


Again, SD/USB is and will continue to be supported. Only with some changes.

And yes is a major that was release with vSphere 7 (the wrong here, is that they should done the partition changes in the first release to inform this SD/USB at the beginning of this release, not in an update).

So until vSphere 7 is running we still have the option to use SD/USB devices without any issues(as long best practices is in place). So if you are talking about a vSphere 7.5, or a v8, there are some years until they launch. And even they launched vSphere 7 End Tecncnial is in 2027 and EOL will be at least 7/8 years.

So again, don't see the big issue here and compared to other changes in the past.

That is my view here and I think looking at the past, makes totally sense. But not planning to win the argument, is just my view and opinion.

Luciano Patrão

VCP-DCV, VCAP-DCV Design 2023, VCP-Cloud 2023
vExpert vSAN, NSX, Cloud Provider, Veeam Vanguard
Solutions Architect - Tech Lead for VMware / Virtual Backups

________________________________
If helpful Please award points
Thank You
Blog: https://www.provirtualzone.com | Twitter: @Luciano_PT
Reply
0 Kudos
LucianoPatrão


@einstein-a-go-g wrote:

@vbabic I have to agree, and we are facing the same discussions with Clients at present, it is very difficult for the implementation and consultants, and VMware Partners at present!

The take home here is how a major change was implemented in an update! Not a major version change e.g. 8.0 ! 


From 6.0 to 6.5 and then to 6.7, there were bit changes, and like I said CPU support changes. Numbers don't mean anything. So no need to be a v8 to have bigger changes.

Again like I said, these kind of changes in partitions should not be done in an update. That I agree 100%

Luciano Patrão

VCP-DCV, VCAP-DCV Design 2023, VCP-Cloud 2023
vExpert vSAN, NSX, Cloud Provider, Veeam Vanguard
Solutions Architect - Tech Lead for VMware / Virtual Backups

________________________________
If helpful Please award points
Thank You
Blog: https://www.provirtualzone.com | Twitter: @Luciano_PT
Reply
0 Kudos