VMware Cloud Community
bo_busillo
VMware Employee
VMware Employee

SD Boot issue Solution in 7.x

Issue The host goes into an un-responsive state due to: "Bootbank cannot be found at path '/bootbank” and boot device is in an APD state.

This issue is seen due to the boot device failing to respond & enter APD state (All paths down). Some cases, Host goes to non-responsive state & shows disconnected from vCenter.

As of 7.0 Update 1, the format of the ESX-OSData boot data partition has been changed. Instead of using FAT it is using a new format called VMFS-L. This new format allows much more and faster I/O to the partition. The level of read and write traffic is overwhelming and corrupting many less capable SD cards.

We have come across lot of customer’s reporting bootbank errors (host booting from SD cards) and host going into un-responsive state in ESXi version 7. 

Our VMware engineering team is gathering information for a fix, there is a new vmkusb driver version available for testing. There is currently a workaround in place, which is to install version-2 of vmkusb driver and monitor the host.

The action plan for future resolution would be to replace the SD card/s with a capable device/disk. Per the best practices mentioned on Installation guide. 

The version 7.0 Update 2 VMware ESXi Installation and Setup Guide, page 12, specifically says that the ESX-OSData partition "must be created on high-endurance storage devices".

https://docs.vmware.com/en/VMware-vSphere/7.0/vsphere-esxi-702-installation-setup-guide.pdf

You can also refer to the below KB:

Reference: https://kb.vmware.com/s/article/83376?lang=en_US

Resolution

VMware engineering has a fix that will be in the next release of 7.02 P03 which is planned for sometime in July 2021.

175 Replies
sysadmin84
Enthusiast
Enthusiast

Strange, it was updated like 3 times yesterday, now they deleted it. 🤔 Well, the suggested resolution doesn't fix it anyway.

mbartle
Enthusiast
Enthusiast

I just upgraded 2 hosts to v7 that run Dell Dual SD cards.  I noticed a few things :

1:  As soon as I added VMFS storage to one box, it automatically moved the /scratch or .locker to the HDD on its own

2: My other cluster server already had scratch going to a SAN disk and retained these settings.

I also followed the KB to move the Tools to RAM.  One host has been upgraded for a week without any issues at all and the other was done yesterday (but now takes 45 min to boot), so i've pulled it out of the cluster until VMware can figure out why it gets stuck for so long after loading the SATP_ALUA policy at boot.

I am a bit concerned hosts will stop working so i've paused the upgrades on the rest of the servers.  If one has performed mitigations I listed above, is there still a chance the SD cards will stop working?

LucianoPatrão


@mbartle wrote:

I just upgraded 2 hosts to v7 that run Dell Dual SD cards.  I noticed a few things :

1:  As soon as I added VMFS storage to one box, it automatically moved the /scratch or .locker to the HDD on its own

2: My other cluster server already had scratch going to a SAN disk and retained these settings.

I also followed the KB to move the Tools to RAM.  One host has been upgraded for a week without any issues at all and the other was done yesterday (but now takes 45 min to boot), so i've pulled it out of the cluster until VMware can figure out why it gets stuck for so long after loading the SATP_ALUA policy at boot.

I am a bit concerned hosts will stop working so i've paused the upgrades on the rest of the servers.  If one has performed mitigations I listed above, is there still a chance the SD cards will stop working?


If this version has a critical bug, why updating to this version?

Also, I don't understand how VMware is still providing this version since it is a faulty version. Honestly can't understand that.

Luciano Patrão

VCP-DCV, VCAP-DCV Design 2023, VCP-Cloud 2023
vExpert vSAN, NSX, Cloud Provider, Veeam Vanguard
Solutions Architect - Tech Lead for VMware / Virtual Backups

________________________________
If helpful Please award points
Thank You
Blog: https://www.provirtualzone.com | Twitter: @Luciano_PT
sysadmin84
Enthusiast
Enthusiast

@mbartle

Yes, I think the bug will still occur. We write our logs to a SAN and didn't upgrade any VMWare tools and the problem still occurred. One of our hosts got hit immediately, with another one it took 2 months. If you haven't upgraded the VM hardware versions yet, I'd rollback: https://kb.vmware.com/s/article/1033604

@LucianoPatrão

True, considering how many hosts there are in the wild with SD cards, at the very least there should be big red box on the download page advising not to upgrade when using SD cards. Better yet, there should 've been a hotfix ages ago. I think the only reason this hasn't received more attention yet is because most admins prefer to take things slowly and are probably still on 6.7.

mbartle
Enthusiast
Enthusiast

Hi Luciano

I did not know I would have these issues or i would have not upgraded. One host was a test server and it seems to have worked ok. I took my first prod server to 7.0.2 and now 45 minutes to boot, stuck on vmw_satp_alua loaded successfully. 

Then i happened to see the issues with SD card problem. I did the HCL compatibility check and even Skyline check and never once did it say there are potential issues with SD card.

I may just wait for 7.0.3 . I don't think I want to do the other 8 hosts because i really don't want to have to build them from scratch and then face potential server loss due to corrupt SD cards.

it seems this version is one gigantic mess. So glad I did not apply the upgrade to the cluster and chose only to start with 1 host. 

Tags (1)
mbartle
Enthusiast
Enthusiast

Thank you. Since I only did one host. I may seriously consider rebuilding it back to 6.7 U3 and wait for further fixes to come from VMware. I have a SR open for my long boot issue. I'll see what they say but after reading all your posts, I have some concerns with 7.0.2

einstein-a-go-g
Hot Shot
Hot Shot

After long boot times we’ve seen flash fail!

Reply
0 Kudos
bo_busillo
VMware Employee
VMware Employee

The KB link is now active and looks like it was updated as of 7-13-21

https://kb.vmware.com/s/article/83376?lang=en_US

einstein-a-go-g
Hot Shot
Hot Shot

still a lot of gobble-de-**bleep**!

I've seen better articles written by an 8 year old playing fornite!

If SD card is lower tolerant devices, we can reduce heavy access to SD cards by following below steps.....

and it makes out it's only occurs, if using VMware Tools, does not explain HOW an ESXi 7.0.2a server fails after 26 minutes of uptime!

on

high-endurance flash media

as per Kb !

I'll never know, maybe about time for VMware to include in their Skyline Product Health Assessment tool, they keep bragging about is so good!

 

They could give a Predicted Failure Alert, your ESXi Host Server is going to crap out at the next reboot!

 

Reply
0 Kudos
No_Way
Enthusiast
Enthusiast

And today is the 15th, and nothing... some rumors inside VMware that possibly the release of U3.

Maybe 15th of August or September 😁

Reply
0 Kudos
depping
Leadership
Leadership

release dates are typically not shared, mainly as they change based on various aspects. In this case your source was/is wrong.

Reply
0 Kudos
sbd27
Contributor
Contributor

I guess Dell finally got their hooks into how VMWare does business. This should be a zero-day-tier1 issue. 7.0 Update 1 and 2 should have been taken down from VMWare's downloads when this problem first creeped up. The maddening part is how VMWare seems to act like this is low tier issue. I had promised my Management Team that the VMWare 7.0 upgrade project would have been completed by April 2021, its now July and I'm only 1/3 of the way done!

And my TAMs solution... "Can you rollback to 6.7?" WHAT? Sure lets just waste a ton of man hours getting to where I am today because VMWare is treating this like a 3rd level bug.

Also, the embedded VMWare solutions are horrible (at least Dell's IDSDMs and BOSS cards ) they are not really redundant and are impossible to manage. When this issue killed one of my hosts, I tried to reinstall and the reinstalls still failed, even though the iDRAC says both cards are "Online" and since SD Cards don't really re-format themselves. I had to swap the SD cards around, and then the host booted up to version of 7.0 that was months old!

I have now recommended to my Management Team, and I advise everyone reading this to do the same, to no longer used embedded ESXi solutions and buy servers with RAID cards with RAID0 (sorry I of course meant RAID1) mirrored SSD disks. The cost difference is about 1%.

I really hope this update comes out soon and resolves this issue, it has really put me in a bad spot. 

 

Reply
0 Kudos
LucianoPatrão


@sbd27 wrote:

...

I have now recommended to my Management Team, and I advise everyone reading this to do the same, to no longer used embedded ESXi solutions and buy servers with RAID cards with RAID0 mirrored SSD disks. The cost difference is about 1%.

I really hope this update comes out soon and resolves this issue, it has really put me in a bad spot. 

 


Yes I know, I am in the spot. But gladly I only upgrade some of our environments, not all. If I did upgrade all to 7, or to 7 U2a then I was in a really bad spot. With some rollbacks and stop upgrades or patch to U2a, I only have at moment around 40 server with this issue. If not, I would have 5 times this.

But the servers I had running with vSphere 7 U1 add no issues for months. Only when I patch those to U2a I start getting this issues. And yes U2 was when VMware changed the bootbank and partitions for ESXi OS. So if anyone is running vSphere 7 U1, should be ok. There was also some issues on it, that was supposed to fix on U2, and it was, but then trigger another one very, very serious.

And yes, new server now always will have local disks.

Since the 15th of July was not a real date(I also had the same information that could be the launch of the U3) so when? There is many, many companies suffering with this issue and systems engineer working a lot so that system continues to run without impact on production. And really rollback for most of the environments is something that cannot be done. Some were planned for months until is finish, like ours and now rollback all again?

Luciano Patrão

VCP-DCV, VCAP-DCV Design 2023, VCP-Cloud 2023
vExpert vSAN, NSX, Cloud Provider, Veeam Vanguard
Solutions Architect - Tech Lead for VMware / Virtual Backups

________________________________
If helpful Please award points
Thank You
Blog: https://www.provirtualzone.com | Twitter: @Luciano_PT
Reply
0 Kudos
lukaslang
Enthusiast
Enthusiast

This is a sad story. On the other hand they show us running ESXi on Raspberries and on the other we have to use "high endurance" flash media to power the hypervisor.

But in the end, U3 will solve this mess. If not, I have to explain my managment that we bought a bunch of useless SD Cards because ESXi 7 does not support the RAID Controller with the existing SSDs in our existing servers because VMware decided to vmotion their QA to the customers.

FYI: Our SD Card Test Blade runs now for 29 days without issues (with heavy workload).

Reply
0 Kudos
einstein-a-go-g
Hot Shot
Hot Shot

@lukaslang we would be interested to know which brand of media ?

Reply
0 Kudos
lukaslang
Enthusiast
Enthusiast

It's a 8 GB HPE SD Card in a BL460c Gen9 Blade. The 8 GB card is not officially supportet by HPE for ESXi 7, but the 32GB is Quickspecs 

All we did was relocating scratch, productLocker and LogDump to FC-SAN (which we also did before for many years). Since during the upgrade process we've seen many other strange issues, this Host was fresh installed.

Reply
0 Kudos
einstein-a-go-g
Hot Shot
Hot Shot

@lukaslang sorry to be the bearer of bad news we;ve had those fail !!!

and they are not high endurance! also class 1.

I think personally speed and increase i/o, high endurance is a red herring and misleading

it looks like ESXi is overwriting and writing across a page boundary!!!

 

e.g. self destruct mode!!!

BUG!!!!!

but for some reason finger is pointed at flash not suitable!!!

Reply
0 Kudos
PatrickDLong
Enthusiast
Enthusiast

At least one of the following statements must be true regarding 7.0U2:
-VMware didn't realize the severity of the impact of the change in I/O profile to USB-based boot devices (likely)
-VMware didn't realize the volume of their install base using USB-based boot devices (unlikely)
-VMware didn't anticipate the blowback that this issue would cause from both clients and hardware vendors and the unbridled anticipation of a forthcoming patch release. (likely)

No serious person would equate VMware's change of recommendation of USB-based boot devices in 7.x to "Legacy" with other high-=endurance methods now "Preferred" to have actually meant that USB-based boot devices are now "at risk of catastrophic failure in your vSphere 7.0U2 environment".

I value simplicity in my VMware hosts back to the GSX days- the fewer hardware components the better. I worked for YEARS to get all spinning disks out of my hosts to eliminate the most common failure point (aside from the occassional failed DIMM), only to have VMware pull a complete 180 with VSAN which of course requires plentiful local disks. Oh well, AFA vendors got my $ instead of my host compute vendor and I've never regretted the decision.   My entire vSphere environment runs on a large number of top tier (read: the orange company) FC AFA storage arrays that are unbelievably easy to manage and I've only replaced one single disk in an AFA over 6+ years. I REALLY don't want to get back in the business of installing controllers and local disks on my hosts unless it's absolutely necessary.

@sbd27indicated "just buy RAID cards and mirrored SD disks...cost difference is about 1%" which may be true for the up-font cost, but *certainly* not for the TCO of the entire environment.  There would be three additional components (potential points of failure) in every host in my environment which will all require regular firmware upgrades and break/fix management. I have a 300-node production environment split across remote data centers. 300 RAID cards, plus 600 SSDs, plus the travel costs and labor hours associated with installing all of that hardware, not to mention the labor required to reload ESXi on all of those hosts with their shiny new "Preferred" boot devices - and the opportunity cost of all of the other business projects that will sit idle while me and my staff accomplish all of this rigamarole - the total cost of an effort like this is ASTRONOMICAL.

@lukaslangI'm very interested to see where you read your 8 GB HPE microSD card in the BL460c: "is not officially supportet by HPE for ESXi 7" as I have many of the same blades. According to ESXi 7.0 Hardware Requirements doc, https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.esxi.install.doc/GUID-DEB8086A-306B-4239-BF...   8 GB micro-SD boot devices should meet the requirement, albeit as a "Legacy" storage upgrade scenario requiring use of additional high-endurance device (which you have already said you do by relocating scratch, productLocker and LogDump to FC-SAN, as do I) per Neils' blog post here: https://blogs.vmware.com/vsphere/2020/07/vsphere-7-system-storage-when-upgrading.html

I'm patiently awaiting the U3 patch like everyone else, but I am HIGHLY skeptical that it will document the precise root causes of this issue and what exact methods the patch is using to mitigate them. Maybe VMware will surprise me with transparency 😉

einstein-a-go-g
Hot Shot
Hot Shot

I think all our true facts!

and they did no testing!

we have had very few failures of USB or SD cards since 2004 with ESXi but then I cannot remember the last time I changed a spinning rust disk either in SANs NAS either they seem to run for years now without failure!

Reply
0 Kudos
sbd27
Contributor
Contributor

@PatrickDLong So you are correct. I would not recommend replacing any current embedded ESXi solution mainly because, at least with Dell, you can't! When you purchase a diskless server from Dell without a Perc card and drive cages, they do not support installing them afterwards, you are stuck.

What makes matters even worse for me (and I have to assume other customers) is that I have some R730s that are diskless with only the IDSDM solution, and the R730 (which is still an ESXi supported server) does not support the BOSS card . If this fix does not work I have to replace servers I did not budget for in my upgrade project. 

However, all new servers that I purchase will not longer utilize a diskless config. I can easily have a non-technical person replace a bad swappable SSD RAID drive, but replacing a BOSS card or its attached SSD requires downtime and opening the hood of the server. No Thanks!