Re: SD Boot issue Solution in 7.x - Page 3

bo_busillo · ‎06-10-2021

Issue The host goes into an un-responsive state due to: "Bootbank cannot be found at path '/bootbank” and boot device is in an APD state.

This issue is seen due to the boot device failing to respond & enter APD state (All paths down). Some cases, Host goes to non-responsive state & shows disconnected from vCenter.

As of 7.0 Update 1, the format of the ESX-OSData boot data partition has been changed. Instead of using FAT it is using a new format called VMFS-L. This new format allows much more and faster I/O to the partition. The level of read and write traffic is overwhelming and corrupting many less capable SD cards.

We have come across lot of customer’s reporting bootbank errors (host booting from SD cards) and host going into un-responsive state in ESXi version 7.

Our VMware engineering team is gathering information for a fix, there is a new vmkusb driver version available for testing. There is currently a workaround in place, which is to install version-2 of vmkusb driver and monitor the host.

The action plan for future resolution would be to replace the SD card/s with a capable device/disk. Per the best practices mentioned on Installation guide.

The version 7.0 Update 2 VMware ESXi Installation and Setup Guide, page 12, specifically says that the ESX-OSData partition "must be created on high-endurance storage devices".

https://docs.vmware.com/en/VMware-vSphere/7.0/vsphere-esxi-702-installation-setup-guide.pdf

You can also refer to the below KB:

Reference: https://kb.vmware.com/s/article/83376?lang=en_US

Resolution

VMware engineering has a fix that will be in the next release of 7.02 P03 which is planned for sometime in July 2021.

einstein-a-go-g · ‎07-11-2021

Dell have issues a statement they no longer recommend installing on SD card, so if they supplied a server with SD for the use of 7.0.2a - get a refund and send it back!

einstein-a-go-g · ‎07-11-2021

again VMware just state use "high endurance flash" it's possible the days of SD card/USB for ESXi 7.0.2a installs is over.!

einstein-a-go-g · ‎07-11-2021

the issue is the wear level of the "flash technology" or the quality of the flash technology.

This would apply to all things flash.

einstein-a-go-g · ‎07-11-2021

We are currently checking ESXi versions old an new to compare read and write cycles to SD/microSD because we have connected up logic analysers to servers whilst ESXi is running in realtime to debug and collect data.

One thing is if ESXi 7.0.2a is wearing out microSD/SD cards which are certified for 4K and 8K video transfer at 60fps, then ESXI *MUST* be doing some very heavy writing to the media, and considering we've always been SOLD, it's called "'" for Embedded, and goes memory resident, it's doing some serious writing, and would not be long before SATA M2. SSD, BOSS cards are also worn out!

They have better wear, but the lifetime will be reduced significantly.

MicroSD card connected into server and logic analyser

logic analyser, decoded protocol in realtime with ESXi

and before anyone gets funky in the thread WHY !!!! Because we can !!!!! That's what we do best Embedded Electronic Debugging!!!!

sysadmin84 · ‎07-11-2021

@einstein-a-go-g wrote:
One thing is if ESXi 7.0.2a is wearing out microSD/SD cards which are certified for 4K and 8K video transfer at 60fps, then ESXI *MUST* be doing some very heavy writing to the media, and considering we've always been SOLD, it's called "'" for Embedded, and goes memory resident, it's doing some serious writing, and would not be long before SATA M2. SSD, BOSS cards are also worn out!

Agreed. If this is expected behavior with ESXi 7, why is VMWare working on a patch? We have a small environment, we write our logs to our SAN, ESXi runs from ram, we use Dell branded SD cards. How much IO can there be to corrupt them? From what I've read the majority is from the clustering service (vCLS) VMs .

PatrickDLong · ‎07-12-2021

@einstein-a-go-gthis debug analysis looks really interesting, although I admit I have no idea what I'm looking at on your real-time graph - can you explain a bit about the visualization, and have you been able to run old/new versions of ESXi to compare? I'm also interested in your statement regarding I/O to the boot device that you've read that the "majority is from the clustering service (vCLS) VMs". I have suspected vCLS played a role as well - in fact I even suggested this very thing in one of my comments on @LucianoPatrão 's blog article about this subject - but I haven't seen that written anywhere else. Can you point to your sources for that information? I'd like to read more.

sysadmin84 · ‎07-12-2021

Patrick, the info about the cluster service VMs is unfortunately only anecdotal, I have seen multiple people point to this across the threads I've been following. Here's one comment from a Reddit thread: "In environments with HA the little heartbeat vms write a lot so it kills SD cards. Ask me how I know....."

https://www.reddit.com/r/vmware/comments/nn1src/careful_when_upgrading_to_702_if_you_have_your/

lukaslang · ‎07-12-2021

But why would the vcls machines write on the SD Cards? 8 or 32 GB Cards do not provide a datastore because they do not have enough space. This little annyoing things are spreading across all datastores but I have never seen one running on SD since this should be impossbile.

Our Testhost with 7.0 U2a on SD runs now for 24 days without issues.

einstein-a-go-g · ‎07-12-2021

We have no idea what is causing the issue at present, we will add Cluster to our list of investigations.

einstein-a-go-g · ‎07-12-2021

@PatrickDLong

Sure I'll explain, we can inspect the data in real time being read and written to the media, either microSD or SD card in the server, whilst ESXi is live and running.

The logic analyser can decode the data in real time, and show us the data as HEX or ASCII, we can comapre and contrast what the OS is doing read and write with the media, we can then compare this with other versions of ESXi, at present we are comparing

ESXi-7.01 Build 17551050 HPE

ESXi-7.0U2a Build 17867351 HPE

We are using identical servers, and the same high endurance MLC media which was used when these servers failed after 46 hours!

We've used fresh new media, and re-installed both of the above, and we are now testing and comparing logs, at present these are standalone, e.g. not connected to vCenter Server or have any VMs running, but with info in this thread reference to vCLS, we will examine this nature as well.

einstein-a-go-g · ‎07-12-2021

@PatrickDLong

I've just had a look at that link to Reddit and Blog, I think there are mixed issues here, the BOOTBANK missing issue, seems different to corruption, or are they one and the same!

This is not what we have seen we have seen ESXi OS damaged and SD cards failing e.g. due to hardware sectors, and wear, resulting in corrupt and not booting OS.

My understanding and I've demonstrated this many times, you should be able to remove SD/USB from an ESXi host and the host will continue running, BUT if MEDIA is disappearing as per that Rediit/Web Blog, any high writes cause the OS to crash!

Why is ESXi now writing alot ? Change of function in the OS ?

and Latency warning could be because media has disappeared ? not necessarily because the media is too slow.

and what is a "must be created on high-endurance storage devices" ?

LucianoPatrão · ‎07-12-2021

@einstein-a-go-g wrote:
@PatrickDLong
Sure I'll explain, we can inspect the data in real time being read and written to the media, either microSD or SD card in the server, whilst ESXi is live and running.
The logic analyser can decode the data in real time, and show us the data as HEX or ASCII, we can comapre and contrast what the OS is doing read and write with the media, we can then compare this with other versions of ESXi, at present we are comparing
ESXi-7.01 Build 17551050 HPE
ESXi-7.0U2a Build 17867351 HPE
We are using identical servers, and the same high endurance MLC media which was used when these servers failed after 46 hours!
We've used fresh new media, and re-installed both of the above, and we are now testing and comparing logs, at present these are standalone, e.g. not connected to vCenter Server or have any VMs running, but with info in this thread reference to vCLS, we will examine this nature as well.

Hi Andrew,

First awesome that you taking the time to do this testing and troubleshooting, great work.

But I have some doughts that you will have the proper information that is needed. Particularly because you are not using any VMs or ESXi hosts are connected to a vCenter.

If you have a standalone ESXi host with vSphere 7.0.2 U2a without any VMs on it running on an SD card, is very rare that you will get the issues. I have 2 and after almost 2 months until today no issues.

Like I said before, what will trigger faster the issue is Importing OVF/OVA files to vCenter, upgrading VMware Tools, and also if you have a vCD running on that vCenter and using ESXi resources, you get the issue triggered in 2/3 days max.

With the tests I did with VMware Tools upgrades, 24h was enough to trigger the issue in a particular host where VMs were upgraded.

Regarding vCLS, I don't have data to answer that this is the root cause, or is just another process that is also triggering the issue. But yes vCLS is doing some r/w data on the partitions.

Also, running my Veeam backups can trigger this. I tested by disabling the backups for a couple of days, and no issues were found (or in some cases less than usual), After I enable again, start the issues with the same frequency.

At the moment I only have 4 environments with 7.02 U2a running on SD Cards, 2x HPE(DL360 G9/G10) with 10 ESXi hosts each, 1 with 8x HPE(BL460c G10) with vCloud Director, 1x with 6x HPE(BL460c G10) using vSAN.

The first one I get issues 1/2 times a week. Not very often. But no VMware Tools are allowed or importing OVF/OVA files to it. But is a very active Cluster that has a lot of new VMs per week, removed VMs, a lot of snapshots, etc.

The second one(vCD) I get 5/6 times per week. Sometimes more.

The third environment(vSAN) is very rare I get the issues. Maybe 2/3 times in a couple of weeks, sometimes none.

Besides the HPE server model, all have the same SD cards, etc.

All the rest of my vSphere 7.0.2 U2a are running in Local Disks, so the issue is not an issue here.

So this is an example of different environments and how they trigger.

I would like to have more time to troubleshoot and test better this issue and I don't work for VMware 😉
.

Luciano Patrão

VCP-DCV, VCAP-DCV Design 2023, VCP-Cloud 2023
vExpert vSAN, NSX, Cloud Provider, Veeam Vanguard
Solutions Architect - Tech Lead for VMware / Virtual Backups

________________________________
If helpful Please award points
Thank You
Blog: https://www.provirtualzone.com | Twitter: @Luciano_PT

einstein-a-go-g · ‎07-12-2021

@LucianoPatrão

Thanks for additional info, we will include in our test plan.

We need to walk before we can run, and collect baselines, these the first baselines collection in the first part of the test plan, and we have been capturing data from ESXi 4.x to ESXi 7.x. When we have collected enough data, we will move on with our test plan, and connect the standalone hosts to vCenter Server. Slowly slowly catchy monkey.....

This will be the exact same condition when our hosts failed after 46 hours using the same conditions at 7.0.2a.

"Like I said before, what will trigger faster the issue is Importing OVF/OVA files to vCenter, upgrading VMware Tools, and also if you have a vCD running on that vCenter and using ESXi resources, you get the issue triggered in 2/3 days max."

Again are we looking at different issues here, we did not see any bootbank loss, on either host, one host failed at 46 hours, the other failed at 58 hours, on host restarts the SD cards were corrupted and causing errors with the media. e.g. the VIBs were corrupted, and compeleting surface scans showed physical media errors.

We still have those cards and are undergoing forensic testing.

Again with our failure. two hosts connected to vCenter Server, with no VMs, failure after 46 and 58 hours.

No OVF imported, NO VMware Tools nothing, this is the condition we are trying to reproduce.

No VMs no backups either.

Standard HPE installs with no changes to configuration or redirection of logs.

We don't work for VMware either, BUT whatever the issue we will detect it and see it by read and writes to media via Logic Analyzer, if we cannot reproduce the issue as we have seen in our environment, we will look at

1. vCLS

2. Import OVA

3. VMware Tools

But why should 1,2 and 3 cause this ? what is being excessively written to and why ?

With the above did you see log files being written alot, excessive swapping to the SD card ?

Hence why we are trapping reads and writes at logic level outside of ESXi at physical hardware.

We will incorporate your findings into our test plans. There is much data gathering to perform.

LucianoPatrão · ‎07-12-2021

@einstein-a-go-g thanks for the update.

When I say, I don't work for VMware, I mean, I have no much time to spend troubleshooting and testing this. I have a team to manage(spending stupid times in meetings these days) and hundreds of ESXi hosts to manage. So no much time to go deeper or build some test labs to test these issues properly.

So it is good that you are doing this and will share with the community the findings.

Most of the logs entries(a lot and I mean a lo) are ESXi host trying to access the SD cards and getting r/w errors because no SD card was found. No physical errors on SD Cards, I don't one SD cards that were corrupt.

After the workaround a reboot, ESXi host is back to production. But funny is that in a Cluster with 10 ESXi hosts, I don't get the issue in the same server twice (at least during weeks).

This week is hosts A, B, next C, D, then maybe third or fourth week it happens again in the same host.

Again, all coredump, scratch, logs all, are stored in a Datastore, not on the SD Cards. We do this for all our ESXi hosts, regardless of using SD Cards, or Local Disks.

Luciano Patrão

VCP-DCV, VCAP-DCV Design 2023, VCP-Cloud 2023
vExpert vSAN, NSX, Cloud Provider, Veeam Vanguard
Solutions Architect - Tech Lead for VMware / Virtual Backups

________________________________
If helpful Please award points
Thank You
Blog: https://www.provirtualzone.com | Twitter: @Luciano_PT

einstein-a-go-g · ‎07-12-2021

WELL WE HAVE A BROKEN 7.0.2A server already and it's only been 40 minutes!

This was not connected to vCenter Server, no VMs, NO VMware Tools, NO OVF/OVA.

Basically I suspect that any other additional I/O just causes more read/writes to the "system", VMFS-L seems to be the culprit, and I wonder if VMware Engineering did any actual physical test across media, e.g. HDD/SDD/Flash/NVMe/SD/USB/MicroSD before releasing to the wild, or just introduced a new thing, and assumed!

We are using hign endurance 3D NAND flash, rated for 4K and 8K 60fps high data rate, but I suspect that is not the issue!

We need to check the data dumps and compare across builds and servers.

einstein-a-go-g · ‎07-12-2021

"When I say, I don't work for VMware, I mean, I have no much time to spend troubleshooting and testing this. I have a team to manage(spending stupid times in meetings these days) and hundreds of ESXi hosts to manage. So no much time to go deeper or build some test labs to test these issues properly."

we do that as well, that's why there are 100 hours in a day!

I think you issue is possibly connected, by we are seeing corrupted SDs, but looking at the evidence now, I think I see a connection!

and HOW VMware is going to fix this - umm - Re-engineering I think!

or they have to issue a real statement.

einstein-a-go-g · ‎07-12-2021

so it does state in this

There is no resolution for the SD card corruption as the hardware has failed.
An update to alleviate the problem is being planned for a future release.

Alternatively, once the new drive is installed, and ESXi has been reinstalled, you can immediately move the locker partition to a RAMdisk, per directions in High frequency of read operations on VMware Tools image may cause SD card corruption

Source

https://kb.vmware.com/s/article/83376

So VMware Engineering do recognize this as a fault and bug!

But there is more going on here with ESXi than just VMware Tools etc, it's the actual buggy ESXi OS.

These recent changes have not been regression tested, which has been confirmed by VMware, they did not know what the results would be on non-high endurance flash and wear level!

Which considering vendors have been selling SD/MicroSD solutions for many years, it would seem VMware are blaming the vendor because they told them in 2018 of their plans to change the OS, and the vendor is blaming VMware!

and unfortunately, us at the coal face are getting the **bleep**e!

einstein-a-go-g · ‎07-12-2021

The technical term is fecked!!!!

We can now reproduce at will!!!!

It's the ESXi 7.0.2a OS !!!!! (standalone!).

sysadmin84 · ‎07-13-2021

VMWare has added some new info to the article about this problem. They now recommend to move the locker partition as a resolution and specifically mention low and high endurance SD cards:
https://kb.vmware.com/s/article/83376?lang=en_US

einstein-a-go-g · ‎07-13-2021

@sysadmin84

I've seen that Kb, but interestingly it's gone Page Not Found! as of writing this at 00:07 UTC

Oh, we are using High Endurance, I don't believe that is the issue!

for a Technical Company this is very vague

You could use a better-performing replacement device that can handle the increased I/O

What and How much increased I/O ? v30, v90 Class 3 ? and are the servers SD card slots capable at read and writing at v30/v90 ?