danieldunn
Contributor
Contributor

New ESXI host v7.0.2 - Bootbank cannot be found at path '/bootbank'

Hi we have setup a new host which has been running ok for a week.

Today we noticed a VM could not be access through RDP and was not responding to ping requests.

We tried to view it through the web console which we could not.

We tried to power it off which would not work.

We also found that all of the VMs on the host could not be viewed through the web console.

When looking in the logs we have noticed these errors started today

hostd performance has degraded due to high system latency

Bootbank cannot be found at path '/bootbank' 

What do these mean and is there a way to fix this?

Thanks 

0 Kudos
22 Replies
SIMACIT
Contributor
Contributor

We are experiencing the same thing since 04-23-2021 on one of 3 hosts.

Also described here: https://www.reddit.com/r/vmware/comments/mtwgem/702_unresponsive_host/

Hope to find a solution soon ...

Andreas

0 Kudos
danieldunn
Contributor
Contributor

We believe this is an error caused by a faulty SD card reader or SD card so we are having them replaced

0 Kudos
SIMACIT
Contributor
Contributor

We did replace 1 of 2 sd cards a few weeks ago on that host. Maybe its that last SD card acting up.

How did you move running VMs from the faulty host?

0 Kudos
danieldunn
Contributor
Contributor

Yes it probably is.

We just bought the Dell server! It did it in the first week and then the second week, so it can probably take a while to occur.

There was nothing we could do in VMware, couldnt console onto VMs or Shut them down.

I powered off the VMs I could remote desktop onto, restarted the host, then migrated them to another host.

 

0 Kudos
SIMACIT
Contributor
Contributor

So the host came back up after restart?

And you could vMotion as usual?

0 Kudos
SIMACIT
Contributor
Contributor

I can confirm that the solution from the OP in the thread I linked above is working. Run hostd restart, vpxa restart and services.sh restart, wait approx. 2 hours, and suddenly the host came back to life in vCenter.

0 Kudos
danieldunn
Contributor
Contributor

We actually think this is our issue now - https://kb.vmware.com/s/article/83376 

0 Kudos
MiMenl
Enthusiast
Enthusiast

We currently experience the same behaviour after upgrading to 7.0 u2.

Observations :

The problem currently occurs on our VDI hosts  HPE DL380 with ESXi installed on SD card.

Judging the posts this is something most affected systems have in common (beinig installed on SD card).
Another weird thing is that after the syste becomes unresponsive you can still acces the host trough ssh.

When you do this every action that will try to list the /bootbank or /altbootbank will timeout and hangup your ssh session which is logical cause the bootbank cannot be found. 

still a normal ls will show blue letters which usually mean the link is intact it should turn red once there is nothing backing the link so this is somtehing weird.

Then after a hard reset (power down does not work) trough ILO the system wil boot up properly without complaining. in case of a faulty SD you would expect unreadbale blocks or file corruption which does not seem to happen.

another strange thing is that when you ssh to the machine /bootbank is readabale as is /altbootbank only on all hosts we currently had the issue on /altbootbank is empty.

Speculations :

I'm not too deep into the ESX architecture so this is based on assumption.

/altbootbank being empty is rather strange cause after the update and validation of it the backup.sh job should fill the /altbootbank with a copy of /bootbank so the system can boot properly in case something happens with the /bootbank integerty. This kind of feels like the upgarde is not entrilty finished or faulted somewere.

when a host fails hardware status requests will time out, though vsan will stay intact but the host will drop out of the Vsphere cluster after a long time. Even though VSAN stays intact (Overal health report) actions involving VSAN time out and querying / browsing VSAN becomes impossible untill the host is reset.

now the weird thing : after uprading some new critical patches became visible. once tjose are applied and teh system is rebooted the /altbootbank is filled and it seems in our case an NVME driver is changed brcmnvmefc.

we are not using NVME but since things are still addressed with FC path id's maybe the old driver was causing issues ? the old driver was part of the HPE baseline so it seems rather strange buit this is something we noticed.

We are now upgrading all baselines and monitoring the environment to see if the issue pop up again.

hope this info helps someone, or might ping some ideas about what is causing this.

The /altbootbank issue might be related to teh fact we uodate from 6.7 to 7.02 which does not have a rollback. but still it should be filled rather soon after the upgrade finishes successfully but this also does not seem to happen until the extra patches are applied.

 

 

 

 

 

 

It's not code it's spaghetti, and who doesn't like pasta ?
damoccles
Contributor
Contributor

same issue on 3x dell poweredge.

 

come on vmware, release 7.0.3 🙂

0 Kudos
danieldunn
Contributor
Contributor

Having spoken to Dell and VMware specialists they aren't going to fix it any time soon if at all.

You will need to install VMware on SSD

It doesnt sound like from 7.0.1 SD cards are not recommended

0 Kudos
SIMACIT
Contributor
Contributor

We have been running fine since we replaced the SD cards and redirected Syslog and scratch partitions to the SAN

0 Kudos
danieldunn
Contributor
Contributor

In the VMware KB they mention that, but we had local SSDs so didnt fancy taking the risk.

There is another issue with something relating to USBs but it didnt effect us.

0 Kudos
Ink_Global
Contributor
Contributor

We've had 6x R630 hosts fail with this bug, all within a few days!
We've rebuilt with 7.0.2a but not holding our breath.
VMware's silence on this is deafening.

0 Kudos
VGrytsenko
Contributor
Contributor

Hi

Today found the same problem on equipment Cisco UCS 😭

0 Kudos
santeleco
Contributor
Contributor

Followed the workaround in the link and went well 🤞 on HPE Gen9 server...

https://www.provirtualzone.com/vsphere-7-update-2-loses-connection-with-sd-cards-workaround/ 

0 Kudos
TAB405ALZ
Enthusiast
Enthusiast

SHD is showing This for one of four identical hosts:

DIAGNOSTICS.Host.KB70788: Bootbank cannot be found at path '/bootbank' and boot device is in APD state

'Issue detected on hostname : Bootbank cannot be found at path '/bootbank' is seen on the client

This issues is seen due to the boot device has failed to respond and entered APD state

In some cases, Host goes to non-responsive state & shows as disconnected from vCenter.

KB Number: 70788

The ESXi 7.0.2 host boots up without issue and I have gone as far as reseating the SD cards as well as the system card that holds the SD(s). 

Since the host is loading and never enters a non-responsive state - does this SHD finding warrant a Server Vendor call or is there something else going on with /bootbank?

Any KBs that deep dive in to /Bootbank?

0 Kudos
PatrickDLong
Enthusiast
Enthusiast

@TAB405ALZ There are two separate but related issues related to ESXi 7.0 U2 and SD-card (or other USB-based boot device media)

Issue #1 - loss of connectivity to USB-based boot devices, APD to the boot device filesystem:

KB 83450 - ESXi hosts experiences All Paths Down events on USB based SD Cards while using the vmkusb driver https://kb.vmware.com/s/article/83450

KB 83963 - Bootbank cannot be found at path '/bootbank' errors being seen after upgrading to ESXi 7.0 U2 https://kb.vmware.com/s/article/83963  "USB devices have a small queue depth and due to a race condition in the ESXi storage stack, some I/O operations might not get to the device. Such I/Os queue in the ESXi storage stack and ultimately time out."

A host suffering from this condition can usually be brought back under control in order to perform remediation steps using procedures outlined here:  https://www.provirtualzone.com/vsphere-7-update-2-loses-connection-with-sd-cards-workaround/ credit to Luciano Patrao (@JailBreak).

 

Issue #2 - corruption of USB-based boot media devices due to continuous and high-volume I/O

KB 83376 - VMFS-L Locker partition corruption on SD cards in ESXi 7.0 https://kb.vmware.com/s/article/83376

KB 2149257 - High frequency of read operations on VMware Tools image may cause SD card corruption https://kb.vmware.com/s/article/2149257

 

Other generally informative related info/links/KB's:

If you want to continue using SD-card or other USB-based boot media, you can reduce your chances of encountering this issue only by minimizing I/O going to that device by ensuring that Scratch is set to SAN datastore or local high-endurance media, and that you redirect references to local vmTools bits on the boot device to another location - either by copying vmTools to a RAMdisk or using a SharedLocker (both linked below.)  Even so, this will not 100% completely eliminate the issue; you will need to apply U3 when it is released.  It should contain an updated vmkusb driver that hopefully will resolve these issues.

https://blogs.vmware.com/vsphere/2020/05/vsphere-7-esxi-system-storage-changes.html

https://blogs.vmware.com/vsphere/2020/07/vsphere-7-system-storage-when-upgrading.html

KB 2129825 - Installing and upgrading the latest version of VMware Tools on existing hosts https://kb.vmware.com/s/article/2129825

Redirect vmTools to SAN datastore - see section "Steps to set up /productLocker symlink" in  KB 2129825 - Installing and upgrading the latest version of VMware Tools on existing hosts  https://kb.vmware.com/s/article/2129825

Redirect vmTools to RAMdisk - KB 83782 - ToolsRamdisk option is not available with ESXi 7.0.x releases  https://kb.vmware.com/s/article/83782

Tags (1)
0 Kudos
Leon_Straathof
Contributor
Contributor

When i read the official VMware response and KB articles, i see that the long term future for the ESX boot device is more durable storage. And that SD and other USB based storage although still supported is considered as legacy. For many existing hardware without local storage this would mean using alternative configuration as in the workarounds (putting parts in ramdisk or on SAN). But i have a serious question: Would a small but good quality USB SSD do the trick as well? Reason i am asking is that for all existing hosts that have a internal USB port this could be a simple and cheap drop in hardware solution that would not require any configuration changes at all.

My current flash storage (samsung) cost 23 euro a piece. Changing those to a USB SSD would only increase cost to 31 euro a piece.

Of course i am asking this for hosts without local storage, and also without a local storage adapter. So the cost of adding both is a bit more to get real dedicated local storage.

0 Kudos
PatrickDLong
Enthusiast
Enthusiast

@Leon_Straathof  I think the issue (aside from the "endurance" qualities of the physical media connected via USB, regardless of internal or external) is also related the throughput, queue depth and transfer mode supported by the USB controller itself.  Many of the systems on VMware's compatibility list for 7.x still use USB 2.0 controllers on the motherboard, meaning that their I/O capabilities to those type of boot devices are severely limited by the controller.  I can't see why USB3-connected devices would not continue to be supported as boot devices; and while I've seen plenty of documentation describing the change in boot-device preference, I haven't seen any public documentation from VMware on WHY they are pushing this change, I've particularly not seen anything from VMware that addresses the interface limitations of USB2 vs the capabilities of USB3, other than the small quote I pulled from KB 83963 in a prior post in this thread  "USB devices have a small queue depth".  This is a pretty good read:  https://www.linkedin.com/pulse/usb-30-compared-20-all-implementations-equal-dennis-mungai/

 

0 Kudos