Issue The host goes into an un-responsive state due to: "Bootbank cannot be found at path '/bootbank” and boot device is in an APD state.
This issue is seen due to the boot device failing to respond & enter APD state (All paths down). Some cases, Host goes to non-responsive state & shows disconnected from vCenter.
As of 7.0 Update 1, the format of the ESX-OSData boot data partition has been changed. Instead of using FAT it is using a new format called VMFS-L. This new format allows much more and faster I/O to the partition. The level of read and write traffic is overwhelming and corrupting many less capable SD cards.
We have come across lot of customer’s reporting bootbank errors (host booting from SD cards) and host going into un-responsive state in ESXi version 7.
Our VMware engineering team is gathering information for a fix, there is a new vmkusb driver version available for testing. There is currently a workaround in place, which is to install version-2 of vmkusb driver and monitor the host.
The action plan for future resolution would be to replace the SD card/s with a capable device/disk. Per the best practices mentioned on Installation guide.
The version 7.0 Update 2 VMware ESXi Installation and Setup Guide, page 12, specifically says that the ESX-OSData partition "must be created on high-endurance storage devices".
You can also refer to the below KB:
VMware engineering has a fix that will be in the next release of 7.02 P03 which is planned for sometime in July 2021.
We have sort of mitigated the issue by scripting reboots of cluster nodes. We also stopped turbonomics from managing DRS in the cluster which had appeared to signficantly increase IO according to logs. esxcfg-rescan -d vmhba32 seems to work on hosts that are not fully disconnected from the cluster.
Here is the comm from support.. note the promise for U3 by mid August.. clock is ticking vmware!
Thank you for your time over the course of this SR:21237061007 and thank you for choosing VMware Products!
I will now proceed in placing this Support Request in an archived state. This state means the Support Request can be re-activated by replying to this mail or by calling VMware Customer Support at any stage within the next 21 days.
To ensure clarity on the resolution of your issue and as a record for yourself below is a summary of what we worked on:
ESXi 7 host frequently disconnecting from vcenter
2021-07-07T23:07:41.135Z cpu12:2097520)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "mpx.vmhba32:C0:T0:L0" state in doubt; requested fast path state update...
2021-07-07T23:07:41.135Z cpu12:2097520)ScsiDeviceIO: 4315: Cmd(0x45d95fcd2100) 0x28, cmdId.initiator=0x43079ee36ac0 CmdSN 0x1 from world 4817311 to dev "mpx.vmhba32:C0:T0:L0" failed H:0x5 D:0x0 P:0x0 Cancelled from path layer. Cmd count Active:1
2021-07-07T23:07:41.136Z cpu27:4817311)VFAT: 5144: Failed to get object 36 type 2 uuid 5f525e1a-4f3300a9-443a-36db70100038 cnum 0 dindex fffffffecdate 0 ctime 0 MS 0 :Timeout
2021-07-07T23:07:41.179Z cpu6:4817326)ALERT: Bootbank cannot be found at path '/bootbank'
2021-07-07T23:07:41.770Z cpu22:2097521)ScsiPath: 8058: Cancelled Cmd(0x45b960955000) 0x0, cmdId.initiator=0x45393781bc58 CmdSN 0x0 from world 0 to path "vmhba32:C0:T0:L0". Cmd count Active:0 Queued:2.
2021-07-07T23:07:41.770Z cpu12:4784715)VMW_SATP_LOCAL: satp_local_updatePath:856: Failed to update path "vmhba32:C0:T0:L0" state. Status=Transient storage condition, suggest retrys..
As you are running ESXi 7.0 update 2 from a Sd-Card so the Host getting non responsive due to /bootbank cannot be found message is a known issue and an action plan was shared with you regarding it.
Fix for the issue will be released in ESXi 7.0 patch 3 which is due to be released in a couple of days latest by mid August in the meanwhile you can perform the following as workaround:
1. Reboot the affected Host as then ESXi starts talking to sd-card again untill sd-card is overwhelmed again in future with I/O's sent by our kernel.
2. If reboot of ESXi host is not an option and VMs are running. Rescan vmhba using command: esxcfg-rescan -d vmhba32"
Test Environment patched, will see how it goes before moving onto prod. I hope VMware are not going through a bad patch with the updates/ patches again like they did years ago. Patch one thing and introduce another bug just as bad...
Thanks to one of our awesome VMware TAM's (thats probably why every account should have a TAM covering them) he provided me with this Skyline update, which proactively detects vSphere-VMFS-L-SDCard for potential VMFS-L Locker partition corruption with low-endurance boot devices on ESXi.
get a TAM & get Skyline rolling
it's still a recommendation to use "High Endurance Flash" even with this patch!
it will be interesting to see if Dell/HPE retract their statements about SD cards!
Only time will tell, if it fixes it, whatever it was!!!! There seem to be many different scenarios which occur.
for me, it crapped out after 13 minutes of new install and high endurance flash media! with no VMware Tools, no VMs.... I will try the same situation and see if I can get it to corrupt the install!
It just points you to the KB article.
As it did for us even though we only have 6.7 hosts. I guess because the article states that 6.7 is also affected (with no resolution) even though it also states:
"Potential VMFS-L Locker partition corruption on SD cards in ESXi 7.0"
"Starting in ESXi 7.0, the boot partition is formatted as VMFS-L instead of FAT"
Does anyone read these articles before they publish them?
Unfortunately, it seems that long gone are the days when we only had to wait for U1 to consider the new ESXi version stable, we obviously have to change our policy and consider it beta until at least U3.
does someone know if I can install this patch if I'm running the DellEMC customized 7.0U2 version? Usually I would wait until dell releases their custom ISO/ZIP however this issue is really annoying...