VMware Cloud Community
bo_busillo
VMware Employee
VMware Employee

SD Boot issue Solution in 7.x

Issue The host goes into an un-responsive state due to: "Bootbank cannot be found at path '/bootbank” and boot device is in an APD state.

This issue is seen due to the boot device failing to respond & enter APD state (All paths down). Some cases, Host goes to non-responsive state & shows disconnected from vCenter.

As of 7.0 Update 1, the format of the ESX-OSData boot data partition has been changed. Instead of using FAT it is using a new format called VMFS-L. This new format allows much more and faster I/O to the partition. The level of read and write traffic is overwhelming and corrupting many less capable SD cards.

We have come across lot of customer’s reporting bootbank errors (host booting from SD cards) and host going into un-responsive state in ESXi version 7. 

Our VMware engineering team is gathering information for a fix, there is a new vmkusb driver version available for testing. There is currently a workaround in place, which is to install version-2 of vmkusb driver and monitor the host.

The action plan for future resolution would be to replace the SD card/s with a capable device/disk. Per the best practices mentioned on Installation guide. 

The version 7.0 Update 2 VMware ESXi Installation and Setup Guide, page 12, specifically says that the ESX-OSData partition "must be created on high-endurance storage devices".

https://docs.vmware.com/en/VMware-vSphere/7.0/vsphere-esxi-702-installation-setup-guide.pdf

You can also refer to the below KB:

Reference: https://kb.vmware.com/s/article/83376?lang=en_US

Resolution

VMware engineering has a fix that will be in the next release of 7.02 P03 which is planned for sometime in July 2021.

175 Replies
LarryBlanco2
Expert
Expert

Any word on the imminent release of U2P03?  My team is tired of playing Whack-A-Mole. 🙂

Larry

 

Reply
0 Kudos
bo_busillo
VMware Employee
VMware Employee

Sorry no specific date yet has been verified for the GA release, I would assume sometime in Aug 2021.

vmrulz
Hot Shot
Hot Shot

We have sort of mitigated the issue by scripting reboots of cluster nodes. We also stopped turbonomics from managing DRS in the cluster which had appeared to signficantly increase IO according to logs. esxcfg-rescan -d vmhba32 seems to work on hosts that are not fully disconnected from the cluster.

 

Here is the comm from support.. note the promise for U3 by mid August.. clock is ticking vmware!

"

 

Thank you for your time over the course of this SR:21237061007 and thank you for choosing VMware Products!

 

I will now proceed in placing this Support Request in an archived state. This state means the Support Request can be re-activated by replying to this mail or by calling VMware Customer Support at any stage within the next 21 days.

To ensure clarity on the resolution of your issue and as a record for yourself below is a summary of what we worked on:

 

Summary

ESXi 7 host frequently disconnecting from vcenter

 

Cause

2021-07-07T23:07:41.135Z cpu12:2097520)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "mpx.vmhba32:C0:T0:L0" state in doubt; requested fast path state update...

2021-07-07T23:07:41.135Z cpu12:2097520)ScsiDeviceIO: 4315: Cmd(0x45d95fcd2100) 0x28, cmdId.initiator=0x43079ee36ac0 CmdSN 0x1 from world 4817311 to dev "mpx.vmhba32:C0:T0:L0" failed H:0x5 D:0x0 P:0x0 Cancelled from path layer. Cmd count Active:1

2021-07-07T23:07:41.135Z cpu12:2097520)Queued:2

2021-07-07T23:07:41.136Z cpu27:4817311)VFAT: 5144: Failed to get object 36 type 2 uuid 5f525e1a-4f3300a9-443a-36db70100038 cnum 0 dindex fffffffecdate 0 ctime 0 MS 0 :Timeout

2021-07-07T23:07:41.179Z cpu6:4817326)ALERT: Bootbank cannot be found at path '/bootbank'

2021-07-07T23:07:41.770Z cpu22:2097521)ScsiPath: 8058: Cancelled Cmd(0x45b960955000) 0x0, cmdId.initiator=0x45393781bc58 CmdSN 0x0 from world 0 to path "vmhba32:C0:T0:L0". Cmd count Active:0 Queued:2.

2021-07-07T23:07:41.770Z cpu12:4784715)VMW_SATP_LOCAL: satp_local_updatePath:856: Failed to update path "vmhba32:C0:T0:L0" state. Status=Transient storage condition, suggest retrys..

 

Resolution

As you are running ESXi 7.0 update 2 from a Sd-Card so the Host getting non responsive due to /bootbank cannot be found message is a known issue and an action plan was shared with you regarding it.

 

Fix for the issue will be released in ESXi 7.0 patch 3 which is due to be released in a couple of days latest by mid August in the meanwhile you can perform the following as workaround:

 

1. Reboot the affected Host as then ESXi starts talking to sd-card again untill sd-card is overwhelmed again in future with I/O's sent by our kernel.

 

 2. If reboot of ESXi host is not an option and VMs are running. Rescan vmhba using command: esxcfg-rescan -d vmhba32"

Reply
0 Kudos
sysadmin84
Enthusiast
Enthusiast

Seems like a patch is imminent:

Resolution
This issue is resolved in VMware vSphere ESXi 7.0 U2c. To download go to the Customer Connect Patch Downloads page.

https://kb.vmware.com/s/article/83376?lang=en_US

cbs44
Contributor
Contributor

It's available within the lifecycle manager in vCenter.

Reply
0 Kudos
A13x
Hot Shot
Hot Shot

Test Environment patched, will see how it goes before moving onto prod. I hope VMware are not going through a bad patch with the updates/ patches again like they did years ago. Patch one thing and introduce another bug just as bad...

Reply
0 Kudos
bo_busillo
VMware Employee
VMware Employee

Thanks to one of our awesome VMware TAM's (thats probably why every account should have a TAM covering them) he provided me with this Skyline update,  which proactively detects vSphere-VMFS-L-SDCard for potential VMFS-L Locker partition corruption with low-endurance boot devices on ESXi.

https://twitter.com/VMwareSkyline/status/1430246999475900417

get a TAM & get Skyline rolling

Reply
0 Kudos
einstein-a-go-g
Hot Shot
Hot Shot

and what does it do when it detects the corruption!

Automatically fix it. email VMware Support advising you not to reboot ever!

Reply
0 Kudos
vmrulz
Hot Shot
Hot Shot

Has anyone confirmed that P03 fixes the SD IO saturation?

Reply
0 Kudos
einstein-a-go-g
Hot Shot
Hot Shot

it's still a recommendation to use "High Endurance Flash" even with this patch!

it will be interesting to see if Dell/HPE retract their statements about SD cards!

Only time will tell, if it fixes it, whatever it was!!!! There seem to be many different scenarios which occur.

 

for me, it crapped out after 13 minutes of new install and high endurance flash media! with no VMware Tools, no VMs.... I will try the same situation and see if I can get it to corrupt the install!

 

 

Reply
0 Kudos
A13x
Hot Shot
Hot Shot

All seems fine for majority of customers however I do have a few which skyline detects possible sd card issues.

Only time will tell but so far so good

Reply
0 Kudos
einstein-a-go-g
Hot Shot
Hot Shot

and what does  Skyline do ? or recommend ?

Reply
0 Kudos
vbabic
Enthusiast
Enthusiast

It just points you to the KB article.

As it did for us even though we only have 6.7 hosts. I guess because the article states that 6.7 is also affected (with no resolution) even though it also states:

"Potential VMFS-L Locker partition corruption on SD cards in ESXi 7.0"

"Starting in ESXi 7.0, the boot partition is formatted as VMFS-L instead of FAT"

Does anyone read these articles before they publish them?

Unfortunately, it seems that long gone are the days when we only had to wait for U1 to consider the new ESXi version stable, we obviously have to change our policy and consider it beta until at least U3.

A13x
Hot Shot
Hot Shot

Skyline detects for potential issues with the new VMFS-L on low endurance SD cards 

Reply
0 Kudos
einstein-a-go-g
Hot Shot
Hot Shot

how does Skyline KNOW - thislow endurance SD cards  ???

 

does it have some sort of AI ?

and what does it do automatically fix it ? or just point you to a useless Kb !

Reply
0 Kudos
vbabic
Enthusiast
Enthusiast

It doesn't know. It just sees an SD card and points you to the article, it does nothing.

Reply
0 Kudos
einstein-a-go-g
Hot Shot
Hot Shot

@vbabic thanks

 

exactly, I have no idea, "why everyone thinks this is the next best thing since sliced bread!"

Reply
0 Kudos
DarkSider
Contributor
Contributor

Hi,

does someone know if I can install this patch if I'm running the DellEMC customized 7.0U2 version? Usually I would wait until dell releases their custom ISO/ZIP however this issue is really annoying...

thanks

Reply
0 Kudos
einstein-a-go-g
Hot Shot
Hot Shot

yes!

Reply
0 Kudos
cbs44
Contributor
Contributor

Same here using customized Dell ISOs and I have successfully updated my hosts with the U2c patch using Lifecycle Manager.

Reply
0 Kudos