VMware Cloud Community
SLCSam
Contributor
Contributor

Host Image Remediation fails with error, but after reboot everything is fine

Hello, 

I am doing a minor upgrade of vmware from 7.0.1 to 7 U3c.  I have this image already deployed on several of my hosts in other clusters and am trying to roll it out to a couple of other clusters.

I know 7 U3c is old, I am only going to it because of some underlying hardware compatibility issues and it is only a stepping stone on my upgrade journey for all of the vmware clusters I just inherited.

Ok, I am able to build the image and my lifecycle manager has all of the files for this update already, including the vendor addons for the image.  I'm using hp proliant and synergy add ons between the two different clusters. 

I've deployed this image to 4 hosts at this point with some odd issues.  1 of the 4 hosts the deployment worked fine, no errors everything is working without issue.  For the other 3 I get a generic failure error in the remediation status.  For the first two hosts that got this i just rebooted them and when they came back up they were on the new version 7.0.3, passed all compliance, network, etc. tests and I have been running vm's on them for a couple of weeks without any issues.  I wasn't able to track down an error, I haven't managed vmware in a while...

I just tried this upgrade again on a different cluster to try and get more informaiton.  I ran the upgrade, and got the same generic failure to remediate.  I checked the lifecycle.log and found this error at the bottom: ERROR esximage.Errors.InstallationError: VMware_bootbank_esx-base_7.0.3-0.20.19193900: Failed to update bootloader: [Errno 28] No space left on device

For this most recent host, I told the host to reboot and once again it came up as 7.0.3 and everything appears to be working just fine on the host, no issues, etc.  

Has anyone seen this in the past?  Should I be concerned about the failure even though everything looks fine.  It will be a while before I can upgrade past the 7.0.3 version, becuase I need to resolve some other hardware issues first.  There's no option to reapply the update or anything like that.  The 3 hosts that had the error are all blades, but very different versions, one is in a HP synergy chassis and the other 2 are in a HP C7000. 

Thank you for any input.

 

Reply
0 Kudos
7 Replies
Octopus_L4
Enthusiast
Enthusiast

Hi,
the same behavior even if the update is carried out manually?
"esxcli software profile update -p ESXi-7.0.0-xxxx-standard -d https://hostupdate.vmware.com/software/VUM/PRODUCTION/main/vmw-depot-index.xml"

can you still try - Enter the host webui and Go to Host > System > Swap and activate swap on our datastore vmfs?

navina
Enthusiast
Enthusiast

From the error "No space left on device" I believe some vib on the ESXi is using more space.
Can you run the command "esxcli software vib list" and see if there are any vibs that can be removed.
Else provide the output of "ls -ltrh /bootbank"

Regards,
Navin A
SLCSam
Contributor
Contributor

I haven't tried a manual update after the failures.  The two clusters were so different, my initial failures were in lab, and 1 of them was a success, so I was optimistically assuming that it was erroring because it's just not a great lab environment.  *and there appears to be no issues, errors, etc. on the hosts, and they are now on the upgraded version...

I do have another cluster to perform an upgrade on in about 2 weeks.  I also found the article related to the datastore swap setting.  I am going to try that before I initiate the upgrade.

I can't find a great explanation about what enabling the datastore swap actually does. My plan is to enable it, try the update and then disable it if the upgrade goes well. 

I will also try the manual update on that host if I see issues. 

Thank you for the reply. 

 

Reply
0 Kudos
SLCSam
Contributor
Contributor

Here's the free space on the last host that failed.  Could it be lack of space in the bootbank volumes?  I did look at this after the upgrade and didn't realize the bootbanks were so small. 

The VIB list had maybe 3 that were "installed" before I performed the update yesterday.  Maybe I could remove those 3, I can look into checking their size and if I can remove them. 

SLCSam_1-1700427226476.png

SLCSam_2-1700427268120.png

Thank you for the reply

 

Reply
0 Kudos
SLCSam
Contributor
Contributor

Well I still don't have a great solution.  I've upgraded two hosts since my last post.  I enabled the swap on the local datastore on both of them.  They were different hardware.  One succeeded without the error and the other got the error, so I don't think that's the fix. 

Either way, both hosts booted up on the new version just fine and seem to be running.  The error I got on today's upgrade is below. 

All I can say for other people, who may see this,  is I've now been running on hosts that get this error for around a month or more and have not noticed any issues.

In the remediation status, if you expand it you'll see similar errors to the screenshot below without much detail.  If you reboot the host then do a compliance check against your image everything looks good.

Remediation of cluster failed

Remediation failed for host xxxxxxx

An unknown error ocurred while performing the operation

SLCSam_0-1701536268080.png

 

Reply
0 Kudos
DanRobinsonHP
Enthusiast
Enthusiast

What's your actual Boot Device?

Hopefully not still 8GB USB/SD

Reply
0 Kudos
SLCSam
Contributor
Contributor

I don't think so, it looks like it's using a local 256gig flash.  The drives, partitions, and config are the same between my blade host that errored, and my rack host that didn't have any issues. 

Reply
0 Kudos