VMware Cloud Community
JMcDowell
Contributor
Contributor

Various Simultaneous Failures - Solved

Hello All:

I've been banging my head against this for a few days, as the new guy I thought it might be best to now turn to the experts.

I'm running ESXi Essentials 7.0U3c on a standalone host. VCSA is installed on the same box so I can use Veeam Community Edition for backups. Everything had been working great for months, till recently.

I had a look at the host a few weeks back and the boot USB had disconnected. I shut down the host and used dd to clone the suspect USB drive to a new one. I booted back up, everything seemed to be fine, till the past few days.

My symptoms are as follows

-If I put the host in Maintenance mode and reboot, it's not in Maintenance Mode when it restarts

-I keep having to reconfigure the coredump location and the logs location on persistent storage. Those settings keep getting lost, the logs location keeps reverting to scratch

-VCSA keeps disconnecting from the host, so my backups fail. When I try to reconnect I get the "The specified key, name, or identifier already exists" error on the host. I then reenter all the info and it will stay connected for about a day before it disconnects again.

As for the last one I had a look at this https://kb.vmware.com/s/article/3824568 but seeing as this all started at once I thought that the underlying cause may be something else, especially when VCSA disconnects on its own.

I'm thinking that there may have been something wrong or corrupted on the original USB that I've carried over after cloning it, though I cloned it weeks back, this has all started in the past few days. What's my best method to back out of this mess? I had thought about running an upgrade install of a 7.0U3f iso to see if that would fix it, but I don't want to get even deeper into the weeds here if that won't sort it out.

I've read a few things relating to bootbank issues, though I have no idea how to look into fixing that if that is the issue here.

Reply
0 Kudos
3 Replies
robc_yk
Enthusiast
Enthusiast

I wonder if any updates have been done while the USB was in a non functioning state.

If you have another server, even for temporary use, I would configure ESX on it and move the VM's to it, then redeploy the initial server with a fresh install.

If you can get the VCSA and the ESX up, then copy out the critical settings, network, storage, ETC. 

To me it looks like either there is an issue with the USB data or a software update / firmware compatibility issue.

VMware did announce they were pulling support of the boot from USB for ESX, I believe it has since been reversed, but at the end of the day I don't use it here, if you still need to, then ensure you are at least using an enterprise level USB.

---

Helpful? Let others know by adding a Kudos and / or accepting the solution.

JMcDowell
Contributor
Contributor

Thanks so much for the quick response!

I was sitting here thinking about backing up the configuration and doing a clean install, but running

vim-cmd hostsvc/firmware/sync_config

fails with;

(vmodl.fault.SystemError) {
faultCause = (vmodl.MethodFault) null,
faultMessage = <unset>,
reason = "Internal error"
msg = "Received SOAP response fault from [<<io_obj p:0x0000004608bf4348, h:5, <TCP '127.0.0 .1 : 60649'>, <TCP '127.0.0.1 : 8307'>>, /sdk>]: syncConfiguration
A general system error occurred: Internal error"
}

I've been reading some posts about this being caused by the /Scratch/Downloads directory missing, but it's there. I've deleted and recreated it and still receive the same error.

I don't have it set for any automatic updating, so I doubt that's the issue.

The whole thing is running on the single box. I could load up another computer, but being an Essentials license I don't have VMotion or any other fun stuff available to help me remediate the problem.

Bottom line, even though the current boot USB looks OK and is new, I suppose it could be failing as well. You're absolutely right, I should have spent the money on an industrial grade USB.

It's not a huge system (10 VMs.) I could recreate the whole thing from scratch, but I'd rather not and would prefer to learn this stuff a bit in depth and try to understand the problem.

I manually reconnected VCSA again this morning and manually ran some backups, so it's not panic time yet. Still plugging away to to sort it out.

Reply
0 Kudos
JMcDowell
Contributor
Contributor

So in the end, from all the research I did, it would seem that the main culprit was booting off the USB.

I managed to get a config backup to run by manually creating the symlinks to BOOTBANK1 and BOOTBANK2 (/bootbank > BOOTBANK1, /altbootbank > BOOTBANK2) These didn't persist across reboots, but I was able to create them and run the config backup right away.

I did a clean install of 7.0U3c (which was the config I backed up,) to a SDD, loaded up the backed up config, crossed my fingers and rebooted. It worked! (nice)

I then updated to 7.0U3f and it seems OK.

I did have some more disconnects after that, for a few quirks that needed to be sorted.

VCSA complained the my host box had a duplicate IP (it didn't, it was the same box as before with the config reloaded.) That error went away after a few reboots.

I also had disconnects being caused by Veeam. It seems I didn't leave enough slack space on one of the drives for Veeam creating snapshots. When that happened VCSA disconnected as well.

In any event, all good now. Thanks.

Reply
0 Kudos