VMware Cloud Community
DaveBowker
Contributor
Contributor

How to Reinstall a failed ESX server into it's original cluster

Hi Everyone.

I've tried various searches and have failed meserably. Can someone kindly point me in the correct direction for my issue.

I think I've managed to destroy the core file system for an ESX server (I'm a PC person and I think I miss typed a Mv commend which destoryed some of the root folders in a esx cluster host machine.)

I hold my hand up and surrender.doing this stupid thing ment I could not connect to thehost in anyway via ssh or console both locked me out.

However, all congratulations to the developers the host maintained all the guests an after connecting to each guest and closing it down. Then powering of the host Vmotion/DRS just re-activated all the running guests like a dream.

However, I now have a host which starts to boot up then fails to mount the root file system and drops to busybox. From here I can see a limited number of folders home and a few other are missing.

Is there some way to repair a ESX server installation or is there a recommend way to go about re-installing the software and bringing the server back into the cluster?

It still appears in the Infrastructure Client as (not responding) but I have limited options from the popup menu. I do not appear to be able to remove it and I didn't just want to blindly go re-install the os.

Thanks in advance (Finger crossed)

Dave.

Reply
0 Kudos
7 Replies
Lightbulb
Virtuoso
Virtuoso

Reinstall the OS. This will take all of 15 minutes (Maybe a bit longer to configure and add to VC). This really is the preferred method for dealing with a failed ESX host once you have the VMs up somewhere else.

Reply
0 Kudos
DaveBowker
Contributor
Contributor

Hi

Thanks for the quick reply, but how do I deal with the old entries in Vm Infrastructure Client. The cluster still knows about the origianal server. Will it just slot back in or do I have to do something to it first?

I know from my windows esperiances that re-installing a win server creates a completly new record in the domain. I'm very new to ESX / Linux re-installs.

Dave

Reply
0 Kudos
Lightbulb
Virtuoso
Virtuoso

I assume by "Vm Infrastructure Client" you mean VC server correct?

If you are having trouble removing the old host from VC, than when you install the new host I would give it a separate FQDN i.e. somehost.somcompany.com and IP. This way there will be no issues with conflicts.

What options are available when you right click on the Host in VC?

Reply
0 Kudos
Cameron2007
Hot Shot
Hot Shot

If the Vms are on shared storage you should be able to re-register them on another host within the cluster and bring them back up on another host. VMware reccomends just rebuilding the server anyway so just re-install and re-add to the cluster. There is an option to upgrade or install when you are re-installing the software. May be better to disconnect the HBA's etc but should be OK.

have attached some screenshots here

Reply
0 Kudos
DaveBowker
Contributor
Contributor

All vm's are on san storage so they moved themselves automatically once I'd powered off the failled host.

My query is the only options I have in VC for the failed host are Disconnect, add alarm, permissions and the two report summary, perf.

I'm assuming there is some way to remove this failed cluster node so that it can be re-added?

Dave

Reply
0 Kudos
Lightbulb
Virtuoso
Virtuoso

Disconnect the node from VC. Add a new Node to cluster. Configure same networks and storage and you are good to go. For purposes of experimental troubleshooting you can work on the host that is acting flaky, if you have the time, but given that it is acting flaky I would think the best option would be to get rid of it to prevent a future issue.

As you noted in your orignal post all you VMs are safly on another host and that is what is important.

Reply
0 Kudos
DaveBowker
Contributor
Contributor

Hi all,

Thanks for the help.

I was able to get access to all the VM's that were still running on the flaky host and manually shut then down, thus leaving the host empty of VM's.

However the VC still thought they were running. I then powered off the flaky host. This allowed the VC to restart all the supposed running VM's on other hosts.

VC still thinks the blade is disconnected and I wasn't able to remove it from VC. So I disconnected the flaky server from the VC, this allows me to Remove it from VC.

I then re-installed ESX on the now dead system (When I tried to get the flaky server back online by powering on, it dropped into a busybox diagnostics window from here I could see lots of missing directories )

I added the re-created server into the VC and re-configured all the network switches to the same as existing cluster members. I also set up all the iscsi nodes that the cluster has and added the server back into the cluster.

And everything sprang back to life. I was even suprised to find that the VM's I had defined on local storage to the host re-appeared. This was because during the re-installation I told it not to overwrite and local vmfs volumes.

The developers have done a really good job with this recovery procedure. I've been very impressed with this whole thing.

Once more thanks to you all for the extra pointers.

Dave

Reply
0 Kudos