VMware Cloud Community
Chrissy_boy
Contributor
Contributor

ESXi hosts unmanagable after datastroes rescans

Recently carried out an SRM array failover, while failing back came up with errors there were problems with rescanning datastores.  Once it had timed out noticed 6 out of 7 ESXi were no longer manable and local storage is inactive!!

iDRAIC to the servers and tried restratinf management agents but just says shutting down!! SSH and manually restarted but made no difference, does nayone have any ideas as this is a Production site.

0 Kudos
1 Reply
continuum
Immortal
Immortal

First aid: connect to each host via ILO or iDraic.
Get a list of VMs that are still running with localcli vm process list.
Shutdown all VMs with localcli - do not attemp to move VMs to the last host that is still alive.

If you use shared storage use the last host that still reacts to ssh to backup the partitiontables for ALL Luns. Print the tables to a file with partedUtil. DO NOT SKIP THIS STEP.
When you got all the partitioning info exported you should be able to rebuild the partitioontables in case you have to.

Reboot all hosts as soon as possible.

I would first power off the 6 unresponsive hosts and then reboot the 7th that still reacts first.
When the 7th host is up again check if you can access all shared Luns. If you have to fix partitiontrables do that next.
Then start the others one by one.

In case you find corrupted Luns - feel free to call me via skype "sanbarrow" before you try any repairs.

Regard the situation as unstable as long as you have not rebooted ALL hosts.
If possible download the logfiles from any host and search for hints for bad hardware or unstable paths before you do reboot them


________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

0 Kudos