VMware Cloud Community
carlbeck
Contributor
Contributor

Several problems caused by ESXi 6 hosts booting faster than SAN

We have a number of ESXi 6 hosts and a SAN, with VM's, syslog and scratchDir located on the SAN for each host. This works like a charm if the SAN stays alive, but in the event of power outage, we get proglems. (You might say "get an UPS for the SAN then", That's a good point, but it's a requirement from the customer that the system should be able to start fully upon blackstart without user input). The scratch and syslog could be moved away from SAN to solve the problem with them, but not the VM's.

This can't be a too unusual problem, but still Google doesn't seem to have a really good answear. For earlier esxi versions, one could just add "delay XX" in the bootloader and the problem was solved. I have also found something about adding sleep XX in the etc/rc.local.d/rc.local file just before exit, but that had no effekt at all on the boot time and I find very little documentation about it..

An additional problem, if we somehow could get the hosts to rescan iSCSI targets after boot, that would still mean that we have to start VM's manually, as at the time the host tries to autostart them, right after boot, they aren't available as the SAN hasn't booted at that point?

0 Kudos
5 Replies
pterlisten
Enthusiast
Enthusiast

Hello,

have you tried to enable a full memory check on boot? With this setting, the server might need longer to boot up.

0 Kudos
krish290785
Enthusiast
Enthusiast

Not Sure whether i understand that quite well or not...Did the "virtual machine startup and shutdown" option to delay VM autostart doesn't work in this scenario.?

How about, adding sleep in the rc.local to 10 minutes followed by vim-cmd vmsvc/power.on /vmfs/volumes/CompletePathToVM to start the VMs .? 

-Bala Krishna Gali If the above info is useful, please mark answer as correct or helpful.
0 Kudos
carlbeck
Contributor
Contributor

Delayed startup of virtual machines alone does not solve the problem, as the host does not rescan for iSCS targets after boot, so even if the VM's waits 10 min to start, their storage isn't found and hence they fail to start.

However, it might be possible to have sleep followed by rescan storage adaptors in local.sh, that would make the host rescan before starting VM's?

For example:

#delay VM startup for X seconds to allow the SAN to boot

sleep 300

#rescan storage adapter

esxcli storage core adapter rescan --all

vmkfstools -V

exit 0

Even better would be if the I could add a check if the SAN has been found and if not, rescan until it is found? Ay idéas on that? In case the SAN's are busy doing disk checks or whatever that prolonges boot time..

0 Kudos
lytledd
Contributor
Contributor

You'll need to specify the full path of the commands, mine below:

cat local.sh

#!/bin/sh

# local configuration options

# Note: modify at your own risk!  If you do/use anything in this

# script that is not part of a stable API (relying on files to be in

# specific places, specific tools, specific output, etc) there is a

# possibility you will end up with a broken system after patching or

# upgrading.  Changes are not supported unless under direction of

# VMware support.

# Increased wait time to 300 seconds to allow the Fibre channel array to cycle up (DL)

/bin/sleep 300

/sbin/esxcli storage core adapter rescan --all

exit 0

Doug

Thanks, Doug
0 Kudos
nwincey
Enthusiast
Enthusiast

We had a very similar "issue", but it was instead from using a hyper converged storage solution where the VM data stores were never available before the ESXi hosts booted.  We resolved this and have used it for a few years now by calling our custom script from the local.sh file that loops for a specified amount of time doing HBA rescans until our specific number of iSCSI connections are found (by issuing a /sbin/esxcfg-mpath -L | grep -c "eui" command) and then proceeds into an automated VM startup procedure (using /bin/vim-cmd vmsvc/power.on $VM_ID commands for each VM on the system).  Since you are just waiting for the SAN to boot the rescan loop shouldn't have to be to long.  There are quite a few more detailed checks we make specific to our system, but at the high level that is the main concept of the script.

Regards,

Nathan

0 Kudos