APlatt
Contributor
Contributor

ESX delay startup/recovery from power failure

Jump to solution

Hello all, well today was one of those days.

4:00 am Lunar Eclipse

4:04 am Power outage in data center

4:41 am UPS batteries die/Generator did not start Smiley Sad

6:15 am Driving in to work to bring up data center

8:15 am spill salsa on my tie

11:00 am coke machine eats my dollar...last dollar only 20 left in wallet.

3:59 pm still on phone trying to recover some vm guests that were clustered...

if i see locusts, im outta here.

Anyway, my situation is the following:

After the power outtage, the 4 ESX hosts came up before the SAN finished booting up. That caused all sorts of problems. HA on all four ESX servers was so confused I had to eventually delete the cluster and create a new one to add the ESX servers back into. I am fighting geting several cluster servers back up and running cause the vmware guests dont know that they have raw data mappings...this is bad.

My question is the following:

Is there a way or a method that I can employ that will delay the ESX hosts from powering up until the SAN is completely up and running. Or are the ESX hosts supposed to somehow recogonize the storage after an extended time or after the SAN eventuallly finishes booting up? My indication after getting VI up and running was that all my VM guests were unknown. And only after rebooting each of my hosts (after SAN was up) did they begin to recognize the guests so that I could start them.

Ultimately this should not happen as power should be continuous...but Mr Murphey hates me, so I want to prepare as much as I can.

Regards,

0 Kudos
1 Solution

Accepted Solutions
Texiwill
Leadership
Leadership

Hello,

Using the iLO/Director/DRAC interfaces and expect it will be possible to script the boot of all your systems. So that you could say have one server that boots which is independent of the SAN or anything else. It talks to the SAN periodically to say hey are you ready... When it is ready, it scripts the boot of the ESX servers.

The option I use is a series of APC Switched PDUs that are all ethernet controlled. On a power out, my servers are set to reboot when power comes back on, but I have the PDUs set so that the power does not come on for a set amount of time, anywhere from 15sec to 5 minutes. If I need longer, I set the PDU to never restart the power port and have another script that queries, waits until the SAN is up and then using Expect and a little bit of Perl Scripting login to the PDU and automatically power on the servers.

Either method is possible. Some are easier than others. Smiley Happy Some tools have delays built into them, but they are not always long enough.

Best regards,

Edward

--
Edward L. Haletky
vExpert XIII: 2009-2021,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill

View solution in original post

0 Kudos
6 Replies
FERC_ESX1
Enthusiast
Enthusiast

Sounds like a rough day. Smiley Sad

As a partial solution, you should be able to go into the Configuration Tab in VC, go to "Storage Adapters" and hit Rescan. This will save you from having to reboot the hosts.

Just be careful with that. If you're still on 3.0.1 a patch was required (maybe just for some hosts) to prevent rescanning all HBAs from locking up the host. We're using HP bl480c servers and that was a problem for us. Rescanning one HBA at a time worked though. Once I patched the servers I could rescan all HBAs at once.

APlatt
Contributor
Contributor

I was able to recan the HBA's but they would not show the LUN's until after the reboot, it was like once the HBA's did not detect a SAN, they just didnt care about ever finding it.

thanks for the reply tho...and im checking my patch level now on the 3.0.1 as suggested.

0 Kudos
Ken_Cline
Champion
Champion

8:15 am spill salsa on my tie

There's the root of all the evil! No one responsible for a datacenter should have to wear a tie!

You could set your servers so that, after a hard halt, they do not autostart. While not an optimal solution, it would allow your other infrastructure components to come up and then you could manually power the hosts up one at a time - allowing the environment to stabilize along the way.

Ken Cline VMware vExpert 2009 VMware Communities User Moderator Blogging at: http://KensVirtualReality.wordpress.com/
APlatt
Contributor
Contributor

lol, yes i agree, and 95% of the year I do not, but today was our 'Award Ceremony' day and of course I was forced to attend.

Yes, I can turn off the autostart after power outage in the BIOS, but I really love that feature, as it saves me from having to come in and turn on 20 some odd servers, but in this case, I may have to reevalutate that option as more of my physical servers are virtualized.

0 Kudos
Ken_Cline
Champion
Champion

...it saves me from having to come in and turn on 20 some odd servers

And that's why there are iLo / Director / DRAC interfaces on most servers - so you can reach out and touch them from anywhere in the world. As your servers become more critical (i.e. host more workloads/VMs), their manageability becomes more and more important. I would encourage you to invest the extra $$$ to ensure that your servers - no matter what vendor you buy them from - have an out-of-band management interface that supports remote power operations, remote BIOS upgrade, remote console interface (KVM), etc...

Ken Cline VMware vExpert 2009 VMware Communities User Moderator Blogging at: http://KensVirtualReality.wordpress.com/
0 Kudos
Texiwill
Leadership
Leadership

Hello,

Using the iLO/Director/DRAC interfaces and expect it will be possible to script the boot of all your systems. So that you could say have one server that boots which is independent of the SAN or anything else. It talks to the SAN periodically to say hey are you ready... When it is ready, it scripts the boot of the ESX servers.

The option I use is a series of APC Switched PDUs that are all ethernet controlled. On a power out, my servers are set to reboot when power comes back on, but I have the PDUs set so that the power does not come on for a set amount of time, anywhere from 15sec to 5 minutes. If I need longer, I set the PDU to never restart the power port and have another script that queries, waits until the SAN is up and then using Expect and a little bit of Perl Scripting login to the PDU and automatically power on the servers.

Either method is possible. Some are easier than others. Smiley Happy Some tools have delays built into them, but they are not always long enough.

Best regards,

Edward

--
Edward L. Haletky
vExpert XIII: 2009-2021,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill

View solution in original post

0 Kudos