Solved: Re: Help with Planning downtime

EllettIT · ‎09-20-2010

Ok, it's not often that I need to shut down all of my VM's and hosts but it has happened occasionally (recent example was to replace the ISCSI switches used for SAN connectivity) and the process is never easy. We currently have 3 ESX 4.0 hosts that are at the Advanced level of licensing and the vCenter server is licensed at the foundation level. I'm looking for a little help with making sure I plan the steps out appropriately so that I don't get caught in the particular scenario that happened to me recently. Basically, I shut down all my VM's (with our AD controllers and DNS servers being VM's too), put the host in maintenance mode, did the work I needed to do then tried to bring everything back up. Well, since I didn't remove the host servers for the HA cluster it took a while for everything to come back up and be happy. I still had to reconfigure each host for HA to make them happy. With this in mind I'm looking to document a process for bringing down the VM's and hosts and get them back up again without all the sweating on my part Something like this:

Step 1 - Shut VM's down

Step 2 - Enter host in maint mode

Step 3 - Remove host from HA cluster (not sure how to do this or even if it's recommended)

Step 4 - Shut hosts down

Step 5 - After work is done bring hosts back up

Step 6 - after VC see's hosts again bring them back into HA cluster (not sure how to do this)

Step 7 - remove host from maint mode

Step 8 - Power VM's back up.

Very basic I know but I'm open to comments and criticisms

chadwickking · ‎09-20-2010

Being in a very large enterprise like ours we have over 200+ sites with a cluster of 3 ESX host. We now have about 500+ host in our HO - Data Center. I use scripting for a lot of my pre-post validations for site outages now. I use PowerCLI from vmware to do alot of automation it helps elimnate little things "rescanning" storage and such . I would encourage you to also look into those things as they can save you alot of time when it comes to automate of your infrastructure.

http://www.lucd.info/

http://www.virtu-al.net/ - alot of good written scripts.

If you need any help just post to the PowerCLI forums we have a lot of great guys that are eager to help.

Have a good night!

Cheers,

Chad King

VCP-410 | Server+

Twitter: http://twitter.com/cwjking

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful

Cheers, Chad King VCP4 Twitter: http://twitter.com/cwjking | virtualnoob.wordpress.com If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful

View solution in original post

sketchy00 · ‎09-20-2010

I tell ya, this is one subject matter that is simply not written enough about. The process can be finicky at best, and scary at worst. The other difficult part is that it is hard to refine the processes for graceful startup, for the same reason why it’s difficult to fully plan for graceful shutdowns off of extended power outages. I’ve had both of them, and had a few things in place to make it easier, and learned a few things along the way.

1. I know the more recent VMware documentation has suggested there is no problem running vcenter in the cluster, but I don’t like it for the very reason you describe. I have an old, reprovisioned system that hooks up to my iSCSI SAN, and have ESXi on it, and have nothing but the single VM running vcenter on it. The VM itself sits on it’s own LUN on the SAN. That way, I can always press the power button and get that thing turned up before anything else.

2. Have an old physical box, and rebuild that as an, at minimum DNS server, and at most, an extra AD Domain Controller. If you have your ESX hosts refer to this box in the primary or secondary DNS settings, you guaranteed name resolution will occur for the ESX hosts. in Vcenter, change one of the DNS entries to this standby DNS server.

3. Add some static entries to your hosts on your ESX servers (of other ESX servers, time servers, SAN arrays, etx.). In theory, If DNS is redundant enough, you won’t need this, but it doesn’t hurt to be extra safe, but daemon start order is not something totally in your control, so it’s best to just add ‘em in there just in case DNS isn’t ready.

4. In cases where you describe (planned power down of the cluster), much of it you can control. If everything goes down from an extended power outage, sometimes power up can be even more difficult depending how your networks connect to eachother. Not much of a problem if your LAN switchgear does interswitch VLAN routing. But if you have those networks connected via a firewall/router, such as an ISA/TMG server, those routing services won’t start properly without and AD DC to refer to. I learned this the hard way.

5. Refine your autopower-on VM sequence. You’ll be thankful later.

6. Each time you have to do this, take notes of what went wrong, so that you can refine your “run book” You have to jump on the opportunity to refine that process, and sometimes the only time this stuff surfaces in when you have to take everything down, or, if the UPS’s takes everything down for you!

Good luck!

chadwickking · ‎09-20-2010

Awesome job Sketchy!

I don't think I could've said it any better.

Just be sure when you bring those host up that they are able to talk to the stroage arrays and that the storage/switching is up before powering on the host. Might want to make that into the checklist as well in case it is a full on outage :-). If your host dont see the storage of course we know you can do a rescan. Before beginning any planned downtime ensure you have proper back ups of all your data in the event something goes south - been there - done that. Usually if your vCenter - AD - DNS are up you shouldn've have too much of a problem add/removing host to a cluster. On most of out sites we may have to - reconfigure ha from time to time. However, we don't have vCenter on the Same host it manages we are able to isolate it and keep it apart which I can say really reduces potential problems. As for the power-up process for your VM's you definitely need to make AD/DNS a top priority!

Good luck as well!

Cheers,

Chad King

VCP-410 | Server+

Twitter: http://twitter.com/cwjking

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful

Cheers, Chad King VCP4 Twitter: http://twitter.com/cwjking | virtualnoob.wordpress.com If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful

sketchy00 · ‎09-20-2010

Thanks Chad. Having had a complete power outage about 6 weeks ago that lasted 10+ hours, some of this is still pretty fresh in my memory. It's far from a complete list though (not to mention the power down lessons). I'm interested to see what others post here, to maybe scavenge an idea or two.

chadwickking · ‎09-20-2010

Being in a very large enterprise like ours we have over 200+ sites with a cluster of 3 ESX host. We now have about 500+ host in our HO - Data Center. I use scripting for a lot of my pre-post validations for site outages now. I use PowerCLI from vmware to do alot of automation it helps elimnate little things "rescanning" storage and such . I would encourage you to also look into those things as they can save you alot of time when it comes to automate of your infrastructure.

http://www.lucd.info/

http://www.virtu-al.net/ - alot of good written scripts.

If you need any help just post to the PowerCLI forums we have a lot of great guys that are eager to help.

Have a good night!

Cheers,

Chad King

VCP-410 | Server+

Twitter: http://twitter.com/cwjking

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful

Cheers, Chad King VCP4 Twitter: http://twitter.com/cwjking | virtualnoob.wordpress.com If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful

GreatWhiteTec · ‎09-21-2010

I'm there with you. I had a SAN failure recently where I lost half my datastores. Fortunately I had a physical DC/GC/DNS. The biggest issue I had was when I moved the SQL server with a lot of DBs (including VC) to the DR Site and brought it up there. Everything was fine until I failback, since my vCenter and SQL are on VMs I could not bring vCenter back up mainly because I'm using vDS and that info its stored in vcenter.

Learned a lot from this. Needless to say I am building a new SQL server JUST for vCenter and SRM and leave it at our site at all times. Keep the physical DC/GC/DNS (that saved my but). Bring AD up first after down time, then SQL then the rest of the servers.

Have redundant NICs. In my case I removed a nic from vDS and create a new standard vSwitch, that allowed me to bring SQL and vCenter into the network and once they were up, I migrated the NIC back to vDS.

This week I upgrade all my hosts from ESX4.0 to ESXi4.1 and downtime was not an issue. Used vmotion to move servers between hosts and used host profiles to make the upgrade extremely easy.

Hope this helps.

GreatWhiteTec · ‎09-21-2010

Forgot to mention. I did remove hosts from the HA cluster. Since I use vDS I had to remove the host from vDS first, otherwise it won't let you remove the host from the cluster.

sketchy00 · ‎09-21-2010

Ouch! ...SAN Failure? ...Lost half your datastores? I'm afraid to ask what happened.

EllettIT · ‎09-21-2010

Thanks guys for all of your help! I think the end result is that I need to both document the process futher and build in some redundancy (physical DNS / AD server etc) to our environment.

sketchy00 · ‎09-21-2010

Good luck to you. It's oddly comforting to know that a few others have experienced the same thing. Taking care of these things ahead of time will make those pressure cooker moments much more reasonable.

Lastly, I forgot to tell you my most obvious mistake on my recent extended power outage. My fancy run book for shutdown procedures and starup procedures were documented nicely in OneNote (most underrated app for IT, ever. But I digress). The sad part was that it was stored nicely on the SAN that didn't have any power to it. Add one more thing to the checklist. Corkboard in the server room, and a printout of your runbook.

chadwickking · ‎09-21-2010

OMG, sketch - HA-lirious!

Sounds like me lol...

Regards,

Chadwick King

Cheers, Chad King VCP4 Twitter: http://twitter.com/cwjking | virtualnoob.wordpress.com If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful

All

Help with Planning downtime