VMware Communities > Blogs > Manual Automation > 2008

Blog Posts

Manual Automation : August 2008

Previous Next
0

Break Like the Wind

Posted by Virtual_JTW Aug 26, 2008

References to Spinal Tap's great album aside, it's ironic: I'm working on VMware Site Recovery Manager product setup and configuration and the day I'm scheduled to fly out to Las Vegas for VMworld a mini-disaster strikes! It's Sunday, September 14^th^ around noon and all is normal. However, the remnants of hurricane Ike are heading this way. No big deal - a little rain, maybe a strong thunderstorm but nothing we haven't seen before.

I'm packing for a week at VMworld and need to hit the road by 3:30PM. Around 2:00PM we start to hear the whirling sound of wind racing across the roof. At about 3:15PM I'm packing up the car and debris is getting blown down the street. Before I leave I have to remove a large piece of cardboard from the front of my car. I've never seen anything like this!

Despite the high winds, I make it to the airport safely and notice planes are still taking off and landing. Listening to the radio on the way there I learned winds were reaching in excess of 80MPH and knocking down trees and power lines all across the state of Ohio. Dayton was impacted especially hard. I'm not sure how or why, but my plane took off successfully and it was a smooth ride once we were above the atmosphere.

My house was without power for 4 days. Others had it worse with the outage lasting over 9 days. This kind of weather event hasn't happened in 200 years. Some very special, one-in-a-million chance conditions came together thanks to, in part, hurricane Ike to cause extraordinarily high winds in our region that none of us had seen before.

So something that will never happen happened - a disaster occurred to our data center causing a multi-day loss in power. We have a natural-gas generator to cover a power outage. It kicked in and life is good right? Wrong! We also have redundant AC units but only one works with the generator and the automatic fail-over didn't work due to a bug in the system (which has since been corrected). The room starts heating up and servers start shutting off as the temperature reached 90 degrees Fahrenheit. We reached 95-96F before a co-worker showed up and manually switch the AC units over (I can't do it - I'm on a plane, remember?). It took him twice as long to get there because of downed power lines and trees that closed roads.

He then starts powering up servers again. Luckily the outage for most systems is an hour or less on a Sunday when most of our users don't care or are being distracted by the tree that's landed in their living room. The ESX hosts and virtual machines all power-up successfully thanks in part to the hardware sensors on the servers that powered them off before the CPU, memory or I/O components fried in the heat.

While the outage was bad, it brought to light several interesting points:

  1. Test the equipment, but test the fail-over of the equipment.
    Testing the actual fail-over is the hardest part of disaster recovery because it impacts production. However, regardless of whether it's AC units or virtual machines, this is the only way to be 100% certain you DR plan will work as designed and implemented.
  2. The quality of built-in server hardware sensors has increased dramatically in the last 7 years.
    This is the third time I've had servers in a room that overheated due to an AC outage. The previous two events were lab servers that did not recover very well. The hardware didn't shutdown cleanly. Many systems were blue-screened if they were still running. When AC service was restored, some servers wouldn't power back up; others threw strange hardware-related errors months after the fact. Heat does bad things to electronics and I've seen too much of this first hand.
  3. Additional data center environmental monitoring and sensor devices are critically important.
    I have the fortune of working for a data center manager that had the foresight to install a Sensaphone remote monitoring device (http://www.sensaphone.com/). I'm sure there are other products on the market but this one works very well for us. It can call a list of numbers and speak the alert condition over the phone. The admin can then enter a code to stop it from calling the next number. It can monitor various conditions but in this case it called us to warn about the temperature. We also have an ADT monitoring unit but it doesn't seem to work as well.
  4. Data center protection is important in a disaster but also consider supporting non-data center work-related processes.
    This "mini-disaster" put us without power for days, yet the business needed to continue to function. We needed to process sales orders, purchase raw materials, process payroll, etc. Have you ever worked for a company that couldn't meet payroll for any reason? To say that employees get upset is an understatement. So when no-one has power, where does the accounting staff go to get their job done? Plan to provide facilities for personnel to process these kinds of essential functions. After-all, what good is making sure the payroll system is running when nobody can access it anyway?
  5. Consider specific disaster scenarios and plan accordingly.
    This maybe the hardest things to accomplish when planning for a disaster. Put two people in a room and they will have very different opinions on which scenario is more important than the other. The bottom line is you'll have choose some number, say the top three, and plan for those. You should plan for something - define it but don't let it stall the progress of the project.

The power outage lasted around 72 hours, the service outage lasted less than an hour. Not bad overall! Now I'd better get VMware Site Recovery Manager working - had that generator stopped running...

0 Comments Permalink
0

Well, nothing much really, but I'll make a connection. Just bare with me...

I was walking through the toys section with my kids at Target yesterday when one of my sons spotted a toy he really wanted - a set of four trucks (they love trucks!). On the front of package it read, "for ages 5 to 95". Now really, so a 96 year-old shouldn't play with these trucks?

I tend to find discussions on virtualization candidates just about as rational and definitely as funny. The debate on whether application XYZ can/should be virtualized is over. Sure there are still exceptions (unique hardware requirements, for example). And yes it depends on your environment (I wouldn't virtualize 3 Exchange 2003 mailbox servers across 2 ESX hosts sporting Pentium 4 CPUs with 1GB of RAM each). But for Virtual Infrastructure (VI) environments running on modern servers and back-end storage systems, there are very few physical servers that can't be virtualized.

If you buy into this "virtualize your datacenter" principle like I do, then are there really no applications off-limits? What about VMware's own products such as VirtualCenter? I know there are VI administrators out there that still refuse to virtualize the VirtualCenter Management Server (VCMS). I usually hear one of two reasons:

  1. "I'm freeing-up all of these physical servers and have one or two that I have to use for something."
  2. "VirtualCenter is becoming so critical that I can't afford it to go down or lose access."
But that's all wrong - 96 year olds can play with trucks! You virtualize the VCMS for the very same reasons you virtualizes all of the other physical servers in your datacenter: to realize all of the benefits of VI. You know what they are but if you're not sure, please go to vmware.com to find out more.

To answer the above concerns: deploying a physical server to host a VI component sort of defeats the purpose, doesn't it? Won't deploying yet another physical server increase cooling cost? Power consumption? System maintenance? Etc, etc. And what about availability? I sometimes wonder if these administrators really understand VMware HA or the power of VMotion - virtualizing the VCMS should increase its availability compared to hosting it on a physical server.

Once VMware announced they fully supported running VirtualCenter in a virtual machine with the release of 2.5, I haven't looked back. I've implemented and supported VI environments for two different companies now with the VCMS running in a virtual machine. It's been two years and I have not heard any of what I would call "deal-killers" to this design decision. However, there is a short list of things that I you should be aware of:

  • * If you need to shut down the entire VI environment, you'll need to save the ESX host(s) that VC and its database server are running on for last. Then you'll need to log on to the hosts directly to complete the shutdown. This doesn't happen too often, but I've had to do this 2 or 3 times, usually due to a storage-related outage.
  • * I've experienced brief 1-2 second pauses in the VMware Infrastructure Client (VIC) when the VCMS VM gets VMotioned from one host to another. Again, this rarely happens.
  • * And here's a new one: As of Update 2, there's a new feature called Enhanced VMotion Compatibility (EVC). To enable this in my environment, VC requires all virtual machines in the cluster be powered-off. It might be hard to enable this feature in VC if the VCMS is powered-off(!). The solution to this isn't too-painful, however: temporarily move the ESX server that hosts the VCMS VM out of the cluster, enable the feature then move it back.

What if your VCMS VM does crash? If VirtualCenter does become unavailable, your VMs will continue to run. HA runs as an agent on each host, so that service will continue to run. Since your probably running the FlexNet licensing service on the same VM as the VCMS, you'll have a grace period of 14 days to get the VM back up and running. If it takes you more than 14 days to get that VM back up and running, it's not very critical in your environment anyway.

For more information on this topic straight from the horse's mouth, please see: http://www.vmware.com/pdf/vi3_vc_in_vm.pdf

Still not convinced?
Leave a comment and let me know why.

0 Comments Permalink
Click to view Virtual_JTW's profile

Virtual_JTW

Member since: Nov 1, 2004

I am a senior IT professional that has designed, implemented and managed the operations of several VI environments. This blog will detail design rationale, testing results and technical tips with a heavy focus on VI/vSphere, storage and cloud computing.

View Virtual_JTW's profile