References to Spinal Tap's great album aside, it's ironic: I'm working on VMware Site Recovery Manager product setup and configuration and the day I'm scheduled to fly out to Las Vegas for VMworld a mini-disaster strikes! It's Sunday, September 14^th^ around noon and all is normal. However, the remnants of hurricane Ike are heading this way. No big deal - a little rain, maybe a strong thunderstorm but nothing we haven't seen before.
I'm packing for a week at VMworld and need to hit the road by 3:30PM. Around 2:00PM we start to hear the whirling sound of wind racing across the roof. At about 3:15PM I'm packing up the car and debris is getting blown down the street. Before I leave I have to remove a large piece of cardboard from the front of my car. I've never seen anything like this!
Despite the high winds, I make it to the airport safely and notice planes are still taking off and landing. Listening to the radio on the way there I learned winds were reaching in excess of 80MPH and knocking down trees and power lines all across the state of Ohio. Dayton was impacted especially hard. I'm not sure how or why, but my plane took off successfully and it was a smooth ride once we were above the atmosphere.
My house was without power for 4 days. Others had it worse with the outage lasting over 9 days. This kind of weather event hasn't happened in 200 years. Some very special, one-in-a-million chance conditions came together thanks to, in part, hurricane Ike to cause extraordinarily high winds in our region that none of us had seen before.
So something that will never happen happened - a disaster occurred to our data center causing a multi-day loss in power. We have a natural-gas generator to cover a power outage. It kicked in and life is good right? Wrong! We also have redundant AC units but only one works with the generator and the automatic fail-over didn't work due to a bug in the system (which has since been corrected). The room starts heating up and servers start shutting off as the temperature reached 90 degrees Fahrenheit. We reached 95-96F before a co-worker showed up and manually switch the AC units over (I can't do it - I'm on a plane, remember?). It took him twice as long to get there because of downed power lines and trees that closed roads.
He then starts powering up servers again. Luckily the outage for most systems is an hour or less on a Sunday when most of our users don't care or are being distracted by the tree that's landed in their living room. The ESX hosts and virtual machines all power-up successfully thanks in part to the hardware sensors on the servers that powered them off before the CPU, memory or I/O components fried in the heat.
While the outage was bad, it brought to light several interesting points:
- Test the equipment, but test the fail-over of the equipment.
Testing the actual fail-over is the hardest part of disaster recovery because it impacts production. However, regardless of whether it's AC units or virtual machines, this is the only way to be 100% certain you DR plan will work as designed and implemented. - The quality of built-in server hardware sensors has increased dramatically in the last 7 years.
This is the third time I've had servers in a room that overheated due to an AC outage. The previous two events were lab servers that did not recover very well. The hardware didn't shutdown cleanly. Many systems were blue-screened if they were still running. When AC service was restored, some servers wouldn't power back up; others threw strange hardware-related errors months after the fact. Heat does bad things to electronics and I've seen too much of this first hand. - Additional data center environmental monitoring and sensor devices are critically important.
I have the fortune of working for a data center manager that had the foresight to install a Sensaphone remote monitoring device (http://www.sensaphone.com/). I'm sure there are other products on the market but this one works very well for us. It can call a list of numbers and speak the alert condition over the phone. The admin can then enter a code to stop it from calling the next number. It can monitor various conditions but in this case it called us to warn about the temperature. We also have an ADT monitoring unit but it doesn't seem to work as well. - Data center protection is important in a disaster but also consider supporting non-data center work-related processes.
This "mini-disaster" put us without power for days, yet the business needed to continue to function. We needed to process sales orders, purchase raw materials, process payroll, etc. Have you ever worked for a company that couldn't meet payroll for any reason? To say that employees get upset is an understatement. So when no-one has power, where does the accounting staff go to get their job done? Plan to provide facilities for personnel to process these kinds of essential functions. After-all, what good is making sure the payroll system is running when nobody can access it anyway? - Consider specific disaster scenarios and plan accordingly.
This maybe the hardest things to accomplish when planning for a disaster. Put two people in a room and they will have very different opinions on which scenario is more important than the other. The bottom line is you'll have choose some number, say the top three, and plan for those. You should plan for something - define it but don't let it stall the progress of the project.
The power outage lasted around 72 hours, the service outage lasted less than an hour. Not bad overall! Now I'd better get VMware Site Recovery Manager working - had that generator stopped running...