Anatomy of an ESX Server Crash (or the Benefit of VMware HA on VI3)

Anatomy of an ESX Server Crash (or the Benefit of VMware HA on VI3)

The agency where I work is currently running VMwareVirtual Center2.0.2 and VMware ESX Server 3.0.2 on twelve HP BL480c G1 blades, each blade has2 Dual Core CPUs and 16GB RAM.

The servers are configured astwo clusters, and each cluster is configured for DRS (Dynamic ResourceScheduling) and HA (High Availability). DRS is the service to balance workloads across ESX hosts, and HA is theservice which monitors ESX hosts, and if the host’s heartbeat is lost, VM’sthat were running on that host are restarted on other hosts in the cluster.

Tiers describe the service levels for each application (and server) with Return to Operations (RTO) and Recovery Point Objectives (RPO) for each tier. One cluster of ESX hosts is for Tier I and Tier II applications; the other is for Tier III and Tier IV applications.

Service Level and Characteristics

Criticality Level

RTO (Availability)

RPO (Data Loss)

Characteristics

System / Data Requirement

Tier I

~1 - 8 hours

~0-4 hours

Systems / Data that directly support the agency

High integrity and high availability

Tier II

~8 - 24 hours

~4-24 hours

Systems / Data that indirectly support the agency

High integrity and medium availability

Tier III

~2 - 5 days

~1 - 5 days

All other systems/data

Basic integrity and basic availability

Tier IV

~2 - 5 days

~1 - 5 days

test/dev

Basic integrity and basic availability

VirtualCenter is also setupto send an email notification if a host server goes to “Red” (goes offline). I had been testing the email notifications, so the alert was only set to go to my email address.

On December 5th, I was out of the office. At approximately 1:30 pm, a Tier III/IV blade suffered a catastrophic failure. VirtualCenter detected that the server was “Red” and send me an email. I was nowhere near my email at the time. At the same time HA lost the heartbeat of the server and in short order started bringing the VMs that were running on the failed host up on other hosts in the cluster.

OpenView detected that the individual VMs were not running and sent out notifications to the Network Engineering team. Upon receiving the message that servers were down, members of the Network Engineering team started investigating the server failures. Engineers were able to logon to the servers that were reported as down, and determined that they were up and running properly.

Within four minutes all of the servers that OpenView had reported as down, were now being reported as up.

Later in the afternoon,around 3PM, I checked my email, saw the server had gone “Red” and remotelystarted to investigate using Citrix, Remote Desktop, VirtualCenter, and HP’s web-based Onboard Administrator (OA) interface for the blade enclosure. From the OA, if the blade could have been restarted, I would have been able to do it there. The blade however did not respond to any commands. A quick call to a Network Engineering team member, and the engineer went down to the data center to re-seat the blade in the enclosure. Normally, a blade will power on in about 30 seconds after being inserted in an enclosure. This blade did NOTHING, no lights, no beeps, just dead.

At this point the engineer called HP for support, and a tech from HP was dispatched. Let me mention that December 5th was a snowy and icy day. It took several hours for the tech to get onsite to diagnose the problem. It was determined that the problem was a failed motherboard on the blade, and the board was replaced around 11PM.

The total time of the ESX host outage was about 9.5 hours. The RTO requirement for servers in the tier is between two and five days, which in and of itself is well within the stated RTO/RPO times for the tier. However, the applications that were running on the failed host were down for about four minutes exceeds the requirement for even Tier I.

If there were no VMware DRS/HA at the agency the outage would have been over 140 times as long as the outage was. The outage was only 0.06% of what is “acceptable” based on RTO. If you know what the cost per minute of server downtime is, you can see how much money was saved by VMware’s HA service.

Tags (2)
Comments

Very nice writeup!

Jason Boche

VMware Communities User Moderator[/i]

Version history
Revision #:
1 of 1
Last update:
‎12-14-2007 11:38 AM
Updated by: