I have a strange problem with a ESX-System (3.5u4) based on 6 HP Servers combined with a Active/Active NetApp MetroCluster Storage-System based on 2 Filers.
Here is the Situation:
2 Racks each with 3 ESX-Servers and 1 Filer-System. Both connected with redundant Power, LAN and SAN (Fibre-Channel).
Last week, I had to do a desaster & recovery test with this system. So I decided to cut the power of one of these Racks. I know, that this test ends up in a split-brain situation and I know what to do with the NetApp- and ESX-System in this case.
My expectation of this test was the following:
Cut the Power of the Rack
Watching several VMs to Crash (Some VMs because those run on the "crashed" ESX-Servers and some other because those have their datastores on the "crashed" Filer but are running on the surviving ESX-Servers)
Watching HA-Cluster to boot crashed VMs on the surviving ESX-Servers with datastores on the surviving Filer
After a while, the admin starts the recovery process on the NetApp- and ESX-System
This should be the situation: Power-Outage accours and the ESX-System tries automaticly to recover and use the remaining VMs without manual-interaction. After a while, the admin gets noticed of this situation and decides the next steps. I know, that the "normal" HA only consider the crash of one ESX-Server, but this isn't the problem. The real problem is the following:
When I cut of the power of the Rack, the other surviving ESX-Servers were nearly inaccessible / usable.
I can't logon with VI-Client; logon to serviceconsole works fine, but every command concerning the scsi / vmfs-system hung. I can't even look, if crashed VMs from other systems are booted.
I waited about 40 Minutes for a change of this situation, but nothing happend.
The situation got better, when I started the recovery process on the surviving NetApp-Filer. In this process, the LUNs from the crashed Filer gets online.
The vmkernel log is full with entries concerning "Path is busy", "SCSI-Reservation-Timeouts" and lots of other stuff.
Basically, I know this is a very hard test and I was not suprised that the system gets inaccessible in the first minutes. I nearly stole them the half of the datastores and the system tries to search them; this is ok. But I was suprised, that the system didn't get in a general timeout, stop searching and proceed with other tasks, such as running the surviving VMs...
My question to the community is - Was this a normal behavior? Fortune? When not, what can I do to solve this problem? Support with both Hardware-Vendors are in progress, but at the moment there are no answers. Is there a parameter for such things?
I think, others of you will also tried such desaster-tests with your ESX-Cluster-System. Perhaps with other Hardware, but what effects did you got? Have you got similar problems with your ESX-Servers or happened nothing?!?
I don't want to try this test again until I have some answers or hints to hope, that this happens never again
Sorry for my englisch-