Good Morning VMWARE world!
So... Long story short we were bitten hard by HA's lack of protection against APD. We run a stretched cluster configuration running 5.5 with the latest updates. EMC VPlex replicates storage to two different EMC arrays, and keeps datastores in sync and creating a true virtual write anywhere configuration.
There are two formal classifications of storage failures from an FDM (HA) prospective as far as I understand... APD / PDL.
PDL is determined by SCSI sense codes shared from the array to the host, and APD is a connection failure from the Array to the Host, or the another catastrophic failure that didn't produce a SCSI sense code.
We have a HP c7000 enclosure with several 1/2 height blades.
About a month ago the flex fabric cards had a situation where only storage (networking was not impacted) failed from the enclosure to the SAN. This caused the VM's to essentially lose the ability to complete any storage I/O. VM's all went the the best I can describe as 'Zombie'. Many of the blades ramped CPU up to near 100% after 15 min or so as VM's were unable to complete any storage I/O. We lost about 200VM's and caused a major outage to the business.
It was then I learned HA doesn't protect against APD in any way shape or form. WE really wish VMWARE would solve this issue.
Instead I have created a few scripts and processes to fix this issue until developers of FDM can get this resolved...
Here is how I solved this issue.
I have created a shell script that runs on an ESXi host that has the following high level logic.
-Check to see if storage attached via FC to SAN is up and accessible.
-If storage is down and all data stores are inaccessible (defined data stores) reboot the impacted host, which will force HA to reboot VM's on a surviving host.
-Ensure script can't restart a host if HA isn't running.
-Ensure that script can't be started multiple times on a host.
-Have a way to collect logs from the script.
-Check against multiple datastores to ensure that paths are down.
-If paths are down use esxcli to rescan the interface several times prior to killing the host.
-run the script on the ESXi host itself to ensure that patching / other activities doesn't impact the script from running.
Here we go!
The script again runs on the ESXi host itself. It is a linux shell script (I am not a linux engineer so this was best effort for me)....
#!/bin/sh
#!/usr/bin/esxcli
echo "$(date) -- Current date : $(date) @ $(hostname)"
echo "$(date) -- IVIS HA APD Issue Script Start!"
#DEFINE TEST DATASTORES
HBFILE=ivisHB$(hostname).txt
TESTDS1=/vmfs/volumes/***DATASTORE_TO_TEST_PATH_1***/ivisHA
TESTDS2=/vmfs/volumes/***DATASTORE_TO_TEST_PATH_2***/ivisHA
TESTDS3=/vmfs/volumes/***DATASTORE_TO_TEST_PATH_3***/ivisHA
TESTDS4=/vmfs/volumes/***DATASTORE_TO_TEST_PATH_4***/ivisHA
LOCKFILELOC=/tmp/ivisHB.lck
mkdir $TESTDS1
mkdir $TESTDS2
mkdir $TESTDS3
mkdir $TESTDS4
#DECLARE FUNCTIONS VARS
RUN=1
EchoDates() {
echo "$(date)" > $TESTDS1/$HBFILE
echo "$(date)" > $TESTDS2/$HBFILE
echo "$(date)" > $TESTDS3/$HBFILE
echo "$(date)" > $TESTDS4/$HBFILE
}
TestDatastores() {
if [[ -e "$TESTDS1" || -e "$TESTDS2" || -e "$TESTDS3" || -e "$TESTDS4" ]]
then
EchoDates
local return_value=0
else
echo "$(date) -- $(hostname) lost connectivity to all datastores at $(date)...."
local return_value=1
fi
return "$return_value"
}
LockScript(){
lockfile -r 0 "$LOCKFILELOC"
}
CheckLock(){
lockfile -r 0 "$LOCKFILELOC" || exit 1 # SCRIPT IS ALREADY RUNNING
}
CheckHARunning(){ #FUNCTION TO CHECK IF HA(FDM v5.1+) IS RUNNING
if [ $(ps -Z | grep fdm | wc -l) -gt 0 ]
then
local rt_CheckHARunning=1 #FDM running
else
local rt_CheckHARunning=0 #FDM not running
fi
return "$rt_CheckHARunning"
}
RescanHBAs(){
esxcli storage core adapter rescan --all
}
CheckLockFileExists(){
if [ -e "$LOCKFILELOC" ]
then
local rt_ChkFileDel=1 #CHK NOT DELETED
else
local rt_ChkFileDel=0 #CHK DELETED
fi
return "$rt_ChkFileDel"
}
#MAIN LOOP PROGRAM
CheckLock
LockScript
RescanHBAs
EchoDates
echo "$(date) $(hostname) has started IVISHA!"
while [ $RUN ]
do
#Check and see if lockfile is deleted. If so EXIT GRACEFULLY.
CheckLockFileExists
rt_ChkFileDel=$?
if [ $rt_ChkFileDel == 0 ] #IF HA RUNNING / ELSE NOT RUNNING
then
echo "$(date) -- Lock File Deleted Killing Script... "
let $RUN=0
fi
#Check and see if HA is running if it isn't don't do anything.
CheckHARunning
rt_CheckHARunning=$?
if [ $rt_CheckHARunning == 1 ]
then
TestDatastores
return_value=$?
if [ $return_value == 1 ]
then
echo "$(date) -- Initial Failure Detected $(date) APD Detected..."
ContinueLoop=1
FailureCount=0
while [ $ContinueLoop ]
do
sleep 5
TestDatastores
return_value=$?
if [ $FailureCount == 3 ]
then
echo "$(date) -- IVISHA has detected APD... REBOOTING HOST NOW!!!!"
reboot -n -f
fi
if [ $return_value == 1 ]
then
let FailureCount=FailureCount+1
RescanHBAs
sleep 5
else
EchoDates
let ContinueLoop=0
fi
done
else
EchoDates
fi
sleep 20
else
echo "$(date) -- IVISHA has detected HA (FDM Agent) is Off-line... Script Sleeping..."
sleep 120
fi
done
-The script is stored on a shared data-store among all of the hosts.
SCRIPT LOGIC
1. Define static vars.
2. Make directories to write heartbeats to.
3. Check to see if the lock file the script creates has already been created, if it has exit the script (as the script is already running)
4. Create the lock file.
5. Scan the HBA's
6. Echo to the log that the script has started.
7. Start endless loop.
8. Ensure lockfile exists, else exit.
9. Check to ensure HA is running if not sleep.
10. Test datastores to see if the heartbeat file exists, if it exists heartbeat a datetime stamp into the file
11. if it doesn't exist go into a sub loop... Check 3 more times to see if any one of the data stores is accessible. If it fails all 3 checks. Force a REBOOT.
From the time of pulling the cable from the host to the time of reboot is about 4.5 min on my test boxes. It works every time, HA then reboots VM's on surviving hosts... The script is still very RAW, but again it works, Please post updates/enhancements here....
Now the next problem.... How do you start the script on an ESXi host???
This is where PowerCLI and PLINK come to the rescue.
# IVIS HA FUNCTIONS
Function Start-ESXiHostSSH()
{
param(
[Parameter(
Mandatory=$true,
ValueFromPipeline=$true,
ValueFromPipelineByPropertyName=$true
)]
[string]$HostName
)
if(![string]::IsNullOrEmpty($_)){
$HostName = $_
}
$TargetHost = Get-VMHost -Name $HostName
Start-VMHostService -HostService ($TargetHost | Get-VMHostService | Where { $_.Key -eq "TSM-SSH"})
}
Function Stop-ESXiHostSSH()
{
param(
[Parameter(
Mandatory=$true,
ValueFromPipeline=$true,
ValueFromPipelineByPropertyName=$true
)]
[string]$HostName
)
if(![string]::IsNullOrEmpty($_)){
$HostName = $_
}
$TargetHost = Get-VMHost -Name $HostName
Stop-VMHostService -HostService ($TargetHost | Get-VMHostService | Where { $_.Key -eq "TSM-SSH"}) -Confirm:$false
}
#GLOBAL HA FUNCTIONS
$primaryIvisHADS = "/vmfs/volumes/***LOCATION_OF_SCRIPT_DATASTORE***/ivisHA/"
$primaryIvisHADSScript = "/vmfs/volumes/***LOCATION_OF_SCRIPT_DATASTORE***/ivisHA/ivisHAv2.sh"
$lockFileLocation = "/tmp/ivisHB.lck"
$localIvisHBLog = "/scratch/log/ivisHA.log"
function Start-ivisHA #Force will not start script if it already running, only if it has crashed and lock file exists.
{
param(
[Parameter(
Mandatory=$true,
ValueFromPipeline=$true,
ValueFromPipelineByPropertyName=$true
)]
[string]$HostName,
[string]$pass,
[bool]$stopSSH = $true,
[bool]$force = $false
)
Copy "\\NETWORK_LOCATION_OF_PLINK.EXE" "c:\putty\"
Start-ESXiHostSSH -HostName $HostName
if($force){
C:\putty\plink.exe $HostName -l root -pw $pass "rm $lockFileLocation; cp $primaryIvisHADSScript /tmp/; chmod 777 /tmp/ivisHAv2.sh; nohup ./tmp/ivisHAv2.sh > $localIvisHBLog &"
}
else{
C:\putty\plink.exe $HostName -l root -pw $pass "cp $primaryIvisHADSScript /tmp/; chmod 777 /tmp/ivisHAv2.sh; nohup ./tmp/ivisHAv2.sh > $localIvisHBLog &"
}
if($stopSSH){
Stop-ESXiHostSSH -HostName $HostName
}
}
function Stop-ivisHA #Force will not start script if it already running on the ESXi Host, only if it has crashed and lock file exists.
{
param(
[Parameter(
Mandatory=$true,
ValueFromPipeline=$true,
ValueFromPipelineByPropertyName=$true
)]
[string]$HostName,
[string]$pass,
[bool]$stopSSH = $true,
[bool]$force = $false
)
Copy "\\NETWORK_LOCATION_OF_PLINK.EXE" "c:\putty\"
Start-ESXiHostSSH -HostName $HostName
C:\putty\plink.exe $HostName -l root -pw $pass "rm $lockFileLocation > $localIvisHBLog &"
if($stopSSH){
Stop-ESXiHostSSH -HostName $HostName
}
}
function Collect-ivisHALogs
{
param(
[Parameter(
Mandatory=$true,
ValueFromPipeline=$true,
ValueFromPipelineByPropertyName=$true
)]
[string]$HostName,
[string]$pass
)
Copy "\\NETWORK_LOCATION_OF_PLINK.EXE" "c:\putty\"
Start-ESXiHostSSH -HostName $HostName
C:\putty\plink.exe $HostName -l root -pw $pass "mkdir $primaryIvisHADS/logCollection; cp /scratch/log/ivisHA.log $primaryIvisHADS/logCollection/ivisHA$HostName.log &"
Stop-ESXiHostSSH -HostName $HostName
}
APD isn't very common but it does happen and when it does strike it really has a serious impact... I am available via email / message to help you set this up in your environment and play with if you would like.
Cheers vmware!
Just an FYI, VMware has announced they are working on a solution for APD scenarios for a future release at VMworld. It was called "component protection". I wrote about it here: INF-BCO2807 - vSphere HA and Datastore Access Outages tech preview
Thanks Duncan,
First off... Huge Fan...
Second Off, I had read your article but the disclaimer had me very worried that it would take vmware a long time to put together component protection.
As I understand this solution will be very complex (not nearly as simple as the hack I put together)...
Thanks again, and Cheers!!
It is complex for sure... And I must say nice job on the script!
Thanks Ryan for posting this. Just wanted to ask if you guys know of another way to do this such as executing a smaller script when a vCenter Alarm or vCOPS alert is generated for APD? We are busy deploying an EMC Metro Storage Cluster for our VMware environment and APD is one of the only scenarios that doesn't have an automatic response of some kind.
Thanks for the whitepaper Duncan, it's coming in handy.
Regards
Tom Otomanski
Good Morning Tom!
Firstly, Awesome.
I am sure you will find the metro storage cluster concept extremely beneficial for your organization.
That was indeed my first approach to configuring this script, but while using FC for storage connectivity I couldn't find any alerts that bubbled up through vcops / vsphere to trigger anything off of; also it hit me then that if the vcenter vm and / or vcops vm was in an APD state you wouldn't be able to pull any information from either system.
Although I did find unique logs on the hosts themselves within the vmkernel log. (This could be another approach)
Since this post I have reworked the script quite a bunch to better suit the environment it is running in (14 node cluster / emc vplex / clarion / vnx / FC connectivity) fully production. We will be testing it in the coming month by actually pulling storage fabric from the c7000 (the script was vetted in a virtual test environment I built). I have full confidence that it is going to do what it needs to do.
If you are interested I can bundle up my script / with the powerCLI functions / modules and email them to ya.
Or....
If you are attending vmworld this year (San Francisco) I would be happy to show you the actual running script and functions in person.
Cheers Tom.
Thanks Ryan
Let me know how the testing goes with your updated script. Unfortunately a couple of people from my team attended EMC World earlier this year so the bosses had long drawn out faces when I asked about VMworld (Though we usually attend the Barcelona event).
I've managed to patch together a simpler solution while doing research. vCenter doesn't create an alarm during an APD occurence however there are events logged on the hosts to indicate that APD has occurred on a Datastore. I found the below blog post on William Lam's blog which was written during the 5.5U1 NFS disconnect fiasco which helped out:
I created a new alarm on the host and added the below 2 entries to the trigger:
esx.problem.storage.apd.start
esx.problem.storage.apd.timeout
I selected the reboot host as the action on the alarm. While testing we noticed that running services.sh restart on a host with APD would kickoff HA of the VM's so we tried incorporating this into a script however with the limited amount of time that we have before going live we haven't been able to do this, hence the reboot host action.
This should in theory reboot the host when the event occurs, the alarm is triggered during the APD event. I'm going to test this out next week and then also see if I can apply it on the cluster as that would be a more convenient place to run the alarm from (Otherwise you'd have to add the alarm every time you add a new host on vCenter).
With regards to running this on the vCenter versus a standalone script, our VMware environment is quite big already and is expanding rapidly (going to hit over 100 hosts, over 4000 VM's) so it's more feasible to run this as a vCenter alarm. I'm also thinking of possibly running our vCenter at our DR site on the host running our Witness server so hopefully that should mitigate the risk of running this on vCenter (Or at least run the vCenter on a Management cluster separate from the production environment).
Enjoy VMworld!
Tom Otomanski
Thanks Tom!
I really like the idea of monitoring and rebooting the host from vCenter, given a bit of time I may explore that approach as well. Thanks!
As far as my script I have tuned it and simplified it for our environment. It has been in production for about 3 weeks now without any issues. We are going to cause an APD event with all of our hosts at our second data center this coming weekend to fully test and ensure that it will work with our productions hosts (pulling fabric from the blade enclosure) (I am confident that it will). Our environment is much smaller. 14 Hosts about 500 VM's. Although that larger environment sounds like a blast!
As far as the number of hosts the PowerShell functions of the script actually does a query for all hosts within a cluster and then starts the script on those hosts. (so starting the script on all of the host is a one-liner in power-shell.)
I have a standard windows scheduled task that runs a PowerShell script nightly and ensure all hosts (even if we add more) have the script started... It also sends an email and writes system events for SCOM to pickup and ingest.
The script has been simplified in such that all you need to do to use it is change one static variable (cluster name) and the PowerShell section handles the rest. (With more than 100 hosts you would have multiple clusters so it would take a bit of modification).
Can't wait for VMWorld... Should be a blast!
Cheers Tom!
