VMware Cloud Community
RyanI
Contributor
Contributor

ESXi 5.5 -- APD not causing HA event. --Scripts to monitor / reboot hosts

Good Morning VMWARE world!

So... Long story short we were bitten hard by HA's lack of protection against APD. We run a stretched cluster configuration running 5.5 with the latest updates. EMC VPlex replicates storage to two different EMC arrays, and keeps datastores in sync and creating a true virtual write anywhere configuration.

There are two formal classifications of storage failures from an FDM (HA) prospective as far as I understand... APD / PDL.

PDL is determined by SCSI sense codes shared from the array to the host, and APD is a connection failure from the Array to the Host, or the another catastrophic failure that didn't produce a SCSI sense code.

We have a HP c7000 enclosure with several 1/2 height blades.

About a month ago the flex fabric cards had a situation where only storage (networking was not impacted) failed from the enclosure to the SAN. This caused the VM's to essentially lose the ability to complete any storage I/O. VM's all went the the best I can describe as 'Zombie'. Many of the blades ramped CPU up to near 100% after 15 min or so as VM's were unable to complete any storage I/O. We lost about 200VM's and caused a major outage to the business.

It was then I learned HA doesn't protect against APD in any way shape or form. WE really wish VMWARE would solve this issue.

Instead I have created a few scripts and processes to fix this issue until developers of FDM can get this resolved...

Here is how I solved this issue.

I have created a shell script that runs on an ESXi host that has the following high level logic.

-Check to see if storage attached via FC to SAN is up and accessible.

-If storage is down and all data stores are inaccessible (defined data stores) reboot the impacted host, which will force HA to reboot VM's on a surviving host.

-Ensure script can't restart a host if HA isn't running.

-Ensure that script can't be started multiple times on a host.

-Have a way to collect logs from the script.

-Check against multiple datastores to ensure that paths are down.

-If paths are down use esxcli to rescan the interface several times prior to killing the host.

-run the script on the ESXi host itself to ensure that patching / other activities doesn't impact the script from running.

Here we go!

The script again runs on the ESXi host itself. It is a linux shell script (I am not a linux engineer so this was best effort for me)....

#!/bin/sh

#!/usr/bin/esxcli

echo "$(date) -- Current date : $(date) @ $(hostname)"

echo "$(date) -- IVIS HA APD Issue Script Start!"

#DEFINE TEST DATASTORES

HBFILE=ivisHB$(hostname).txt

TESTDS1=/vmfs/volumes/***DATASTORE_TO_TEST_PATH_1***/ivisHA

TESTDS2=/vmfs/volumes/***DATASTORE_TO_TEST_PATH_2***/ivisHA

TESTDS3=/vmfs/volumes/***DATASTORE_TO_TEST_PATH_3***/ivisHA

TESTDS4=/vmfs/volumes/***DATASTORE_TO_TEST_PATH_4***/ivisHA

LOCKFILELOC=/tmp/ivisHB.lck

mkdir $TESTDS1

mkdir $TESTDS2

mkdir $TESTDS3

mkdir $TESTDS4

#DECLARE FUNCTIONS VARS

RUN=1

EchoDates() {

  echo "$(date)" > $TESTDS1/$HBFILE

  echo "$(date)" > $TESTDS2/$HBFILE

  echo "$(date)" > $TESTDS3/$HBFILE

  echo "$(date)" > $TESTDS4/$HBFILE

}

TestDatastores() {

  if [[ -e "$TESTDS1" || -e "$TESTDS2" || -e "$TESTDS3" || -e "$TESTDS4" ]]

  then

  EchoDates

  local return_value=0

  else

  echo "$(date) -- $(hostname) lost connectivity to all datastores at $(date)...."

  local return_value=1

  fi

  return "$return_value"

}

LockScript(){

  lockfile -r 0 "$LOCKFILELOC"

}

CheckLock(){

  lockfile -r 0 "$LOCKFILELOC" || exit 1 # SCRIPT IS ALREADY RUNNING

}

CheckHARunning(){ #FUNCTION TO CHECK IF HA(FDM v5.1+) IS RUNNING

  if [ $(ps -Z  | grep fdm | wc -l) -gt 0 ]

  then

  local rt_CheckHARunning=1 #FDM running

  else

  local rt_CheckHARunning=0 #FDM not running

  fi

  return "$rt_CheckHARunning"

}

RescanHBAs(){

  esxcli storage core adapter rescan --all

}

CheckLockFileExists(){

  if [ -e "$LOCKFILELOC" ]

  then

  local rt_ChkFileDel=1 #CHK NOT DELETED

  else

  local rt_ChkFileDel=0 #CHK DELETED

  fi

  return "$rt_ChkFileDel"

}

#MAIN LOOP PROGRAM

CheckLock

LockScript

RescanHBAs

EchoDates

echo "$(date) $(hostname) has started IVISHA!"

while [ $RUN ]

  do

  #Check and see if lockfile is deleted. If so EXIT GRACEFULLY.

  CheckLockFileExists

  rt_ChkFileDel=$?

  if [ $rt_ChkFileDel == 0 ] #IF HA RUNNING / ELSE NOT RUNNING

  then

  echo "$(date) -- Lock File Deleted Killing Script... "

  let $RUN=0

  fi

  #Check and see if HA is running if it isn't don't do anything.

  CheckHARunning

  rt_CheckHARunning=$?

  if [ $rt_CheckHARunning == 1 ]

  then

  TestDatastores

  return_value=$?

  if [ $return_value == 1 ]

  then

  echo "$(date) -- Initial Failure Detected $(date) APD Detected..."

  ContinueLoop=1

  FailureCount=0

  while [ $ContinueLoop ]

  do

  sleep 5

  TestDatastores

  return_value=$?

  if [ $FailureCount == 3 ]

  then

  echo "$(date) -- IVISHA has detected APD... REBOOTING HOST NOW!!!!"

  reboot -n -f

  fi

  if [ $return_value == 1 ]

  then

  let FailureCount=FailureCount+1

  RescanHBAs

  sleep 5

  else

  EchoDates

  let ContinueLoop=0

  fi

  done

  else

  EchoDates

  fi

  sleep 20

  else

  echo "$(date) -- IVISHA has detected HA (FDM Agent) is Off-line... Script Sleeping..."

  sleep 120

  fi

  done

-The script is stored on a shared data-store among all of the hosts.

SCRIPT LOGIC

1. Define static vars.

2. Make directories to write heartbeats to.

3. Check to see if the lock file the script creates has already been created, if it has exit the script (as the script is already running)

4. Create the lock file.

5. Scan the HBA's

6. Echo to the log that the script has started.

7. Start endless loop.

8. Ensure lockfile exists, else exit.

9. Check to ensure HA is running if not sleep.

10. Test datastores to see if the heartbeat file exists, if it exists heartbeat a datetime stamp into the file

11. if it doesn't exist go into a sub loop... Check 3 more times to see if any one of the data stores is accessible. If it fails all 3 checks. Force a REBOOT.

From the time of pulling the cable from the host to the time of reboot is about 4.5 min on my test boxes. It works every time, HA then reboots VM's on surviving hosts... The script is still very RAW, but again it works, Please post updates/enhancements here....

Now the next problem.... How do you start the script on an ESXi host???

This is where PowerCLI and PLINK come to the rescue.

# IVIS HA FUNCTIONS


Function Start-ESXiHostSSH()

{

    param(

        [Parameter(

            Mandatory=$true,

            ValueFromPipeline=$true,

            ValueFromPipelineByPropertyName=$true

            )]

        [string]$HostName

    )

    if(![string]::IsNullOrEmpty($_)){

        $HostName = $_

    }

  $TargetHost = Get-VMHost -Name $HostName

  Start-VMHostService -HostService ($TargetHost | Get-VMHostService | Where { $_.Key -eq "TSM-SSH"})

}

Function Stop-ESXiHostSSH()

{

    param(

        [Parameter(

            Mandatory=$true,

            ValueFromPipeline=$true,

            ValueFromPipelineByPropertyName=$true

            )]

        [string]$HostName

    )

    if(![string]::IsNullOrEmpty($_)){

        $HostName = $_

    }

  $TargetHost = Get-VMHost -Name $HostName

  Stop-VMHostService -HostService ($TargetHost | Get-VMHostService | Where { $_.Key -eq "TSM-SSH"}) -Confirm:$false

}

#GLOBAL HA FUNCTIONS

$primaryIvisHADS = "/vmfs/volumes/***LOCATION_OF_SCRIPT_DATASTORE***/ivisHA/"

$primaryIvisHADSScript = "/vmfs/volumes/***LOCATION_OF_SCRIPT_DATASTORE***/ivisHA/ivisHAv2.sh"

$lockFileLocation = "/tmp/ivisHB.lck"

$localIvisHBLog = "/scratch/log/ivisHA.log"

function Start-ivisHA #Force will not start script if it already running, only if it has crashed and lock file exists.

{

    param(

        [Parameter(

            Mandatory=$true,

            ValueFromPipeline=$true,

            ValueFromPipelineByPropertyName=$true

            )]

        [string]$HostName,

  [string]$pass,

  [bool]$stopSSH = $true,

  [bool]$force = $false

    )

  Copy "\\NETWORK_LOCATION_OF_PLINK.EXE" "c:\putty\"

  Start-ESXiHostSSH -HostName $HostName

  if($force){

  C:\putty\plink.exe $HostName -l root -pw $pass "rm $lockFileLocation; cp $primaryIvisHADSScript /tmp/; chmod 777 /tmp/ivisHAv2.sh; nohup ./tmp/ivisHAv2.sh  > $localIvisHBLog &"

  }

  else{

  C:\putty\plink.exe $HostName -l root -pw $pass "cp $primaryIvisHADSScript /tmp/; chmod 777 /tmp/ivisHAv2.sh; nohup ./tmp/ivisHAv2.sh  > $localIvisHBLog &"

  }

  if($stopSSH){

  Stop-ESXiHostSSH -HostName $HostName

  }

}

function Stop-ivisHA #Force will not start script if it already running on the ESXi Host, only if it has crashed and lock file exists.

{

    param(

        [Parameter(

            Mandatory=$true,

            ValueFromPipeline=$true,

            ValueFromPipelineByPropertyName=$true

            )]

        [string]$HostName,

  [string]$pass,

  [bool]$stopSSH = $true,

  [bool]$force = $false

    )

  Copy "\\NETWORK_LOCATION_OF_PLINK.EXE" "c:\putty\"

  Start-ESXiHostSSH -HostName $HostName

  C:\putty\plink.exe $HostName -l root -pw $pass "rm $lockFileLocation > $localIvisHBLog &"

  if($stopSSH){

  Stop-ESXiHostSSH -HostName $HostName

  }

}

function Collect-ivisHALogs

{

    param(

        [Parameter(

            Mandatory=$true,

            ValueFromPipeline=$true,

            ValueFromPipelineByPropertyName=$true

            )]

        [string]$HostName,

  [string]$pass

    )


  Copy "\\NETWORK_LOCATION_OF_PLINK.EXE" "c:\putty\"

  Start-ESXiHostSSH -HostName $HostName

  C:\putty\plink.exe $HostName -l root -pw $pass "mkdir $primaryIvisHADS/logCollection; cp /scratch/log/ivisHA.log $primaryIvisHADS/logCollection/ivisHA$HostName.log &"

  Stop-ESXiHostSSH -HostName $HostName

}

APD isn't very common but it does happen and when it does strike it really has a serious impact... I am available via email / message to help you set this up in your environment and play with if you would like.

Cheers vmware!

Reply
0 Kudos
7 Replies
depping
Leadership
Leadership

Just an FYI, VMware has announced they are working on a solution for APD scenarios for a future release at VMworld. It was called "component protection". I wrote about it here: INF-BCO2807 - vSphere HA and Datastore Access Outages tech preview

RyanI
Contributor
Contributor

Thanks Duncan,

First off... Huge Fan...

Second Off, I had read your article but the disclaimer had me very worried that it would take vmware a long time to put together component protection.

As I understand this solution will be very complex (not nearly as simple as the hack I put together)...

Thanks again, and Cheers!!

Reply
0 Kudos
depping
Leadership
Leadership

It is complex for sure... And I must say nice job on the script!

Reply
0 Kudos
TomOtto
Contributor
Contributor

Thanks Ryan for posting this. Just wanted to ask if you guys know of another way to do this such as executing a smaller script when a vCenter Alarm or vCOPS alert is generated for APD? We are busy deploying an EMC Metro Storage Cluster for our VMware environment and APD is one of the only scenarios that doesn't have an automatic response of some kind.

Thanks for the whitepaper Duncan, it's coming in handy.

Regards

Tom Otomanski

Reply
0 Kudos
RyanI
Contributor
Contributor

Good Morning Tom!

Firstly, Awesome.

I am sure you will find the metro storage cluster concept extremely beneficial for your organization.

That was indeed my first approach to configuring this script, but while using FC for storage connectivity I couldn't find any alerts that bubbled up through vcops / vsphere to trigger anything off of; also it hit me then that if the vcenter vm and / or vcops vm was in an APD state you wouldn't be able to pull any information from either system.

Although I did find unique logs on the hosts themselves within the vmkernel log. (This could be another approach)

Since this post I have reworked the script quite a bunch to better suit the environment it is running in (14 node cluster / emc vplex / clarion / vnx / FC connectivity) fully production. We will be testing it in the coming month by actually pulling storage fabric from the c7000 (the script was vetted in a virtual test environment I built). I have full confidence that it is going to do what it needs to do.

If you are interested I can bundle up my script / with the powerCLI functions / modules and email them to ya.

Or....

If you are attending vmworld this year (San Francisco)  I would be happy to show you the actual running script and functions in person.

Cheers Tom.

Reply
0 Kudos
TomOtto
Contributor
Contributor

Thanks Ryan

Let me know how the testing goes with your updated script. Unfortunately a couple of people from my team attended EMC World earlier this year so the bosses had long drawn out faces when I asked about VMworld (Though we usually attend the Barcelona event).

I've managed to patch together a simpler solution while doing research. vCenter doesn't create an alarm during an APD occurence however there are events logged on the hosts to indicate that APD has occurred on a Datastore. I found the below blog post on William Lam's blog which was written during the 5.5U1 NFS disconnect fiasco which helped out:

http://www.virtuallyghetto.com/2014/04/how-to-create-vcenter-alarm-to-alert-on-esxi-5-5u1-nfs-apd-is...

I created a new alarm on the host and added the below 2 entries to the trigger:

esx.problem.storage.apd.start

esx.problem.storage.apd.timeout


I selected the reboot host as the action on the alarm. While testing we noticed that running services.sh restart on a host with APD would kickoff HA of the VM's so we tried incorporating this into a script however with the limited amount of time that we have before going live we haven't been able to do this, hence the reboot host action.


This should in theory reboot the host when the event occurs, the alarm is triggered during the APD event. I'm going to test this out next week and then also see if I can apply it on the cluster as that would be a more convenient place to run the alarm from (Otherwise you'd have to add the alarm every time you add a new host on vCenter).


With regards to running this on the vCenter versus a standalone script, our VMware environment is quite big already and is expanding rapidly (going to hit over 100 hosts, over 4000 VM's) so it's more feasible to run this as a vCenter alarm. I'm also thinking of possibly running our vCenter at our DR site on the host running our Witness server so hopefully that should mitigate the risk of running this on vCenter (Or at least run the vCenter on a Management cluster separate from the production environment).


Enjoy VMworld!

Tom Otomanski

Reply
0 Kudos
RyanI
Contributor
Contributor

Thanks Tom!

I really like the idea of monitoring and rebooting the host from vCenter, given a bit of time I may explore that approach as well. Thanks!

As far as my script I have tuned it and simplified it for our environment. It has been in production for about 3 weeks now without any issues. We are going to cause an APD event with all of our hosts at our second data center this coming weekend to fully test and ensure that it will work with our productions hosts (pulling fabric from the blade enclosure) (I am confident that it will). Our environment is much smaller. 14 Hosts about 500 VM's. Although that larger environment sounds like a blast!

As far as the number of hosts the PowerShell functions of the script actually does a query for all hosts within a cluster and then starts the script on those hosts. (so starting the script on all of the host is a one-liner in power-shell.)

I have a standard windows scheduled task that runs a PowerShell script nightly and ensure all hosts (even if we add more) have the script started... It also sends an email and writes system events for SCOM to pickup and ingest.

The script has been simplified in such that all you need to do to use it is change one static variable (cluster name) and the PowerShell section handles the rest. (With more than 100 hosts you would have multiple clusters so it would take a bit of modification).

Can't wait for VMWorld... Should be a blast!

Cheers Tom!

Reply
0 Kudos