VMware Cloud Community
FJ1200
Enthusiast
Enthusiast

VM shutdown on APD possible?

Is it possible to shut down VMs on a host in an APD situation as well as PDL?  Out customer wants the system configured that way if possible rather then having the VMs sat with no I/O if the SAN fails.  I believe I could script it but don't know how to trigger the script if an APD condition occurs - can the system state be monitored without too much of an overhead?  Or is there a 3rd party tool that would do the job?  Would I have to trigger it through vCenter and if so what happens if the SAN running vC fails?

PDL works fine, btw.

0 Kudos
7 Replies
admin
Immortal
Immortal

APD can happen when there is a sudden or unplanned disconnection of storage with the ESXi host. Now if a APD is encountered, we wont have access to VMs since the storage is down. if you can create a script, then you need to point it to the ESXi since its a ESX based event.

0 Kudos
FJ1200
Enthusiast
Enthusiast

Hi AakashJ,

I understand APD events and how they work but the customer wants the VMs shut down like a PDL event.  I'm just not sure if it can be done easily.  Not sure how to trigger a script on ESXi.  I could monitor the vmkernel.log for a confirmed APD event but how? I could poll it frequently but that's a bit clunky.

0 Kudos
admin
Immortal
Immortal

yes the logic might be to grep for APD in vmkernel log of ESXi corresponding to current time and kill the worlds associated with the VM managed by the concerned datastore. You can engage VMware PSO if its a requirement from the customer

0 Kudos
FJ1200
Enthusiast
Enthusiast

Is there a good way to trigger this rather than run as a cron job or would I simply have to poll a tail of the log on a regular basis?  If I have to poll it what do people think is an acceptable frequency?  I could poll the datastores and if they become unavailable tail the log?  Is there anything in ESXi like the FileSystemWatcher in .Net?

Sorry - new to scripting ESXi and haven't done any *nix shell scripting in a years. 

0 Kudos
FJ1200
Enthusiast
Enthusiast

Still looking into this, however it's got a bit more serious.

Due to the way our software replicates it's databases, if the db VMs hang we could potentially end up in a situation where a large database gets corrupted or even deleted - it can still see the db server and so tries to replicate.  We need a resolution to shut down affected VMs in an APD state.  I could use esxcfg-mpath to get the state or grep vmkernel.log run from cron but would prefer something built in, or at least have the option to shut a VM down.  This is serious for us, and we have a system due to be shipped to a customer in 2 weeks and I need to find a way to do this. 

For info: We're on ESXi 5.1.0 U1 with vCenter on 5.1.0 U1a, DBs are propriety to us, and it's the first time we've used HA on out system so this is new territory we were not expecting.

Any ideas?  If I script it, what's the best way to trigger it in ESXi?  Cron? 

I saw a post that was un-replied to from this time last year asking about triggering a script from an Alarm.  Would that work?  What if vCenter is one of the VMs that get's hit? 

0 Kudos
admin
Immortal
Immortal

I could use esxcfg-mpath to get the state or grep vmkernel.log run from cron but would prefer something built in, or at least have the option to shut a VM down.  This is serious for us, and we have a system due to be shipped to a customer in 2 weeks and I need to find a way to do this.

A host in APD runs on best effort basis, you cannot expect vmkernel.log and esxcfg-mpath to actually contain useful information or even log that event correctly. 5.1 resolves that issue to a certain degree due to hostd not being killed completely by an APD.

Just imagine the following scenario:

Your storage latency grows to an extremely high amount without a failover to a different path resolving the issue for whatever reason. Per definition you might not be in a clean APD but the behaviour you will be experiencing is the same. How do you intend to deal with that scenario (and yes it is valid, I see it several times per months).

0 Kudos
FJ1200
Enthusiast
Enthusiast

Ok.  So what would you suggest?  I'm cloning the 3 db servers involved to test them with this scenario as I have the system to myself for a few days.  We need to verify just what impact this will have, so any suggestions as to what I could do or how we could resolve this welcomed.

0 Kudos