Solved: VMSC: PDL/APD?

Dryv · ‎02-03-2016

Hi All,

Would the below be an APD/PDL condition? and if so, what should actually happen?

Environment:

- VMSC setup with 4 servers each site.

- ESXi 5.5

- I have set the values as per Duncans post on VMSC for the following:

VMkernel.Boot.terminateVMOnPDL = True
Das.maskCleanShutdownEnabled = True
Disk.AutoremoveOnPDL = 0

- Using 3Par as storage

- One Active LUN hosted of out Site 1 replicated to Site 2

- From Site 1 all VMs have their disks on the one Active LUN on Site 1

- Site 2 hosts just 1 VM and its disks are also on the one Active LUN in Site 1

- Site 2 VM talks to its disks therefore over over the stretched FC connection between the sites (I know not the best thing to do but I am testing )

Now, when the stretched fabric between the 2 sites is physically disconnected the following is observed (Note the network between all hosts remains up):

- Site 2 ESXi hosts can longer see the one Active LUN in Site 1. Accepted.

- Site 1 ESXi hosts continue to see the one Active LUN in Site 1. Accepted.

- Site 1 VMs continue to run. They respond to ping. They respond to RDP and browsing through the file system of the VM. Accepted

- Site 2 VM though, confusingly, continues to responds to Ping. Accepts an RDP connection, but when trying to browse the file system just hangs. But I can still get to VM even though its of no use anymore as its disks have been pulled. Why is this happening? Shouldn't the VM be killed and restarted elsewhere?

I'm pretty sure I am seeing something that been addressed by restarting the VM in version 6 with an APD timeout value... But can this applied to 5.5 environments in the case where the preferred VM to Storage alignment is overidden

kermic · ‎02-04-2016

1) If you would like HA to restart VMs on other hosts in case of APD/PDL scenario, yes, you need to have vCenter 6.x and ESXi 6.x. 5.x hosts won't do anything apart from reacting to storage loss in terms of fast failing IOs either immediately or after specified timeout (depends on scenario).

2) Ensuring VMs are running within same sate, where read/write copy of storage resources reside would be one option. Another one is implementing a storage metro-cluster like solution, where 2 arrays in different (metro-) sites do synchronously replicate a storage resource (f.x. a LUN - Lun A1 in site 1 is being synced with LUN A2 in site 2) and present that resource as a virtual entity (LUN A, that has A1 and A2 in the back) to hosts in both sites, so that it is accessible to hosts in both sites in read/write fashion. If implemented correctly, this would avoid APD issues in case of inter-site connectivity loss. These tend to be sort of expensive solutions, therefore if you ask your storage vendor about it, most likely they will be more than happy to share out lots of info

View solution in original post

kermic · ‎02-03-2016

For VMCP (PDL/APD protection) to work all hosts in HA cluster must be ESXi 6.0 or higher (vSphere 6.0 Documentation Center). Should not work with ESXi 5.5.

And the guest OS does not necessarily has to fail with a screen of your favorite color when detaching the disk. As long as the processes running on the guest are accessing only data in memory, it should survive.

As for the APD timeout value, in case you are talking about the "Misc.APDTimeout" setting (of 140s by default) - this is the time after which the host will fast-fail system IO in case of storage device disappearance in "surprise" fashion. HA has an additional setting that allows to wait for specified amount of time before actually restarting VMs if APD is detected on any host in cluster.

Hope this helps!

Dryv · ‎02-03-2016

Hi Kermic

Thanks for the explanation. Are you basically saying:

1. I need to go to vsphere 6 inorder for the VM to be shutdown and restarted by HA if this scenario happens?

2. The only way to avoid this on 5.5 is ensure the VM and it's storage are always on the same site?

Thanks you so much for your time.

kermic · ‎02-04-2016

1) If you would like HA to restart VMs on other hosts in case of APD/PDL scenario, yes, you need to have vCenter 6.x and ESXi 6.x. 5.x hosts won't do anything apart from reacting to storage loss in terms of fast failing IOs either immediately or after specified timeout (depends on scenario).

2) Ensuring VMs are running within same sate, where read/write copy of storage resources reside would be one option. Another one is implementing a storage metro-cluster like solution, where 2 arrays in different (metro-) sites do synchronously replicate a storage resource (f.x. a LUN - Lun A1 in site 1 is being synced with LUN A2 in site 2) and present that resource as a virtual entity (LUN A, that has A1 and A2 in the back) to hosts in both sites, so that it is accessible to hosts in both sites in read/write fashion. If implemented correctly, this would avoid APD issues in case of inter-site connectivity loss. These tend to be sort of expensive solutions, therefore if you ask your storage vendor about it, most likely they will be more than happy to share out lots of info

Dryv · ‎02-04-2016

Perfect. .. Thank you Kermic. Very helpful indeed. Yes unfortunately I don't believe my storage can do this. .. I'm running 3Par with Peer Persistence. I believe you are referring to what EMC VPLEX is capable of doing i.e. be able to write to the lun from both sites in case of inter site connectivity loss.

Thanks again for your time. Really very much appreciated... This kind of direction and knowledge doesn't come about alot.