VMware Cloud Community
Sharantyr3
Enthusiast
Enthusiast

How to handle (iscsi) storage access temporary outage

Hello !

I was wondering if anyone could give me advices on how to handle storage access failures for VMs.

For exemple, we have a iSCSI storage array located on datacenter 1, and VMs accessing it from datacenter 1 AND another datacenter, datacenter2.

We had a network outage between datacenter1 and 2 that caused issue inside a windows file server guest VM (all previous version VSS snapshots lost/corrupted).

How could I handle this in future ?

I know I can increase the disk timeout inside windows guest ( vSphere Documentation Center ) but would that be egnouth ?

Do I have to increase timeouts too in VMware ESXis ?

What I try to achieve is data consistency over service availability.

By this I mean if I could find a way to "auto-freeze" all VMs that need access to storage with APD condition, either with a SUSPEND command directly sent to the VM or freezeing the VM world on the ESXi.

So the VM would be inaccessible but would not fail data i/o until the storage is accessible again.

Thanks for any hints

0 Kudos
5 Replies
daphnissov
Immortal
Immortal

You're trying to engineer a solution around a problem which is caused by poor architecture decisions, namely

we have a iSCSI storage array located on datacenter 1, and VMs accessing it from datacenter 1 AND another datacenter, datacenter2

This is just a very bad idea to begin with, as you've now discovered. Failure of the link between the two sites not only disrupts any application data but now IP-storage data as well. Storage access should be kept local to a datacenter for scenarios just like this.

0 Kudos
Sharantyr3
Enthusiast
Enthusiast

I can't disagree with you, but sometimes you have to deal with hierarchy requirements and $$ constraints.

And right now, the storage array is on a datacenter without any ESXi yet (even worse than what I described).

Anyway, I'm sure I could find an use case that would be valid to your eyes, but that's not the point.

I think what I'm asking is not so weird and I would like to get some guidance about how to tweak timeouts, at which levels (esxi, guest, ...?)

And if also, by any chance, a mechanism to freeze suspend VMs could be engaged in such scenarios.

Sometimes you just have a small network outage and it would save a lot of time to be abble to achieve what I am asking.

Sorry for poor english btw Smiley Happy

0 Kudos
daphnissov
Immortal
Immortal

The fact is, if a host cannot see its storage for an extended period, there's really nothing you can do. For brief storage disconnects, there are a few tweaks you can make, but if you just kill the storage traffic altogether there's no magic to make that event be ok. So things like "suspend" or "freeze" that are in your mind won't work. Some of those tweaks would be at the ESXi level and are advanced iSCSI parameters like LoginTimeout and RecoveryTimeout. Again, those have limited effectiveness and will not help you in a case where the link is hard down. Until you address the heart of the issue by correcting the separation of workload from storage, everything else doesn't really matter much.

gregsn
Enthusiast
Enthusiast

You may want to give NFS a try as it may handle extended APD I/O differently than iscsi.  In my experience when I was experimenting with extended NFSv3 outages (several hours), Windows 10 VMs stored on NFSv3 had no issues (no crashes, no file system corruption, etc).  I've had issues with older versions of Windows VMs BSOD and reboot with extended NFS outages, but no file system corruption as a result.  I've not tested this with NFSv4.

You can experiment using a small Linux NFS VM as a test datastore (sync mount), at which point, you can power it off/reboot it/etc. while doing I/O to it and see what happens.

Sharantyr3
Enthusiast
Enthusiast

Thanks for your answers.

NFS is not offered by the storage array, unfortunately, so this is not an option.

I will check on timeout values on both ESXi and guests.

According to Configuring Advanced Parameters for iSCSI

The values I might look into would be DefaultTimeToWait and DefautTimeToRetain

0 Kudos