Solved: Storage conectivity losses, Consequences at VM Lev...

amnesyak · ‎09-19-2020

Hello everyone,

I'm seeking for information about the consequence of a iscsi storage failover (Active (IO) path dead) at the VM level :

How a VM (Deian/Centos/Windows at their system level only) react to storage loss of few second (1-5) during a storage failover ?
Do you know a white paper focuses on the path status and configuration optimisation on ESXI ?

I found many type of document that explain in details how rendondency work but nothing about "Hey, we lost an array IRL, path changed in less than 5 second but 150 VM get weird after failover and we know why ! (BTW we know how to simulate it)".

Array vendor have many specification and configuration recommandation depending the array hardware. The context of my questions is restricted to VMware/VM only.

Thanks in advance guys.

khiregange · ‎09-20-2020

How a VM (Deian/Centos/Windows at their system level only) react to storage loss of few second (1-5) during a storage failover ? - >

When a path fails, storage I/O might pause for 30-60 seconds until your host determines that the link is unavailable and performs the failover. If you attempt to display the host, its storage devices, or its adapters, the operation might appear to stall. Virtual machines with their disks installed on the SAN can appear unresponsive. After the failover, I/O resumes normally and the virtual machines continue to run.

Virtual machine I/O might be delayed for up to 60 seconds while path failover takes place. With these delays, the SAN can stabilize its configuration after topology changes. In general, the I/O delays might be longer on active-passive arrays and shorter on active-active arrays.

Its the esxi that reacts to the storage loss as soon as LUN fail-over happens on the array, the Path selection policy detects the error reports to the SATP plugin which is dependent on your storage array configuration and the IOs are retried over the new path.

You may check - > VMware Knowledge Base

Path Failover and Virtual Machines

Do you know a white paper focuses on the path status and configuration optimisation on ESXI ?

To optimize the connectivity path you may use the Multipathing Plug-ins (MPPs) provided by the storage array vendor

For EMC based arrays they have power-path which replaces the native nmp and SATP

You also tweak the PSP settings for optimization

check with storage vendor for recommendation on tweaking the driver parameters and other HBA settings

View solution in original post

ZibiM · ‎09-19-2020

Hi

Some comments based on heavy NFS usage

VMs tend to survive up to 60s of no storage access - after that Linux turn to r/o and Windows goes crazy

I tested storage HA several times (like moving LIF to other controller/unit) and never had any issues

You can check Netapp regarding their recommended OS optimizations (disk timeouts, etc) - it might be helpful for the older OSes out there.

khiregange · ‎09-20-2020

How a VM (Deian/Centos/Windows at their system level only) react to storage loss of few second (1-5) during a storage failover ? - >

When a path fails, storage I/O might pause for 30-60 seconds until your host determines that the link is unavailable and performs the failover. If you attempt to display the host, its storage devices, or its adapters, the operation might appear to stall. Virtual machines with their disks installed on the SAN can appear unresponsive. After the failover, I/O resumes normally and the virtual machines continue to run.

Virtual machine I/O might be delayed for up to 60 seconds while path failover takes place. With these delays, the SAN can stabilize its configuration after topology changes. In general, the I/O delays might be longer on active-passive arrays and shorter on active-active arrays.

Its the esxi that reacts to the storage loss as soon as LUN fail-over happens on the array, the Path selection policy detects the error reports to the SATP plugin which is dependent on your storage array configuration and the IOs are retried over the new path.

You may check - > VMware Knowledge Base

Path Failover and Virtual Machines

Do you know a white paper focuses on the path status and configuration optimisation on ESXI ?

To optimize the connectivity path you may use the Multipathing Plug-ins (MPPs) provided by the storage array vendor

For EMC based arrays they have power-path which replaces the native nmp and SATP

You also tweak the PSP settings for optimization

check with storage vendor for recommendation on tweaking the driver parameters and other HBA settings

nachogonzalez · ‎09-20-2020

I think khiregange's response is most complete

Some points on my own:

- I've seen Unix/linux go Read only after more than 60s with no IO to the storage. With less than that performance is deteriorated and IOPS are queued.
- Make sure not to use Multipath on your VM's and use it on the hosts

- I've seen another weird issue in which the storage array had a LOT of dead paths and it caused ESXI management services to go down and cause a PSOD. --> We needed to rescan and restart management agents. IDK if this is your case but keep it in mind.

Regarding the whitepaper you are requesting, I don't remember one in particular, but every storage vendor (and every type of storage) behaves a little different.
Can you provide more information regarding your vSphere version and Storage Array configuration?

Blog Nachogonzalez.com.ar Twitter @nachogon_

amnesyak · ‎09-20-2020

Thanks a lot for your input everyone !

All

Storage conectivity losses, Consequences at VM Level