Hello everyone,
I'm seeking for information about the consequence of a iscsi storage failover (Active (IO) path dead) at the VM level :
I found many type of document that explain in details how rendondency work but nothing about "Hey, we lost an array IRL, path changed in less than 5 second but 150 VM get weird after failover and we know why ! (BTW we know how to simulate it)".
Array vendor have many specification and configuration recommandation depending the array hardware. The context of my questions is restricted to VMware/VM only.
Thanks in advance guys.
When a path fails, storage I/O might pause for 30-60 seconds until your host determines that the link is unavailable and performs the failover. If you attempt to display the host, its storage devices, or its adapters, the operation might appear to stall. Virtual machines with their disks installed on the SAN can appear unresponsive. After the failover, I/O resumes normally and the virtual machines continue to run.
Virtual machine I/O might be delayed for up to 60 seconds while path failover takes place. With these delays, the SAN can stabilize its configuration after topology changes. In general, the I/O delays might be longer on active-passive arrays and shorter on active-active arrays.
Its the esxi that reacts to the storage loss as soon as LUN fail-over happens on the array, the Path selection policy detects the error reports to the SATP plugin which is dependent on your storage array configuration and the IOs are retried over the new path.
You may check - > VMware Knowledge Base
Path Failover and Virtual Machines
To optimize the connectivity path you may use the Multipathing Plug-ins (MPPs) provided by the storage array vendor
For EMC based arrays they have power-path which replaces the native nmp and SATP
You also tweak the PSP settings for optimization
check with storage vendor for recommendation on tweaking the driver parameters and other HBA settings
Hi
Some comments based on heavy NFS usage
VMs tend to survive up to 60s of no storage access - after that Linux turn to r/o and Windows goes crazy
I tested storage HA several times (like moving LIF to other controller/unit) and never had any issues
You can check Netapp regarding their recommended OS optimizations (disk timeouts, etc) - it might be helpful for the older OSes out there.
When a path fails, storage I/O might pause for 30-60 seconds until your host determines that the link is unavailable and performs the failover. If you attempt to display the host, its storage devices, or its adapters, the operation might appear to stall. Virtual machines with their disks installed on the SAN can appear unresponsive. After the failover, I/O resumes normally and the virtual machines continue to run.
Virtual machine I/O might be delayed for up to 60 seconds while path failover takes place. With these delays, the SAN can stabilize its configuration after topology changes. In general, the I/O delays might be longer on active-passive arrays and shorter on active-active arrays.
Its the esxi that reacts to the storage loss as soon as LUN fail-over happens on the array, the Path selection policy detects the error reports to the SATP plugin which is dependent on your storage array configuration and the IOs are retried over the new path.
You may check - > VMware Knowledge Base
Path Failover and Virtual Machines
To optimize the connectivity path you may use the Multipathing Plug-ins (MPPs) provided by the storage array vendor
For EMC based arrays they have power-path which replaces the native nmp and SATP
You also tweak the PSP settings for optimization
check with storage vendor for recommendation on tweaking the driver parameters and other HBA settings
I think khiregange's response is most complete
Some points on my own:
- I've seen Unix/linux go Read only after more than 60s with no IO to the storage. With less than that performance is deteriorated and IOPS are queued.
- Make sure not to use Multipath on your VM's and use it on the hosts
- I've seen another weird issue in which the storage array had a LOT of dead paths and it caused ESXI management services to go down and cause a PSOD. --> We needed to rescan and restart management agents. IDK if this is your case but keep it in mind.
Regarding the whitepaper you are requesting, I don't remember one in particular, but every storage vendor (and every type of storage) behaves a little different.
Can you provide more information regarding your vSphere version and Storage Array configuration?
Thanks a lot for your input everyone !