Some comments based on heavy NFS usage
VMs tend to survive up to 60s of no storage access - after that Linux turn to r/o and Windows goes crazy
I tested storage HA several times (like moving LIF to other controller/unit) and never had any issues
You can check Netapp regarding their recommended OS optimizations (disk timeouts, etc) - it might be helpful for the older OSes out there.
1 person found this helpful
- How a VM (Deian/Centos/Windows at their system level only) react to storage loss of few second (1-5) during a storage failover ? - >
When a path fails, storage I/O might pause for 30-60 seconds until your host determines that the link is unavailable and performs the failover. If you attempt to display the host, its storage devices, or its adapters, the operation might appear to stall. Virtual machines with their disks installed on the SAN can appear unresponsive. After the failover, I/O resumes normally and the virtual machines continue to run.
Virtual machine I/O might be delayed for up to 60 seconds while path failover takes place. With these delays, the SAN can stabilize its configuration after topology changes. In general, the I/O delays might be longer on active-passive arrays and shorter on active-active arrays.
Its the esxi that reacts to the storage loss as soon as LUN fail-over happens on the array, the Path selection policy detects the error reports to the SATP plugin which is dependent on your storage array configuration and the IOs are retried over the new path.
You may check - > VMware Knowledge Base
- Do you know a white paper focuses on the path status and configuration optimisation on ESXI ?
To optimize the connectivity path you may use the Multipathing Plug-ins (MPPs) provided by the storage array vendor
For EMC based arrays they have power-path which replaces the native nmp and SATP
You also tweak the PSP settings for optimization
check with storage vendor for recommendation on tweaking the driver parameters and other HBA settings
I think khiregange's response is most complete
Some points on my own:
- I've seen Unix/linux go Read only after more than 60s with no IO to the storage. With less than that performance is deteriorated and IOPS are queued.
- Make sure not to use Multipath on your VM's and use it on the hosts
- I've seen another weird issue in which the storage array had a LOT of dead paths and it caused ESXI management services to go down and cause a PSOD. --> We needed to rescan and restart management agents. IDK if this is your case but keep it in mind.
Regarding the whitepaper you are requesting, I don't remember one in particular, but every storage vendor (and every type of storage) behaves a little different.
Can you provide more information regarding your vSphere version and Storage Array configuration?
Thanks a lot for your input everyone !