4 Replies Latest reply on Sep 20, 2020 6:50 AM by amnesyak

    Storage conectivity losses, Consequences at VM Level

    amnesyak Lurker

      Hello everyone,

       

      I'm seeking for information about the consequence of a iscsi storage failover (Active (IO) path dead) at the VM level :

       

      • How a VM (Deian/Centos/Windows at their system level only) react to storage loss of few second (1-5) during a storage failover ?
      • Do you know a white paper focuses on the path status and configuration optimisation on ESXI ?

       

      I found many type of document that explain in details how rendondency work but nothing about "Hey, we lost an array IRL, path changed in less than 5 second but 150 VM get weird after failover and we know why ! (BTW we know how to simulate it)".

       

      Array vendor have many specification and configuration recommandation depending the array hardware. The context of my questions is restricted to VMware/VM only.

       

      Thanks in advance guys.

        • 1. Re: Storage conectivity losses, Consequences at VM Level
          ZibiM Enthusiast

          Hi

           

          Some comments based on heavy NFS usage

          VMs tend to survive up to 60s of no storage access - after that Linux turn to r/o and Windows goes crazy

          I tested storage HA several times (like moving LIF to other controller/unit) and never had any issues

           

          You can check Netapp regarding their recommended OS optimizations (disk timeouts, etc) - it might be helpful for the older OSes out there.

          • 2. Re: Storage conectivity losses, Consequences at VM Level
            khiregange Enthusiast
            • How a VM (Deian/Centos/Windows at their system level only) react to storage loss of few second (1-5) during a storage failover ? - >

             

            When a path fails, storage I/O might pause for 30-60 seconds until your host determines that the link is unavailable and performs the failover. If you attempt to display the host, its storage devices, or its adapters, the operation might appear to stall. Virtual machines with their disks installed on the SAN can appear unresponsive. After the failover, I/O resumes normally and the virtual machines continue to run.

             

            Virtual machine I/O might be delayed for up to 60 seconds while path failover takes place. With these delays, the SAN can stabilize its configuration after topology changes. In general, the I/O delays might be longer on active-passive arrays and shorter on active-active arrays.

             

            Its the esxi that reacts to the storage loss as soon as LUN fail-over happens on the array,  the Path selection policy detects the error reports to the SATP plugin which is dependent on your storage array configuration and the IOs are retried over the new path.

             

            You may check - > VMware Knowledge Base

            Path Failover and Virtual Machines

             

             

            • Do you know a white paper focuses on the path status and configuration optimisation on ESXI ?

             

            To optimize the connectivity path you may use the Multipathing Plug-ins (MPPs) provided by the storage array vendor

             

            For EMC based arrays  they have power-path which replaces the native nmp and SATP

            You also tweak the PSP settings for optimization

            check with storage vendor for recommendation on tweaking the driver parameters and other HBA settings

            1 person found this helpful
            • 3. Re: Storage conectivity losses, Consequences at VM Level
              nachogonzalez Expert

              I think khiregange's response is most complete

               

              Some points on my own:

               

              - I've seen Unix/linux go Read only after more than 60s with no IO to the storage. With less than that performance is deteriorated and IOPS are queued.
              - Make sure not to use Multipath on your VM's and use it on the hosts

              - I've seen another weird issue in which the storage array had a LOT of dead paths and it caused ESXI management services to go down and cause a PSOD. --> We needed to rescan and restart management agents.  IDK if this is your case but keep it in mind.

               

               

              Regarding the whitepaper you are requesting, I don't remember one in particular, but every storage vendor (and every type of storage) behaves a little different.
              Can you provide more information regarding your vSphere version and Storage Array configuration?

              • 4. Re: Storage conectivity losses, Consequences at VM Level
                amnesyak Lurker

                Thanks a lot for your input everyone !