5 Replies Latest reply on Jul 12, 2020 5:19 AM by TheBobkin

    VSAN 6.0 - VMs freeze after host failure

    LR_Analyst Lurker

      Hi everyone!

       

      We are facing a problem when a host in a VSAN cluster fails, the virtual machines simply freezes.  Linux OS normally turn filesystems to read only, Windows OS stop respond.

       

      bellow some information about environment:

       

      • 2 x Clusters (1 x 10 hosts / 2 x 20 hosts);
      • Host: HP DL380 Gen 9;
      • Storage Controller: HP Smart HBA H240ar;
      • 4 x SSD Disk 400GB eMLC + 20 x SAS Disk 1.2 TB, per host;
      • 2 x 10GB (LACP + MTU 9000 + Multicast) to VSAN + 2 x 10GB (LACP) to LAN, per host;
      • VMware ESXi, 6.0.0, 2715440

       

      Any help?

       

      Thanks,

      CH

        • 1. Re: VSAN 6.0 - VMs freeze after host failure
          elerium Hot Shot
          vExpert

          If you are talking about VMs that were running on the failed host freezing, that is to be expected on any host that fails.

           

          If VMs that are running on a different host than the failed one are stalling/stopping that would be an issue. I'm also assuming you're using FTT=1 or default, if you are on FTT=0, the VMs using FTT=0 disks may stop responding.

          • 2. Re: VSAN 6.0 - VMs freeze after host failure
            LR_Analyst Lurker

            Hi, thanks for help!  I forgot some information:

             

            FTT=1

            VMs freezing in different hosts too (host where the VMs disk are working on VSAN)

             

            The often recorrence is because there is a issue with H240ar, the controller stops to work and lost access to VSAN Disks, but the VMs running in this host keep going working. As VSAN distributes the storage workload in different hosts, when 1 host fail (Controller fail) the VM's freeze and other hosts too.

             

             

            Thanks!

            CH

            • 3. Re: VSAN 6.0 - VMs freeze after host failure
              elerium Hot Shot
              vExpert

              My best guess is that the disk resync load after the host failure may be causing the VSAN to stall to high latency. If it occurs again, you would need to use VSAN Observer or esxtop on your hosts to check disk/network latency and confirm this. If you open a support case, VMWare might be able to figure out what happened as well from your support log bundle.

              • 4. Re: VSAN 6.0 - VMs freeze after host failure
                velocity08 Novice

                Hi CH

                 

                we have the same setup HPE dl360’s, same hba and have been seeing this behaviour for years with still no fix from VMware.

                 

                Host that’s stuck needs a hard power cycle to release VM’s so they can vmotion to other hosts.

                 

                we are on latest 6.5 and issue has been the same across ESX versions.

                 

                did you ever get an answer on this issue or a fix?

                 

                "”Cheers

                G

                • 5. Re: VSAN 6.0 - VMs freeze after host failure
                  TheBobkin Virtuoso
                  vExpertVMware Employees

                  Hello velocity08,

                   

                  Firstly, I wouldn't advise necroing a 5-year old thread that doesn't really appear to have any real technical information or insightful analysis.

                   

                  You say "we have the same setup HPE dl360’s, same hba and have been seeing this behaviour for years with still no fix from VMware.", but if your controller is broken/dropping disks/driver and/or firmware losing the plot then how/why would you expect this to get fixed from the VMware side?

                   

                  Starting point should be to provide some useful information:

                  - Exact controller model in use (e.g. IDs not just name).

                  - Current driver and firmware configured for this controller.

                  - Model of Cache and Capacity-tier disks and the firmware installed on them.

                  - Whether you have other potentially problematic elements configured (e.g. logging to vsanDatastore, Mixing VMFS used for VMs and vSAN disks on the same controller, mix of RAID0 and passthrough devices on vSAN controller).

                  - Your findings and analysis so far from logs prior to/during the issue occurring - merely stating "Host that’s stuck" is unlikely to get you anywhere.

                   

                  Bob