5 Replies Latest reply on May 8, 2019 12:32 PM by gferreyra

    Issue with clustered RDM's and storage outages

    bradyk87 Lurker

      Hi all

       

      We have a number of clusters that each contain about 15 hosts. We utilise RDM's for Microsoft failover clusters quite heavily in our environment as well - up to 70 RDM's. Our SAN array is a VNX 7500. All hosts within each cluster are defined in a host group on the array.

      ESXi hosts are Dell M620's M630's and R730's. Running ESXi 5.5 Update 3.

       

      All works well on a day-to-day basis however we have been having issues with random clusters experiencing a failure/failover whenever we add a new host to the host group on the SAN array. It appears that when the host is added to the storage group it automatically kicks off a storage scan (as i can see because the Datastores on the host start appearing automatically). Some time after the host is added to the storage group, sometimes 15 minutes, sometimes up to 5 hours, some of the clusters start failing due to the physical disks which they use being unavailable. Errors we are seeing in the event log:

       

      Cluster resource 'INST01_Log' of type 'Physical Disk' in clustered role 'SQL Server (clustername\INST01)' failed.

       

      Ownership of cluster disk 'INST02_Data' has been unexpectedly lost by this node. Run the Validate a Configuration wizard to check your storage configuration.

       

      In most cases the cluster will successfully fail over to the passive node. In other instances I'll need to manually bring the disk resource back online if it hasnt automatically recovered.

       

      The reason for the extremely long time it is taking before it causes an issue is seen is because as the RDM's are being scanned for the first time, there is a SCSI reservation on them which does not allow them to be read. It waits until it times out before move onto the next device. As good practise we perennially reserve all of our cluster RDM's however its not possible to do this until the disk has been added for the first time. If we happen to reboot a host that hasnt had the disk perennially reserved yet it can take up to 6 hours for it to start responding.

       

      We logged a job with VMware however they came back saying that the issue is being caused by the array and we should contact EMC. I dont necessarily agree with this as things operate fine usually - its just when a host is added for the first time and a scan takes place does it cause some sort of lock on the RDM that prevents the MSCS cluster from being able to read/write to it. No issue with the VMFS data-stores themselves has been seen.

       

      Has anyone else seen this or know what could be causing the issue? Should a host performing a scan on an RDM being used in an MSCS cluster cause it to fail?

       

      Cheers
      Brady

        • 1. Re: Issue with clustered RDM's and storage outages
          PaulLab3 Novice

          Are you using EMC multipathing driver or native?

           

          Some time ago I have problem with MS cluster validation with RDMs from HDS G200.

          Solution for me was setting Most Recently Used policy for multipathing (VMware native driver).

          1 person found this helpful
          • 2. Re: Issue with clustered RDM's and storage outages
            bradyk87 Lurker

            We are using the native multipathing driver.

             

            The LUN's are currently set to use Round Robin. I can set the LUN's to use MRU on the existing hosts however not sure if this will help when a new host is added to the storage group. I'll give it a shot in our test environment regardless and see how things so.

             

            We are currently also playing with the idea of disabling VAAI as we have seen issues with it in the past. We think that the ATS locking primitive could in fact be causing issues based on some article we have found.

            • 3. Re: Issue with clustered RDM's and storage outages
              RAJ_RAJ Expert
              vExpert

              Hi ,

               

              Try with different SCSI devices , 2 devices in one scsi , if you are using more than 200 GB use in seperate scsi

               

              OS DISK - SCSI 0:0

              MSDTC and QUORUM - SCSI 1:0 , SCSCI 1:1

              OTHER RDM  - SCSI 2:0

              NEXT  - SCSI 3:0

               

              Also check that in EMC the owner of the RDM LUN is changing or not  , try to make  SPA or SPB  some cases if the load on the LUNs increases then it changes from SPA and SPB  so that time it may fail.

              RAJESH RADHAKRISHNAN
              VCA -DCV/WM/Cloud,VCP 5 - DCV/DT/CLOUD, ,VCP6-DCV, EMCISA,EMCSA,MCTS,MCPS,BCFA
              https://ae.linkedin.com/in/rajesh-radhakrishnan-76269335
              Mark my post as "helpful" or "correct" if I've helped resolve or answered your query!
              • 4. Re: Issue with clustered RDM's and storage outages
                MJNY Lurker

                Hi,

                 

                Have you found a solution to this?  We are running into exact issue.  As soon as we add new ESXi host to VMware storage in SAN, MS clustered VM's losing access to the disk and cluster fails.

                 

                 

                Thank you,

                Mike

                • 5. Re: Issue with clustered RDM's and storage outages
                  gferreyra Novice

                  We have.

                   

                  We experienced the same situation.

                   

                  VMware told us we have our ESXi hosts not updated. That's all.

                  A bug. Repaired on some patch.

                   

                  Now, we have a cluster with paravirtual controller, 5 TB of clustered disks.

                  100% functional.

                   

                  Cheers!