10 Replies Latest reply on Dec 2, 2017 1:04 AM by srodenburg

    vSAN disk groups failure once a month

    pineapplehead Lurker

      We have deployed 8 Dell R730xd hosts equipped with H730 Mini controller at the end of last year. All hosts are running vSphere 6 build 4600944. Each host has 2 vSAN disk groups with hybrid configurtion. Each disk group has 1 Intel SSDSC2BX40 (400GB) and 6 Toshiba AL14SEB060N (600GB).

       

      The firmware of all devices are on the vSAN HCL since the deployment. Since March of this year, the vSAN starts having issues. The vSAN would report “Flash drives dead or error” on one of the hosts and the 2 disk groups on that host would drop out from the vSAN. In iDRAC, it would show all disks are fine, but the log would show all disks are reset when the issue is occurring. When the issue occurs, we will have to reboot the server in order for the disk groups on the host to be working again. However, the next day or so, the same problem will occur on the other host. One after another. After the host is rebooted, the host works fine for about 1 month, then back to the same cycle.

       

      We have worked with VMware and Dell support for months on this issue. We have replaced the controllers and backplanes on some of the hosts at the beginning. That didn’t help. Newer firmware came out for the controller (25.5.2.0001) and backplane (3.32), we did the upgrade. Those didn’t help neither. We have purchased additional drives for each host and planned to expand the vSAN capacity. However, the support recommended not to make any changes since they’re still trying to figure the issue. We have planned to purchase additional servers to build other vSAN clusters. Due to this issue, those projects have been on-hold. Feels like we’re stuck. The only thing is to wait for the support to come up with any new ideas. Every time they find something (ex. a new firmware just comes out) and we make change, we will have to wait a month to see if the problem comes back again or not. This has been dragging for months. Does anyone have any comments or suggestions?

        • 1. Re: vSAN disk groups failure once a month
          Sureshkumar M Expert

          first step is to isolate the issue , if the issue is due to ESXi or driver or firmware. Mostly, this is due to driver/firmware issue where hardware vendor should work much on it.

           

          From your update i understand they have recommended to go for different versions of firmware/driver as trial and error method. Have they given any update on where the issue lies ?

           

          What does the driver/firmware dump says. If you say this issue occurs everymonth , is it occurring with some pattern ? If so it could be due to load due to scheduled jobs when more load comes the driver/firmware or hardware could not handle.

          • 2. Re: vSAN disk groups failure once a month
            pineapplehead Lurker

            Thank you for the reply, Suresh. When we deployed the system at the end of last year, the firmware/driver on all devices (controller, SSD, etc.) were up to date and on the vSAN HCL at that time. As the issue started occurring months ago, new firmware/driver also came out (ex. Apr. 2017) and vSAN HCL has updated. What we did is to upgrade the firmware/driver to the versions according to the most current vSAN HCL.

             

            We have asked a few times to replace the controller/SSDs to a different model to help to isolate the problem, but either support is interested in going that route. The latest update I got is from the VMware support. VMware engineering team is working with Dell team to investigate further, but there is no ETA for this investigation.

             

            I have been keeping track since the end of May, but I can’t see a pattern.

                                                                                                                                                 

            Date

            Server

            vSAN

            Perform Task

            5/24/2017

            h

            Reported Flash drives dead or error

            replaced controller and backplane

            6/7/2017

            a

            Reported Flash drives dead or error

            replaced controller and backplane

            6/11/2017

            g

            Reported Flash drives dead or error

            rebooted the server

            6/14/2017

            d

            Reported Flash drives dead or error

            rebooted the server

            6/15/2017

            e

            Reported Flash drives dead or error

            rebooted the server

            6/20/2017

            f

            Reported Flash drives dead or error

            rebooted the server

            6/25/2017

            c

            Reported Flash drives dead or error

            rebooted the server

            7/6/2017

            ALL

             

            Upgraded SSD firmware to DL2D

            7/14/2017

            a

             

            Upgraded backplane firmware to 3.32

            7/14/2017

            g

             

            Upgraded backplane firmware to 3.32

            8/6/2017

            d

            Reported Flash drives dead or error

            rebooted the server, upgraded backplane firmware to 3.32

            8/7/2017

            a

            Reported Flash drives dead or error

            rebooted the server

            8/10/2017

            f

            Reported Flash drives dead or error

            rebooted the server, upgraded backplane firmware to 3.32

            8/11/2017

            e

            Reported Flash drives dead or error

            rebooted the server, upgraded backplane firmware to 3.32

            8/12/2017

            g

            Reported Flash drives dead or error

            rebooted the server

            8/12/2017

            c

            Reported Flash drives dead or error

            rebooted the server, upgraded backplane firmware to 3.32

            • 3. Re: vSAN disk groups failure once a month
              Sureshkumar M Expert

              Very sad to see the issue is very frequent. Looks like the issue is with almost all the servers so it is mostly due to driver/firmware, now it is with vendors to determine the cause as most of the options were already tried apart from completely removing and readding the cluster. You have to wait till they come back. However, by using some debugging tool they should be able to find the cause but I am not sure if they asked you to run some debug build or by some other means till now.

              • 4. Re: vSAN disk groups failure once a month
                TheBobkin Master
                vExpertVMware Employees

                Hello pineapplehead,

                 

                 

                Sorry to hear you having a bad run with vSAN.

                 

                As both disk-groups go at the same time it is almost certainly a controller driver/firmware issues. Both disk-groups on single controller, yes? Any non-vSAN disks (boot OR log/dump etc.) on same controller?

                If VMware Engineering are engaged then any common issue or simple explanation has likely been ruled out - this may be a case of perfect-storm of components and scenario that happens to trigger the issue.

                 

                Can you share vmkernel.log and vobd.log from a host running currently? And if possible, the same logs from a period when the issue occurred. Also can you share or PM me the SR number (no promises that I can look or assist for various reasons but I do want to read any related PRs).

                 

                 

                Bob

                • 5. Re: vSAN disk groups failure once a month
                  pineapplehead Lurker

                  Hi Bob.

                  I was in the vSphere client twice when the issue just occurred. What I have seen is one disk group would fail first. Then, ~15 minutes later, the 2nd disk group would fail.

                  Yes, both disk groups are on the same controller. There is only one controller in each host. There is no non-vSAN disk on the controller. The boot disk is on its own dedicated SD media flash disk.

                  I have attached the logs. The SR# is 17488918406. Any input is appreciated.

                  Thank you.

                  • 6. Re: vSAN disk groups failure once a month
                    timalexanderINV Novice

                    We have been experiencing a very similar issue on vSAN 6.2 with Dell FX2 (FD332 is essentially a PERC 730).  Dell eventually seemed to understand that it is a firmware bug and their suggested workaround was to swap the controller out for the PERC330 (which you cant do with the FX2 platform).  One thing that has helped is increasing the timeout value (suggested by Dell support but not strictly recommended by VMware from what I understand):

                     

                    esxcfg-advcfg -s 10000 /VMFS3/OptLockReadTimeout

                     

                    We could ascertain the fault was the controller by looking at the log stream from the DCUI.  You may well see similar entries along the lines of "lsi_mr3: controller firmware in critical state" and multiple naa disk resets.  My understanding is that the latest driver does not have the same issue, even on the current HCL Firmware, but it is still not certified to be used and therefore not on the HCL yet .  We are currently limping along waiting for the HCL update/certification so we can hopefully get some stability back.

                    • 7. Re: vSAN disk groups failure once a month
                      pineapplehead Lurker

                      Hi timalexanderINV,

                       

                      Thank you for sharing the information. It’s good to know that we’re not alone here. Since my last post, we have 2 more hosts experiencing this issue. VMware support did mention about changing a setting on the ESXi side at one time, but later decided not to do it until further investigation.  Not sure if it’s the timeout value that you mentioned.

                       

                      I will keep an eye on the vSAN HCL. I will also keep you posted if I get any updates from the support.

                      • 8. Re: vSAN disk groups failure once a month
                        mhaberman Lurker

                        Any resolution? We are having the same issue?

                        • 9. Re: vSAN disk groups failure once a month
                          mckenp Lurker

                          We had exactly the same issue across our H730p-equipped R730s.

                           

                          Despite all firmwares being fully up to date and compliant with HCL, we could almost guarantee that a disk group would crap out at some point soon (we built the architecture in such a way to withstand these failures in the end, which is painful!).

                           

                          VMware tickets were helpful insofar as checking that there the values in this article were correct - VMware Knowledge Base and they assisted in checking data parity across the member nodes.

                           

                          Beyond that, we ended up rebuilding the VSAN cluster several times in horror.

                           

                          I am disappointed that VMware has not discovered a resolution to this, which existed in 6.0 and persists today afaik. I have since moved on from that infrastructure.

                          • 10. Re: vSAN disk groups failure once a month
                            srodenburg Enthusiast

                            Has all the symptoms of Controllers and controller-firmware. We run Supermicro hosts with vSAN 6.6.1 Hybrid and two LSI 9207-8i cards in each. If I put the HCL recommended Firmware 20 on those cards, we are totally screwed just like you.

                            We went back to Firmware 19 (downgraded all cards) and we have had 0 issues since then. So much for HCL Firmware recommendations eh.