10 Replies Latest reply on Jul 4, 2017 12:39 AM by GaelV

    ESXi 5.5U3 Storage Failure - troubleshooting advice is needed

    GaelV Enthusiast

      Hello,

       

      I'm currently working on 6 ESXi 5.5 U3 build 4722766 (on HP DL360 G9 Servers).

      All of them are linked with fiber through a switch fiber to a HP 3PAR Disk bay where all volumes are stored.

       

      Last night one ESXi has been blocked itself, its log files repeted these lines since last night :

       

      2017-06-22T20:39:05.656Z: <DATASTOREXXXX, 0> Write error to fd 9

      2017-06-22T20:40:11.217Z: <DATASTOREXXXX, 0> I/Os from datastore naa.60002ac0000000000000016f00009893 took 60.849006(>= 30.000000) seconds to complete stats computation. Reducing its polling frequency.

      2017-06-22T20:41:37.453Z: WARNING: Write 0xff98b66f[512] -> 18 failed. Resource temporarily unavailable

      2017-06-22T20:41:37.453Z: <DATASTOREXXXX, 0> Write error to fd 18

      2017-06-22T20:41:37.453Z: <DATASTOREXXXX, 0> I/Os from datastore naa.50002ac1a0009893 took 86.234590(>= 30.000000) seconds to complete stats computation. Reducing its polling frequency.

      2017-06-22T20:41:37.454Z: WARNING: Write 0xff98b66f[512] -> 9 failed. Resource temporarily unavailable

      2017-06-22T20:41:37.454Z: <DATASTOREXXXX, 0> Write error to fd 9

      2017-06-22T20:41:37.454Z: WARNING: Write 0xff98b66f[512] -> 5 failed. Resource temporarily unavailable

      2017-06-22T20:41:37.454Z: <DATASTOREXXXX5, 0> Write error to fd 5

       

       

       

       

      Because of this issue, VM have been instable and we had to migrate them to another host and they work good now.

       

      I've checked volumes on my HP3PAR bay, there's no alert, no alarm, everything looks good and there is the same thing for the switch fiber.

       

      By the way, i also have a SIOC alarm whereas it's not activated on the datastore :

      Do someone has an idea or troubleshooting advice ?

       

      I mean the configuration hasn't been changed and it's really weird that this ESXi works wrong instead of other (which have common storage configuration and the same ESXi version work great)

       

       

      Thank you for reading

       

      Gael

        • 1. Re: ESXi 5.5U3 Storage Failure - troubleshooting advice is needed
          dekoshal Hot Shot
          vExpert

          First Identify all the datastore that are reporting unmanaged workload detected.

          Check /var/log/vobd.log for < Detected a non vi workload on datastore > followed by the friendly name of the affected VMFS datastore."

          or Run below mentioned command to give name of the problem datastore.

          esxcfg-scsidevs -m |egrep -i <LUN ID>

           

          use below flow chart to narrow down the issue and follow the action plan.

           

          troubleshooting-sioc-flowchart-kb-1020651.png

           

           

          Additionally you may attach vobd.log and vmkernel.log here for further review.

           

          If you found this or any other answer helpful, please consider the use of the Helpful to award points.

           

          Best Regards,

          Deepak Koshal

          CNE|CLA|CWMA|VCP4|VCP5|CCAH

          • 2. Re: ESXi 5.5U3 Storage Failure - troubleshooting advice is needed
            GaelV Enthusiast

            Hi dekoshal,

             

            Thank you for your quick answer.

             

            I've checked the VMFS datastore which is referred to < Detected a non vi workload on datastore xxx >

             

            I saw the flow chart on a VMWare KB, but the point is we use Adaptative Queuing (QFullSampleSize and Disk.QFullThreshold parameters for each datastore) instead of SIOC, all of our esxi work in that way without have any issue.

             

            These parameters are set in line with the HP3PAR values.

             

            For me, there's a difference between SIOC and Adaptative Queuing don't you think ?

             

            Regards,

             

             

            • 3. Re: ESXi 5.5U3 Storage Failure - troubleshooting advice is needed
              dekoshal Hot Shot
              vExpert

              Yes, There is difference between both SIOC and Adaptative Queuing for how they manage qdepth on esxi host. SIOC take prioritization of VM into consideration in term of assigned share value regardless of which host vm's are residing on whereas Adaptative Queuing is not that smart and work at storage device(LUN) level. When queue full state arises queue depth of device in question is halved and post congestion increase in incremental order.

               

              Controlling LUN queue depth throttling in VMware ESX/ESXi (1008113) | VMware KB

               

              If you found this or any other answer helpful, please consider the use of the Helpful to award points.

               

              Best Regards,

              Deepak Koshal

              CNE|CLA|CWMA|VCP4|VCP5|CCAH

               

              • 4. Re: ESXi 5.5U3 Storage Failure - troubleshooting advice is needed
                dekoshal Hot Shot
                vExpert

                It would be interesting to see if you have received error similar to below log snippet in vmkernel.log

                 

                2016-12-28T07:54:18.215Z cpu34:39468718)lpfc: lpfc_scsi_cmd_iocb_cmpl:2185: 0:(0):3271: FCP cmd x89 failed <1/125> sid xbf000c, did xbf0005, oxid x69 iotag x406 Abort Requested Host Abort Req

                2016-12-28T07:54:18.215Z cpu34:39468718)lpfc: lpfc_scsi_cmd_iocb_cmpl:2185: 0:(0):3271: FCP cmd x28 failed <1/125> sid xbf000c, did xbf0005, oxid x2cd iotag x66a SCSI Queue Full -

                2016-12-28T07:54:18.266Z cpu34:38708490)lpfc: lpfc_scsi_cmd_iocb_cmpl:2174: 0:(0):0711 detected lun queue full adjust qdepth to 28

                2016-12-28T07:54:18.266Z cpu34:38708490)lpfc: lpfc_scsi_cmd_iocb_cmpl:2185: 0:(0):3271: FCP cmd x28 failed <1/125> sid xbf000c, did xbf0005, oxid x2d4 iotag x671 SCSI Queue Full -

                2016-12-28T07:54:18.315Z cpu17:32209598)lpfc: lpfc_scsi_cmd_iocb_cmpl:2185: 0:(0):3271: FCP cmd x2a failed <1/131> sid xbf000c, did xbf0005, oxid x1d8 iotag x575 Abort Requested Host Abort Req

                2016-12-28T07:54:18.315Z cpu17:32209598)lpfc: lpfc_scsi_cmd_iocb_cmpl:2185: 0:(0):3271: FCP cmd x2a failed <1/131> sid xbf000c, did xbf0005, oxid x2a3 iotag x640 Abort Requested Host Abort Req

                2016-12-28T07:59:32.249Z cpu35:38708489)lpfc: lpfc_scsi_cmd_iocb_cmpl:2185: 1:(0):3271: FCP cmd x89 failed <0/225> sid x9c000c, did x9c0005, oxid x734 iotag x6d1 SCSI Chk Cond - 0xe: Data(x2:xe:x1d:x0)

                2016-12-28T08:26:59.860Z cpu1:32822)lpfc: lpfc_scsi_cmd_iocb_cmpl:2185: 0:(0):3271: FCP cmd x89 failed <1/206> sid xbf000c, did xbf0005, oxid x2b9 iotag x656 SCSI Chk Cond - 0xe: Data(x2:xe:x1d:x0)

                 

                Best Regards,

                Deepak Koshal

                • 5. Re: ESXi 5.5U3 Storage Failure - troubleshooting advice is needed
                  GaelV Enthusiast

                  Thanks for explanation, by default we use 32 and 4 as default values as HP3PAR prescribe.

                   

                  I'm gonna check the log out , but i don't think I've ever seen some lines like that.

                   

                  Edit : I didn't found the required log, the nearest look like below :

                   

                  2017-06-23T01:52:30.236Z cpu9:34176)Fil3: 15438: Max retries (10) exceeded for caller Fil3_FileIO (status 'IO was aborted by VMFS via a virt-reset on the device')

                  2017-06-23T01:52:30.236Z cpu9:34176)BC: 2288: Failed to write (uncached) object '.iormstats.sf': Maximum kernel-level retries exceeded

                   

                   

                   

                  BR

                   

                  Gael

                  • 6. Re: ESXi 5.5U3 Storage Failure - troubleshooting advice is needed
                    dekoshal Hot Shot
                    vExpert

                    In the logs look for SCSI or other storage error code which will help you narrow down the issue and indicate the reason why the host needed to stop the I/O.

                    Failed write command to write-quiesced partition in VMware ESX 4.x and ESXi 4.x/5.0/5.1/5.5 (2009482) | VMware KB

                     

                    If you find a SCSI error and want to decode use below link .

                    VMware ESXi SCSI Sense Code Decoder | Virten.net

                     

                    If you found this or any other answer helpful, please consider the use of the Helpful to award points.

                     

                    Best Regards,

                    Deepak Koshal

                    CNE|CLA|CWMA|VCP4|VCP5|CCAH

                    • 7. Re: ESXi 5.5U3 Storage Failure - troubleshooting advice is needed
                      dekoshal Hot Shot
                      vExpert

                      Some more info on queue depth.

                       

                      Queue Depth basics

                       

                      1. The queue depth is per-LUN, not per initiator (HBA)

                      2. Queue depth :

                      a. Decides how many commands can be active at one time to a given LUN. 

                      b. Allow for multiple virtual machines to share a single resource. 

                      c. Allow applications to have multiple active ("in-flight") I/O requests on a LUN at the same time -provides concurrency and improves performance

                       

                       

                      Queue depth recommendations

                       

                      1. Reduce Queuing on the ESX Host:

                      2. sum of active commands from all VMs should not consistently exceed the LUN queue depth.

                      3. move some of the lentency sensitive VM to a different VMFS volume (LUN)

                      4. Disk.SchedNumReqOutstanding parameter should have same value as the queue depth

                      The per-host parameter Disk.SchedNumReqOutstanding is deprecated in vSphere 5.5. The setting is now per device/LUN. For more information, see the Solution section in the below  article

                      https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1268

                      5. Reduce Queuing on the Storage Array

                       

                      Below refence doc is bit old but a good read 

                      https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/scalable_storage_performance.pdf

                       

                       

                      Queue depth Tips:

                      If there is high level of consolidation (VMs per LUN ratio) or very intensive storage workloads in their environment, some of the vSphere queues may need to be adjusted 

                      investigation of queuing is required at all levels of the storage stack - from the application and guest OS to the storage array

                      esxtop or performance charts can be used to check if Device/LUN queue is reporting 100% active or full

                       

                      If you found this or any other answer helpful, please consider the use of the Helpful to award points.

                       

                      Best Regards,

                      Deepak Koshal

                      CNE|CLA|CWMA|VCP4|VCP5|CCAH

                      1 person found this helpful
                      • 8. Re: ESXi 5.5U3 Storage Failure - troubleshooting advice is needed
                        GaelV Enthusiast

                        Hi dekoshal,

                         

                        Thank for your great detailed answer, sorry for the delay.

                         

                        I'm checking the "No of outstanding IOs with competing worlds:" parameter, they're all set to 32 (for each device so)

                         

                        I'm currently reading the scalable storage performance doc then I let you know what i found for my issue.

                         

                        By the way, about the sense code decoder, I have found 2 interesting explanations :

                         

                         

                        Host Status[0x3]TIME_OUTThis status is returned when the command in-flight to the array times out.
                        Host Status[0x8]RESET

                        This status is returned when the HBA driver has aborted the I/O. It can also occur if the HBA does a reset of the target.

                         

                        So i think we can confirm there's a relation with latency as we saw in the log.

                        • 9. Re: ESXi 5.5U3 Storage Failure - troubleshooting advice is needed
                          dekoshal Hot Shot
                          vExpert

                          Hi GaeIV

                           

                          Sound good. This would require end to end examination to  narrow down  and identification of issue. I have mentioned some points to consider while troubleshooting I/O issue.

                           

                          0x3 and 0x8 can happen because of many reasons.

                           

                          1. You can start with checking the HBA driver version and update it to latest if required.

                          To check the HBA driver follow instruction mentioned below.

                          a. Run command esxcfg-scsidevs -a

                          b. Identify HBA used for problem LUNS

                          c. Note down HBA driver name to corresponding HBA

                          d. Run Command vmkload_mod -s <HBA driver name> | grep Version

                          This should display the driver for your HBA

                          e. Navigate to VMware Compatibility Guide - System Search and check for updated driver for HBA applicable for esxi version running in your environment.

                          2. Check for any hardware issue on the Storage Array side such as SP port issue

                          3. Check the placement of the VM to ensure efficient  I/O performance.

                          4. Check for any connection issue between esxi host and storage.

                          5. Check if jumbo frame setting if modified from default.

                           

                          If you found this or any other answer helpful, please consider the use of the Correct or Helpful to award points.

                           

                          Best Regards,

                          Deepak Koshal

                          CNE|CLA|CWMA|VCP4|VCP5|CCAH

                          • 10. Re: ESXi 5.5U3 Storage Failure - troubleshooting advice is needed
                            GaelV Enthusiast

                            Hi,

                             

                            So i've checked many parameters and stats from the storage array side, it is well configured so my problem is really on ESXi configuration.

                             

                            I followed your advices to find the drivers's version and to compare to some of our others servers, the firmware was different.

                             

                            The current version was qlnativefc 8.01.02 , on the other servers we have qlnativefc 8.02.00

                             

                            It should be an interesting point of reflection, i'll update the driver to 8.02.00 then i'll see if this was the main issue or not.