Re: ESXi 5.5U3 Storage Failure - troubleshooting a...

GaelV · ‎06-23-2017

Hello,

I'm currently working on 6 ESXi 5.5 U3 build 4722766 (on HP DL360 G9 Servers).

All of them are linked with fiber through a switch fiber to a HP 3PAR Disk bay where all volumes are stored.

Last night one ESXi has been blocked itself, its log files repeted these lines since last night :

2017-06-22T20:39:05.656Z: <DATASTOREXXXX, 0> Write error to fd 9
2017-06-22T20:40:11.217Z: <DATASTOREXXXX, 0> I/Os from datastore naa.60002ac0000000000000016f00009893 took 60.849006(>= 30.000000) seconds to complete stats computation. Reducing its polling frequency.
2017-06-22T20:41:37.453Z: WARNING: Write 0xff98b66f[512] -> 18 failed. Resource temporarily unavailable
2017-06-22T20:41:37.453Z: <DATASTOREXXXX, 0> Write error to fd 18
2017-06-22T20:41:37.453Z: <DATASTOREXXXX, 0> I/Os from datastore naa.50002ac1a0009893 took 86.234590(>= 30.000000) seconds to complete stats computation. Reducing its polling frequency.
2017-06-22T20:41:37.454Z: WARNING: Write 0xff98b66f[512] -> 9 failed. Resource temporarily unavailable
2017-06-22T20:41:37.454Z: <DATASTOREXXXX, 0> Write error to fd 9
2017-06-22T20:41:37.454Z: WARNING: Write 0xff98b66f[512] -> 5 failed. Resource temporarily unavailable
2017-06-22T20:41:37.454Z: <DATASTOREXXXX5, 0> Write error to fd 5

Because of this issue, VM have been instable and we had to migrate them to another host and they work good now.

I've checked volumes on my HP3PAR bay, there's no alert, no alarm, everything looks good and there is the same thing for the switch fiber.

By the way, i also have a SIOC alarm whereas it's not activated on the datastore :

Do someone has an idea or troubleshooting advice ?

I mean the configuration hasn't been changed and it's really weird that this ESXi works wrong instead of other (which have common storage configuration and the same ESXi version work great)

Thank you for reading

Gael

dekoshal · ‎06-23-2017

First Identify all the datastore that are reporting unmanaged workload detected.

Check /var/log/vobd.log for < Detected a non vi workload on datastore > followed by the friendly name of the affected VMFS datastore."

or Run below mentioned command to give name of the problem datastore.

esxcfg-scsidevs -m |egrep -i <LUN ID>

use below flow chart to narrow down the issue and follow the action plan.

Additionally you may attach vobd.log and vmkernel.log here for further review.

If you found this or any other answer helpful, please consider the use of the Helpful to award points.

Best Regards,

Deepak Koshal

CNE|CLA|CWMA|VCP4|VCP5|CCAH

GaelV · ‎06-23-2017

Hi dekoshal,

Thank you for your quick answer.

I've checked the VMFS datastore which is referred to < Detected a non vi workload on datastore xxx >

I saw the flow chart on a VMWare KB, but the point is we use Adaptative Queuing (QFullSampleSize and Disk.QFullThreshold parameters for each datastore) instead of SIOC, all of our esxi work in that way without have any issue.

These parameters are set in line with the HP3PAR values.

For me, there's a difference between SIOC and Adaptative Queuing don't you think ?

Regards,

dekoshal · ‎06-23-2017

Yes, There is difference between both SIOC and Adaptative Queuing for how they manage qdepth on esxi host. SIOC take prioritization of VM into consideration in term of assigned share value regardless of which host vm's are residing on whereas Adaptative Queuing is not that smart and work at storage device(LUN) level. When queue full state arises queue depth of device in question is halved and post congestion increase in incremental order.

Controlling LUN queue depth throttling in VMware ESX/ESXi (1008113) | VMware KB

If you found this or any other answer helpful, please consider the use of the Helpful to award points.

Best Regards,

Deepak Koshal

CNE|CLA|CWMA|VCP4|VCP5|CCAH

dekoshal · ‎06-23-2017

It would be interesting to see if you have received error similar to below log snippet in vmkernel.log

2016-12-28T07:54:18.215Z cpu34:39468718)lpfc: lpfc_scsi_cmd_iocb_cmpl:2185: 0:(0):3271: FCP cmd x89 failed <1/125> sid xbf000c, did xbf0005, oxid x69 iotag x406 Abort Requested Host Abort Req

2016-12-28T07:54:18.215Z cpu34:39468718)lpfc: lpfc_scsi_cmd_iocb_cmpl:2185: 0:(0):3271: FCP cmd x28 failed <1/125> sid xbf000c, did xbf0005, oxid x2cd iotag x66a SCSI Queue Full -

2016-12-28T07:54:18.266Z cpu34:38708490)lpfc: lpfc_scsi_cmd_iocb_cmpl:2174: 0:(0):0711 detected lun queue full adjust qdepth to 28

2016-12-28T07:54:18.266Z cpu34:38708490)lpfc: lpfc_scsi_cmd_iocb_cmpl:2185: 0:(0):3271: FCP cmd x28 failed <1/125> sid xbf000c, did xbf0005, oxid x2d4 iotag x671 SCSI Queue Full -

2016-12-28T07:54:18.315Z cpu17:32209598)lpfc: lpfc_scsi_cmd_iocb_cmpl:2185: 0:(0):3271: FCP cmd x2a failed <1/131> sid xbf000c, did xbf0005, oxid x1d8 iotag x575 Abort Requested Host Abort Req

2016-12-28T07:54:18.315Z cpu17:32209598)lpfc: lpfc_scsi_cmd_iocb_cmpl:2185: 0:(0):3271: FCP cmd x2a failed <1/131> sid xbf000c, did xbf0005, oxid x2a3 iotag x640 Abort Requested Host Abort Req

2016-12-28T07:59:32.249Z cpu35:38708489)lpfc: lpfc_scsi_cmd_iocb_cmpl:2185: 1:(0):3271: FCP cmd x89 failed <0/225> sid x9c000c, did x9c0005, oxid x734 iotag x6d1 SCSI Chk Cond - 0xe: Data(x2:xe:x1d:x0)

2016-12-28T08:26:59.860Z cpu1:32822)lpfc: lpfc_scsi_cmd_iocb_cmpl:2185: 0:(0):3271: FCP cmd x89 failed <1/206> sid xbf000c, did xbf0005, oxid x2b9 iotag x656 SCSI Chk Cond - 0xe: Data(x2:xe:x1d:x0)

Best Regards,

Deepak Koshal

GaelV · ‎06-23-2017

Thanks for explanation, by default we use 32 and 4 as default values as HP3PAR prescribe.

I'm gonna check the log out , but i don't think I've ever seen some lines like that.

Edit : I didn't found the required log, the nearest look like below :

2017-06-23T01:52:30.236Z cpu9:34176)Fil3: 15438: Max retries (10) exceeded for caller Fil3_FileIO (status 'IO was aborted by VMFS via a virt-reset on the device')
2017-06-23T01:52:30.236Z cpu9:34176)BC: 2288: Failed to write (uncached) object '.iormstats.sf': Maximum kernel-level retries exceeded

BR

Gael

dekoshal · ‎06-23-2017

In the logs look for SCSI or other storage error code which will help you narrow down the issue and indicate the reason why the host needed to stop the I/O.

Failed write command to write-quiesced partition in VMware ESX 4.x and ESXi 4.x/5.0/5.1/5.5 (2009482...

If you find a SCSI error and want to decode use below link .

VMware ESXi SCSI Sense Code Decoder | Virten.net

If you found this or any other answer helpful, please consider the use of the Helpful to award points.

Best Regards,

Deepak Koshal

CNE|CLA|CWMA|VCP4|VCP5|CCAH

dekoshal · ‎06-23-2017

Some more info on queue depth.

Queue Depth basics

1. The queue depth is per-LUN, not per initiator (HBA)

2. Queue depth :

a. Decides how many commands can be active at one time to a given LUN.

b. Allow for multiple virtual machines to share a single resource.

c. Allow applications to have multiple active ("in-flight") I/O requests on a LUN at the same time -provides concurrency and improves performance

Queue depth recommendations

1. Reduce Queuing on the ESX Host:

2. sum of active commands from all VMs should not consistently exceed the LUN queue depth.

3. move some of the lentency sensitive VM to a different VMFS volume (LUN)

4. Disk.SchedNumReqOutstanding parameter should have same value as the queue depth

The per-host parameter Disk.SchedNumReqOutstanding is deprecated in vSphere 5.5. The setting is now per device/LUN. For more information, see the Solution section in the below article

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1268

5. Reduce Queuing on the Storage Array

Below refence doc is bit old but a good read

https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/scalable_storage_perform...

Queue depth Tips:

If there is high level of consolidation (VMs per LUN ratio) or very intensive storage workloads in their environment, some of the vSphere queues may need to be adjusted

investigation of queuing is required at all levels of the storage stack - from the application and guest OS to the storage array

esxtop or performance charts can be used to check if Device/LUN queue is reporting 100% active or full

If you found this or any other answer helpful, please consider the use of the Helpful to award points.

Best Regards,

Deepak Koshal

CNE|CLA|CWMA|VCP4|VCP5|CCAH

GaelV · ‎06-29-2017

Hi dekoshal,

Thank for your great detailed answer, sorry for the delay.

I'm checking the "No of outstanding IOs with competing worlds:" parameter, they're all set to 32 (for each device so)

I'm currently reading the scalable storage performance doc then I let you know what i found for my issue.

By the way, about the sense code decoder, I have found 2 interesting explanations :

Host Status

[0x3]

TIME_OUT

This status is returned when the command in-flight to the array times out.

Host Status

[0x8]

RESET

This status is returned when the HBA driver has aborted the I/O. It can also occur if the HBA does a reset of the target.

So i think we can confirm there's a relation with latency as we saw in the log.

dekoshal · ‎06-29-2017

Hi GaeIV

Sound good. This would require end to end examination to narrow down and identification of issue. I have mentioned some points to consider while troubleshooting I/O issue.

0x3 and 0x8 can happen because of many reasons.

1. You can start with checking the HBA driver version and update it to latest if required.

To check the HBA driver follow instruction mentioned below.

a. Run command esxcfg-scsidevs -a

b. Identify HBA used for problem LUNS

c. Note down HBA driver name to corresponding HBA

d. Run Command vmkload_mod -s <HBA driver name> | grep Version

This should display the driver for your HBA

e. Navigate to VMware Compatibility Guide - System Search and check for updated driver for HBA applicable for esxi version running in your environment.

2. Check for any hardware issue on the Storage Array side such as SP port issue

3. Check the placement of the VM to ensure efficient I/O performance.

4. Check for any connection issue between esxi host and storage.

5. Check if jumbo frame setting if modified from default.

If you found this or any other answer helpful, please consider the use of the Correct or Helpful to award points.

Best Regards,

Deepak Koshal

CNE|CLA|CWMA|VCP4|VCP5|CCAH

GaelV · ‎07-04-2017

Hi,

So i've checked many parameters and stats from the storage array side, it is well configured so my problem is really on ESXi configuration.

I followed your advices to find the drivers's version and to compare to some of our others servers, the firmware was different.

The current version was qlnativefc 8.01.02 , on the other servers we have qlnativefc 8.02.00

It should be an interesting point of reflection, i'll update the driver to 8.02.00 then i'll see if this was the main issue or not.

All

ESXi 5.5U3 Storage Failure - troubleshooting advice is needed