VMware Cloud Community
gregsn
Enthusiast
Enthusiast

ESXi 5 hangs with Adaptec controllers (aacraid: Host adapter abort request)

Hi There,

I have a couple of ESXi 5 boxes running controllers (one 6805 and one 5805Z).  I have been experiencing hanging with ESXi randomly (seemingly after heavy load) from time to time over the last several months.  When the hang occurs, I see contentiously scrolling error messages "SCSILinux AbortCommands: 1798:Failed, Driver aacraid." The only way to recover is by hard resetting the host.

After working with Adaptec for the last couple of months, trying different drivers, firmware (now officially available for both 6805 and 5805Z controllers), they have pointed me to the article below (Adaptec Answer ID 15357).  They told me to contact VMware about making the adjustment since they have no information on how to make the adjustment within ESXi.  Does anyone know how to correctly implement this adjustment?

Answer ID
15357


Error: "aacraid: Host adapter abort request"
Question
The Linux server shows the following error messages:
aacraid: Host adapter abort request (4,0,1,0)
aacraid: Host adapter reset request. SCSI hang ?

What can be done to resolve this issue?

This information applies to the following Operating System(s):

- Linux Kernel 2.6.18 and later
Answer
AACRAID based controllers have an underlying timeout/recovery cycle that is 35 seconds long.

The default in some SCSI subsystems was 60 seconds in the past, but is now standardized at 30 seconds which results in an interference pattern between the controller and the Linux SCSI subsystem.

The alternate workaround is for the user to adjust the timeout in SYSFS if it is shorter than 35 seconds.

Changing the timeout values for a Linux block device can be done via SYSFS. For example, if /dev/sdc , /dev/sdd and /dev/sde are the device LUNs on a given Linux host, then the following commands need to be issued:
echo 45 > /sys/block/sdc/device/timeout
echo 45 > /sys/block /sdd/device/timeout
echo 45 > /sys/block/sde/device/timeout
In this example the timeout is 45 seconds which should be enough.

Note:Any AACRAID based controller is going through an error correction cycle on the SAS/SATA bus that is delaying the completion of I/O beyond the Linux default timeout set for the device, this may be a hardware issue or a problem with the default timeout value as outlined above. If changing the timeout value doesn't solve the problem then please follow the steps we recommend to trouble shoot "Host adapter reset request. SCSI hang ?" messages:
  • Check for any updated firmware for the motherboard, controller, targets and enclosure on the respective manufacturer's web sites.
  • Check per-device queue depth in SYSFS to make sure it is reasonable.
  • Engage disk drive manufacturer's technical support department to check through compatibility or drive class issues.
  • Engage enclosure manufacturer's technical support department to check through compatibility issues.
0 Kudos
7 Replies
FragKing
Contributor
Contributor

Hi

I have the same problem. Have you found a solution for this yet?

Thanks

http://www.quadrotech-it.com
0 Kudos
gregsn
Enthusiast
Enthusiast

Adaptec has released some new drivers (29100) which I'm testing right now.  So far so good. The new drivers can be downloaded from their website.

0 Kudos
FragKing
Contributor
Contributor

Thx, trying them out now!

http://www.quadrotech-it.com
0 Kudos
Icereval
Contributor
Contributor

I tried the Adaptec AACRAID v1.1.7-29100 driver and it did not resolve the issue.

Were you able to find a solution?

0 Kudos
gregsn
Enthusiast
Enthusiast

Try disabling VT-d in the BIOS.  Also, check to ensure all your drives are in good working condition.  I've found that faulty SSDs can cause a very similar hanging issue.  Some more info here: https://communities.vmware.com/message/2258169#2258169

Try the following if you haven't already:

1. Latest 6805 BIOS update (http://www.adaptec.com/en-us/speed/raid/asr/fw_bios/6805_fw_b19109_exe.htm)

2. Latest drivers (AACRAID Driver v1.2.1-29900 for VMware) from here (http://www.adaptec.com/en-us/speed/raid/aac/linux/aacraid_vmware_drivers_1_2_1-29900_tgz.htm)

3. Disable VT-D in the motherboard BIOS.

4. Check for drive failures using arcconf.  Use the command "arcconf getlogs 1 device tabular" and check for any problematic drives.

5. If it continues to fail, you may have a bad controller (I've had a brand new controller cause the same errors.  It turned out to be a bad controller).  Create a support archive and look through the logs there and/or create a support case with Adaptec.

With 1,2 and 3, I haven't had any stability issues on 5.1 (799733).  If you are using SSDs, they can fail in a way that can also cause the same error message.  Physically unplugging the failed SSD will usually allow the controller to recover.

0 Kudos
Punisher713
Contributor
Contributor

I also have tried the Adaptec AACRAID v1.1.7-29100 driver and it did not resolve the issue.


Latest firmware installed, disk drives are brand new seagate 4tb in Raid5.


Disabling VT-D did not change anything.


Exact symptoms i'm experimenting:

On heavy write load, ESXi host hangs completely. No PSOD, nothing in vmkernel.log, it just hangs and needs a forced shutdown.

On heavy read operations, everything is fine. Tired moving a 15gb files from my array's datastore to previous HDD's datastore, transfer was successful, ESXi crashed trying to delete the file from the array.

0 Kudos
cneulieb
Contributor
Contributor

I have been looking for an answer for this for months - finally found this - VMware ESXi 5.5: unresponsive system and SCSILinuxAbortCommands

I haven't been able to get it implemented yet, but it sounds exactly like the problem I am seeing.

0 Kudos