Troubleshooting SAN SCSI Busy errors

RyanWI · ‎08-28-2007

We've had an enviroment of ESX 3.0.1 running for awhile now. Equipment is IBM Blades, HS40s (8839s) with two 2port qla2300's mez cards. We often see SCSI errors like the ones below and besides sending a problem into the SAN team who basically ignores it, I could use a better troubleshooting path. Does anyone have any ideas on things to try to narrrow down the problem?

here is a snippet of vmkernel

Aug 28 08:14:12 utomad1p0019 vmkernel: 47:18:17:28.080 cpu7:1037)SCSI: 8043: vmhba3:0:29:1 status = 8/0 0x0 0x0 0x0

Aug 28 08:14:12 utomad1p0019 vmkernel: 47:18:17:28.080 cpu7:1037)SCSI: 8062: vmhba3:0:29:1 Retry (busy)

Aug 28 08:14:32 utomad1p0019 vmkernel: 47:18:17:48.102 cpu4:1100)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK

Aug 28 08:14:32 utomad1p0019 vmkernel: 47:18:17:48.102 cpu4:1100)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY

Aug 28 08:14:32 utomad1p0019 vmkernel: 47:18:17:48.102 cpu4:1100)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK

Aug 28 08:14:32 utomad1p0019 vmkernel: 47:18:17:48.102 cpu4:1100)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY

Aug 28 08:14:32 utomad1p0019 vmkernel: 47:18:17:48.102 cpu4:1100)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK

Aug 28 08:14:32 utomad1p0019 vmkernel: 47:18:17:48.102 cpu4:1100)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY

Aug 28 08:14:32 utomad1p0019 vmkernel: 47:18:17:48.102 cpu7:1037)SCSI: 8043: vmhba3:0:29:1 status = 8/0 0x0 0x0 0x0

Aug 28 08:14:32 utomad1p0019 vmkernel: 47:18:17:48.102 cpu7:1037)SCSI: 8062: vmhba3:0:29:1 Retry (busy)

Aug 28 08:14:52 utomad1p0019 vmkernel: 47:18:18:08.124 cpu5:1197)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK

Aug 28 08:14:52 utomad1p0019 vmkernel: 47:18:18:08.124 cpu5:1197)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY

Aug 28 08:14:52 utomad1p0019 vmkernel: 47:18:18:08.124 cpu5:1197)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK

Aug 28 08:14:52 utomad1p0019 vmkernel: 47:18:18:08.124 cpu5:1197)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY

Aug 28 08:14:52 utomad1p0019 vmkernel: 47:18:18:08.124 cpu5:1197)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK

Aug 28 08:14:52 utomad1p0019 vmkernel: 47:18:18:08.124 cpu5:1197)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY

Aug 28 08:14:52 utomad1p0019 vmkernel: 47:18:18:08.124 cpu7:1037)SCSI: 8043: vmhba3:0:29:1 status = 8/0 0x0 0x0 0x0

Aug 28 08:14:52 utomad1p0019 vmkernel: 47:18:18:08.124 cpu7:1037)SCSI: 8062: vmhba3:0:29:1 Retry (busy)

Aug 28 08:14:54 utomad1p0019 vmkernel: 47:18:18:10.017 cpu5:1190)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK

Aug 28 08:14:54 utomad1p0019 vmkernel: 47:18:18:10.017 cpu5:1190)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY

Aug 28 08:14:54 utomad1p0019 vmkernel: 47:18:18:10.217 cpu5:1191)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK

Aug 28 08:14:54 utomad1p0019 vmkernel: 47:18:18:10.217 cpu5:1191)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY

Aug 28 08:14:54 utomad1p0019 vmkernel: 47:18:18:10.217 cpu6:1038)SCSI: 8043: vmhba3:0:53:1 status = 8/0 0x0 0x0 0x0

Aug 28 08:14:54 utomad1p0019 vmkernel: 47:18:18:10.217 cpu6:1038)SCSI: 8062: vmhba3:0:53:1 Retry (busy)

neilhdavies · ‎08-31-2007

I'm getting the same issue with some of our hosts in a similar setup.

HP BL25p blades, ESX 3.0.1 with 2-port mez HBA's (QLA2312).

If you get any ideas on what this is, I would love to know.

Big_T · ‎08-31-2007

Me three!

Aug 28 11:31:41 vh-cl55 vmkernel: 45:19:38:51.404 cpu1:1028)LinSCSI: 2604: Forcing host status from 2 to SCSI_HOST_OK

Aug 28 11:31:41 vh-cl55 vmkernel: 45:19:38:51.404 cpu1:1028)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY

HP BL20p-G3 blades, ESX 3.0.1 with 2-port mez HBA's (QLA2312).

Brocade SAN, EMC Clarion. On the HBA I'm running the firmware and bios configuration as specified by EMC for the HBA from the Qlogic site.

http://support.qlogic.com/support/oem_product_detail.asp?p_id=900&oemid=65&oemname=Fibre

99.999% of the time, the errors only appear when running a virus scan.

I tried switching from Symantec antivirus to a trial of McAfee corporate, and the same errors were logged.

I tried to recreate the messages using iometer, but despite all the different load scenarios I tried, i was unable to duplicate the antivirus behavior.

RyanWI · ‎08-31-2007

how many vm's do you run per lun?

fakber · ‎09-05-2007

First, what type of Storage Array and Fabric Switches are you using?

Are you using any special switches in the Blade Enclosure?

Check the Fabric Switches and Storage Array for any possible issues there.

Ensure that the Fabric ports are setup correctly to Type F.

Do not worry about firmware versions for Q-Logic HBAs. The firmware is loaded from the driver unless it is a Boot From SAN setup where the firmware is used from the card until the driver is loaded.

The Array is reporting the Busy status. Look at possible overloading issues.

Also check /proc/scsi/qla2300/* for link status and for any dropped frames.

Check /proc/vmware/scsi/vmhba\[0-9]/stats for any bus resets (busRst) and for command aborts (cmdsAbrt). These values should be zero in most environments.

This is where I would start to see to try and narrow down any issues.

RyanWI · ‎09-06-2007

EMC DMX 3, IBM Enterprise Chassis, w/McData switches. I've opened tickets to the SAN folks but they say everything is fine...

We do boot from SAN. Qlogic cards w/firmware 1.47.

We've done some additional testing and it's still got me at a loss.

-Disabling that path in the ESX host, all other paths work great.

-VMotioning the VMs to another ESX host that has access to the same disk down the same FAs, shows no issues.

--This means that there isn't a Device/Frame/FA issue.

-Replaced the HBA and the server is still having the same issue down the same path.

-Moved the blade to a different slot in the same bladecenter, issue followed server.

Its starting to look more and more like an issue with the server config, but we have 40hosts setup identical and can't pinpoint it ;(.

fakber · ‎09-06-2007

So based on what you have said so far, it seems that everything works fine except on this one blade down one path.

Therefore, I would then look at the following areas for any possible issues.

1 - Like you said look at the server configuration with other servers for possible issues.

2 - Check the Chasis switch module for the port this HBA is connected to. Look to see that port is setup correctly and doesn't have any errors.

3 - Check to see if the HBA has any issues with the mezzanine slot in the blade.

Let's see what you find here.

spex · ‎11-05-2007

Are there any updates to your problems.

We see also many "Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY " messages on one of our blades.

Regards Spex

fakber · ‎12-04-2007

Going from "good" to "busy" may mean either be because of high I/O or an issue on the Array side of things. Try looking at the array for any possible issues.

vperris · ‎02-26-2008

We are suddenly getting a constant flood of these messages also. I replced the gbics on the switch. no luck.

Anyone ever find a definitive answer on what is causing these?

We are on esx 3.0.2, build 52542, with qlogic 2340's, a brocade switch, and fastT ds4800, pretty vanilla setup.

Thanks for any help....Vicki

# tail vmkernel

Feb 26 08:43:51 seelplx6 vmkernel: 4:08:09:58.202 cpu0:1024)LinSCSI: 2604: Forcing host status from 7 to SCSI_HOST_OK

Feb 26 08:43:51 seelplx6 vmkernel: 4:08:09:58.202 cpu0:1024)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY

Feb 26 08:43:52 seelplx6 vmkernel: 4:08:09:58.402 cpu0:1024)LinSCSI: 2604: Forcing host status from 7 to SCSI_HOST_OK

Feb 26 08:43:52 seelplx6 vmkernel: 4:08:09:58.402 cpu0:1024)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY

Feb 26 08:43:52 seelplx6 vmkernel: 4:08:09:58.602 cpu0:1024)LinSCSI: 2604: Forcing host status from 7 to SCSI_HOST_OK

Feb 26 08:43:52 seelplx6 vmkernel: 4:08:09:58.602 cpu0:1024)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY

Feb 26 08:43:52 seelplx6 vmkernel: 4:08:09:58.802 cpu0:1024)LinSCSI: 2604: Forcing host status from 7 to SCSI_HOST_OK

Feb 26 08:43:52 seelplx6 vmkernel: 4:08:09:58.802 cpu0:1024)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY

Feb 26 08:43:52 seelplx6 vmkernel: 4:08:09:59.002 cpu0:1024)LinSCSI: 2604: Forcing host status from 7 to SCSI_HOST_OK

Feb 26 08:43:52 seelplx6 vmkernel: 4:08:09:59.002 cpu0:1024)LinSCSI: 2606: Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY

# date

Tue Feb 26 08:43:54 PST 2008

#

mcowger · ‎02-26-2008

We get this when I max out a port on my array (2Gbit port). Are you sure you aren't maxing out the port?

--Matt

--Matt VCDX #52 blog.cowger.us

mike_laspina · ‎02-26-2008

Hello,

I have a couple of comments.

Keep in mind that BUSY is not and error but a state.

esxtop and has realtime counters that will help determine if there is a issue with the SAN on the hosts. AQLEN LQLEN

You can output long term report data from the MUI performance tools in VIC which are better for determining normalization of the ESX host performance and observing a peek time.

To many hosts to a very large VMFS volume is the likely issue.

It could also be the number of drives per array is to low on the SAN.

There can be the unlikely faulty path or misconfig but his usually shows up all the time not just at peeks of I/O

Adapter queue depth and LUN queue depth are the main indication that the issue is host related or not. If the sum of LUN queue depths on one adapter is reaching the adaptor depth or exceeding it will result in a busy state event. The key to telling if this is within normal and not a system problem is determined by the SAN's sustained I/O maximums range and what type of I/O is being driven. The smaller the I/O the lower the overall data volume flow and then the high the I/O count will cause the queues to fill. Higher data volume can peek the bandwidth of the HBA paths and the result is the same.

http://blog.laspina.ca/ vExpert 2009

All

Troubleshooting SAN SCSI Busy errors