VMware Cloud Community
todd6666
Contributor
Contributor

Troubleshooting HBA cmdsabrt & busRst

Hi,

I have ESX 2.x and 3.x using various different QLogic HBAs (mainly in HP blades and DL385 / 580 servers) all of which are experiencing commands aborted and bus resets, the screenshot below shows an example. The cmdsAbrt and busRst number are much higher on most server than seen below, usually between 100 - 200 for both values.

The HBA's in question are all connected to SAN storage used to store vmdk files for around 1000 VMs (500 VC 1.x and 500 VI 2.x) and 100 ESX hosts again across VI 1.x and VC 2.x

Basically i would appreciate if someone could give me some general troubleshooting technics, and common causes for these errors.

Thanks in advance.

Message was edited by: Badsah: Restored original post by copying from a document containing duplicate content: DOC-8320.

Tags (6)
0 Kudos
6 Replies
mitchellm3
Enthusiast
Enthusiast

What kind of Storage are you using?

As far as a common cause for those errors, I'm not quite sure what they all are. We have them in our environment and we are working with VMware and our storage provider (IBM) to fix them. I think the best place to start looking for an answer would be /var/log/vmkernel and /var/log/vmkwarning. See if you have any errors that stick out and if so, are there common times that they occur and how many hosts do they affect when they occur. In our case, we get these across all hosts...

Oct 21 03:08:36 vm09 vmkernel: 54:13:51:01.362 cpu5:1029)StorageMonitor: 222: vmhba1:1:2:1 status = 8/0 0x0 0x0 0x0

What follows the "status =" is the SCSI error code. In the above case the 8/0 error code means that the storage processor is too busy to handle the SCSI request...not good.

If you get a status = 24/0, that is a SCSI reservation conflict. You can turn up your SCSI logging in the advanced options of each ESX host if you need more info in the logs. We have that turned up right now.

Of course, your best bet would be to call VMware.

0 Kudos
dtux101
Enthusiast
Enthusiast

Judging by this message in the 2nd poster's reply, it may be a case of trying to drive too much I/O or else that the arary is over taxed:

Oct 21 03:08:36 vm09 vmkernel: 54:13:51:01.362 cpu5:1029)StorageMonitor: 222: vmhba1:1:2:1 status = 8/0 0x0 0x0 0x0

The 8/0 0x0 0x0 0x0 decodes to BUSY in SCSI terminology. The first poster mentioned 1000 VMDKS also.

When you see the aborted commands coming back form the QLogic (or any) driver, it means that a command was issued from the controller to the target, but the target did not accept the i/o and therefore the i/o was aborted. When this happens, there are a few different scenarios that can occur

1) The i/o is re-queued at the driver level (HBA cache). In the QLogic and Emulex drivers, the queue size defaults to 32, but this is changeable

2) The i/o cannot be re-queued as the HBA queue is already full. In this case, a record of this is propogated back to the VM via vmkernel, and the guest (VM) will cknowledge. Then a bus reset is sent form the guest. You may see some VSCSI warnings/errors in /var/log/vmkernel at this time. You'd also expect to see eventID 11, 15,51 errors inside Windows VMs. Some Liux kernels will ignore this until a sustained period expires, at whichc stage the filesystem inside the guest will be places read-only. Windows will bluescreen if it has no disk access fter 30 seconds (see vmware KB 1014 for more details)

So, in your case, it may be that the array is over-taxed (for the second poster the 8/0 warning indicates this).

In both your cases, you might want to consider setting the HBA queue depth to 64 or even 128 (0-255 is the range) This is described in the SAN configuration guide (search for ql2xmaxqdepth, etc)

The VM's queue is fixed at a max of 128, but still - you can change the LSILogic queue depth settings inside Windows (regedit (symmpi is the key)). Linux has an adjustable queue.

Also, you might want to run esxtop again and use the j option to view disk average latency, which will tell you how long it takes for the array to respond. If this is above 100ms, you'd be looking at a sluggish array.

HTH - let me know how you get on

todd6666
Contributor
Contributor

f

0 Kudos
depping
Leadership
Leadership

No clue what happened here, but no useful info anymore so i moved the thread.



Duncan

Blogging: http://www.yellow-bricks.com

If you find this information useful, please award points for "correct" or "helpful".

0 Kudos
Badsah
Expert
Expert

I'm not sure what happened to cause the original post to disappear either, but I restored the text and the image and moved this back. One of our Support Escalation Engineers is intending to respond to this thread. Thanks.




---

Badsah Mukherji

VMware Web Communities Team

--- Badsah Mukherji VMware Web Communities Team
0 Kudos
Badsah
Expert
Expert

Todd, do you have any clue as to what is happening here? Your posts are turning into single letters. Very strange.




---

Badsah Mukherji

VMware Web Communities Team

--- Badsah Mukherji VMware Web Communities Team
0 Kudos