SCSI Sense errors 0xb 0x47 0x0

limp15000 · ‎09-11-2012

Hello,

We have recently upgraded an ESXi 4.0 machine running on HP ProLiant DL360 G5 to ESXi 5.0 Update 1.

We did not have any problem with any of the hardware on 4.0.

The HCL included the SCSI card we were using : Adaptec 29320LPE.

We just had to use an image with the aic97xx driver.

We actually created an image to our needs, but later we tried a lot of alternatives (will detail below).

So, we did the upgrade with the customized image, using vCenter and VUM.

We saw that during the reboot there were a lot of SCSI errors, it seemed an infinite loop, so we decided to make a clean install, no big deal.

Starting the image from a CD-ROM, we encountered the same errors.

We tried the original ESXi image (to be able to install the driver later, on disk), but we happened to find some version of the aic97xx driver in it too, so we had the same errors and could not finish the CD boot.

Here is a sample:

2012-09-05T13:13:43.916Z cpu1:2049)ScsiDeviceIO: 2309: Cmd(0x41240078acc0) 0x2a, CmdSN 0x746 from world 2052 to dev "t10.YI2D16SAEU4______202489310000057_" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xb 0x47 0x0.

There is nothing else noticeable, no warnings or errors regarding the driver or the interface, just this.

For those wondering, this means that there is no error on the ESXi side, but the other SCSI side answered something along the lines of "we found a corrupted packet and ignored it".

The other side is a SAN, model YI2D16SAEU4, with SATA disks and double SCSI Ultra320 interface. The actual SCSI chip is an LSI 1030T.

We thought there was a problem in the configuration, triple checked everything on the SAN, but nothing was wrong.

We also tried to switch off the only "advanced" option, QAS.

It did not help.

What is worse, is that when the Linux kernel receives this kind of error, it seems to fall down on its knees for about half a minute.

There are actually a number of bug tickets related to this on the web, with all kinds of SCSI drivers and all kinds of Linux-based systems, just nothing related to ESXi, this card or this SAN in particular.

Anyway, we thought it was a driver problem, and we changed the card.

We opted for one with the LSI Logic 53C1020 chip, the ESXi default mptspi driver took care of it.

It is probably the most common PCI-X to Ultra320 chip in the wild, so we thought, there can not be any error with it or someone would have spotted it.

The problem is that we had the same exact error, just with a slower frequency.

Of course we tried to change the SCSI port on the SAN, and the cable...

Not only it did not change anything, but it did not make sense, since it worked nicely with ESXi 4.0.

Thanks to the slower frequency of the errors, we were able to start very slowly the system on CD, install very slowly the system on the local disk and finally boot up very slowly the system, with everything more or less working.

In fact everything was working, just so slowly that a lot of stuff timed out now and then.

Any access to a LUN on the SAN (FYI there are 4 of them) meant a system freeze and performance breakdown for a few minutes, but occasionnally the data squeezed thourgh.

We were also able to connect the vCenter, and try to debug the problem a little via SSH.

After a lot of searches and tries, we found that by setting the SAN to Ultra2 (80Mbit/s), the errors went away completely.

Setting it to Ultra3 (160Mbit/s) did not help.

We do not suffer performance issues, since we have the chance to run a testing environment on this machine, but I have been asked to explain the problem here and seek for help. We would be delighted to know that we will not encounter this problem with one of our clients, or with the production servers we are going to upgrade soon.

We are available for more informations. If you need any other information to debug this, let us know.

If you have any idea to try and test, we are open to anything which is free, just let us know.

And thank you for your help.

All

SCSI Sense errors 0xb 0x47 0x0