FreeBSD 10 guest - CAM status: SCSI Status Error

vmejo · ‎12-31-2014

Hi,

I'm running into a serious problem wrt. a FreeBSD guest under ESXi 5 (VMware vCenter Server V 5.5.0 Build 1623101 to be exact):

From time to time for no apparent reason I get the following errors on the guest:

Increasingly I'm seeing errors like these:

Dec 15 01:33:25 igue kernel: (da0:mpt0:0:0:0): WRITE(10). CDB: 2a 00 00 c0 9e 22 00 00 08 00

Dec 15 01:33:25 igue kernel: (da0:mpt0:0:0:0): CAM status: SCSI Status Error

Dec 15 01:33:25 igue kernel: (da0:mpt0:0:0:0): SCSI status: Busy

Dec 15 01:33:25 igue kernel: (da0:mpt0:0:0:0): Retrying command

Dec 15 01:33:25 igue kernel: (da0:mpt0:0:0:0): WRITE(10). CDB: 2a 00 00 c0 00 a2 00 00 08 00

Dec 15 01:33:25 igue kernel: (da0:mpt0:0:0:0): CAM status: SCSI Status Error

Dec 15 01:33:25 igue kernel: (da0:mpt0:0:0:0): SCSI status: Busy

Dec 15 01:33:25 igue kernel: (da0:mpt0:0:0:0): Retrying command

BTW, the "disk" is recognized by the OS as follows:

Dec 12 13:42:40 igue kernel: da0 at mpt0 bus 0 scbus2 target 0 lun 0

Dec 12 13:42:40 igue kernel: da0: <VMware Virtual disk 1.0> Fixed Direct Access SCSI-2 device

Dec 12 13:42:40 igue kernel: da0: 300.000MB/s transfers

Dec 12 13:42:40 igue kernel: da0: Command Queueing enabled

Dec 12 13:42:40 igue kernel: da0: 61440MB (125829120 512 byte sectors: 255H 63S/T 7832C)

Here's what I checked so far:

Faulty disk, SCSI-errors is something that can be ruled out for sure: "Disks" used by this VM come from a NetApp system providing storage for 200+ virtual machines and none of them is experiencing these problems.
re-installing guest OS from scratch - errors remain
with or without VMware tools installed - errors remain
Set everything up under FreeBSD 9.x instead of FreeBSD 10 - same story
these errors are definitely not related to "heavy io" - the machine in question is acting as a DHCP and caching DNS-server
errors appear repeatedly but not related to a specific time of the day

Even worse: if these errors appear several times in a row the VM completely crashes and has to be "power-cycled" manually

Thanks much in advance for any clue,

-vmejo

aakalan · ‎02-12-2015

Hello vmejo

did you solve this problem?

vmejo · ‎02-14-2015

Hi,

Sorry - no 😞

Neither have I solved this problem nor did I get any reply to my question.

Problem is still there with the VM running into the exact same problems every couple of days.

mattclemens · ‎03-19-2015

Hello, we are seeing the exact same error.

If anyone has any tips please help!

matkovskiy · ‎04-30-2015

I have same problem.

10.1-RELEASE-p9

ESI 5.5.0, 2068190

Have any solutions ?

Alistar · ‎04-30-2015

Hello everyone,

what is the SAS/SCSI/SATA virtual adapter you are using and have you tried changing it around? Think LSI Logic SAS or Parallel switch could help, maybe even trying a SATA controller for your disk storage could help.

Stop by my blog if you'd like 🙂 I dabble in vSphere troubleshooting, PowerCLI scripting and NetApp storage - and I share my journeys at http://vmxp.wordpress.com/

aakalan · ‎04-30-2015

This is solved with Freebsd update. you can see the link below

CAM status: SCSI Status Error | FreeNAS Community

FIXED in latest 9.2.1.2.

matkovskiy · ‎04-30-2015

ESXi installed on Dell R730XD. Raid controller PERC H730 Mini it is LSI 3108.

Have two Logical dirve. On on SSD drive and One on SATA drive.

Both have a problem.

matkovskiy · ‎04-30-2015

What about FreeBSD 10 ?

matkovskiy · ‎04-30-2015

I use LSI Logic Parallel SCSI Controller for FreeBSD VM. Because it is by default.

Somebody try use LSI SAS SCSI Controller for FreeBSD VM ?

vmejo · ‎04-30-2015

Hi,

Are you sure it's actually fixed ? I'm running 10.1 (kernel & System dating April 15) and the problem is still there, i.e. I'm still getting These Errors, to the log - only the system doesn't crash any more.

matkovskiy · ‎05-04-2015

The problem is relevant!

On FreeBSD 10.1 (r274401) appears about once every 48 hours.

There are some solution

matkovskiy · ‎05-11-2015

In VmWare ESXi 5.5 i running 3 FreeBSD. And all three virtual machines at the same time fall with error (CAM status: SCSI Status Error).

it happens always at night.

Have any solutions for this problem?

vmejo · ‎05-29-2015

Also from what I've seen it happens mostly overnight. One possible explanation, at least in our context is that night time is backup time, i.e. heavy load on the storage side.

gpwwww · ‎07-01-2015

I was getting more and more of these errors on my FreeBSD VM as I added more VMs to the ESX. I've now added a memory reservation to the FreeBSD VM (Settings -> Resources -> Memory -> Reservation 3000MB) and it seems to have vastly improved (not seen any in the last 3 days). Guess it might have just been due to too much memory over subscription on the ESX impacting the performance of FreeBSD.

Cannoli · ‎08-11-2015

We're seeing the same issue but I can say with certainty it is NOT a FreeBSD issue. Working with Supermicro, LSI and VMware, we determined the LSI controller is "timing out" where all I/O comes to a complete stop on the controller. While it was the FreeBSD VM console that alerted us to the issue during our initial build-out of the server cluster, the vmkernel log file confirmed the LSI 3108 controller that backs an 8 disk SSD RAID is timing out then resetting. We've been able to cause it to "time out" at will by powering up or resetting 5 VM's at the same time. Not only does the vmkernel log display the loss of communications to the controller, the LED activity on the drives is non-existent for ~30-40 seconds.

We've tried new LSI controller firmware (even beta firmware from LSI), various VMware drivers for the controller, hardware BIOS settings for the system and the controller. You name it, we've tried it all without success.

I have an open ticket with Supermicro and VMware to solve this issue. I'll post more as I have information.

LaminarCS · ‎08-19-2015

I think I may have the solution.

We just ran into this issue with a brand new Supermicro machine with an LS3108 based RAID and VMWare ESXi 6.0.

The solution was to ditch the lsi_mr3 card and use the Avago / LSI scsi-megaraid-sas driver. We were able to find the appropriate driver for our ESXI by going here: http://mycusthelp.info/LSI/_cs/AnswerDetail.aspx?inc=8447

Be sure you download what they are labeling as the "legacy driver" and not the native driver, as that is the one with the problems. Oracle has an excellent article with instructions on switching to the scsi-megaraid-sas driver and for turning off the lsi_mr3 driver, you can follow those, but reference the newer driver version / files you downloaded. Here are the Oracle instructions: Enable the megaraid_sas Driver - Oracle Server X5-2 HTML Documentation Collection

With the new driver installed I was successfully able to run the StorCLI utility (the replacement for MegaCLI) to access the card. I was able to view the current firmware and installed a newer firmware that I was able to find here: ftp://ftp.supermicro.com/driver/SAS/LSI/3108/Firmware/

After installing the latest firmware, I did have to re-add the storage for some reason. I also had problems with the web client and simply re-added the storage with the old Windows client.

I believe the key to fixing the issue is switching to the scsi-megaraid-sas driver, although I did also upgrade the firmware before performing tests that would cause the errors previously ... so I can't confirm this 100%.

SomeRandomDude · ‎09-17-2015

I was experiencing the same issues; LaminarCS's instructions were almost enough but in my case I had to do one more thing. My environment - Dell T630, Megaraid 8380E attached to 8 1TB Samsung SSD 850 Pro drives; FreeNas 9.3 VM, and NAS4Free 10.2 VM providing iSCSI from a VMFS datastore provided by those drives I listed.

My additional problem was HEAT. The Megaraid card was idling at 98 degrees Celsius. I can only imagine what kind of temperatures it was reaching under load. Even a 20% increase would put it over the suggested thermal limit. The Dell recommended slot placement of the RAID card puts it at the top of the case where there is zero airflow. I added a fan blowing directly onto the card, and temperatures were reduced by 46 degrees C. This is a reduction from 208 to 125 F, which is enormous. Once the fan was in place, the errors ceased.

So, if you've tried all the suggestions in this thread and still are experiencing errors, check your airflow and temperatures. In my case I had to:

1.) Upgrade firmware of the Megaraid 8380E

2.) Commit 16GB to my NAS Virtual Machine

3.) Update the Megaraid driver to the newest version from VMWare

4.) Add active cooling to keep the 8380E at a reasonable operating temperature

mfitz50 · ‎03-18-2017

I do not know if this will help anyone,

But I did just run into a similar issue my hardware using X79 Chipset.

In this case I resolved the issue using the sata-ahci driver in ESXi 6.5

Hope this helps

timboAUS · ‎06-02-2017

I had this error. It turned out to be the cache controller battery had failed on a HP DL380 G7

Lift the cover off your server and check the battery leds. If you have a solid amber light, then a new battery will fix the problem

All

FreeBSD 10 guest - CAM status: SCSI Status Error