VMware Cloud Community
securitux
Contributor
Contributor

ESXi 4.1 PSOD when performing certain Linux disk operations

Hey all,

I am running ESXi 4.1 on an Intel server and I get PSOD's whenever I perform certain disk operations. It seems to happen when I partition a disk or once in a while when writing to disk.

It happens on thin or thick provisioned disks

It ONLY happens with LSI Parallel config, not LSI SAS.

I have attached the purple screen I get. It sucks that a Linux guests disk operation can crash an entire ESXi host with 15 guests on it. The guest disk operations SHOULD be irrelevant to the host.

Any help would be appreciated.

Thanks

-J

0 Kudos
74 Replies
jamesbowling
VMware Employee
VMware Employee

What "certain disk operations" are you performing?

If you found this at all helpful please award points by using the correct or helpful buttons! Thanks!

James B. | Blog: http://www.vSential.com | Twitter: @vSential --- If you found this helpful then please awards helpful or correct points accordingly. Thanks!
0 Kudos
securitux
Contributor
Contributor

When I am finished configuring my partitions on the disk and I write those partitions to disk, and the disk is LSI Parallel, I am pretty much guaranteed it will crash.

I had a PSOD once when I wasn't doing anything to the disks except performing a file transfer (a write obviously) and I would like to say that caused it as well, but I was building 2-3 VM's at the time and one of them COULD have been writing partitions.

So far I have avoided building disks with LSI Parallel and that seems to be ok.

I am deeply concerned as I use a lot of Debian and Ubuntu, and this server is going into a prod environment where uptime is important.

Some more interesting facts:

This is an Intel SR2612UR server. I have an identical ESXi install on a Super Micro which doesn't have this issue. BOTH are supposedly ESXi certified.

Both servers have an Areca array controller. The Intel uses an Areca ARC1680i and the SM uses an Areca ARC1220.

Thanks

-J

0 Kudos
securitux
Contributor
Contributor

More info. The server just crashed out overnight. No partitioning or anything being done. So this appears to be random when running Linux guests and exacerbated when performing partitioning.

It happens when using LSI Parallel or LSI SAS as the virtual disk controller.I tested.

It also happens whether or not I use LVM... the only combination that gets me past partitioning is LSI SAS and use of LVM. Any other combo and bam, PSOD.

Just an FYI, my Areca is running the latest FW: 1.48. Drivers from Areca are the latest on their site as well: v1.0 apparently.

Has anyone else experienced this or able to help?

-J

0 Kudos
securitux
Contributor
Contributor

Just thought I would follow up, to keep others informed if they run into the same issue.

I have this one pretty much solved, I am doing some final testing, but it has to do with either enabling NCQ on the Areca or the Disk Queue Length. Once I set Disk Queue Length to 1 (disabled) and turned NCQ off, I tried partitioning the same way I would to get a crash and it worked flawlessly. I really tried to get a PSOD and couldn't. So one of these two settings is the culprit. I assume it's the NCQ as this is a SATA thing on a SAS controller which to me spells trouble, but I will need to validate.

I will post my final results when I have them. I have also sent an e-mail to Areca to inform them of my issue.

-J

0 Kudos
ClarksonAdmin
Contributor
Contributor

I get the same purple screen using a SuperMicro mobo with an Areca 1220 Raid card and SATA drives.  Could you verify that your problem was resolved and/or offer any suggestions on how to fix it on the Areca 1220?  Thanks.

0 Kudos
securitux
Contributor
Contributor

Hey. Yeah I figured it out actually, it's a setting in the 1220 that you have to change. I will check what that is tonight (as I need to prep the server for a move) and let you know. Since I made the change I have had ZERO problems with it so I'll get the info to you ASAP.

-J

0 Kudos
securitux
Contributor
Contributor

Just a heads up, my fix worked for the 1680, naturally... It might not on the 1220... I will confirm what I did tonight but as an FYI I believe I disabled NCQ support as well as set HDD Queue Depth to 1. I can't recall which one fixed the problem (or if it was both) but I will check tonight. If you want to test in the meantime go for it.

Thanks

-J

0 Kudos
mauricev
Enthusiast
Enthusiast

I'm running into to this exact problem. Do you which change solved it?

0 Kudos
mauricev
Enthusiast
Enthusiast

When I turned off NCQ, the PSOD seem to have stopped, but the controller could still stop responding, although it's clearly more stable. So I disabled HDD queue depth. That seemed to hav no additional effect. When I turned off all the different caching, it brought back the PSOD.

Clearly, there's a bug in the Areca driver and turning off NCQ masks it, but only to a point.

Areca has a beta driver on the ftp site, but I don't how to install it. esxupdate doesn't seem to allow reinstallation of an existing driver. Areca tech support is on vacation. If VMWare tech support comes up with a solution, I will post.

0 Kudos
ClarksonAdmin
Contributor
Contributor

Was this on a 1220 card?  I'm not able to find the options you guys are talking about.  Where did you change these settings?

0 Kudos
securitux
Contributor
Contributor

On the 1680i, I left caching enabled. NCQ off, Queue depth off. I didn't turn either back on because it apparently isn't recommended, especially the more drives you add.

I don't want to mess with the beta driver... who knows what damage that will cause. Beta driver + ESXi + SAS controller + SATA drives... no thanks Smiley Happy

-J

0 Kudos
securitux
Contributor
Contributor

The 1220 does not have the option for queue depth but it DOES for NCQ... It's just different.

If you go into the System Config and Max SATA mode supported, you will see the options... 2 with NCQ... Use SATA300 NOT SATA300NCQ and maybe that will help you. I have NCQ on for the SuperMicro with the 1220 and I have no issues. But let me show you my settings and you can see if there's something there... I won't include the useless ones like alarms and LED's. That won't affect anything.

Max SATA: 300+NCQ

HDD Read Ahead: Enabled

Volume Data read ahead: Normal

HDD SMART Status: Disabled <-- maybe enabled breaks stuff?

Disk write cache mode: Auto

Then for the array (volume set):

Cache mode: Write back

TCQ: Enabled

Let us know how it goes.

-J

0 Kudos
securitux
Contributor
Contributor

Let me also state, my 1220 is at firmware 1.48 and boot ROM 1.48.

My 1680i: 1.48 / 1.48. SAS FW: 4.7.3.0

0 Kudos
Saturnous
Enthusiast
Enthusiast

Disabling NCQ sounds for me a bit performance affecting, if commands to the virtual LSI paralell controller cause trouble setting  lsilogic.reflectIntrMask = TRUE in the vmx file helped a lot.

0 Kudos
securitux
Contributor
Contributor

I'd rather have a performance hit than a PSOD every few hours in a production environment. That being said, I'd be interested in seeing what that setting does and if it solves the issue. When I get a chance I'll test it.

Thanks

-J

0 Kudos
DSTAVERT
Immortal
Immortal

If this card is on the HCL http://vmware.com/go/hcl and you have used the VMware downloadable driver and are still having issues I think I would check with Areca support.

-- David -- VMware Communities Moderator
0 Kudos
mauricev
Enthusiast
Enthusiast

The "beta" driver fixes this bug. I really don't know why they are calling a beta. I recommend installing it and then you can turn everything back on. What you have to do first is remove the broken, builtin driver first. The working driver for ESXi 4.1 is the offline-bundle.zip file inside the iso at ftp://ftp.areca.com.tw/RaidCards/AP_Drivers/VMware/ESXi_4.x/Beta/arcmsr.iso. To install it, you must supply --nosigcheck to the esxupdate command.

0 Kudos
securitux
Contributor
Contributor

Excellent! I'll try it out.

0 Kudos
GTMagician
Contributor
Contributor

I have the same PSOD problem - despite using the same beta Areca driver for a several months with no issues, the problem just started happening about 2 weeks ago on a machine that has been running like a champ for almost a year (though the Areca controller is a relatively new addition).

Same supermicro chassis and Mobo, but an Areca 1210 controller.   Same PSOD.  However mine will appear under moderately heavy disk activity of any sort - even simply tarring a large VM from one part of the datastore to another under the ESXi management shell, with no VMs running at all.  We're on a 4-disk 1.8tb RAID 10 array.  I'm going to try disabling NCQ and see if this improves things.

0 Kudos