Hey all,
I am running ESXi 4.1 on an Intel server and I get PSOD's whenever I perform certain disk operations. It seems to happen when I partition a disk or once in a while when writing to disk.
It happens on thin or thick provisioned disks
It ONLY happens with LSI Parallel config, not LSI SAS.
I have attached the purple screen I get. It sucks that a Linux guests disk operation can crash an entire ESXi host with 15 guests on it. The guest disk operations SHOULD be irrelevant to the host.
Any help would be appreciated.
Thanks
-J
What "certain disk operations" are you performing?
If you found this at all helpful please award points by using the correct or helpful buttons! Thanks!
When I am finished configuring my partitions on the disk and I write those partitions to disk, and the disk is LSI Parallel, I am pretty much guaranteed it will crash.
I had a PSOD once when I wasn't doing anything to the disks except performing a file transfer (a write obviously) and I would like to say that caused it as well, but I was building 2-3 VM's at the time and one of them COULD have been writing partitions.
So far I have avoided building disks with LSI Parallel and that seems to be ok.
I am deeply concerned as I use a lot of Debian and Ubuntu, and this server is going into a prod environment where uptime is important.
Some more interesting facts:
This is an Intel SR2612UR server. I have an identical ESXi install on a Super Micro which doesn't have this issue. BOTH are supposedly ESXi certified.
Both servers have an Areca array controller. The Intel uses an Areca ARC1680i and the SM uses an Areca ARC1220.
Thanks
-J
More info. The server just crashed out overnight. No partitioning or anything being done. So this appears to be random when running Linux guests and exacerbated when performing partitioning.
It happens when using LSI Parallel or LSI SAS as the virtual disk controller.I tested.
It also happens whether or not I use LVM... the only combination that gets me past partitioning is LSI SAS and use of LVM. Any other combo and bam, PSOD.
Just an FYI, my Areca is running the latest FW: 1.48. Drivers from Areca are the latest on their site as well: v1.0 apparently.
Has anyone else experienced this or able to help?
-J
Just thought I would follow up, to keep others informed if they run into the same issue.
I have this one pretty much solved, I am doing some final testing, but it has to do with either enabling NCQ on the Areca or the Disk Queue Length. Once I set Disk Queue Length to 1 (disabled) and turned NCQ off, I tried partitioning the same way I would to get a crash and it worked flawlessly. I really tried to get a PSOD and couldn't. So one of these two settings is the culprit. I assume it's the NCQ as this is a SATA thing on a SAS controller which to me spells trouble, but I will need to validate.
I will post my final results when I have them. I have also sent an e-mail to Areca to inform them of my issue.
-J
I get the same purple screen using a SuperMicro mobo with an Areca 1220 Raid card and SATA drives. Could you verify that your problem was resolved and/or offer any suggestions on how to fix it on the Areca 1220? Thanks.
Hey. Yeah I figured it out actually, it's a setting in the 1220 that you have to change. I will check what that is tonight (as I need to prep the server for a move) and let you know. Since I made the change I have had ZERO problems with it so I'll get the info to you ASAP.
-J
Just a heads up, my fix worked for the 1680, naturally... It might not on the 1220... I will confirm what I did tonight but as an FYI I believe I disabled NCQ support as well as set HDD Queue Depth to 1. I can't recall which one fixed the problem (or if it was both) but I will check tonight. If you want to test in the meantime go for it.
Thanks
-J
I'm running into to this exact problem. Do you which change solved it?
When I turned off NCQ, the PSOD seem to have stopped, but the controller could still stop responding, although it's clearly more stable. So I disabled HDD queue depth. That seemed to hav no additional effect. When I turned off all the different caching, it brought back the PSOD.
Clearly, there's a bug in the Areca driver and turning off NCQ masks it, but only to a point.
Areca has a beta driver on the ftp site, but I don't how to install it. esxupdate doesn't seem to allow reinstallation of an existing driver. Areca tech support is on vacation. If VMWare tech support comes up with a solution, I will post.
Was this on a 1220 card? I'm not able to find the options you guys are talking about. Where did you change these settings?
On the 1680i, I left caching enabled. NCQ off, Queue depth off. I didn't turn either back on because it apparently isn't recommended, especially the more drives you add.
I don't want to mess with the beta driver... who knows what damage that will cause. Beta driver + ESXi + SAS controller + SATA drives... no thanks
-J
The 1220 does not have the option for queue depth but it DOES for NCQ... It's just different.
If you go into the System Config and Max SATA mode supported, you will see the options... 2 with NCQ... Use SATA300 NOT SATA300NCQ and maybe that will help you. I have NCQ on for the SuperMicro with the 1220 and I have no issues. But let me show you my settings and you can see if there's something there... I won't include the useless ones like alarms and LED's. That won't affect anything.
Max SATA: 300+NCQ
HDD Read Ahead: Enabled
Volume Data read ahead: Normal
HDD SMART Status: Disabled <-- maybe enabled breaks stuff?
Disk write cache mode: Auto
Then for the array (volume set):
Cache mode: Write back
TCQ: Enabled
Let us know how it goes.
-J
Let me also state, my 1220 is at firmware 1.48 and boot ROM 1.48.
My 1680i: 1.48 / 1.48. SAS FW: 4.7.3.0
Disabling NCQ sounds for me a bit performance affecting, if commands to the virtual LSI paralell controller cause trouble setting lsilogic.reflectIntrMask = TRUE in the vmx file helped a lot.
I'd rather have a performance hit than a PSOD every few hours in a production environment. That being said, I'd be interested in seeing what that setting does and if it solves the issue. When I get a chance I'll test it.
Thanks
-J
If this card is on the HCL http://vmware.com/go/hcl and you have used the VMware downloadable driver and are still having issues I think I would check with Areca support.
The "beta" driver fixes this bug. I really don't know why they are calling a beta. I recommend installing it and then you can turn everything back on. What you have to do first is remove the broken, builtin driver first. The working driver for ESXi 4.1 is the offline-bundle.zip file inside the iso at ftp://ftp.areca.com.tw/RaidCards/AP_Drivers/VMware/ESXi_4.x/Beta/arcmsr.iso. To install it, you must supply --nosigcheck to the esxupdate command.
Excellent! I'll try it out.
I have the same PSOD problem - despite using the same beta Areca driver for a several months with no issues, the problem just started happening about 2 weeks ago on a machine that has been running like a champ for almost a year (though the Areca controller is a relatively new addition).
Same supermicro chassis and Mobo, but an Areca 1210 controller. Same PSOD. However mine will appear under moderately heavy disk activity of any sort - even simply tarring a large VM from one part of the datastore to another under the ESXi management shell, with no VMs running at all. We're on a 4-disk 1.8tb RAID 10 array. I'm going to try disabling NCQ and see if this improves things.