srwsol
Hot Shot
Hot Shot

Lost access to local datastore

Hi folks:

I have ESXi 6.0 running on an Intel S2600CWTS motherboard using the onboard LSI SAS controller and six Samsung 850 Pro SSD drives in a RAID 5 configuration.  This server is about a month and a half old, and it worked fine for the first month, but in the last couple of weeks I've started seeing messages in the event log saying that ESXi has lost access to the datastore, and then, most of the time, about 15 seconds later access is restored.   A few times it took longer, one of which crashed the server.  On those longer times the Intel motherboard event log showed that a drive had failed and started a rebuild, and on another time the motherboard showed that two different drives had failed and it took them both offline causing ESXi to crash.  On the latter I was able to bring both drives back online through the RAID bios and everything worked fine again.  I'm doubting that I've got hard drive problems as these drives are new, and because different drives were reported as failing, and also because no data was lost even when 2 drives went offline at the same time.   I suppose I could have a bad LSI controller, or a cable has come loose somewhere, but I would expect data loss to happen if that's really the case. 

I also noticed that the lost access messages tend to appear in the log the instant I start a VM.  At first I thought it might be a throughput thing, figuring that a starting VM does a lot of I/Os, but this happens immediately, even before the VM bios screen disappears, so I don't think the VM is actually reading the disk yet.  Also, as a precaution I started migrating some VMs off the server to the old server that I still had available, and there were no lost access messages in the event log while that was going on, even though about 50 megabytes per second of I/Os were hitting the disk during the transfers.  I put the latest patch on ESXi in mid-May, about two weeks before this started, and I'm beginning to suspect that the patch may have something to do with it, as it looks to me as if something is happening between ESXi and the disk controller causing it to hang up or lose interrupts for a short period of time, and I'm wondering if that goes on long enough if the hardware sensors in the motherboard interpret that as some sort of hardware failure and simply mark whichever drives whose I/O's were hung up at the time as bad.

Unfortunately I'm out of town right now so I didn't want to anymore than I had to to the server remotely out of fear that I could cause it not to come back up.  I've moved a couple of the more critical VMs to the old server which I was able to remotely boot up and transfer the VMs to.  I also noticed that this issue tended to occur more frequently when I started up the VCenter Server appliance than any other (it happens sometimes to the others but happened every time I tried to start up the VCenter appliance).  Therefore that was the first one I moved back to the old server, but interestingly enough there were no errors logged when I transferred the files, so the issue isn't that ESXi had trouble reading the VMs vmdk file when I started it up.

I also wanted to ask if there are any vibs for the LSI Megaraid controller on the S2600CWTS motherboard that I could install which would allow me to access the RAID controller without having to take the server down and do it from the BIOS setup screens, similar to how you can access Dell's RAID controller while ESXi is running via an add-on vib.  So far I haven't found one.

My intention when I get home to the server is to run an integrity check against the RAID 5 array through the controller, and then use the VOMA ESXi utility on the datastore to see if something is wrong there.  If not, then I guess I'll backout the May patch.  

Thoughts or suggestions welcome.