srwsol
Hot Shot
Hot Shot

Lost access to local datastore

Jump to solution

Hi folks:

I have ESXi 6.0 running on an Intel S2600CWTS motherboard using the onboard LSI SAS controller and six Samsung 850 Pro SSD drives in a RAID 5 configuration.  This server is about a month and a half old, and it worked fine for the first month, but in the last couple of weeks I've started seeing messages in the event log saying that ESXi has lost access to the datastore, and then, most of the time, about 15 seconds later access is restored.   A few times it took longer, one of which crashed the server.  On those longer times the Intel motherboard event log showed that a drive had failed and started a rebuild, and on another time the motherboard showed that two different drives had failed and it took them both offline causing ESXi to crash.  On the latter I was able to bring both drives back online through the RAID bios and everything worked fine again.  I'm doubting that I've got hard drive problems as these drives are new, and because different drives were reported as failing, and also because no data was lost even when 2 drives went offline at the same time.   I suppose I could have a bad LSI controller, or a cable has come loose somewhere, but I would expect data loss to happen if that's really the case. 

I also noticed that the lost access messages tend to appear in the log the instant I start a VM.  At first I thought it might be a throughput thing, figuring that a starting VM does a lot of I/Os, but this happens immediately, even before the VM bios screen disappears, so I don't think the VM is actually reading the disk yet.  Also, as a precaution I started migrating some VMs off the server to the old server that I still had available, and there were no lost access messages in the event log while that was going on, even though about 50 megabytes per second of I/Os were hitting the disk during the transfers.  I put the latest patch on ESXi in mid-May, about two weeks before this started, and I'm beginning to suspect that the patch may have something to do with it, as it looks to me as if something is happening between ESXi and the disk controller causing it to hang up or lose interrupts for a short period of time, and I'm wondering if that goes on long enough if the hardware sensors in the motherboard interpret that as some sort of hardware failure and simply mark whichever drives whose I/O's were hung up at the time as bad.

Unfortunately I'm out of town right now so I didn't want to anymore than I had to to the server remotely out of fear that I could cause it not to come back up.  I've moved a couple of the more critical VMs to the old server which I was able to remotely boot up and transfer the VMs to.  I also noticed that this issue tended to occur more frequently when I started up the VCenter Server appliance than any other (it happens sometimes to the others but happened every time I tried to start up the VCenter appliance).  Therefore that was the first one I moved back to the old server, but interestingly enough there were no errors logged when I transferred the files, so the issue isn't that ESXi had trouble reading the VMs vmdk file when I started it up.

I also wanted to ask if there are any vibs for the LSI Megaraid controller on the S2600CWTS motherboard that I could install which would allow me to access the RAID controller without having to take the server down and do it from the BIOS setup screens, similar to how you can access Dell's RAID controller while ESXi is running via an add-on vib.  So far I haven't found one.

My intention when I get home to the server is to run an integrity check against the RAID 5 array through the controller, and then use the VOMA ESXi utility on the datastore to see if something is wrong there.  If not, then I guess I'll backout the May patch.  

Thoughts or suggestions welcome.

1 Solution

Accepted Solutions
srwsol
Hot Shot
Hot Shot

Success!!  I finally figured out what the problem was and got it corrected.  The problem was the fact that although the latest driver vib from LSI was installed, for some reason it would not automatically replace the default ESXi driver lsi_mr3.  I didn't notice that although the latest driver package vib from LSI installed properly the controller was still using the original driver.  This could be seen by doing a "esxcfg-scsidev -a" command which still showed "vmhba1  lsi_mr3           link-n/a  sas.5001e67ca647e000                    (0000:0a:00.0) LSI MegaRAID SAS Fury Controller" .  The second field after the controller (vmhba1) is the current driver being used.  The driver that needed to be used was:  "vmhba1    megaraid_sas    link-n/a  unknown.vmhba   (0000:03:00.0) Avago (LSI / Symbios Logic) MegaRAID SAS Invader Controller"  .

To fix this problem I had to disable the lsi_mr3 driver so that ESXi was forced to use the other one.  The procedure to do that is as follows:

1).  Verify that you have the new driver that you want EXSi to use already installed.  I'm assuming that bad things would happen if you disabled the only driver that could operate your disk controller.  Use this command to see the installed drivers:  "esxcli software vib list"  and you should get something similar to the following:

[root@intelserver:~] esxcli software vib list

Name                           Version                               Vendor  Acceptance Level  Install Date

-----------------------------  ------------------------------------  ------  ----------------  ------------

scsi-megaraid-sas              6.608.11.00-1OEM.600.0.0.2494585      Avago   VMwareCertified   2015-07-19

lsi-mr3                        6.606.10.00-1OEM.550.0.0.1391871      LSI     VMwareCertified   2015-06-19

mtip32xx-native                3.8.5-1vmw.600.0.0.2494585            VMWARE  VMwareCertified   2015-04-28

ata-pata-amd                   0.3.10-3vmw.600.0.0.2494585           VMware  VMwareCertified   2015-04-28

ata-pata-atiixp                0.4.6-4vmw.600.0.0.2494585            VMware  VMwareCertified   2015-04-28

ata-pata-cmd64x                0.2.5-3vmw.600.0.0.2494585            VMware  VMwareCertified   2015-04-28

ata-pata-hpt3x2n               0.3.4-3vmw.600.0.0.2494585            VMware  VMwareCertified   2015-04-28

ata-pata-pdc2027x              1.0-3vmw.600.0.0.2494585              VMware  VMwareCertified   2015-04-28

ata-pata-serverworks           0.4.3-3vmw.600.0.0.2494585            VMware  VMwareCertified   2015-04-28

ata-pata-sil680                0.4.8-3vmw.600.0.0.2494585            VMware  VMwareCertified   2015-04-28

ata-pata-via                   0.3.3-2vmw.600.0.0.2494585            VMware  VMwareCertified   2015-04-28

block-cciss                    3.6.14-10vmw.600.0.0.2494585          VMware  VMwareCertified   2015-04-28

cpu-microcode                  6.0.0-0.0.2494585                     VMware  VMwareCertified   2015-04-28

ehci-ehci-hcd                  1.0-3vmw.600.0.0.2494585              VMware  VMwareCertified   2015-04-28

elxnet                         10.2.309.6v-1vmw.600.0.0.2494585      VMware  VMwareCertified   2015-04-28

emulex-esx-elxnetcli           10.2.309.6v-0.0.2494585               VMware  VMwareCertified   2015-04-28

esx-base                       6.0.0-0.11.2809209                    VMware  VMwareCertified   2015-07-21

esx-dvfilter-generic-fastpath  6.0.0-0.0.2494585                     VMware  VMwareCertified   2015-04-28

esx-tboot                      6.0.0-0.0.2494585                     VMware  VMwareCertified   2015-04-28

esx-xserver                    6.0.0-0.0.2494585                     VMware  VMwareCertified   2015-04-28

ima-qla4xxx                    2.02.18-1vmw.600.0.0.2494585          VMware  VMwareCertified   2015-04-28

ipmi-ipmi-devintf              39.1-4vmw.600.0.0.2494585             VMware  VMwareCertified   2015-04-28

ipmi-ipmi-msghandler           39.1-4vmw.600.0.0.2494585             VMware  VMwareCertified   2015-04-28

ipmi-ipmi-si-drv               39.1-4vmw.600.0.0.2494585             VMware  VMwareCertified   2015-04-28

lpfc                           10.2.309.8-2vmw.600.0.0.2494585       VMware  VMwareCertified   2015-04-28

lsi-msgpt3                     06.255.12.00-7vmw.600.0.0.2494585     VMware  VMwareCertified   2015-04-28

lsu-hp-hpsa-plugin             1.0.0-1vmw.600.0.0.2494585            VMware  VMwareCertified   2015-04-28

lsu-lsi-lsi-mr3-plugin         1.0.0-2vmw.600.0.11.2809209           VMware  VMwareCertified   2015-07-21

lsu-lsi-lsi-msgpt3-plugin      1.0.0-1vmw.600.0.0.2494585            VMware  VMwareCertified   2015-04-28

lsu-lsi-megaraid-sas-plugin    1.0.0-2vmw.600.0.11.2809209           VMware  VMwareCertified   2015-07-21

lsu-lsi-mpt2sas-plugin         1.0.0-1vmw.600.0.0.2494585            VMware  VMwareCertified   2015-04-28

lsu-lsi-mptsas-plugin          1.0.0-1vmw.600.0.0.2494585            VMware  VMwareCertified   2015-04-28

misc-cnic-register             1.78.75.v60.7-1vmw.600.0.0.2494585    VMware  VMwareCertified   2015-04-28

misc-drivers                   6.0.0-0.11.2809209                    VMware  VMwareCertified   2015-07-21

net-bnx2                       2.2.4f.v60.10-1vmw.600.0.0.2494585    VMware  VMwareCertified   2015-04-28

net-bnx2x                      1.78.80.v60.12-1vmw.600.0.0.2494585   VMware  VMwareCertified   2015-04-28

net-cnic                       1.78.76.v60.13-2vmw.600.0.0.2494585   VMware  VMwareCertified   2015-04-28

net-e1000                      8.0.3.1-5vmw.600.0.0.2494585          VMware  VMwareCertified   2015-04-28

net-e1000e                     2.5.4-6vmw.600.0.0.2494585            VMware  VMwareCertified   2015-04-28

net-enic                       2.1.2.38-2vmw.600.0.0.2494585         VMware  VMwareCertified   2015-04-28

net-forcedeth                  0.61-2vmw.600.0.0.2494585             VMware  VMwareCertified   2015-04-28

net-igb                        5.0.5.1.1-5vmw.600.0.0.2494585        VMware  VMwareCertified   2015-04-28

net-ixgbe                      3.7.13.7.14iov-20vmw.600.0.0.2494585  VMware  VMwareCertified   2015-04-28

net-mlx4-core                  1.9.7.0-1vmw.600.0.0.2494585          VMware  VMwareCertified   2015-04-28

net-mlx4-en                    1.9.7.0-1vmw.600.0.0.2494585          VMware  VMwareCertified   2015-04-28

net-nx-nic                     5.0.621-5vmw.600.0.0.2494585          VMware  VMwareCertified   2015-04-28

net-tg3                        3.131d.v60.4-1vmw.600.0.0.2494585     VMware  VMwareCertified   2015-04-28

net-vmxnet3                    1.1.3.0-3vmw.600.0.0.2494585          VMware  VMwareCertified   2015-04-28

nmlx4-core                     3.0.0.0-1vmw.600.0.0.2494585          VMware  VMwareCertified   2015-04-28

nmlx4-en                       3.0.0.0-1vmw.600.0.0.2494585          VMware  VMwareCertified   2015-04-28

nmlx4-rdma                     3.0.0.0-1vmw.600.0.0.2494585          VMware  VMwareCertified   2015-04-28

nvme                           1.0e.0.35-1vmw.600.0.0.2494585        VMware  VMwareCertified   2015-04-28

ohci-usb-ohci                  1.0-3vmw.600.0.0.2494585              VMware  VMwareCertified   2015-04-28

qlnativefc                     2.0.12.0-5vmw.600.0.0.2494585         VMware  VMwareCertified   2015-04-28

rste                           2.0.2.0088-4vmw.600.0.0.2494585       VMware  VMwareCertified   2015-04-28

sata-ahci                      3.0-21vmw.600.0.11.2809209            VMware  VMwareCertified   2015-07-21

sata-ata-piix                  2.12-10vmw.600.0.0.2494585            VMware  VMwareCertified   2015-04-28

sata-sata-nv                   3.5-4vmw.600.0.0.2494585              VMware  VMwareCertified   2015-04-28

sata-sata-promise              2.12-3vmw.600.0.0.2494585             VMware  VMwareCertified   2015-04-28

sata-sata-sil24                1.1-1vmw.600.0.0.2494585              VMware  VMwareCertified   2015-04-28

sata-sata-sil                  2.3-4vmw.600.0.0.2494585              VMware  VMwareCertified   2015-04-28

sata-sata-svw                  2.3-3vmw.600.0.0.2494585              VMware  VMwareCertified   2015-04-28

scsi-aacraid                   1.1.5.1-9vmw.600.0.0.2494585          VMware  VMwareCertified   2015-04-28

scsi-adp94xx                   1.0.8.12-6vmw.600.0.0.2494585         VMware  VMwareCertified   2015-04-28

scsi-aic79xx                   3.1-5vmw.600.0.0.2494585              VMware  VMwareCertified   2015-04-28

scsi-bnx2fc                    1.78.78.v60.8-1vmw.600.0.0.2494585    VMware  VMwareCertified   2015-04-28

scsi-bnx2i                     2.78.76.v60.8-1vmw.600.0.11.2809209   VMware  VMwareCertified   2015-07-21

scsi-fnic                      1.5.0.45-3vmw.600.0.0.2494585         VMware  VMwareCertified   2015-04-28

scsi-hpsa                      6.0.0.44-4vmw.600.0.0.2494585         VMware  VMwareCertified   2015-04-28

scsi-ips                       7.12.05-4vmw.600.0.0.2494585          VMware  VMwareCertified   2015-04-28

scsi-megaraid-mbox             2.20.5.1-6vmw.600.0.0.2494585         VMware  VMwareCertified   2015-04-28

scsi-megaraid2                 2.00.4-9vmw.600.0.0.2494585           VMware  VMwareCertified   2015-04-28

scsi-mpt2sas                   19.00.00.00-1vmw.600.0.0.2494585      VMware  VMwareCertified   2015-04-28

scsi-mptsas                    4.23.01.00-9vmw.600.0.0.2494585       VMware  VMwareCertified   2015-04-28

scsi-mptspi                    4.23.01.00-9vmw.600.0.0.2494585       VMware  VMwareCertified   2015-04-28

scsi-qla4xxx                   5.01.03.2-7vmw.600.0.0.2494585        VMware  VMwareCertified   2015-04-28

uhci-usb-uhci                  1.0-3vmw.600.0.0.2494585              VMware  VMwareCertified   2015-04-28

vmware-esx-dvfilter-maclearn   1.00-1.00                             VMware  VMwareCertified   2015-05-12

xhci-xhci                      1.0-2vmw.600.0.0.2494585              VMware  VMwareCertified   2015-04-28

tools-light                    6.0.0-0.11.2809209                    VMware  VMwareCertified   2015-07-21

[root@intelserver:~

Verify that the scsi-megaraid-sas driver is installed prior to proceeding. 

2).  To disable the lsi_mr3 driver do the following:  "esxcli system module set --enabled=false --module=lsi_mr3" 

3).  Reboot

After rebooting do: "esxcfg-scsidev -a"  to verify that megaraid_sas driver is assigned to your disk controller.  

I don't know how ESXi decides the order of precedence if two or more drivers are applicable to a single hardware device, or if the install process of the new LSI driver is supposed to disable or remove the default lsi_mr3 driver and simply failed to do so.  I also don't know what's going to happen when I put on the next ESXi patch, specifically if the lsi_mr3 driver will once again take precedence.  If anyone knows how the driver install process is supposed to work in this situation please let me know.  I am happy though that I finally got this fixed as the server was seriously gimped by having the datastore become unaccessable for 5-10 seconds everytime a VM started or stopped, and the old driver on at least one occasion did something bad enough to fool the controller into thinking that one of the disks had gone bad and started a RAID resynch.   

View solution in original post

0 Kudos
16 Replies
cykVM
Expert
Expert

Hi,

I guess you already checked for BIOS/firmware updates for that Intel board? Anyway I could not find exactly that model on VMWare's HCL, only the S2600CWT is listed as being compatible up to VMWare 5.5 U2: VMware Compatibility Guide: System Search

The onboard LSI SAS3008 should be generally working with the driver provided by VMWare (lsi_msgpt3 ...) but you may check if you have exactly that version (06.255.12.00-7vmw) running as listed in HCL: VMware Compatibility Guide: I/O Device Search

Also check if the Samsung SSDs are all on the same firmware level.

For the management of the LSI onboard controller from inside VMWare there might be little to no hope that LSI's CIM provider together with MegaRAID Storage Manager might work. But those are not officially available from LSI for VMWare 6, yet. I would stick with BIOS control for now.

cykVM

0 Kudos
srwsol
Hot Shot
Hot Shot

Mine shows this:

[root@intelserver:~] esxcfg-scsidevs -a

vmhba38 ahci              link-n/a  sata.vmhba38                            (0000:00:1f.2) Intel Corporation Wellsburg AHCI Controller

vmhba39 ahci              link-n/a  sata.vmhba39                            (0000:00:1f.2) Intel Corporation Wellsburg AHCI Controller

vmhba0  ahci              link-n/a  sata.vmhba0                             (0000:00:1f.2) Intel Corporation Wellsburg AHCI Controller

vmhba1  lsi_mr3           link-n/a  sas.5001e67ca647e000                    (0000:0a:00.0) LSI MegaRAID SAS Fury Controller

vmhba2  ahci              link-n/a  sata.vmhba2                             (0000:00:11.4) Intel Corporation Wellsburg AHCI Controller

vmhba40 ahci              link-n/a  sata.vmhba40                            (0000:00:1f.2) Intel Corporation Wellsburg AHCI Controller

vmhba41 ahci              link-n/a  sata.vmhba41                            (0000:00:1f.2) Intel Corporation Wellsburg AHCI Controller

vmhba34 ahci              link-n/a  sata.vmhba34                            (0000:00:11.4) Intel Corporation Wellsburg AHCI Controller

vmhba35 ahci              link-n/a  sata.vmhba35                            (0000:00:11.4) Intel Corporation Wellsburg AHCI Controller

vmhba36 ahci              link-n/a  sata.vmhba36                            (0000:00:11.4) Intel Corporation Wellsburg AHCI Controller

vmhba37 ahci              link-n/a  sata.vmhba37                            (0000:00:1f.2) Intel Corporation Wellsburg AHCI Controller

[root@intelserver:~] vmkload_mod -s lsi_mr3

vmkload_mod module information

input file: /usr/lib/vmware/vmkmod/lsi_mr3

Version: 6.605.08.00-6vmw.600.0.0.2494585

Build Type: release

License: GPLv2

Required name-spaces:

  com.vmware.vmkapi#v2_3_0_0

Parameters:

  mfiDumpFailedCmd: int

    Hex dump of failed command in driver log

  max_sectors: int

    Maximum number of sectors per IO command

[root@intelserver:~]


I didn't do anything special when I installed ESXi.  That's just the driver that was chosen.  Is there someplace I can lookup exactly what the driver is supposed to be for?  The controller is an LSI* SAS3008 SAS 12G . 

Edit:

More info on the controller:

[root@intelserver:~] esxcfg-scsidevs -l

naa.6001e67ca647e0001cd1cdb8275eb90e

   Device Type: Direct-Access

   Size: 4878040 MB

   Display Name: Intel Serial Attached SCSI Disk (naa.6001e67ca647e0001cd1cdb8275eb90e)

   Multipath Plugin: NMP

   Console Device: /vmfs/devices/disks/naa.6001e67ca647e0001cd1cdb8275eb90e

   Devfs Path: /vmfs/devices/disks/naa.6001e67ca647e0001cd1cdb8275eb90e

   Vendor: Intel     Model: RS3YC             Revis: 4.26

   SCSI Level: 5  Is Pseudo: false Status: degraded

   Is RDM Capable: true  Is Removable: false

   Is Local: false Is SSD: true

   Other Names:

      vml.02000000006001e67ca647e0001cd1cdb8275eb90e525333594320

   VAAI Status: unsupported

[root@intelserver:~]


EDIT #2


After more searching and keying in the device parameters I did find this device in the hardware compatibility list, but it lists different drivers than the one that was picked:

VMware Compatibility Guide: I/O Device Search

However, I was unable to lookup the firmware version of the controller, or at least wasn't able to get ESXi to show that, without rebooting the server and bringing up the RAID Bios screen, so I'm not 100% sure of that part, but for ESXi6 it only listed one firmware version and I did apply the latest BIOS and RAID firmware from Intel when I built this server about 60 days ago.   If I install one of these drivers will it just be picked up, or do I need to do something to force the driver to be used with the controller?

0 Kudos
cykVM
Expert
Expert

In the first place I meant the BIOS version on the motherboard, is that on latest available from Intel?

As the LSI controller is onboard the firmware for that will get upodated with the BIOS, I guess.

Besides the output of "esxcfg-scsidevs -l" shows the RAID as being "degraded" (SCSI Level: 5  Is Pseudo: false Status: degraded) this is something you should investigate first.

It might also not be a good idea to run a RAID5 with SSDs in general.



0 Kudos
srwsol
Hot Shot
Hot Shot

I know that it shows degraded on ESXi, but in the RAID Bios it shows as fine, although I'm running a consistency check now just to make sure.   I wasn't sure if that meant that there are only 6gb drives connected to the controller rather than 12gb drives.   I've read quite a bit about the pros and cons of running RAID 5 with SSDs say vs RAID 10, but as this server isn't going to be under huge I/O loads I figured that the extra space for RAID 5 made the difference, and the low likelihood that multiple drives would fail at once.  I did notice that as I took the VMs down to reboot the server to go into the RAID Bios that I got the lost access message again.  I don't know what it is about starting and stopping VMs versus transferring files (the files I transferred off the server were for one of the VMs that's causing this message, so if it's a bad block on a disk or an error in the indexes in the datastore I should have run into it while copying the files off)  After this finishes, assuming it finds no problem I may try the VOMA thing to check the logical consistency of the datastore.   

Regarding the Intel server board BIOS, it is the latest within the last 60 days.  I also found that the FW in the RAID controller doesn't match exactly any of the FW versions listed on the driver page.  I'm tempted to just install the latest driver version, assuming that I don't find any other issue. 

0 Kudos
srwsol
Hot Shot
Hot Shot

There were no errors on the RAID Bios integrity check of the RAID array, and the VOMA command showed nothing wrong with the datastore.  However, I still got the lost access message when I started up one of the VMs (it's always does it on the same one when I start or stop it, and occasionally other times), and the esxcfg-scsidevs -l command still shows the array as degraded even though the RAID Bios says it's fine.

0 Kudos
cykVM
Expert
Expert

I guess there is no new(er) firmware for the controller than the one you have.

Give the recommended driver listed in above "Edit #2" a try, see: VMware vSphere 5: Private Cloud Computing, Server and Data Center Virtualization

This should replace the VMWare (generic) driver on installation.

Regarding the "degraded" message I guess that's just wrong infomration the VMWare driver gets from the controllers firmware and should probably disappear after the recommended driver is installed.

To find out the firmware version from the CLI inside VMWare you may follow: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=102720...

srwsol
Hot Shot
Hot Shot

Thanks for your assistance so far.  If I apply the above driver, you are saying that will automatically replace the existing driver?  Also, if I do that, what happens with the recovery option on the bootup screen?  If I understand things correctly, I can go back one patch or change by doing ALT-R during the boot sequence.  I was considering doing that to go back before the May patch to see if it still happens.  If I upgrade the driver will the recovery sequence on bootup just go back to before the driver, or is it tracking only the changes to the base ESXi code?   I'd like to keep the option of backing out the patch, although I suppose I could reapply the base ESXi 6 installation as long as it allows me through the update process to update to a prior version.  I haven't tried any of these things before so I'm not sure what the limitations are in the software update process.

0 Kudos
cykVM
Expert
Expert

The storage driver change will also change the bootbank, I guess.

So I would first do the patch/update-removal-test with ALT-.R and aftwards swap the driver if the issues still occur.

There is no real downgrade option as it's not possible to downgrade by "upgrading" to a previuos version of ESXi. The only things which works is doing a fresh install overwriting the exiting ESXi install and afterwards re-registering your VMs to the inventory.

0 Kudos
srwsol
Hot Shot
Hot Shot

I checked into some more things and found that i have the April patch applied, not the May patch, although the description of the May patch indicates that it fixed just one bug which had nothing to do with my issue; so instead I went ahead and and upgraded the LSI driver to the current version, but it had no effect.  I still got the lost access message when I started up a couple of the VMs, and the RAID array still appears degraded to the ESXi esxcfg-scsidevs -l command even though the RAID Bios says it's fine.

I'm beginning to suspect that there is some incompatibility here.  I wonder if they tried RAID 5 with this controller, and if that's the issue.  With this motherboard you have to buy an extra key dongle and attach it to the motherboard to enable RAID 5 on the LSI controller and I wonder if in their testing they didn't do that.  That's also true of the other Intel motherboard in the S2600CW series that comes with the attached LSI controller (the difference between the two boards is that my model also comes with 10gb NICs and the LSI controller where the other model with the LSI controller just has 1gb NICs).

I'm trying to think of what happens at VM startup and shutdown that would cause this issue which doesn't happen at other times, as I don't get the message when I'm doing lots of I/Os to the controller from running VMs.  I've transferred gigabytes of data between VMs, both of which are hitting the same datastore, and I don't get the message, which leads me to believe it's note the rate of I/O's or even the rate of writes that's doing this.  The only thing I can think of is that I know VM startup and shutdown involves locking files that ESXi uses to know that a VM is active, and I'm wondering if ESXi does something different I/O wise when it's creating, manipulating, or deleting these lock files, such as trying to quiesce all I/O's to the datastore while this process is happening.   From what I can tell my controller doesn't do write caching because there is no battery option, but I didn't think that would be an issue with SSD drives.  Perhaps it is an issue if ESXi is doing something strange at VM startup and shutdown that holds up I/Os to the controller for a short period of time.  This is just my guess however.

There is another RAID option on this motherboard, as I could use the software RAID capability present on all the S2600CW motherboards.  That would require re-cabling and probably a reformat of the disks, but if I can't get this figured out I may have to do it, as the configuration really isn't stable as is.  It also means I will have wasted my money in buying this version of the motherboard with the LSI controller.  <sigh>

If you can think of anything else I'm certainly all ears, and I do thank you for your assistance.

Here's the current info:

[root@intelserver:~] esxcfg-scsidevs -l

naa.6001e67ca647e0001cd1cdb8275eb90e

   Device Type: Direct-Access

   Size: 4878040 MB

   Display Name: Intel Serial Attached SCSI Disk (naa.6001e67ca647e0001cd1cdb8275eb90e)

   Multipath Plugin: NMP

   Console Device: /vmfs/devices/disks/naa.6001e67ca647e0001cd1cdb8275eb90e

   Devfs Path: /vmfs/devices/disks/naa.6001e67ca647e0001cd1cdb8275eb90e

   Vendor: Intel     Model: RS3YC             Revis: 4.26

   SCSI Level: 5  Is Pseudo: false Status: degraded

   Is RDM Capable: true  Is Removable: false

   Is Local: false Is SSD: true

   Other Names:

      vml.02000000006001e67ca647e0001cd1cdb8275eb90e525333594320

   VAAI Status: unsupported

[root@intelserver:~] vmkload_mod -s lsi_mr3

vmkload_mod module information

input file: /usr/lib/vmware/vmkmod/lsi_mr3

Version: 6.606.10.00-1OEM.550.0.0.1391871

License: GPLv2

Required name-spaces:

  com.vmware.vmkapi#v2_2_0_0

Parameters:

  lb_pending_cmds: int

    Change raid-1 load balancing outstanding threshold.Valid Values are 1-128. Default: 4

  mfiDumpFailedCmd: int

    Hex dump of failed command in driver log

  max_sectors: int

    Maximum number of sectors per IO command

0 Kudos
cykVM
Expert
Expert

Just a short note: the "other" onboard software RAID won't be supported by VMWare. You will have the single/separate disks presented to VMWare only with no RAID volume.

And for the battery or flash backed write cache: This greatly improves performance at least and of course if the server hits a power failure the battery/flash jumps in and writes back the cached data to the RAID.

0 Kudos
srwsol
Hot Shot
Hot Shot

Without the battery the controller won't do cached writes, which I'm wondering if is somehow related to the behavior I'm seeing.  I wasn't aware that the other controller wouldn't work with ESXi in RAID mode.  I'll have to do some more research on that.  I supposed the last resort would be to buy a controller card as I don't think there is an option to add a battery to the onboard LSI controller.  Unfortunately, this is all speculation and I would hate spending money on a new controller without knowing for sure that this whole line of reasoning is accurate.   I suppose I could also pay for an incident with VMWare (I have the Essentials product as I'm a consultant and don't have a whole datacenter full of equipment) and see if I can get them to work this out.

0 Kudos
cykVM
Expert
Expert

Not quite sure if opening a case with VMWare support would lead to a solution, they might go the "not (fully) supported hardware in use" route. Smiley Wink

0 Kudos
cykVM
Expert
Expert

By the way here's another discussion with the lsi_mr3 driver in use and also getting those "lost access" errors with VMWare 5: Re: lost access to volume

0 Kudos
srwsol
Hot Shot
Hot Shot

Time for an update.  I bought a LSI 9361-8i controller and I bought the capacitor battery backup so that I could enable write caching.  Unfortunately this changed nothing.  I also updated ESXi to the July patch level and that changed nothing either.  I also tried the latest driver from LSI (which the ESXi patch process promptly replaced when I applied it), and that didn't do anything either. 

I'm beginning to think this is some sort of software problem related to locking of the files, as this only happens when certain VMs start and stop, and only at the moment when they first start (before the console display starts), and when they stop and transition from running to inactive.  The two VMs that it always happens to are a Windows SBS 2008 VM and the VCenter Server appliance VM.   The only things I can see in common with those two is that they both have multiple VCPUs assigned to them and (and maybe this is important) they have quite a number of virtual disks in the configuration. 

Somehow when these VMs start and stop it's causing all I/O to the datastore to stop such that the watchdog timer goes off and causes the lost access process to begin.  I actually think that whatever is being done is happening at the controller level (i.e. ESXi has issued some sort of command to the controller that is hanging things up) because when I had both SBS 2008 and the VCenter Server appliance on this machine at the same time and they both started at once, it caused a big enough problem that controller threw a disk error and started a rebuild, which happened a couple of times until I moved the VCenter Server VM to another machine.  I've sort of ruled out a hardware problem because the rebuild was occurring on different disks in the array and these are new SSD drives.  Also, there are never any errors or lost access messages if I download the files for both VMs from the server, like one would expect if there was a disk problem and it was having trouble reading data.  Also there are never any errors or lost access messages no matter how hard I stress the array with reads and writes.  It's only at VM startup and shutdown and only for certain VMs that this happens.

I'm stumped at this point and my wallet has been drained from replacing parts.  Not a good situation Smiley Sad

0 Kudos
srwsol
Hot Shot
Hot Shot

I found some interesting log entries that did show the issue is locking, and I also see a whole bunch of scsi commands being retried and rejected or reset.  Unfortunately I'm not a scsi expert so I can't easily decode what's going on here or why.   I'm attaching a link that shows the relevent portions of the vmkernel log and the hostd log.   The relevant time period starts at about 06:17:25 when I started VM SBS2008.  About 9  seconds later is when the lost access message shows in the hostd log, although meanwhile in the vmkernel log are a whole bunch of scsi commands hanging and being reset.   Maybe somebody more fluent in scsi protocol can give me some direction as to what's going on.   I'm still very skeptical that it's a hardware problem in that a disk is bad because of all the stuff I mentioned before, but maybe there is some sort of scsi command being issued here that causes the controller a problem. 

https://drive.google.com/folderview?id=0B_1WZam8s_8BfllFLXNiUW9oOHZpS3lkWk1xNzhXQUppNFZ0THE1MTY2UDhp...

0 Kudos
srwsol
Hot Shot
Hot Shot

Success!!  I finally figured out what the problem was and got it corrected.  The problem was the fact that although the latest driver vib from LSI was installed, for some reason it would not automatically replace the default ESXi driver lsi_mr3.  I didn't notice that although the latest driver package vib from LSI installed properly the controller was still using the original driver.  This could be seen by doing a "esxcfg-scsidev -a" command which still showed "vmhba1  lsi_mr3           link-n/a  sas.5001e67ca647e000                    (0000:0a:00.0) LSI MegaRAID SAS Fury Controller" .  The second field after the controller (vmhba1) is the current driver being used.  The driver that needed to be used was:  "vmhba1    megaraid_sas    link-n/a  unknown.vmhba   (0000:03:00.0) Avago (LSI / Symbios Logic) MegaRAID SAS Invader Controller"  .

To fix this problem I had to disable the lsi_mr3 driver so that ESXi was forced to use the other one.  The procedure to do that is as follows:

1).  Verify that you have the new driver that you want EXSi to use already installed.  I'm assuming that bad things would happen if you disabled the only driver that could operate your disk controller.  Use this command to see the installed drivers:  "esxcli software vib list"  and you should get something similar to the following:

[root@intelserver:~] esxcli software vib list

Name                           Version                               Vendor  Acceptance Level  Install Date

-----------------------------  ------------------------------------  ------  ----------------  ------------

scsi-megaraid-sas              6.608.11.00-1OEM.600.0.0.2494585      Avago   VMwareCertified   2015-07-19

lsi-mr3                        6.606.10.00-1OEM.550.0.0.1391871      LSI     VMwareCertified   2015-06-19

mtip32xx-native                3.8.5-1vmw.600.0.0.2494585            VMWARE  VMwareCertified   2015-04-28

ata-pata-amd                   0.3.10-3vmw.600.0.0.2494585           VMware  VMwareCertified   2015-04-28

ata-pata-atiixp                0.4.6-4vmw.600.0.0.2494585            VMware  VMwareCertified   2015-04-28

ata-pata-cmd64x                0.2.5-3vmw.600.0.0.2494585            VMware  VMwareCertified   2015-04-28

ata-pata-hpt3x2n               0.3.4-3vmw.600.0.0.2494585            VMware  VMwareCertified   2015-04-28

ata-pata-pdc2027x              1.0-3vmw.600.0.0.2494585              VMware  VMwareCertified   2015-04-28

ata-pata-serverworks           0.4.3-3vmw.600.0.0.2494585            VMware  VMwareCertified   2015-04-28

ata-pata-sil680                0.4.8-3vmw.600.0.0.2494585            VMware  VMwareCertified   2015-04-28

ata-pata-via                   0.3.3-2vmw.600.0.0.2494585            VMware  VMwareCertified   2015-04-28

block-cciss                    3.6.14-10vmw.600.0.0.2494585          VMware  VMwareCertified   2015-04-28

cpu-microcode                  6.0.0-0.0.2494585                     VMware  VMwareCertified   2015-04-28

ehci-ehci-hcd                  1.0-3vmw.600.0.0.2494585              VMware  VMwareCertified   2015-04-28

elxnet                         10.2.309.6v-1vmw.600.0.0.2494585      VMware  VMwareCertified   2015-04-28

emulex-esx-elxnetcli           10.2.309.6v-0.0.2494585               VMware  VMwareCertified   2015-04-28

esx-base                       6.0.0-0.11.2809209                    VMware  VMwareCertified   2015-07-21

esx-dvfilter-generic-fastpath  6.0.0-0.0.2494585                     VMware  VMwareCertified   2015-04-28

esx-tboot                      6.0.0-0.0.2494585                     VMware  VMwareCertified   2015-04-28

esx-xserver                    6.0.0-0.0.2494585                     VMware  VMwareCertified   2015-04-28

ima-qla4xxx                    2.02.18-1vmw.600.0.0.2494585          VMware  VMwareCertified   2015-04-28

ipmi-ipmi-devintf              39.1-4vmw.600.0.0.2494585             VMware  VMwareCertified   2015-04-28

ipmi-ipmi-msghandler           39.1-4vmw.600.0.0.2494585             VMware  VMwareCertified   2015-04-28

ipmi-ipmi-si-drv               39.1-4vmw.600.0.0.2494585             VMware  VMwareCertified   2015-04-28

lpfc                           10.2.309.8-2vmw.600.0.0.2494585       VMware  VMwareCertified   2015-04-28

lsi-msgpt3                     06.255.12.00-7vmw.600.0.0.2494585     VMware  VMwareCertified   2015-04-28

lsu-hp-hpsa-plugin             1.0.0-1vmw.600.0.0.2494585            VMware  VMwareCertified   2015-04-28

lsu-lsi-lsi-mr3-plugin         1.0.0-2vmw.600.0.11.2809209           VMware  VMwareCertified   2015-07-21

lsu-lsi-lsi-msgpt3-plugin      1.0.0-1vmw.600.0.0.2494585            VMware  VMwareCertified   2015-04-28

lsu-lsi-megaraid-sas-plugin    1.0.0-2vmw.600.0.11.2809209           VMware  VMwareCertified   2015-07-21

lsu-lsi-mpt2sas-plugin         1.0.0-1vmw.600.0.0.2494585            VMware  VMwareCertified   2015-04-28

lsu-lsi-mptsas-plugin          1.0.0-1vmw.600.0.0.2494585            VMware  VMwareCertified   2015-04-28

misc-cnic-register             1.78.75.v60.7-1vmw.600.0.0.2494585    VMware  VMwareCertified   2015-04-28

misc-drivers                   6.0.0-0.11.2809209                    VMware  VMwareCertified   2015-07-21

net-bnx2                       2.2.4f.v60.10-1vmw.600.0.0.2494585    VMware  VMwareCertified   2015-04-28

net-bnx2x                      1.78.80.v60.12-1vmw.600.0.0.2494585   VMware  VMwareCertified   2015-04-28

net-cnic                       1.78.76.v60.13-2vmw.600.0.0.2494585   VMware  VMwareCertified   2015-04-28

net-e1000                      8.0.3.1-5vmw.600.0.0.2494585          VMware  VMwareCertified   2015-04-28

net-e1000e                     2.5.4-6vmw.600.0.0.2494585            VMware  VMwareCertified   2015-04-28

net-enic                       2.1.2.38-2vmw.600.0.0.2494585         VMware  VMwareCertified   2015-04-28

net-forcedeth                  0.61-2vmw.600.0.0.2494585             VMware  VMwareCertified   2015-04-28

net-igb                        5.0.5.1.1-5vmw.600.0.0.2494585        VMware  VMwareCertified   2015-04-28

net-ixgbe                      3.7.13.7.14iov-20vmw.600.0.0.2494585  VMware  VMwareCertified   2015-04-28

net-mlx4-core                  1.9.7.0-1vmw.600.0.0.2494585          VMware  VMwareCertified   2015-04-28

net-mlx4-en                    1.9.7.0-1vmw.600.0.0.2494585          VMware  VMwareCertified   2015-04-28

net-nx-nic                     5.0.621-5vmw.600.0.0.2494585          VMware  VMwareCertified   2015-04-28

net-tg3                        3.131d.v60.4-1vmw.600.0.0.2494585     VMware  VMwareCertified   2015-04-28

net-vmxnet3                    1.1.3.0-3vmw.600.0.0.2494585          VMware  VMwareCertified   2015-04-28

nmlx4-core                     3.0.0.0-1vmw.600.0.0.2494585          VMware  VMwareCertified   2015-04-28

nmlx4-en                       3.0.0.0-1vmw.600.0.0.2494585          VMware  VMwareCertified   2015-04-28

nmlx4-rdma                     3.0.0.0-1vmw.600.0.0.2494585          VMware  VMwareCertified   2015-04-28

nvme                           1.0e.0.35-1vmw.600.0.0.2494585        VMware  VMwareCertified   2015-04-28

ohci-usb-ohci                  1.0-3vmw.600.0.0.2494585              VMware  VMwareCertified   2015-04-28

qlnativefc                     2.0.12.0-5vmw.600.0.0.2494585         VMware  VMwareCertified   2015-04-28

rste                           2.0.2.0088-4vmw.600.0.0.2494585       VMware  VMwareCertified   2015-04-28

sata-ahci                      3.0-21vmw.600.0.11.2809209            VMware  VMwareCertified   2015-07-21

sata-ata-piix                  2.12-10vmw.600.0.0.2494585            VMware  VMwareCertified   2015-04-28

sata-sata-nv                   3.5-4vmw.600.0.0.2494585              VMware  VMwareCertified   2015-04-28

sata-sata-promise              2.12-3vmw.600.0.0.2494585             VMware  VMwareCertified   2015-04-28

sata-sata-sil24                1.1-1vmw.600.0.0.2494585              VMware  VMwareCertified   2015-04-28

sata-sata-sil                  2.3-4vmw.600.0.0.2494585              VMware  VMwareCertified   2015-04-28

sata-sata-svw                  2.3-3vmw.600.0.0.2494585              VMware  VMwareCertified   2015-04-28

scsi-aacraid                   1.1.5.1-9vmw.600.0.0.2494585          VMware  VMwareCertified   2015-04-28

scsi-adp94xx                   1.0.8.12-6vmw.600.0.0.2494585         VMware  VMwareCertified   2015-04-28

scsi-aic79xx                   3.1-5vmw.600.0.0.2494585              VMware  VMwareCertified   2015-04-28

scsi-bnx2fc                    1.78.78.v60.8-1vmw.600.0.0.2494585    VMware  VMwareCertified   2015-04-28

scsi-bnx2i                     2.78.76.v60.8-1vmw.600.0.11.2809209   VMware  VMwareCertified   2015-07-21

scsi-fnic                      1.5.0.45-3vmw.600.0.0.2494585         VMware  VMwareCertified   2015-04-28

scsi-hpsa                      6.0.0.44-4vmw.600.0.0.2494585         VMware  VMwareCertified   2015-04-28

scsi-ips                       7.12.05-4vmw.600.0.0.2494585          VMware  VMwareCertified   2015-04-28

scsi-megaraid-mbox             2.20.5.1-6vmw.600.0.0.2494585         VMware  VMwareCertified   2015-04-28

scsi-megaraid2                 2.00.4-9vmw.600.0.0.2494585           VMware  VMwareCertified   2015-04-28

scsi-mpt2sas                   19.00.00.00-1vmw.600.0.0.2494585      VMware  VMwareCertified   2015-04-28

scsi-mptsas                    4.23.01.00-9vmw.600.0.0.2494585       VMware  VMwareCertified   2015-04-28

scsi-mptspi                    4.23.01.00-9vmw.600.0.0.2494585       VMware  VMwareCertified   2015-04-28

scsi-qla4xxx                   5.01.03.2-7vmw.600.0.0.2494585        VMware  VMwareCertified   2015-04-28

uhci-usb-uhci                  1.0-3vmw.600.0.0.2494585              VMware  VMwareCertified   2015-04-28

vmware-esx-dvfilter-maclearn   1.00-1.00                             VMware  VMwareCertified   2015-05-12

xhci-xhci                      1.0-2vmw.600.0.0.2494585              VMware  VMwareCertified   2015-04-28

tools-light                    6.0.0-0.11.2809209                    VMware  VMwareCertified   2015-07-21

[root@intelserver:~

Verify that the scsi-megaraid-sas driver is installed prior to proceeding. 

2).  To disable the lsi_mr3 driver do the following:  "esxcli system module set --enabled=false --module=lsi_mr3" 

3).  Reboot

After rebooting do: "esxcfg-scsidev -a"  to verify that megaraid_sas driver is assigned to your disk controller.  

I don't know how ESXi decides the order of precedence if two or more drivers are applicable to a single hardware device, or if the install process of the new LSI driver is supposed to disable or remove the default lsi_mr3 driver and simply failed to do so.  I also don't know what's going to happen when I put on the next ESXi patch, specifically if the lsi_mr3 driver will once again take precedence.  If anyone knows how the driver install process is supposed to work in this situation please let me know.  I am happy though that I finally got this fixed as the server was seriously gimped by having the datastore become unaccessable for 5-10 seconds everytime a VM started or stopped, and the old driver on at least one occasion did something bad enough to fool the controller into thinking that one of the disks had gone bad and started a RAID resynch.   

0 Kudos