Hi,
We are using vCenter 6.5.0.10000 Build 5973321 with physical host running VMware ESXi, 6.0.0, 7967664.
Host has 12G SAS modular RAID controller attached to it.
From time to time, we are seeing that some datastores are becoming inactive / inaccessible. And another datastore is not visible at all.
In this case out of total 8 datastores, 3 are inaccessible as shown below and 1 is missing.
What causes datastore reach inaccessible / inactive states? What is the difference between these states?
What makes a DS go missing and reappear after host is rebooted?
Host reboot fixes the problems but that is not an acceptable solution. Can you please help us resolve these 2 issues?
Thanks,
Rama
Hi Rama,
To isolate, we need more info on the datastores and logs.
Are those local datastores to ESXi host or shared?
Can you share the vmkernel.log of the ESXi host?
Thanks,
MS
Hi MS,
Thanks for your prompt response.
These datastores are dedicated for that host (no sharing with other hosts). Attached is vmkernel.log.
I see the following in the log. Wondering if this is the reason. If so, how do we fix this issue?
1157 2018-11-01T21:52:12.244Z cpu28:33577)ScsiDeviceIO: 2636: Cmd(0x43be5966d940) 0x4d, CmdSN 0x1b1 from world 34717 to dev "naa.6006bf1d58d2ad1020a478cc0bdb995c" failed H:0x0 D:0x2 P:0x0 Vali d sense data: 0x5 0x20 0x0.
1158 2018-11-01T21:52:12.246Z cpu29:33577)ScsiDeviceIO: 2636: Cmd(0x43be5966d940) 0x4d, CmdSN 0x1b5 from world 34717 to dev "naa.6006bf1d58d2ad1020a478b00a2e44be" failed H:0x0 D:0x2 P:0x0 Vali d sense data: 0x5 0x20 0x0.
1159 2018-11-01T21:52:12.248Z cpu29:33577)ScsiDeviceIO: 2636: Cmd(0x43be5966d940) 0x1a, CmdSN 0x1ba from world 34717 to dev "naa.6006bf1d58d2ad1020a4786905e96bdf" failed H:0x0 D:0x2 P:0x0 Vali d sense data: 0x5 0x24 0x0.
1160 2018-11-01T21:52:12.250Z cpu28:33577)ScsiDeviceIO: 2636: Cmd(0x43be5966d940) 0x4d, CmdSN 0x1bd from world 34717 to dev "naa.6006bf1d58d2ad1020a4789008440836" failed H:0x0 D:0x2 P:0x0 Vali d sense data: 0x5 0x20 0x0.
1161 2018-11-01T21:52:47.760Z cpu38:33033)FS3Misc: 1759: Long VMFS rsv time on 'ESX_18_DS_HDD_T30' (held for 307 msecs). # R: 1, # W: 1 bytesXfer: 2 sectors
Thanks,
Rama
For me it looks like the storage controller is not working properly. SCSI sense errors on different hard disks and "Long VMFS rsv time" errors indicate an overload, malfunction or misconfiguration of the storage controller.
Have you checked if you are using the correct firmware and drivers for this controller (see VMware Compatibility Guide - System Search )? And have you already contacted the server manufacturer to investigate this problem?
If you look at the Scsi device errors (for example here: https://www.virten.net/vmware/esxi-scsi-sense-code-decoder/?host=0&device=2&plugin=0&sensekey=5&asc=... ) you will see that the sense key is ILLEGAL REQUEST and the sense data is INVALID COMMAND OPERATION CODE.
Controller is from our own Cisco. From the ESXi host CLI, it shows the following as what we have.
[root@localhost:~] /usr/lib/vmware/vmkmgmt_keyval/vmkmgmt_keyval -l -i "lsi_mr3-5006bf1d58d2ad10/LSI Incorporation"
Listing keys:
Name: MR-DriverVersion
Type: string
value: 6.605.08.00
Name: MR-HBAModel
Type: string
value: LSI HBA 1000:5d:1137:db
Name: MR-FWVersion
Type: string
value: Fw Rev. 24.12.1-0049
Name: MR-ChipRevision
Type: string
value: Chip Rev. C0
Name: MR-CtrlStatus
Type: string
value: FwState c0000000
[root@localhost:~]
From VMW compatibility guide, I see following as needed. Can this be the reason that we are running Fw Rev 24.12.1.0049 and needed is 24.12.1-0411? But if so, wondering why it happens only on some datastores and at random times.
Can you provide your analysis on these inputs?
Thanks,
Rama
Since the SCSI errors occur with different hard disks and also different datastores are affected, I still assume a problem with the storage controller because that's one of the central components in your storage setup. If there were only problems with one disk or one datastore, we could assume a single hard disk failure, but I don't believe so.
The versions recommended on the VMware HCL are the minimum supported versions by VMware. And both your controller firmware and the driver are far below the recommended versions. I would therefore update both the driver and the firmware as a first step.
And since you write that these are Cisco servers, I would also check the compatibility matrix of Cisco to see which versions are recommended for your server model from their side. Sometimes this differs between Cisco and VMware: Cisco UCS Hardware Compatibility List
Just select your server type, model and ESXi version and scroll down to your server firmware version and expand the adapters -> raid section to see what driver versions are recommended by Cisco.
Hi Sebastian,
We scheduled upgrade of firmware for this weekend. Will check on compatible matrices and see how it goes.
Thanks,
Rama
Hi Sebastian,
We updated firmware to 24.12.1-0411 and waiting for the problem to reoccur now.
Thanks,
Rama