namelesss
Contributor
Contributor

Local Datastore become Inactive after long time

Jump to solution

Hi,

I'm running ESXI 4.1 (260247), on a DELL R410, with a DELL SAS6/IR SATA Controller, with 2x 500GB SATA Drives.

Everything was running fine, but since one week, after some time, my datastore become inactive.

I can still connect to vsphere, but the datastore is shown as Inactive. VM appears to be still runing (No one is marked as Inactive), but I can't ping anyone, nor reach them using Console or SSH.

The only solution I have found is to restart the host (Which takes 30-40min, as it's trying to save the state of the VM, but seem's he can't, as the datastore is inactive.). Anyway, after 30-40min, the server reboot, and everything is running fine.

The datastore become inactive always during the night. During the night, I have a job that backup my mail server. Maybe, it's because of a high I/O of the disks, then the datastore become inactive, even if it's seems to be weird, as I had this job running for months without any problems.

I don't have anything in the logs, so a little hard to debug.

I tried rescanning VMFS Modules, Refresh Datastore, Scan the controller, but nothing works expect rebooting the host.

If anyone have an idea, it would be greatly appreciated, as I can't sleep to much last weeks, because I have to check if my server is still running every night...

Thanks ! Smiley Happy

1 Solution

Accepted Solutions
markfrey
Contributor
Contributor

Hi Frank,

I believe your SATA driver is the ahci module.  The problem you're seeing could also be defective drive or controller hardware though.  The hardware test is a good idea - it's more likely a hardware problem if it's been running so long on 4.0 without issues.

You can see what configurable parameters ahci has by running:

esxcfg-module -i ahci

If there's a maximum device queue depth parameter listed, it should say what the default is and you can try changing it with:

esxcfg-module -s <queue depth parameter name>=<desired max device queue depth> ahci

Reboot, and confirm the new setting with:

esxcfg-module -g ahci

Mark.

View solution in original post

0 Kudos
14 Replies
namelesss
Contributor
Contributor

A little more infos :

I saw this yesterday, even if my server didn't crashed this time :

May 18 22:16:19 vmkernel: 0:23:03:35.411 cpu9:4231)ScsiDeviceIO: 1672:  Command 0x2a to device "naa.600508e000000000f7a03804e1b70305" failed  H:0x5 D:0x0 P:0x0 Pos                                                                             sible sense data: 0x0 0x0 0x0.
May 18  22:16:22 vmkernel: 0:23:03:38.462 cpu7:4103)<6>mptbase: ioc0:  LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed},  SubCode(0x0000)
May 18 22:16:22 vmkernel: 0:23:03:38.464  cpu7:4103)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x41027f9dcd40)  to NMP device "naa.600508e000000000f7a03804e1b7                                                                             0305" failed on  physical path "vmhba2:C1:T0:L0" H:0x8 D:0x0 P:0x0 Possible sense data:  0x
May 18 22:16:23 0 0x0 0x0.
May 18 22:16:22 vmkernel:  0:23:03:38.464 cpu7:4103)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:  NMP device "naa.600508e000000000f7a03804e1b70305" state in dou                                                                             bt;  requested fast path state update...
May 18 22:16:22 vmkernel:  0:23:03:38.464 cpu7:4103)ScsiDeviceIO: 1672: Command 0x2a to device  "naa.600508e000000000f7a03804e1b70305" failed H:0x8 D:0x0 P:0x0 Pos                                                                              sible sense data: 0x0 0x0 0x0.
May 18 22:16:22 vmkernel:  0:23:03:38.466 cpu7:4103)NMP: nmp_CompleteCommandForPath: Command 0x2a  (0x41027eeea740) to NMP device "naa.600508e000000000f7a03804e1b7                                                                             0305"  failed on physical path "vmhba2:C1:T0:L0" H:0x8 D:0x0 P:0x0 Possible  sense data: 0x
May 18 22:16:23 0 0x0 0x0.
May 18 22:16:22  vmkernel: 0:23:03:38.466 cpu7:4103)ScsiDeviceIO: 1672: Command 0x2a to  device "naa.600508e000000000f7a03804e1b70305" failed H:0x8 D:0x0 P:0x0  Pos                                                                             sible sense data: 0x0 0x0 0x0.
May 18 22:16:22 vmkernel:  0:23:03:38.468 cpu7:4103)NMP: nmp_CompleteCommandForPath: Command 0x2a  (0x41027f0f0940) to NMP device "naa.600508e000000000f7a03804e1b7                                                                             0305"  failed on physical path "vmhba2:C1:T0:L0" H:0x8 D:0x0 P:0x0 Possible  sense data: 0x
May 18 22:16:24 0 0x0 0x0.
May 18 22:16:22  vmkernel: 0:23:03:38.468 cpu7:4103)ScsiDeviceIO: 1672: Command 0x2a to  device "naa.600508e000000000f7a03804e1b70305" failed H:0x8 D:0x0 P:0x0  Pos                                                                             sible sense data: 0x0 0x0 0x0.
May 18 22:16:22 vmkernel:  0:23:03:38.470 cpu7:4103)NMP: nmp_CompleteCommandForPath: Command 0x2a  (0x41027f9d4340) to NMP device "naa.600508e000000000f7a03804e1b7                                                                             0305"  failed on physical path "vmhba2:C1:T0:L0" H:0x8 D:0x0 P:0x0 Possible  sense data: 0x
May 18 22:16:24 0 0x0 0x0.
May 18 22:16:22  vmkernel: 0:23:03:38.470 cpu7:4103)ScsiDeviceIO: 1672: Command 0x2a to  device "naa.600508e000000000f7a03804e1b70305" failed H:0x8 D:0x0 P:0x0  Pos                                                                             sible sense data: 0x0 0x0 0x0.
May 18 22:16:22 vmkernel:  0:23:03:38.472 cpu7:4103)NMP: nmp_CompleteCommandForPath: Command 0x2a  (0x41027f26aa40) to NMP device "naa.600508e000000000f7a03804e1b7                                                                             0305"  failed on physical path "vmhba2:C1:T0:L0" H:0x8 D:0x0 P:0x0 Possible  sense data: 0x
May 18 22:16:24 0 0x0 0x0.
May 18 22:16:22  vmkernel: 0:23:03:38.472 cpu7:4103)ScsiDeviceIO: 1672: Command 0x2a to  device "naa.600508e000000000f7a03804e1b70305" failed H:0x8 D:0x0 P:0x0  Pos                                                                             sible sense data: 0x0 0x0 0x0.
May 18 22:16:22 vmkernel:  0:23:03:38.473 cpu7:4103)NMP: nmp_CompleteCommandForPath: Command 0x2a  (0x41027f24e040) to NMP device "naa.600508e000000000f7a03804e1b7                                                                             0305"  failed on physical path "vmhba2:C1:T0:L0" H:0x8 D:0x0 P:0x0 Possible  sense data: 0x
May 18 22:16:24 0 0x0 0x0.

And the last line in the logs when my datastore became inactive :

1:23:18:58.554 cpu7:4103)<6>mptbase: ioc0: LogInfo(0x31110700): Originator={PL},
<="" a=""> />Code={Reset}, SubCode(0x0700)

I saw this on the 4.1 U1 changelog :

"ESXi hosts might fail when using LSI SAS HBAs connected to SATA disks
Data  loss might occur on ESXi hosts using LSI SAS HBAs connected to SATA  disks. This issue occurs when the maximum I/O size is set to more than  64KB in
mptsas driver and LSI SAS HBAs are connected to SATA disks. The  issue is resolved in this release"

Don't know if it could be related to my problem.... I'll try to update on next week, but if anyone has any advice, I would be glad to hear it Smiley Happy

0 Kudos
markfrey
Contributor
Contributor

Hi,

I had what looks like the same problem on 2 different Dell R310s.  A Linux guest was doing scheduled backups of a file share on another physical machine and the ESXi lockups would happen within a few minutes of that backup process starting - not every backup, but it was always during the first minutes of a backup.  It also happened once while cloning a linux system to a new VM in the ESXi host over the network - lots of disk activity going on.

May 14 01:13:03 vmkernel: 0:08:04:30.867 cpu3:4099)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x41027f3ec340) to NMP device "naa.600508e00000000076d0ee8960a73102" failed on physical path "vmhba2:C1:T0:L0" H:0x8 D:0x0 P:0x0 Possible sense data: 0x
May 14 01:20:21 0 0x0 0x0.
May 14 01:13:03 vmkernel: 0:08:04:30.867 cpu3:4099)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe: NMP device "naa.600508e00000000076d0ee8960a73102" state in doubt; requested fast path state update...
May 14 01:13:03 vmkernel: 0:08:04:30.867 cpu3:4099)ScsiDeviceIO: 1672: Command 0x2a to device "naa.600508e00000000076d0ee8960a73102" failed H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
May 14 01:13:03 vmkernel: 0:08:04:30.868 cpu3:4099)<6>mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000)
May 14 01:13:03 vmkernel: 0:08:04:30.871 cpu3:4099)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x41027f3e2640) to NMP device "naa.600508e00000000076d0ee8960a73102" failed on physical path "vmhba2:C1:T0:L0" H:0x8 D:0x0 P:0x0 Possible sense data: 0x
May 14 01:37:01 0 0x0 0x0.

...

May 14 01:13:16 vmkernel: 0:08:04:43.496 cpu0:4114)<6>mptscsih: ioc0: attempting task abort! (sc=0x41000e01e340)

May 14 01:14:35 Hostd: "6"

May 14 01:17:04 vmkernel: 0:08:08:31.207 cpu1:4115)<6>mptscsih: ioc0: attempting task abort! (sc=0x41000e011940)

May 14 01:17:04 vmkernel: 0:08:08:31.207 cpu1:4115)MPT SAS Host:6:1:0:0 ::

May 14 01:19:36 Hostd: "6"

May 14 01:19:36 Hostd: }

May 14 01:23:48 vmkernel: 0:08:15:15.643 cpu0:4167)VSCSI: 2519: handle 8193(vscsi0:1):Reset [Retries: 18/0]

May 14 01:23:48 vmkernel: 0:08:15:15.643 cpu0:4167)MPT SAS Host:6:1:0:0 ::

May 14 01:24:36 Hostd: }

...

At that time the local SATA raid 1 appeared inaccessable, and the guests started fallling apart.

I could still log into the ESXi tech support console.  Issuing a 'df' command would output information on a couple of the filesystems but then would hang instead of displaying the rest of them.

I found some postings online about a similar problem with some SAN adapters, caused by the SAN issuing an incorrect or unexpected response when it's command queue became full, causing the adapter to go offline.

My thought was that something similar might be happening with the LSI SAS/SATA raid controller.  By default the mptsas driver will queue up to 64 commands to the drives.  I logged into the ESXi tech support console and issued this command:

~ # esxcfg-module -s mpt_sdev_queue_depth=32 mptsas

to set the driver to only queue up to 32 commands to the drives, then rebooted ESXi.

Before changing the driver option to 32, both machines would regularly lock up every couple of days while doing the backup routine.  Since changing the option, neither machine has locked up - 17 days and counting.

HTH,

Mark.

Loc2262
Contributor
Contributor

Hello!

I seem to be experiencing a similar problem on my ESXi 4.1 server. It is running on a MSI-7522 mainboard which has an ICH10 SATA controller. I'm not using any hardware RAID. Two 1.5 GB HDDs are connected.

Two days ago, when the VMs and subsequently the host started acting up, very similarly to what you've been describing, I found entries in /var/log/messages also very similar to those posted here. After a cold reboot, the system continued operating normally.

So I'm wondering if my problem might be in any way related to what you've been discussing here. Since I'm not using an LSI controller, the respective SATA module is probably a different one for me. I'm not an expert though in terms of ESXi modules and driver options...

I also have the suspicion that the recent update to ESXi 4.1 (was running 4.0 before) might have something to do with it, since the host in question has been running for about 1.5 years without any such trouble before. To rule out actual hardware failure, I'm having the hoster where the server is located perform a full hardware check in a few days.

Could you give me any hint what I could try here? What info would you need? What option in what module could help here?

I'm adding the output of "esxcfg-module -l" below. Thanks a lot in advance!

Name                     Used Size (kb)
vmkernel                 51   11007
procfs                   2    16
vmkapiv1_0_0_0_vmkernel_shim0    32
vmklinux                 13   664
random                   1    32
usb                      4    112
ehci-hcd                 0    36
usb-uhci                 0    24
hid                      0    24
dm                       2    12
nmp                      26   120
vmw_satp_local           7    8
vmw_satp_default_aa      0    8
vmw_psp_lib              8    12
vmw_psp_fixed            7    12
vmw_psp_rr               0    16
vmw_psp_mru              0    12
vmw_psp_fixed_ap         0    12
vmw_satp_alua            0    20
vmw_satp_cx              0    28
libata                   1    92
usb-storage              0    60
vfat                     1    36
mask_path_plugin         0    8
vaai_filter              0    16
vmw_vaaip_emc            0    8
vmkapiv1_0_0_0_nmp_shim  0    8
vmci                     17   112
iscsi_trans              3    52
hub                      1    8
netsched                 1    16
e1000                    0    152
iscsi_linux              0    16
ahci                     6    20
lvmdriver                2    80
deltadisk                4    40
multiextent              0    12
dvsdev                   0    12
etherswitch              5    180
shaper                   0    28
cdp                      7    100
tcpip3                   4    816
dvfilter                 2    72
vmkapiv1_0_0_0_dvfilter_shim0    8
vmkibft                  0    16
vmfs3                    2    260
nfsclient                0    96
vmkstatelogger           1    116
migrate                  2    252
cbt                      0    12
tcpip2v6                 Not Loaded

Kind regards,

Frank

0 Kudos
markfrey
Contributor
Contributor

Hi Frank,

I believe your SATA driver is the ahci module.  The problem you're seeing could also be defective drive or controller hardware though.  The hardware test is a good idea - it's more likely a hardware problem if it's been running so long on 4.0 without issues.

You can see what configurable parameters ahci has by running:

esxcfg-module -i ahci

If there's a maximum device queue depth parameter listed, it should say what the default is and you can try changing it with:

esxcfg-module -s <queue depth parameter name>=<desired max device queue depth> ahci

Reboot, and confirm the new setting with:

esxcfg-module -g ahci

Mark.

0 Kudos
Loc2262
Contributor
Contributor

Hey markfrey!

Thanks for your fast reply!

There does not seem to be a max queue size parameter for the AHCI module unfortunately:

~ # esxcfg-module -i ahci
esxcfg-module module information
input file: /usr/lib/vmware/vmkmod/ahci.o
License: GPL
Version:
Name-space:
Required name-spaces:
  com.vmware.vmkapi@v1_1_0_0
Parameters:
  heap_initial: int
   Initial heap size allocated for the driver.
  heap_max: int
   Maximum attainable heap size for the driver.

Do you maybe know of a way to determine which hardware component in an ESXi system is operated by which module? Maybe there is something else involved that performs the SATA commands...


Also, your conjecture is that if the system has been running stable for 1.5 years under 4.0 (initial version, u1 and u2), it should also do so under 4.1u1? Would you rather rule out a compatibility issue introduced with 4.1?


I have scheduled the hardware check for this weekend, and am curious about its outcome.


Thanks again and kind regards,

Frank

0 Kudos
Loc2262
Contributor
Contributor

Oh, another thing. Do you maybe have an idea what's causing these entries in /var/log/messages? 3-5 of them appear every 20 seconds.

Jun 16 09:15:00 Hostd: [2011-06-16 09:15:00.011 2EB81B90 verbose 'Statssvc'] HostCtl exception Unable to complete Sysinfo operation.  Please see the VMkernel log file for more details.

It sounds like some statistics collection mechanism is trying to record statistics, but can't. Since the "VMkernel log file" is the same file (there's only /var/log/messages for everything in the free ESXi), there's no further info.

The error does not seem to have an apparent ill effect, but it still kinda worries me, especially in the light of the recent SATA failure event. I can't recall to have seen anything like this in the 4.0 version.

0 Kudos
markfrey
Contributor
Contributor

Hi Frank,

The ahci module is a somewhat generic driver for SATA cards that support the ahci standards.  It would be the only module that deals with SATA drive commands.  Google might turn up information on specific modules, or you'd have to look at the Linux source code and documentation for the modules to see what they do (vmware specific ones excluded of course).

It's possible that some incompatibility was introduced in 4.1u1, but if the hardware ends up testing OK and the lockups continue, the only way to be sure would be to go back to 4.0 and see if it clears up.

I don't have the stats errors in my logs that you do, but Google turns up quite a few others that do.

Mark.

0 Kudos
Loc2262
Contributor
Contributor

Thanks mark for your further information!

The hoster is performing the hardware check as I write this, we shall see its outcome tomorrow.

About the error messages: by experimentation and some luck, I found out something, including a solution/workaround, which I posted in this thread: http://communities.vmware.com/message/1775093#1775093

0 Kudos
Loc2262
Contributor
Contributor

Here's some news: The hoster completed the test and found (and replaced) a defective memory module.

So hopefully this was the cause of this issue. Smiley Happy

0 Kudos
namelesss
Contributor
Contributor

Hi,

I tried the command provided by markfrey, hope I will only report in few weeks to say it's working Smiley Happy

Thanks for your help !

0 Kudos
namelesss
Contributor
Contributor

Solution provided by markfrey seems to work.

Thank you very much, you saved my life !!! Smiley Happy

http://communities.vmware.com/message/1770596#1770596

0 Kudos
namelesss
Contributor
Contributor

Update : finally, problem not solved, now it's occuring more often 😕

Had it running for the last 2 months perfect...Nothing changed on host/VM

Nobody found a solution ?? :smileycry:

0 Kudos
markfrey
Contributor
Contributor

If you've suddenly started having frequent loss of data store again, there's a good chance it's a hardware problem this time.  Maybe you've developed a bad disk, cable, RAM or controller card?

Mark.

0 Kudos
markfrey
Contributor
Contributor

Just wanted to post an overdue followup.  After changing the mpt driver maximum queue depth, my systems became completely stable:

You have activated Tech Support Mode.
The time and date of this activation have been sent to the system     logs.

VMware offers supported, powerful system administration tools.      Please
see www.vmware.com/go/sysadmintools for details.

Tech Support Mode may be disabled by an administrative user.
Please consult the ESXi Configuration Guide for additional
important information.

~ # uptime
  19:15:11 up 692 days,  3:55, load average: 0.00, 0.00, 0.00
~ #

Mark.

0 Kudos