Hi,
I had what looks like the same problem on 2 different Dell R310s. A Linux guest was doing scheduled backups of a file share on another physical machine and the ESXi lockups would happen within a few minutes of that backup process starting - not every backup, but it was always during the first minutes of a backup. It also happened once while cloning a linux system to a new VM in the ESXi host over the network - lots of disk activity going on.
May 14 01:13:03 vmkernel: 0:08:04:30.867 cpu3:4099)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x41027f3ec340) to NMP device "naa.600508e00000000076d0ee8960a73102" failed on physical path "vmhba2:C1:T0:L0" H:0x8 D:0x0 P:0x0 Possible sense data: 0x
May 14 01:20:21 0 0x0 0x0.
May 14 01:13:03 vmkernel: 0:08:04:30.867 cpu3:4099)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe: NMP device "naa.600508e00000000076d0ee8960a73102" state in doubt; requested fast path state update...
May 14 01:13:03 vmkernel: 0:08:04:30.867 cpu3:4099)ScsiDeviceIO: 1672: Command 0x2a to device "naa.600508e00000000076d0ee8960a73102" failed H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
May 14 01:13:03 vmkernel: 0:08:04:30.868 cpu3:4099)<6>mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000)
May 14 01:13:03 vmkernel: 0:08:04:30.871 cpu3:4099)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x41027f3e2640) to NMP device "naa.600508e00000000076d0ee8960a73102" failed on physical path "vmhba2:C1:T0:L0" H:0x8 D:0x0 P:0x0 Possible sense data: 0x
May 14 01:37:01 0 0x0 0x0.
...
May 14 01:13:16 vmkernel: 0:08:04:43.496 cpu0:4114)<6>mptscsih: ioc0: attempting task abort! (sc=0x41000e01e340)
May 14 01:14:35 Hostd: "6"
May 14 01:17:04 vmkernel: 0:08:08:31.207 cpu1:4115)<6>mptscsih: ioc0: attempting task abort! (sc=0x41000e011940)
May 14 01:17:04 vmkernel: 0:08:08:31.207 cpu1:4115)MPT SAS Host:6:1:0:0 ::
May 14 01:19:36 Hostd: "6"
May 14 01:19:36 Hostd: }
May 14 01:23:48 vmkernel: 0:08:15:15.643 cpu0:4167)VSCSI: 2519: handle 8193(vscsi0:1):Reset [Retries: 18/0]
May 14 01:23:48 vmkernel: 0:08:15:15.643 cpu0:4167)MPT SAS Host:6:1:0:0 ::
May 14 01:24:36 Hostd: }
...
At that time the local SATA raid 1 appeared inaccessable, and the guests started fallling apart.
I could still log into the ESXi tech support console. Issuing a 'df' command would output information on a couple of the filesystems but then would hang instead of displaying the rest of them.
I found some postings online about a similar problem with some SAN adapters, caused by the SAN issuing an incorrect or unexpected response when it's command queue became full, causing the adapter to go offline.
My thought was that something similar might be happening with the LSI SAS/SATA raid controller. By default the mptsas driver will queue up to 64 commands to the drives. I logged into the ESXi tech support console and issued this command:
~ # esxcfg-module -s mpt_sdev_queue_depth=32 mptsas
to set the driver to only queue up to 32 commands to the drives, then rebooted ESXi.
Before changing the driver option to 32, both machines would regularly lock up every couple of days while doing the backup routine. Since changing the option, neither machine has locked up - 17 days and counting.
HTH,
Mark.