nmtd
Contributor
Contributor

ESXi 5.1 Host local datastore freezes

Hi

I have a very frustrating problem in that I have to keep hard rebooting a newly built ESXi 5.1.0 (Kernel build 799733) host machine.

The machine is a 2 x Dual Core AMD Opteron x64 based server with 28GB RAM and 2 x 1TB local HD. ESXi is booting from a USB drive, and each HD has one VM datastore so I have DS1 and DS2 datastores.

The symptom is that after a period of low activity (guests are all up but not being asked to do much work), DS2 becomes frozen. Trying to browse the datastore just says "Searching datastore........" and all VM's on that datastore are uncontactable. It only ever affects DS2. DS1 doesn't experience the issue.

Going into the host via SSH, at this point I cannot even list the contents of the /var/log or var/vmfs/volumes directories - it just hangs.

I cannot restart the managemnt agents or reboot from the ESXi console. The only way to bring things back to life is to restart the host, at which everything is fine. VM's start and are responsive.

I have tried this but it made no difference.

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=103026...

I have also disabled all power saving options and IOMMU in the host BIOS.

After reboot I check vmkernel.log and can see these disk related messages logged just before the reboot

2013-05-02T14:59:18.681Z cpu2:6453)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba33:0:0:0 (driver name: sata_nv) - Message repeated 2 times

2013-05-02T14:59:18.681Z cpu2:6453)ScsiDeviceIO: 2303: Cmd(0x4124403d5a40) 0x2a, CmdSN 0x80000007 from world 6453 to dev "t10.ATA_____WDC_WD1002FAEX2D00Z3A0________________________WD2DWCATRC182891" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0$
2013-05-02T15:00:01.530Z cpu1:19075)VSCSI: 2370: handle 8200(vscsi0:0):Reset request on FSS handle 198922 (0 outstanding commands)
2013-05-02T15:00:01.530Z cpu1:19075)VSCSI: 2370: handle 8201(vscsi0:1):Reset request on FSS handle 166153 (0 outstanding commands)
2013-05-02T15:00:01.530Z cpu2:4170)VSCSI: 2648: handle 8200(vscsi0:0):Reset [Retries: 0/0]
2013-05-02T15:00:01.530Z cpu2:4170)VSCSI: 2446: handle 8200(vscsi0:0):Completing reset (0 outstanding commands)
2013-05-02T15:00:01.530Z cpu2:4170)VSCSI: 2648: handle 8201(vscsi0:1):Reset [Retries: 0/0]
2013-05-02T15:00:01.530Z cpu2:4170)VSCSI: 2446: handle 8201(vscsi0:1):Completing reset (0 outstanding commands)
2013-05-02T15:29:18.941Z cpu0:5267)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba1:0:0:0 (driver name: sata_nv) - Message repeated 17 times
2013-05-02T15:29:18.941Z cpu0:5267)ScsiDeviceIO: 2303: Cmd(0x4124003edac0) 0x85, CmdSN 0x18 from world 5267 to dev "t10.ATA_____WDC_WD1002FAEX2D00Z3A0________________________WD2DWCATRC024514" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.
2013-05-02T15:59:20.236Z cpu3:6569)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba1:0:0:0 (driver name: sata_nv) - Message repeated 91 times
2013-05-02T15:59:20.236Z cpu3:6569)ScsiDeviceIO: 2303: Cmd(0x4124403f5400) 0x2a, CmdSN 0x80000016 from world 6569 to dev "t10.ATA_____WDC_WD1002FAEX2D00Z3A0________________________WD2DWCATRC024514" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x5 0x20 $
2013-05-02T16:29:20.717Z cpu3:6522)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba33:0:0:0 (driver name: sata_nv) - Message repeated 7 times
2013-05-02T16:29:20.717Z cpu3:6522)ScsiDeviceIO: 2303: Cmd(0x4124404017c0) 0x2a, CmdSN 0x8000001d from world 6522 to dev "t10.ATA_____WDC_WD1002FAEX2D00Z3A0________________________WD2DWCATRC182891" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0$
2013-05-02T16:59:21.052Z cpu2:6453)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba33:0:0:0 (driver name: sata_nv) - Message repeated 9 times
2013-05-02T16:59:21.052Z cpu2:6453)ScsiDeviceIO: 2303: Cmd(0x4124403f2400) 0x2a, CmdSN 0x80000044 from world 6453 to dev "t10.ATA_____WDC_WD1002FAEX2D00Z3A0________________________WD2DWCATRC182891" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0$
2013-05-02T17:59:21.822Z cpu2:6453)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba33:0:0:0 (driver name: sata_nv) - Message repeated 3 times
2013-05-02T17:59:21.822Z cpu2:6453)ScsiDeviceIO: 2303: Cmd(0x4124403da000) 0x2a, CmdSN 0x8000001e from world 6453 to dev "t10.ATA_____WDC_WD1002FAEX2D00Z3A0________________________WD2DWCATRC182891" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0$
2013-05-02T18:59:22.730Z cpu2:6569)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba1:0:0:0 (driver name: sata_nv) - Message repeated 3 times
2013-05-02T18:59:22.730Z cpu2:6569)ScsiDeviceIO: 2303: Cmd(0x4124403d3d40) 0x2a, CmdSN 0x8000003d from world 6569 to dev "t10.ATA_____WDC_WD1002FAEX2D00Z3A0________________________WD2DWCATRC024514" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0$
2013-05-02T19:29:23.425Z cpu2:6453)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba33:0:0:0 (driver name: sata_nv) - Message repeated 1 time
2013-05-02T19:29:23.425Z cpu2:6453)ScsiDeviceIO: 2303: Cmd(0x4124403dbf00) 0x2a, CmdSN 0x80000045 from world 6453 to dev "t10.ATA_____WDC_WD1002FAEX2D00Z3A0________________________WD2DWCATRC182891" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0$
2013-05-02T19:29:23.425Z cpu2:6453)ScsiDeviceIO: 2303: Cmd(0x4124403d5340) 0x2a, CmdSN 0x8000005f from world 6453 to dev "t10.ATA_____WDC_WD1002FAEX2D00Z3A0________________________WD2DWCATRC182891" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0$
2013-05-02T19:59:25.512Z cpu3:6529)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba1:0:0:0 (driver name: sata_nv) - Message repeated 125 times
2013-05-02T19:59:25.512Z cpu3:6529)ScsiDeviceIO: 2303: Cmd(0x4124403d4040) 0x2a, CmdSN 0x800000b1 from world 6529 to dev "t10.ATA_____WDC_WD1002FAEX2D00Z3A0________________________WD2DWCATRC024514" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0$
2013-05-02T19:59:25.672Z cpu2:6529)ScsiDeviceIO: 2303: Cmd(0x4124403da200) 0x2a, CmdSN 0x7164 from world 4100 to dev "t10.ATA_____WDC_WD1002FAEX2D00Z3A0________________________WD2DWCATRC024514" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.

VMK_SCSI_DEVICE_BUSY = 0x8

vmkernel:  1:02:02:02.206 cpu3:4099)NMP: nmp_CompleteCommandForPath: Command 0x28  (0x410005078e00) to NMP device "naa.6001e4f000105e6b00001f14499bfead"  failed on physical path "vmhba1:C0:T0:L100" H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.

This status is returned when a LUN cannot accept SCSI commands at  the moment. As this should be a temporary condition, the command is  tried again.
In vmkwarning.log I am getting a similar messages every 30 minutes
2013-05-02T14:29:18.420Z cpu0:6453)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba33:0:0:0 (driver name: sata_nv) - Message repeated 1 time
2013-05-02T14:59:18.681Z cpu2:6453)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba33:0:0:0 (driver name: sata_nv) - Message repeated 2 times
2013-05-02T15:29:18.941Z cpu0:5267)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba1:0:0:0 (driver name: sata_nv) - Message repeated 17 times
2013-05-02T15:59:20.236Z cpu3:6569)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba1:0:0:0 (driver name: sata_nv) - Message repeated 91 times
2013-05-02T16:29:20.717Z cpu3:6522)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba33:0:0:0 (driver name: sata_nv) - Message repeated 7 times
2013-05-02T16:59:21.052Z cpu2:6453)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba33:0:0:0 (driver name: sata_nv) - Message repeated 9 times
2013-05-02T17:59:21.822Z cpu2:6453)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba33:0:0:0 (driver name: sata_nv) - Message repeated 3 times
2013-05-02T18:59:22.730Z cpu2:6569)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba1:0:0:0 (driver name: sata_nv) - Message repeated 3 times
2013-05-02T19:29:23.425Z cpu2:6453)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba33:0:0:0 (driver name: sata_nv) - Message repeated 1 time
2013-05-02T19:59:25.512Z cpu3:6529)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba1:0:0:0 (driver name: sata_nv) - Message repeated 125 times

A similar story appears in http://communities.vmware.com/thread/341512 but there doesn't seem to be anything extra here to try that I haven't already.

Any ideas appreciated.

Thanks

0 Kudos
4 Replies
daftu
Contributor
Contributor

Hi nmtd,

I'm having exactly the same problem.

Did you find any solution/workaround?

0 Kudos
daftu
Contributor
Contributor

Downgraded to ESXi 5.0.0 U2 and problem seems to disappear.... no vmkernel.log entries in the last 12 hours.

It's a second reason not to upgrade to 5.1, first is: Datastore speed issue - two same drives

0 Kudos
nmtd
Contributor
Contributor

Hi daftu

Really pleased if you managed to solve this one with a downgrade. I must admit I lost patience with it in the end, and installed on some different hardware which has been stable on 5.1.

I've already rebuild the original server as a standalone so I am unable to try your downgrade solution at the moment, but please post on here if it continues to stay stable.

0 Kudos
daftu
Contributor
Contributor

I confirm. There are no messages like previous in vmkernel.log (before downgrading there was first messages about 20-30 minutes ESXi uptime).

Of course no datastore freezing too.

Tested with heavy load of all drives for a few hours.

0 Kudos