I have a very frustrating problem in that I have to keep hard rebooting a newly built ESXi 5.1.0 (Kernel build 799733) host machine.
The machine is a 2 x Dual Core AMD Opteron x64 based server with 28GB RAM and 2 x 1TB local HD. ESXi is booting from a USB drive, and each HD has one VM datastore so I have DS1 and DS2 datastores.
The symptom is that after a period of low activity (guests are all up but not being asked to do much work), DS2 becomes frozen. Trying to browse the datastore just says "Searching datastore........" and all VM's on that datastore are uncontactable. It only ever affects DS2. DS1 doesn't experience the issue.
Going into the host via SSH, at this point I cannot even list the contents of the /var/log or var/vmfs/volumes directories - it just hangs.
I cannot restart the managemnt agents or reboot from the ESXi console. The only way to bring things back to life is to restart the host, at which everything is fine. VM's start and are responsive.
I have tried this but it made no difference.
I have also disabled all power saving options and IOMMU in the host BIOS.
After reboot I check vmkernel.log and can see these disk related messages logged just before the reboot
2013-05-02T14:59:18.681Z cpu2:6453)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba33:0:0:0 (driver name: sata_nv) - Message repeated 2 times
vmkernel: 1:02:02:02.206 cpu3:4099)NMP: nmp_CompleteCommandForPath: Command 0x28 (0x410005078e00) to NMP device "naa.6001e4f000105e6b00001f14499bfead" failed on physical path "vmhba1:C0:T0:L100" H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.
A similar story appears in http://communities.vmware.com/thread/341512 but there doesn't seem to be anything extra here to try that I haven't already.
Any ideas appreciated.
Downgraded to ESXi 5.0.0 U2 and problem seems to disappear.... no vmkernel.log entries in the last 12 hours.
It's a second reason not to upgrade to 5.1, first is: Datastore speed issue - two same drives
Really pleased if you managed to solve this one with a downgrade. I must admit I lost patience with it in the end, and installed on some different hardware which has been stable on 5.1.
I've already rebuild the original server as a standalone so I am unable to try your downgrade solution at the moment, but please post on here if it continues to stay stable.
I confirm. There are no messages like previous in vmkernel.log (before downgrading there was first messages about 20-30 minutes ESXi uptime).
Of course no datastore freezing too.
Tested with heavy load of all drives for a few hours.