We had an ESX 3.5 (update 1) host hang/panic yesterday. I first noticed a problem when the host appeared disconnedt from the VI Client. The VM guests were still running fine, but also disconnected. At this point I could not access the service console via ssh or the web interface. So I went over to the server room and hooked up a monitor and keyboard.
The console reported in red text "cpu0:)1024 VMNIX : scsi : device set offline -command error recovery failed : host 1 channel 0 id 0 lun 0". When I accessed the command line via Alt-F1, the error message "I/O error : dev 08:02, sector 6180680" was looping. So the command prompt was unuseable. Our only option was to power down any VMs running on this host and hard re-set the box. I left the host in maintenance mode until I figure out a resolution.
This issue is covered in KB Article 1003316
The host is a Dell 2950, ESX is installed on a RAID1 2-disk (SAS) array with a Perc/5i controller. It was part of a 6 node cluster with all VMs stored on a NetApp 3020 via NFS.
badblocks and fsck check out OK. All the Dell diags check out OK too.
Does anyone have any experience with this issue?
I have, and had this bookmarked ... the following KB article describes your issue exactly. I recall updating the particular node (hardware related patches/firmware etc)
[
|http://kb.vmware.com/selfservice/viewContent.do?externalId=1003316&sliceId=1]
This knowledgebase article is the correct resolution for this
SCSI: device set offline - command error recovery failed (1003316)
Rick Blythe
Social Media Specialist
VMware Inc.