VMware Cloud Community
sfonten
Contributor
Contributor

console error message

has any1 seen thie message on the console b4?

2:11:56:14:323 CPO0:1024)VMNIX:<0> scsc: device set Offline-command error recovery failed: host0charge0 Id 0 lan 0

thanks

Reply
0 Kudos
33 Replies
Faustina
Enthusiast
Enthusiast

post your /var/log/vmkwarning and your /var/log/vmkernel logs snippets around that same time.

Reply
0 Kudos
VirtualKenneth
Virtuoso
Virtuoso

Dude, first read this:

http://www.vmware.com/community/help.jspa#questionrp

all your threads currently posted aren't the way they should be!

Mark them as a question

Reply
0 Kudos
sfonten
Contributor
Contributor

the issue is a result of IBM director crashing the SC OS in 3.0.1

Reply
0 Kudos
jesse_gardner
Enthusiast
Enthusiast

sfonten, can you comment on how you know that? I have the Director agent, and last night I encountered this error.

The VM's are still running, but I don't see a way to get them off gracefully. Did you have to hard-power-off the server?

Reply
0 Kudos
ian_griffin
Contributor
Contributor

I am also interested on how you are relating this to IBM Director, I have a current critical SR raised with vmware support.

I have removed director from two of the servers which experienced this problem, and it has not happened again. But I need to be able to prove the issue is Director before I can raise it with IBM

Reply
0 Kudos
jesse_gardner
Enthusiast
Enthusiast

Ian, great, another person. Let's confirm our symptoms.

First off, the virtual machines were still running fine. If you tried to reboot a VM (via RDP, etc), it won't come back.

Seemingly out of the blue, the host goes Disconnected in Virtual Center. There's a status message for the host saying something like[/i] "There is an error with the HA agent".

You can't PuTTY/ssh into it, nor is the local management website accessible.

On the main console screen, the following messages are in red:

VMNIX: <0>scsi: device set offline - command error recovery failed: host 0 channel 0 id 0 lun 0
VMNIX: <0>journal commit I/O error

If you press Alt+F12 to view the non-interactive vmkernel log, it's getting spammed. The timestamps flying by are several hours behind the current time. Messages are like:

07:25:15.281 cpu2:1038)SCSI: 3175: vmhba0:0:0:0 Abort cmd on timeout failed,s

07:25:15:341 cpu2:1038)SCSI:3169: vmhba0:0:0:0 Abort cmd due to timeout, s/n=

07:25:15:351 cpu2:1038)LinSCSI:3596: Aborting cmds with world 1024, originHan

07:25:15:360 cpu2:1038)LinSCSI: 3612: Abort failed for cmd with serial=2, stat

I opened a case with VMWare support while the problem was still occurring. They said that on the surface it looks like the filesystem went read-only. ESX is installed to local SCSI disks, it does not boot-from-SAN.

I had to forcefully power cycle the host, there was no other way in. Afterwards, it did an FSCK but otherwise came up fine.

The vm-support script didn't provide any useful information in the logs for them. The only thing they noticed was that the time in the logs was jumping around a little. Apparently I had gathered my VMWare Best Practices for NTP (http://kb.vmware.com/KanisaPlatform/Publishing/408/1339_f.SAL_Public.html) before they amended the document to add the "Restrict 127.0.0.1" line. So my host was getting the following line in the messages log constantly:

"ntpd returns a permission denied error!"

VMware support saw these messages, told me how to fix it, and left it at that. This NTP configuration problem may or may not be related.

This is an IBM x3850 with the latest BIOS, 4-way 3.66ghz single-core, 32gb RAM. ESX 3.01 with all patches installed through the end of January. As a matter of fact, I had just installed those January patches a few days before this crash. It was running IBM Director v5.10.3, which I've since uninstalled because of this thread.

Reply
0 Kudos
ian_griffin
Contributor
Contributor

Jesse

My symptons are exactly the same error messages on the console in red. Host going disconnected, virtual machines remain running for short period of time.

COS disk system being set readonly.

No error messages are presented in the logs because the log directory is on the COS disk (I have now mounted /var/log on a san disk to try and capture any future crashes)

I have removed director agents (also 5.10.3) and the two servers have not suffered again. (yet !)

I have seen this problem on two IBM x445 and four IBM x440,

The x440's have not reoccured since a Serveraid firmware update, but this did not fix the x445's.

The next option is to reinstall the director agents on the x445 and see if the error comes back.

The issue is still a critical SR with vmware

Reply
0 Kudos
jesse_gardner
Enthusiast
Enthusiast

Thank you Ian. Either I was not forceful enough or my support rep wasn't diligent enough, but we let the case get closed due to lack of information. I've only had the problem occur once on one server, and we just hope it doesn't happen again. If your support rep is curious, my SR # was 364425.

Please feel free to use my experience to aid your case, and keep this thread updated with anything you learn.

Thank you.

Jesse

Reply
0 Kudos
sfonten
Contributor
Contributor

jesse

yes i just had to reboot the host. i can veify b/c i opened a case and they look at the logs

Reply
0 Kudos
jesse_gardner
Enthusiast
Enthusiast

Did you open the case with IBM or VMware? VMWare wasn't able to find anything in my logs.

Reply
0 Kudos
mfleener
Contributor
Contributor

I'm getting the same errors reported in this thread, with the same symptoms. The Hosts are completely unresponsive, but the VMs are working just fine. RDP works great on every guest OS, just no way to bring them back if they get shut down, at least on the same host.

Here is the kicker, I'm a Dell only shop. I have two PE 2850s affected with this issue. Setup ESX 3.01 VC 2.01, fiber based SAN, single HBAs, dual nics

I'm curretnly working with Dell on this issue and they are brining in tier 3 from Vmware. I do not run dell's server managment suite.

The only third party product i have ever put on an ESX host was Virtugo. But that was back in the 2.5 days.

Did anyone get to the bottom of this issue?

Reply
0 Kudos
jesse_gardner
Enthusiast
Enthusiast

I did not get to the bottom of it, but I only experienced it once. I uninstalled IBM Director because of this thread and the problem hasn't reoccurred, but that may be a fluke and Director may not have been involved.

Please let us know if you find anything.

Reply
0 Kudos
millerda
Contributor
Contributor

I was having this exact problem on a x366 server with ServeRAID 8i controller (adaptech) and updating the firmware/bios seems to have fixed it. I have no IBM director agents loaded.

Thanks,

Reply
0 Kudos
jamome
Enthusiast
Enthusiast

Hi, I am having the same problem. We have a few Dell 2950 servers, which are brand new, and we have ESX 3.0.1 and VC 2.0.1 installed. I was using the migration wizard to move (2) VMs from one ESX to another. Halfway through the Migrations (31% on one, 40% on the other) the destination ESX became unresponsive. I went to look at the ESX's console and the same message was in red. The host also became disconnected.

We do \*not* have IBM Director installed.

Has anyone found a patch for this? Is VMware considering this issue worth fixing? We have been extremely happy with our ESXs and consider them money very well spent. We hope to continue to have good results. This case in point is not a good result.

Reply
0 Kudos
ian_griffin
Contributor
Contributor

This has now been resolved by updating the firmware of the IBM Serveraid adapter to at least 7.12.12. VMware are now working with IBM to understand what the exact issue was and what the root cause was and the fix which was included in this firmware version.

I had a number of hosts that would give this error as soon as they booted, applying the firmware update resolved all straight away.

Reply
0 Kudos
acmcnick
Enthusiast
Enthusiast

We just experienced the same issue with a PE 2950 connected to a Network Appliance 3050c, this came at the same time as a local SCSI disk Drive failure, any idea what causes this?

Reply
0 Kudos
acmcnick
Enthusiast
Enthusiast

To answer my own question, it turns out in our case, it was a firmware bug with the PERC 5/i Controller Card on the Dell 2950's, we updated the firmware and everything is functional at this point.

Here is the link, this Firmware update was released 4/10/07 and is considered by Dell as Urgent.

http://support.dell.com/support/downloads/download.aspx?c=us&l=en&s=gen&releaseid=R149666&SystemID=P...

Reply
0 Kudos
acmcnick
Enthusiast
Enthusiast

The root cause was that server failed a drive two weeks ago, we replaced the drive and the array rebuilt. The second drive failed over the weekend, (RAID 1) and this failure caused the Logicial Disk to become unavailable to the OS momentarily resulting in a Read Only state of the Service Console OS. This was remedied in the new firmware update for PERC 5/i.

Reply
0 Kudos
HenriP
Contributor
Contributor

I'm writing to share my experience. We have Dell 2950's connected to an HP MSA1500 SAN. Dell's boot ESX 3.0.1 from local drives, VM's are at home on the SAN. We are using VI Client; don't have Virtual Center. The server has been running for about 4 months with no problem, hosting about 8 VMs. This is our first experience with Vmware.

Last week we lost all management contact with the Dell 2950 ESX server. No web access, no ssh, VI Client couldn't connect and all of the local console ttys were unresponsive. On the console Alt-F11 screen a message in red appeared "124:01:45:44.712 cpu0:1024)VMNIX: <0>scsi : device set offline - command error recovery failed : host 0 channel 0 id 0 lun 0". On the Alt-F1 tty there was a rolling screen of errors "I/O error : dev 08:02, sector 4568672". Sometimes the sector was different. On the Alt-F12 tty there was a couple of scsi errors listed which led up to an error in red with what I've listed above was on the Alt-F11 screen.

All of the Virtual Machines were functioning normally throughout this problem. During a maintenance window we shutdown all of the virtual machines on this problem ESX box then had to hard power-off the ESX server. Registered the VMs on a different Dell 2950 ESX server and brought them back online. Minimal loss of service.

We completely disconnected the problem ESX server from the network and the SAN before turning it back on. It came up with no problems and no errors other than the file system integrity check warning that it had been shutdown ungracefully.

(Note: have seen some error messages on boot:

May 30 14:19:42 vmesx1 kernel: I/O error: dev 08:50, sector 0

May 30 14:19:42 vmesx1 kernel: sdf: I/O error: dev 08:50, sector 0

But these seem to be explained as a known issue with Dell / Vmware, "Virtual floppy I/O errors in /var/log/messages after system boot up", http://www.dell.com/downloads/global/solutions/Installing_Dell_OpenManage_50_on_ESX_3.pdf, pg 18 of 21).

Have been working with Dell and VMware Support. We have upgraded BIOS, PERC5/i firmware, SAS MAX3147RC hard drive firmware, and Dell SAS Backplane firmware. Hopefully this will solve the problem. Just saw this set of posts today so I'm feeling better that the fix was found and the PERC5/i firmware was the root cause.

Reply
0 Kudos