VMware Cloud Community
MAPABVBA
Contributor
Contributor

ESX 4.1.0 and device failed error recovery - setting offline

Hello,

I have some weird problem on 2 servers (HP DL350 G5 and HP ML350 G6).

On both servers I see this next error on the monitor that has connected with physical server:

cpu0:4096 VMNIX: <0>st 6:0:0:0: SCSI: Device failed error recovery - Setting offline

But when i'm watching to vSphere Client, i don't see any error.

And I tried some test and I think I know what's the problem, but don't know how to resolve:

The both servers have tapestreamers from HP with Smart Array P212. When I remove the tapestreamer and let the server running after 1 day, i don't get the error. And I put the tapestreamer back and I still see no errors. But after 6 - 10 hours I see the error back.

So, i'm almost sure the problem is on the tapestreamer or the smart array. But in the vSphere Client I see NO ERRORS, so i don't know how to find the problem and how to find the resolution. And that's why I need all your help. It's very urgent.

Thank you very much.

0 Kudos
7 Replies
Troy_Clavell
Immortal
Immortal

see below, it could be the Smart Array P212 Controller

http://kb.vmware.com/kb/1003316

piaroa
Expert
Expert

Start by checking your ESX host logs. Check the vmkwaring and hostd logs. You can get to them via https://yourhostip/host

If this post has been helpful/solved your issue, please mark the thread and award points as you see fit. Thanks!
MAPABVBA
Contributor
Contributor

Thank you for the fast reply.

Never knew about that and now I don't know what I should take to watch the logs.

Can you show me please?

thanks.

0 Kudos
MAPABVBA
Contributor
Contributor

Well, honestly, i don't know.

Because, on another servers I prepared before for other clients, i don't had that problem.

Those 2 are the first.

8 hours later after reboot, i get the error back. If the problem is on the Smart Array P212, then do you know how to resolve? Or contacting the vendor, like it said in KB?

0 Kudos
Troy_Clavell
Immortal
Immortal

if there is a bad controller, than HP will have to replace it.  If you can, I would run some the HP SmartStart Diagnostics on it, to see if any errors are detected.

0 Kudos
MAPABVBA
Contributor
Contributor

I can do that, but i'll do that tomorrow, because I already restarted the server, and I'd like to see if that helped.

I'm gonna let you know. Thank you for the fast reply!

0 Kudos
krishnaprasad
Hot Shot
Hot Shot

This error message can be seen if there are any "dead" entries created in the system.

esxcfg-scsidevs -l command will show you if there are any dead entries ( due to the SCSI devices ) created in the server. can you post the command output here?

Rebooting the system will remove all the dead entries created in it.

Otherwise, you can manually delete the dead entries using esxcfg-rescan <vmhba> where vmhba is the name of the adapter created for the dead SCSI device. The above command actually rescans the adapter and if it's found dead, it removes the entry.

0 Kudos