VMware Cloud Community
ROM13
Contributor
Contributor

Host not responding and disconnected / Service Console pings but can't log in - HP Smart Array P410i issue

Hi,

I've a problem on a 2-nodes ESX 4.0 cluster, HA enabled.

Esx Patches are up to date.

My hosts disconnect from vCenter periodically (3 times last month). They're marked as "not responding" , then "disconnected" if i try to manually reconnect them. I have to reboot host.

  • Service console pings, but SSH or console logins fails : i can type my login at the prompt, but there's no password prompt after. Using Vsphere Client directly on a host results as a timeout. Accessing Management Homepage from HP also.

  • Some VM are still pinging, etc., but some VMs don't. All VMs are showing as "disconnected" on vCenter. They don't reboot on Host1.

  • I use HP BL 460c G6 for all nodes, on a c7000 Chassis. Smart Array p410i Controller.

Vcenter events :

24/10/2009 04:59:32 : HA agent on Host2.test.com in cluster TEST has an error : HA agent on the host failed

24/10/2009 04:59:39 : Host Host2.test.com in TEST is not responding

All VM's hosted by server2 are marked are disconnected.

Then 14 hours late, other host goes down (edit : but these to consecutive events don't seem to be linked, i'm maybe just unlucky !) :

24/10/2009 19:06:51 : Host Host1.test.com in TEST is not responding

24/10/2009 19:06:51 : Unable to contact a primary HA agent in cluster TEST

The only VM hosted by Host1 was still pinging, but when i tried to remote control it, VM goes down.

*Update : possible answer*

After a call with HP support, and VMWare support, it seems that a controller issue is the cause of the crash of the service console.

There's a firmware update (v2.50) concerning Smart Array P212, P410, P410i, P411, and P712m ---> here

  • Fix for a potential controller hang condition (lockup error code 0XBC) seen during heavy I/O.

  • Fix for a server operating system hang condition encountered during IO stress tests, such as SQLIO.

  • Fix for a potential controller hang condition (lockup error code
    0XAB) seen when controller is configured in Zero Memory Mode (no cache
    module installed)

  • Fix for a potential controller hang condition that may be seen when a 2nd SATA drive fails in a RAID 6 configuration.

Thanks for your help.

Message was edited by: ROM13

Reply
0 Kudos
21 Replies
a2alpha
Expert
Expert

Did you upgrade these hosts or were they clean installs. If they were upgrades, for all the time it would take I would rebuild each one from the latest ESX 4 DVD. It could be remnants from previous install.

Reply
0 Kudos
ROM13
Contributor
Contributor

Clean installs only Smiley Sad

Reply
0 Kudos
a2alpha
Expert
Expert

Have you tried moving one of the blades to a different location in the chassis. To a 'known working' one. We have had issues where the test and dev blades (traditionally just stuck on the end slots in the chassis) weren't getting enough power due to the set up of power domains and that caused all sorts of funny errors.

Reply
0 Kudos
ROM13
Contributor
Contributor

I didn't, it's kind of ... weird !

My chassis is not full of blades, i have 4 empty slots. But ... why not ?

It could also be an hardware problem, since environnement is identical for both of the clusters. But can't find any clues while parsing the logs... Smiley Sad

Reply
0 Kudos
a2alpha
Expert
Expert

You could also try creating a new cluster in VC and dropping these to hosts in one by one, it might force some checks for HA and other things that might not be apparent.

If you were really feeling crazy, you could stick one of these hosts into the working cluster and take one out of that cluster into this one and see if its a setting thing. I appreciate you may not be able / want to if it is production.

Dan

Reply
0 Kudos
ROM13
Contributor
Contributor

Found other users experiencing quite the same issue :

http://communities.vmware.com/message/1399740#1399740

http://communities.vmware.com/message/1300379#1300379

Moderator : Could this thread be moved to VMware Communities > VMTN > VMware vSphere™ > VMware ESX™ 4 > Discussions ?

Thx

Reply
0 Kudos
antnic
Contributor
Contributor

Hi ROM13, I have exactly the same issue you are describing, with nearly the same hardware, mine is a cx7000 blade chassis with 3 bl460c blades, i have made changes to the SC mem, changed the MPIO driver on the VC (recommended by HP didn't really help at all) added second SC's to the hosts but still having the same issue... did you mange to get this resolved ? I have all the latest patches installed, with all the latest firmwares (as per what HP wanted) but the issue still crops up.

Ant

Reply
0 Kudos
ROM13
Contributor
Contributor

Antnic :

Problem not resolved, SR is still opened.

I have 3 * BL460c G5. (X5550 processors - "nehalem")

Hyperthreading is ON

I edit first post, since hyperthreading is another diffence between the 2-nodes cluster wich is ok and the problematic one.

Do you have HP Management Agents 8.2.5 installed ? Someone @vmware told me that on earlier versions, there were memory leaks issues on hostd, caused by hpmgmt.

(and please excuse my english, mes amis Smiley Happy )

Message was edited by: ROM13

Reply
0 Kudos
antnic
Contributor
Contributor

Hi ROM

Sorry I don't have the agents installed, I thought it wise just to let esx do it's own thing, buth them I have heard that the CIM service was causing major issues on some hardware in esx3.5, to resolve this you had to disable the service and that seemed to cure the issue, vmwaer have said tat this should not be occurring in vsphere...

You have very similar hardware to what I have installed, same proc, the real only difference is that I have the G6 installed, but I don't think there is much different.

Let me know if you hear of anything, I did get into a heated discussion about my vcb backups being the issue... I now know better that there are others like me with the same issue

A

Reply
0 Kudos
ROM13
Contributor
Contributor

In my opinion, it's a hostd problem, but i'm still looking for the causes of this failure.

I don't use VCB. I've deactivated hpmgmt service at the moment.

Support asked me to turn hostd logging level on "trivia" in /etc/vmware/hostd/config.xml

Now, i just have to wait for my production server to disconnect again ... :smileyangry:

Reply
0 Kudos
antnic
Contributor
Contributor

I always found that it disconnect after 4-5 days really annoying !

Reply
0 Kudos
antnic
Contributor
Contributor

Hi again, when you server does disconnect, reboot it but watch to see if you are getting a array lockup error on the array.. this is what I have seen for the last two lockups on two different servers.

ROM13
Contributor
Contributor

Do you mean internal disk array or SAN ? If Internal, which events / log files did you check ?

I've heard there's a known bug about HP 460c series Smart Array P410i Controllers. Can you confirm ?

By the way, thx for your answers Smiley Happy

Reply
0 Kudos
antnic
Contributor
Contributor

It is the internal I will check with the hp rep then to see if there is a bug

Reply
0 Kudos
ROM13
Contributor
Contributor

Ho ! seems we have something serious here !http://communities.vmware.com/images/emoticons/happy.gif!

My server has disconnect again last night. Same symptoms.

This time, we can see lot of disk related errors (maybe controller ?) on server console. I don't remember seeing them last time.

I'm gonna check hostd.log files also, but i can't reboot host at the

moment (last time, one other node get disconnected, i have to wait

until the end of the day...)

Antnic, have you seen this kind of stuff ?

Screenshot :

7542_7542.JPG

And i've found something else, concerning a controller issue, which is maybe related to your (our ?) problem, but they talk about high latency, not disconnections.

Esx 4.0 release notes (search for P410)

HP P410i Customer Notice about bug with raid controller

bad-performance-on-hp-proliant-dl380-g6

Reply
0 Kudos
admin
Immortal
Immortal

Moved at original poster's request. - Robert

Robert Dell'Immagine, Director of VMware Communities

Reply
0 Kudos
SergTan
Contributor
Contributor

Hi,

I've a problem on a 2-nodes ESX 4.0 cluster, HA enabled.

Firmwares/Esx Patches are up to date.

Any ideas ?

I believe that this behaviour is by design. HA doesn't work when I have problem with host RAID controller -

In VC, failed host and VMs are marked as "disconnected", but ping to service console is OK. And VMs brings online another hosts only when failed host is shut down.

Some of similar situations are described here: http://communities.vmware.com/message/1128708

and here

Reply
0 Kudos
antnic
Contributor
Contributor

Hi, I think that is the error that vmware picked up on once I got past hp support, still no resolution for it unless you buy the cache unit, I think I may have to suggest to the customer to move to 4i.

Ant

Reply
0 Kudos
ROM13
Contributor
Contributor

After a call with HP support, and VMWare support, it seems that a controller issue is the cause of the crash of the service console.

There's a firmware update (v2.50) concerning Smart Array P212, P410, P410i, P411, and P712m ---> here

  • Fix for a potential controller hang condition (lockup error code 0XBC) seen during heavy I/O.

  • Fix for a server operating system hang condition encountered during IO stress tests, such as SQLIO.

  • Fix for a potential controller hang condition (lockup error code 0XAB) seen when controller is configured in Zero Memory Mode (no cache module installed)

  • Fix for a potential controller hang condition that may be seen when a 2nd SATA drive fails in a RAID 6 configuration.

I update first post.

Thanks for your help!

Reply
0 Kudos