Hi,
I've a problem on a 2-nodes ESX 4.0 cluster, HA enabled.
Esx Patches are up to date.
My hosts disconnect from vCenter periodically (3 times last month). They're marked as "not responding" , then "disconnected" if i try to manually reconnect them. I have to reboot host.
Service console pings, but SSH or console logins fails : i can type my login at the prompt, but there's no password prompt after. Using Vsphere Client directly on a host results as a timeout. Accessing Management Homepage from HP also.
Some VM are still pinging, etc., but some VMs don't. All VMs are showing as "disconnected" on vCenter. They don't reboot on Host1.
I use HP BL 460c G6 for all nodes, on a c7000 Chassis. Smart Array p410i Controller.
Vcenter events :
24/10/2009 04:59:32 : HA agent on Host2.test.com in cluster TEST has an error : HA agent on the host failed
24/10/2009 04:59:39 : Host Host2.test.com in TEST is not responding
All VM's hosted by server2 are marked are disconnected.
Then 14 hours late, other host goes down (edit : but these to consecutive events don't seem to be linked, i'm maybe just unlucky !) :
24/10/2009 19:06:51 : Host Host1.test.com in TEST is not responding
24/10/2009 19:06:51 : Unable to contact a primary HA agent in cluster TEST
The only VM hosted by Host1 was still pinging, but when i tried to remote control it, VM goes down.
*Update : possible answer*
After a call with HP support, and VMWare support, it seems that a controller issue is the cause of the crash of the service console.
There's a firmware update (v2.50) concerning Smart Array P212, P410, P410i, P411, and P712m ---> here
Fix for a potential controller hang condition (lockup error code 0XBC) seen during heavy I/O.
Fix for a server operating system hang condition encountered during IO stress tests, such as SQLIO.
Fix for a potential controller hang condition (lockup error code
0XAB) seen when controller is configured in Zero Memory Mode (no cache
module installed)
Fix for a potential controller hang condition that may be seen when a 2nd SATA drive fails in a RAID 6 configuration.
Thanks for your help.
Message was edited by: ROM13
Did you upgrade these hosts or were they clean installs. If they were upgrades, for all the time it would take I would rebuild each one from the latest ESX 4 DVD. It could be remnants from previous install.
Clean installs only
Have you tried moving one of the blades to a different location in the chassis. To a 'known working' one. We have had issues where the test and dev blades (traditionally just stuck on the end slots in the chassis) weren't getting enough power due to the set up of power domains and that caused all sorts of funny errors.
I didn't, it's kind of ... weird !
My chassis is not full of blades, i have 4 empty slots. But ... why not ?
It could also be an hardware problem, since environnement is identical for both of the clusters. But can't find any clues while parsing the logs...
You could also try creating a new cluster in VC and dropping these to hosts in one by one, it might force some checks for HA and other things that might not be apparent.
If you were really feeling crazy, you could stick one of these hosts into the working cluster and take one out of that cluster into this one and see if its a setting thing. I appreciate you may not be able / want to if it is production.
Dan
Found other users experiencing quite the same issue :
http://communities.vmware.com/message/1399740#1399740
http://communities.vmware.com/message/1300379#1300379
Moderator : Could this thread be moved to VMware Communities > VMTN > VMware vSphere™ > VMware ESX™ 4 > Discussions ?
Thx
Hi ROM13, I have exactly the same issue you are describing, with nearly the same hardware, mine is a cx7000 blade chassis with 3 bl460c blades, i have made changes to the SC mem, changed the MPIO driver on the VC (recommended by HP didn't really help at all) added second SC's to the hosts but still having the same issue... did you mange to get this resolved ? I have all the latest patches installed, with all the latest firmwares (as per what HP wanted) but the issue still crops up.
Ant
Antnic :
Problem not resolved, SR is still opened.
I have 3 * BL460c G5. (X5550 processors - "nehalem")
Hyperthreading is ON
I edit first post, since hyperthreading is another diffence between the 2-nodes cluster wich is ok and the problematic one.
Do you have HP Management Agents 8.2.5 installed ? Someone @vmware told me that on earlier versions, there were memory leaks issues on hostd, caused by hpmgmt.
(and please excuse my english, mes amis )
Message was edited by: ROM13
Hi ROM
Sorry I don't have the agents installed, I thought it wise just to let esx do it's own thing, buth them I have heard that the CIM service was causing major issues on some hardware in esx3.5, to resolve this you had to disable the service and that seemed to cure the issue, vmwaer have said tat this should not be occurring in vsphere...
You have very similar hardware to what I have installed, same proc, the real only difference is that I have the G6 installed, but I don't think there is much different.
Let me know if you hear of anything, I did get into a heated discussion about my vcb backups being the issue... I now know better that there are others like me with the same issue
A
In my opinion, it's a hostd problem, but i'm still looking for the causes of this failure.
I don't use VCB. I've deactivated hpmgmt service at the moment.
Support asked me to turn hostd logging level on "trivia" in /etc/vmware/hostd/config.xml
Now, i just have to wait for my production server to disconnect again ... :smileyangry:
I always found that it disconnect after 4-5 days really annoying !
Hi again, when you server does disconnect, reboot it but watch to see if you are getting a array lockup error on the array.. this is what I have seen for the last two lockups on two different servers.
Do you mean internal disk array or SAN ? If Internal, which events / log files did you check ?
I've heard there's a known bug about HP 460c series Smart Array P410i Controllers. Can you confirm ?
By the way, thx for your answers
It is the internal I will check with the hp rep then to see if there is a bug
Ho ! seems we have something serious here !http://communities.vmware.com/images/emoticons/happy.gif!
My server has disconnect again last night. Same symptoms.
This time, we can see lot of disk related errors (maybe controller ?) on server console. I don't remember seeing them last time.
I'm gonna check hostd.log files also, but i can't reboot host at the
moment (last time, one other node get disconnected, i have to wait
until the end of the day...)
Antnic, have you seen this kind of stuff ?
Screenshot :
And i've found something else, concerning a controller issue, which is maybe related to your (our ?) problem, but they talk about high latency, not disconnections.
Esx 4.0 release notes (search for P410)
HP P410i Customer Notice about bug with raid controller
Moved at original poster's request. - Robert
Robert Dell'Immagine, Director of VMware Communities
Hi,
I've a problem on a 2-nodes ESX 4.0 cluster, HA enabled.
Firmwares/Esx Patches are up to date.
Any ideas ?
I believe that this behaviour is by design. HA doesn't work when I have problem with host RAID controller -
In VC, failed host and VMs are marked as "disconnected", but ping to service console is OK. And VMs brings online another hosts only when failed host is shut down.
Some of similar situations are described here: http://communities.vmware.com/message/1128708
Hi, I think that is the error that vmware picked up on once I got past hp support, still no resolution for it unless you buy the cache unit, I think I may have to suggest to the customer to move to 4i.
Ant
After a call with HP support, and VMWare support, it seems that a controller issue is the cause of the crash of the service console.
There's a firmware update (v2.50) concerning Smart Array P212, P410, P410i, P411, and P712m ---> here
Fix for a potential controller hang condition (lockup error code 0XBC) seen during heavy I/O.
Fix for a server operating system hang condition encountered during IO stress tests, such as SQLIO.
Fix for a potential controller hang condition (lockup error code 0XAB) seen when controller is configured in Zero Memory Mode (no cache module installed)
Fix for a potential controller hang condition that may be seen when a 2nd SATA drive fails in a RAID 6 configuration.
I update first post.
Thanks for your help!