VMware Cloud Community
Hampuslind
Contributor
Contributor

HP DL585 G7 crashes mysterious

Hi all,

We have a frustrating issue in one of your data centers. At the site we have six HP DL585 G7 servers running ESXi 5.0.0 build-623860. These servers crash/hang now and then without writing anything to the logs.

We have been running hardware diag test with HP and vmware support has gone through all support information available, but no one can find any issues or problem.. The servers just freeze/hang mysterious.. I have even gone so far that I've asked the data center vendor for power logs and tracking electrical disturbancy, but of course without any luck.

Anyone else having these kind of experiences with HP servers and ESX?

Thanks and regards,

Hampus

0 Kudos
12 Replies
sparrowangelste
Virtuoso
Virtuoso

did you do the latest firmware updates and mem test on the machine?

--------------------- Sparrowangelstechnology : Vmware lover http://sparrowangelstechnology.blogspot.com
0 Kudos
Hampuslind
Contributor
Contributor

I'm sure there are new patches to apply, but in the begining of the summer we were on the latest once but still we had these issues. And as HP or vmware support hasen't pointed out anything obvious or told us to upgrade we haven't really done anything on that front. Our application isn't really 100% vmotion proof so we usually need downtime for patching all servers, and downtime is hard to argue without a good bug report which points out know bugs in our current software. Personally I would like to upgrade but I dont really think it will solve our problems.

We have done intensive hardware test with HP smartstart on all devices (disk, mem, bus, cpu etc etc) for multiple loops (10+ hours test) trying to force the problem to occure again, but without success. Every test was clean/successful so there hasen't been anything for HP to work with really.

Also please note that this happens randomly on our six servers, not just one server. And those server are from two seperate orders/shipments.

0 Kudos
Phoenycks
Enthusiast
Enthusiast

Can you give a little more detail about the crashes? Are they PSOD, or just a freeze? Do the hosts reboot? Does HA recognize the host failure and failover? Are you using View or any function/backup solution that might be cloning/snapshotting the guests for any reason?

Also, what SAN are you using?

Anything you can give helps...

Jes

0 Kudos
Hampuslind
Contributor
Contributor

Thanks for asking! 🙂

The servers just freeze or hang. Last time it happened it stopped writing to vmkernel.log around 07:00 in the morning and we got the call about system down around 19:00 in the evening. So the system was up and running until 19:00 or so, but it stopped logging on esx level in the morning. The server is usually non-responding and we need to reset it through ILO.

We dont use HA.

We are not using any snapshot or backup on esx level. Dont really know what view is, so I would say we dont use it..

It feels like there is much talk about storage when it comes to strange issues like this... I found some other thread here about dell servers and qlogic issues.. I have also seen that there are new qlogic drivers and firmware out there and that we run some kind of -debug driver. Probably something we have applied after a failure earlier.

I know there are new HBA and FW drivers but it would be nice with some hard evidence before we upgrade (wishhful thinking, I know)...

Server:

HP DL585 G7

HBA HPAJ764A Storageworks 82Q (qlogic)

FC Firmware version 5.03.15 (d5)

Driver version 901.k1.1-14vmw-debug

Storage:

EMC VXN5300

SAN:

EMC conntectrix-510 (brocade 5100)

Thanks for helping out guys!

0 Kudos
Phoenycks
Enthusiast
Enthusiast

Of course, you know I'm going to suggest applying any and all firmware updates 🙂

www.hp.com/go/spp - 2012.08 is the latest as far as I know.

qlogic - definitely update.

But here's a question - how have you installed ESXi? On local disks? Or on an SD card internally? USB key? PXE boot?

I once saw a very similar problem with a host that had ESXi installed on an internal 4GB SD card. The management interface would hose, but the guests would continue to run for a while, just like you are seeing. In my case, I could still log into the ESXi console, and I could restart the management agents, but that usually didn't work - they would appear to restart without error, but the host was still unresponsive in vCenter. Only a hard boot would fix the problem, usually killing any remaining guests in the process and causing an HA failover (which doesn't do you any good).

So with this experience (and several other quirks I've encountered) I no longer recommend the SD card install - get a couple of drives in a mirror for the ESXi install. SD cards just arent enterprise class yet - there's no monitoring to tell you when they fail, and there's no redundancy.

Anyway, this is only speculation without your installation information...

Jes

Oh - View is VDI - virtual desktops. You'd know if you used it 🙂

0 Kudos
Hampuslind
Contributor
Contributor

Many thanks Jes!

I'll see what I can do about the upgrade, just feels bad going to management asking for downtime without knowing if it's gona solve the problem or not.

ESX is installed on local disks in the servers.

I have read and posted on a similar thread on HP's forum site and after searching for randomly reboot here at the vmware forum I see similar problems for both HP and Dell servers. Our other option was to replace HP with Dell but now that feels likewise bad.

Br,

Hampus

0 Kudos
Hampuslind
Contributor
Contributor

Ok guys, after some more research I'm feeling more sad than yesterday.. Just a quick google on "esxi random reboot" gives you a lot of hits. So it seems like we're far from alone with these types of issues and it hits all types of server vendors.

Do you guys think/know if there is a difference between AMD and Intel in these types of situation? E.g is Intel more stable then the AMD. We run AMD on the affected servers.

Thanks,

Hampus

0 Kudos
Hampuslind
Contributor
Contributor

Many crazy ideas right now... I read below comment on a HP forum, anyone tried it?

"I've read some comments regarding similar problems with smaller DL's G7's (DL165  I think) rebooting in the same way and read that advice is to force them into  high power mode.. coincidentally we had similar problems with a DL385 G6 in the  past and forcing into high power mode resolved the problems, so going to try the  same with these servers and see how they behave."

Br,

Hampus

0 Kudos
sparrowangelste
Virtuoso
Virtuoso

Good idea on using high powered mode, since dymanic power mode causes vm issues

see this:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=101820...

"Poor virtual machine application performance may be caused by processor power management settings"

--------------------- Sparrowangelstechnology : Vmware lover http://sparrowangelstechnology.blogspot.com
0 Kudos
Phoenycks
Enthusiast
Enthusiast

I agree with the High Powered mode - I've read this recommendation in VMware documentation somewhere. It's generally in the BIOS of the server, in I think the ACPI section. You'll have to dig around a bit, but make sure everything is selected for Maximum Performance.

I typically use the DL380's, which are Intel chips. I don't have a lot of experience with the AMD servers, so my experience there is unfortunately limited.

But try setting everything to max performance, disable any power-saving features, and see what happens...

Jes

0 Kudos
Hampuslind
Contributor
Contributor

Great guys, thanks for helping out.

This is what we have done so far:

ESXi patched to latest 5.0.0
BIOS upgrade
Qlogic BIOS upgrade
Changed to high/max power setting in BIOS/ILO
Applied latest HP(Qlogic) FC driver in ESXi

Applied latest HPSA (Smart Array for boot disks) driver in ESXi

Applied HP NMI driver in ESXi which will log hardware events in ESX logs

Applied ESXi dump collector

If this doesn't solve the problem or catch additional information in the logs, I dont know what I will do...

Br,

Hampus

0 Kudos
Tich
Contributor
Contributor

Can you confirm the memory and CPU's you use?

My company use over 50 x DL585 G7s for ESXi v5 and worked with HP on a similar issue, as a result they developed this new firmware to address the issue:

http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareDescription.jsp?lang=en&cc=us&prodTypeId=15...

Please make sure you upgrade your G7s and then test again.

Thanks.

Tich.

0 Kudos