Random reboot at ProLiant DL380 G6 servers

nr1c0re · ‎09-27-2011

Hello community!

I've got 4 vmware esxi 4.1 servers with vCenter:

- 2 running on ProLiant DL380 G5 - theese are OK

- 2 running on ProLiant DL380 G6 - here are random reboot problem.

Randomly once per 2 weeks one of two ProLiant DL380 G6 servers goes unnormal reboot. Monitoring system HP SIM says only "host x.x.x.x unreachable" then when machine boots it says "host x.x.x.x reachable". I've updated all firmwares on servers, run several stress tests, updated vmware - nothing solved problem.

Logs are clean, exept iLo 2 log:

Informational
iLO 2
09/27/2011 09:36
09/27/2011 09:36
2
Server power restored.
Informational
iLO 2
09/27/2011 09:36
09/27/2011 09:36
1
Server power removed.
Caution
iLO 2
09/27/2011 09:36
09/27/2011 09:36
2
Server reset.
That's all.
How can I catch a cause of this problem?
Point me to the right direction, I've got no ideas what to look at.

biolog5 · ‎09-27-2011

Hello,

Are you using P410 family RAID controller by any chance? And if so, are you using BBWC or FBWC? I read about quite a lot of rebooting problems originating from these controllers (quite often can be solved by firmware upgrade, but not always)

Just an idea...

S.

krowczynski · ‎09-27-2011

Hi,

disable the ASR in the Bios, this can also cause an reboot of the server.

And look here please

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=101084...

MCP, VCP3 , VCP4

nr1c0re · ‎09-27-2011

It's P410i controller, but I've updated firmware from HP firmware DVD 9.3 downloaded from hp.com...

And on the link http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=101084... says

Symptoms

This article provides information about HP Automatic Server Recovery (ASR) and ASR events.

If an ASR event occurs, the server hardware is restarted and IML shows the message: An ASR has occurred

But no such mesage in IML, just nothing there.

EricCorrales · ‎09-27-2011

Power remove and power restore issues:

There are two options that can trigger the behavior

1. Firmware ( BIOS - ILO - Controller)

2. Hardware issues (Power subsystem : Power supply, System Board, Riser Cage)

Based on that be sure you have the firmware up to date:

** CRITICAL ** Systems ROMPaq Firmware Upgrade for HP ProLiant DL380 G6 (P62) Servers (For USB Key-M...

2011.05.05 (A)
6 Jul 2011

Firmware CD Supplemental Update / Online ROM Flash Component for Linux - HP Integrated Lights-Out 2

2.07
31 Aug 2011

* RECOMMENDED * Firmware CD Supplemental Update / Online ROM Flash Component for Linux - Smart Array... (multi-part download)

5.06
24 Jun 2011

If that doesn`t resolve the behavior, run the diagnostic offline and save the report, save the vm-support file and open a ticket with HP.

jmahes · ‎09-29-2011

We are having the same issue here is something someone from my team found out yesturday that might help you out. CHECK THE SERIAL NUMBER for a bad batch of DL380 G6 servers......

HP Support mentioned that there is a power blackplane issue specifically for HP DL360 G6 with serial numbers that contain "941"

XXX9410059 / 491315-001

XXX941006K / 491315-001

I followed the process to create a bootable USB key

http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareDescription.jsp?lang=en&cc=us&prodTypeId=15...

This is the link for the update that is affecting these two servers The update came out 9/19/2011

Index of ftp://ftp.hp.com/pub/softlib2/software1/sc-linux-fw/p1687714714/v69108/

This is what we will try let us know if anyone is trying this as well, we are experiencing the same behavior mentioned above in first post.

nr1c0re · ‎09-29-2011

Updated firmwares on 1st server, tomorrow I'll update on 2nd.

jmahes, servers serial number contains XXX946XXXX / 494329-B21. But may be that this one can have same problems.

jmahes · ‎10-03-2011

Yeh HP said the notes they had were for xxx941 serial numbers, the next step is to replace the back plane for the power supply, that is attached to the mainboard and cannot be seperated.

We have told HP that the software did not correct this problem and that we need to move to the next step and replace the mainboard. That is happening 10/5/11 i update you on how it goes.

nr1c0re · ‎10-06-2011

I sent all logs from maintenance tests to HP support and they told that there is no problems with hardware.

And they suggested to update NIC drivers and if that not help - to contact with vmware support team.

Okay, I found new drivers.

Here they are:

http://downloads.vmware.com/d/details/dt_esx41_broadcom_netxtremeii_032311/ZHcqYnR0anBiZCpwcA

The ESX Server 4.1 driver CD includes support for version bnx2i-1.9.1t.v41.2, bnx2-2.0.22f.v41.2,bnx2x-1.62.15.v41.2, cnic-1.10.2q.v41.9 on ESX/ESXi.

I checked my esxi - it is 4.1 build 433742.

From console I checked versions of drivers:

~ # ethtool -i vmnic0
driver: bnx2
version: 2.0.7d-4vmw
firmware-version: 5.2.3 NCSI 2.0.6
bus-info: 0000:02:00.0
~ # ethtool -i vmnic4
driver: e1000e
version: 1.1.2-NAPI
firmware-version: 5.12-2
bus-info: 0000:0f:00.0
~ #

I got old one drivers.

That means that this drivers update applyes to my esxi.

Also I've got vmware update manager. Is it possible to find this drivers update in update manager? I have not found it there.

rajvm256 · ‎10-06-2011

Update manager usually contains the patches/updates that vmware has released. It will not have the drivers. You can download the ISO and extract it inside it you should find the offline line bundles. You can install your bundle using esxupdate.

Please do award points by clicking correct/helpful if it helped you.

Thanks | http://virtualvm.info/

nr1c0re · ‎10-07-2011

Thanks rajvm256.

I downloaded iso, extracted, took zip file with bnx2 driver, imported it in update manager, made new baseline with only that driver, attached to esxi hosts and updated those drivers successfuly.

~ # ethtool -i vmnic0
driver: bnx2
version: 2.0.22f.v41.2
firmware-version: bc 5.2.3 NCSI 2.0.6
bus-info: 0000:02:00.0
~ #

nr1c0re · ‎10-19-2011

I can confirm that everything is stable now after updating network drivers - bnx2.

As a side effect - vMotion problems fixed - before update if I move many machines from one node to other - process can stuck and half of them will migrate, and other half will stop migration with error. So it was network drivers issue.

Thanks to everyone here!!!

nr1c0re · ‎10-19-2011

Network driver bnx2 update in vmware fixed problems.

EricCorrales · ‎10-19-2011

Thanks for your email, currently I am out of the office. If you need immediate assistance, replay to Johnston, tim.johnston@hp.com or franco.coto-mesen@hp.com or feel free to call for further support to the following numbers:

UK +44 8708422330 opt 1, opt 3, opt 1 (opt 1 blades/opt 3 proliant)

Ireland +44 8708422330 opt 1, opt 3, opt 1 (opt 1 blades/opt 3 proliant)

Australia +61 3 8833 5000 or 13 10 47

New Zealand +64 800 449 553

Philippines +62 21 5798 7777

Singapore +65 6272 4333 #1

Malaysia +60 1800 88 8588

USA 1-800-334-5144 opt 1, opt2 "Proliant Running Linux"

nr1c0re · ‎11-09-2011

Yesterday one of 2 servers went reboot with same messages in iLo 2 log...

I had no ideas exept changing theese both servers to new servers.

EricCorrales · ‎11-09-2011

Thanks for your email, currently I am out of the office. If you need immediate assistance, replay to Johnston, tim.johnston@hp.com or franco.coto-mesen@hp.com or feel free to call for further support to the following numbers:

UK +44 8708422330 opt 1, opt 3, opt 1 (opt 1 blades/opt 3 proliant)

Ireland +44 8708422330 opt 1, opt 3, opt 1 (opt 1 blades/opt 3 proliant)

Australia +61 3 8833 5000 or 13 10 47

New Zealand +64 800 449 553

Philippines +62 21 5798 7777

Singapore +65 6272 4333 #1

Malaysia +60 1800 88 8588

USA 1-800-334-5144 opt 1, opt2 "Proliant Running Linux"

jmahes · ‎11-09-2011

Since I replaced the motherboards on both my servers in the cluster I have had no issues “knock on wood” 23 days of uptime.

Jonathan Mahes

Team Lead – Infrastructure Engineer, IT Infrastructure

DealerTrack, Inc.

1111 Marcus Avenue

Lake Success, NY 11042

jon.mahes@dealertrack.com<mailto:jon.mahes@dealertrack.com>

+1-516-734-3763 (o)

+1-718-755-9019 (m)

Informational	iLO 2	09/27/2011 09:36	09/27/2011 09:36	2	Server power restored.
Informational	iLO 2	09/27/2011 09:36	09/27/2011 09:36	1	Server power removed.
Caution	iLO 2	09/27/2011 09:36	09/27/2011 09:36	2	Server reset.

All

Random reboot at ProLiant DL380 G6 servers

Symptoms