Hello community!
I've got 4 vmware esxi 4.1 servers with vCenter:
- 2 running on ProLiant DL380 G5 - theese are OK
- 2 running on ProLiant DL380 G6 - here are random reboot problem.
Randomly once per 2 weeks one of two ProLiant DL380 G6 servers goes unnormal reboot. Monitoring system HP SIM says only "host x.x.x.x unreachable" then when machine boots it says "host x.x.x.x reachable". I've updated all firmwares on servers, run several stress tests, updated vmware - nothing solved problem.
Logs are clean, exept iLo 2 log:
Informational iLO 2 09/27/2011 09:36 09/27/2011 09:36 2 Server power restored. Informational iLO 2 09/27/2011 09:36 09/27/2011 09:36 1 Server power removed. Caution iLO 2 09/27/2011 09:36 09/27/2011 09:36 2 Server reset.That's all.
How can I catch a cause of this problem?
Point me to the right direction, I've got no ideas what to look at.
Hello,
Are you using P410 family RAID controller by any chance? And if so, are you using BBWC or FBWC? I read about quite a lot of rebooting problems originating from these controllers (quite often can be solved by firmware upgrade, but not always)
Just an idea...
S.
Hi,
disable the ASR in the Bios, this can also cause an reboot of the server.
And look here please
It's P410i controller, but I've updated firmware from HP firmware DVD 9.3 downloaded from hp.com...
And on the link http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=101084... says
Power remove and power restore issues:
There are two options that can trigger the behavior
1. Firmware ( BIOS - ILO - Controller)
2. Hardware issues (Power subsystem : Power supply, System Board, Riser Cage)
Based on that be sure you have the firmware up to date:
** CRITICAL ** Systems ROMPaq Firmware Upgrade for HP ProLiant DL380 G6 (P62) Servers (For USB Key-M... | 2011.05.05 (A) 6 Jul 2011 |
Firmware CD Supplemental Update / Online ROM Flash Component for Linux - HP Integrated Lights-Out 2 | 2.07 31 Aug 2011 |
* RECOMMENDED * Firmware CD Supplemental Update / Online ROM Flash Component for Linux - Smart Array... (multi-part download) | 5.06 24 Jun 2011 |
If that doesn`t resolve the behavior, run the diagnostic offline and save the report, save the vm-support file and open a ticket with HP.
We are having the same issue here is something someone from my team found out yesturday that might help you out. CHECK THE SERIAL NUMBER for a bad batch of DL380 G6 servers......
HP Support mentioned that there is a power blackplane issue specifically for HP DL360 G6 with serial numbers that contain "941"
XXX9410059 / 491315-001
XXX941006K / 491315-001
I followed the process to create a bootable USB key
This is the link for the update that is affecting these two servers The update came out 9/19/2011
Index of ftp://ftp.hp.com/pub/softlib2/software1/sc-linux-fw/p1687714714/v69108/
This is what we will try let us know if anyone is trying this as well, we are experiencing the same behavior mentioned above in first post.
Updated firmwares on 1st server, tomorrow I'll update on 2nd.
jmahes, servers serial number contains XXX946XXXX / 494329-B21. But may be that this one can have same problems.
Yeh HP said the notes they had were for xxx941 serial numbers, the next step is to replace the back plane for the power supply, that is attached to the mainboard and cannot be seperated.
We have told HP that the software did not correct this problem and that we need to move to the next step and replace the mainboard. That is happening 10/5/11 i update you on how it goes.
I sent all logs from maintenance tests to HP support and they told that there is no problems with hardware.
And they suggested to update NIC drivers and if that not help - to contact with vmware support team.
Okay, I found new drivers.
Here they are:
http://downloads.vmware.com/d/details/dt_esx41_broadcom_netxtremeii_032311/ZHcqYnR0anBiZCpwcA
The ESX Server 4.1 driver CD includes support for version bnx2i-1.9.1t.v41.2, bnx2-2.0.22f.v41.2,bnx2x-1.62.15.v41.2, cnic-1.10.2q.v41.9 on ESX/ESXi.
I checked my esxi - it is 4.1 build 433742.
From console I checked versions of drivers:
~ # ethtool -i vmnic0
driver: bnx2
version: 2.0.7d-4vmw
firmware-version: 5.2.3 NCSI 2.0.6
bus-info: 0000:02:00.0
~ # ethtool -i vmnic4
driver: e1000e
version: 1.1.2-NAPI
firmware-version: 5.12-2
bus-info: 0000:0f:00.0
~ #
I got old one drivers.
That means that this drivers update applyes to my esxi.
Also I've got vmware update manager. Is it possible to find this drivers update in update manager? I have not found it there.
Update manager usually contains the patches/updates that vmware has released. It will not have the drivers. You can download the ISO and extract it inside it you should find the offline line bundles. You can install your bundle using esxupdate.
Please do award points by clicking correct/helpful if it helped you.
Thanks rajvm256.
I downloaded iso, extracted, took zip file with bnx2 driver, imported it in update manager, made new baseline with only that driver, attached to esxi hosts and updated those drivers successfuly.
~ # ethtool -i vmnic0
driver: bnx2
version: 2.0.22f.v41.2
firmware-version: bc 5.2.3 NCSI 2.0.6
bus-info: 0000:02:00.0
~ #
I can confirm that everything is stable now after updating network drivers - bnx2.
As a side effect - vMotion problems fixed - before update if I move many machines from one node to other - process can stuck and half of them will migrate, and other half will stop migration with error. So it was network drivers issue.
Thanks to everyone here!!!
Network driver bnx2 update in vmware fixed problems.
Thanks for your email, currently I am out of the office. If you need immediate assistance, replay to Johnston, tim.johnston@hp.com or franco.coto-mesen@hp.com or feel free to call for further support to the following numbers:
UK +44 8708422330 opt 1, opt 3, opt 1 (opt 1 blades/opt 3 proliant)
Ireland +44 8708422330 opt 1, opt 3, opt 1 (opt 1 blades/opt 3 proliant)
Australia +61 3 8833 5000 or 13 10 47
New Zealand +64 800 449 553
Philippines +62 21 5798 7777
Singapore +65 6272 4333 #1
Malaysia +60 1800 88 8588
USA 1-800-334-5144 opt 1, opt2 "Proliant Running Linux"
Yesterday one of 2 servers went reboot with same messages in iLo 2 log...
I had no ideas exept changing theese both servers to new servers.
Thanks for your email, currently I am out of the office. If you need immediate assistance, replay to Johnston, tim.johnston@hp.com or franco.coto-mesen@hp.com or feel free to call for further support to the following numbers:
UK +44 8708422330 opt 1, opt 3, opt 1 (opt 1 blades/opt 3 proliant)
Ireland +44 8708422330 opt 1, opt 3, opt 1 (opt 1 blades/opt 3 proliant)
Australia +61 3 8833 5000 or 13 10 47
New Zealand +64 800 449 553
Philippines +62 21 5798 7777
Singapore +65 6272 4333 #1
Malaysia +60 1800 88 8588
USA 1-800-334-5144 opt 1, opt2 "Proliant Running Linux"
Since I replaced the motherboards on both my servers in the cluster I have had no issues “knock on wood” 23 days of uptime.
Jonathan Mahes
Team Lead – Infrastructure Engineer, IT Infrastructure
DealerTrack, Inc.
1111 Marcus Avenue
Lake Success, NY 11042
jon.mahes@dealertrack.com<mailto:jon.mahes@dealertrack.com>
+1-516-734-3763 (o)
+1-718-755-9019 (m)