VMware Cloud Community
vmrulz
Hot Shot
Hot Shot
Jump to solution

3.5.3 host reboots just prior to OS load completion

Greetings,

I've got a toughy. We've got an HP DL585G2 (our standard ESX platform, 32GRAM, Qlogic HBA's) that I just re-installed ESX on this morning (no HP agents installed yet). I've had 3 SR's open with VMware in the last two months on this box. The last thing VMware said was it was probably kernel corruption so re-install the OS which I did. I've ran HP diagnostics over and over and never found a single issue (surprise surprise). I changed out the hard drives on it.. the next step is to do full memory tests with a RAMCHECK LX.

Anyway, everytime I attempt to boot this server the OS almost fully loads.. when it gets to "starting webaccess" the boxes reboots itself.. this is consistent.. up until the boot partition gets corrupted and won't load from the random reboots.

Anybody seen anything similar to this or have a thought on what might cause this?

Mother's don't let your children do production support for a living!

0 Kudos
1 Solution

Accepted Solutions
Texiwill
Leadership
Leadership
Jump to solution

Hello,

Disk timeouts take minutes and mostly you want the diags to test other things as many times as possible. DIsk timeouts generally HALT all diags until finished. So instead of 100s of runs you may only get 10. You want more than one run with HP Diags.


Best regards,

Edward L. Haletky

VMware Communities User Moderator

====

Author of the book 'VMWare ESX Server in the Enterprise: Planning and Securing Virtualization Servers', Copyright 2008 Pearson Education.

Blue Gears and SearchVMware Pro Blogs: http://www.astroarch.com/wiki/index.php/Blog_Roll

Top Virtualization Security Links: http://www.astroarch.com/wiki/index.php/Top_Virtualization_Security_Links

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill

View solution in original post

0 Kudos
8 Replies
Texiwill
Leadership
Leadership
Jump to solution

Hello,

Are you installing over the ILO or using the local CDROM? There used to be a bug when using the ILO that caused a reboot. The solution was to upgrade the ILO firmware.


Best regards,

Edward L. Haletky

VMware Communities User Moderator

====

Author of the book 'VMWare ESX Server in the Enterprise: Planning and Securing Virtualization Servers', Copyright 2008 Pearson Education.

Blue Gears and SearchVMware Pro Blogs: http://www.astroarch.com/wiki/index.php/Blog_Roll

Top Virtualization Security Links: http://www.astroarch.com/wiki/index.php/Top_Virtualization_Security_Links

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill
0 Kudos
vmrulz
Hot Shot
Hot Shot
Jump to solution

Standard CD ROM installation. Just did extended memory checks on all the RAM.. clean. I hate these ones because its so difficult to pinpoint the issue when all your diags come up clean but you know its a hardware issue of some kind.

Mother's don't let your children do production support for a living!

0 Kudos
vmrulz
Hot Shot
Hot Shot
Jump to solution

Just loaded win2003 on the host and it is exhibiting similar behaviour.. I guess its time to haggle with HP and swap out every part on the box!

Mother's don't let your children do production support for a living!

0 Kudos
SuryaVMware
Expert
Expert
Jump to solution

The RAM shipped with HP servers sometime ago caused issues with ESX server running, no memory diag showed any issue though. I can't recollect the RAM makers, but HP has replaced this for all the Customers using ESX server. You would want to check that with HP.

-Surya

Texiwill
Leadership
Leadership
Jump to solution

Hello,

I would do the following:

0) Open up the box and look for obvious issues/reseat cards. I had a loose heatsink one time

1) Upgrade BIOS/Firmware

2) Verify BIOS is set properly

3) Run Vendor Hardware DIAGS w/o disk timeout test for at least 24-48 hours

4) Run memtest86+ for at least 24-48 hours.

Or call the vendor and have them do it all for you.


Best regards,

Edward L. Haletky

VMware Communities User Moderator

====

Author of the book 'VMWare ESX Server in the Enterprise: Planning and Securing Virtualization Servers', Copyright 2008 Pearson Education.

Blue Gears and SearchVMware Pro Blogs: http://www.astroarch.com/wiki/index.php/Blog_Roll

Top Virtualization Security Links: http://www.astroarch.com/wiki/index.php/Top_Virtualization_Security_Links

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill
vmrulz
Hot Shot
Hot Shot
Jump to solution

Thanks for the tips.. believe me I've gone through a lot of the standard troubleshooting.

What is the reasoning regarding w/o disk timeout?

Mother's don't let your children do production support for a living!

0 Kudos
Texiwill
Leadership
Leadership
Jump to solution

Hello,

Disk timeouts take minutes and mostly you want the diags to test other things as many times as possible. DIsk timeouts generally HALT all diags until finished. So instead of 100s of runs you may only get 10. You want more than one run with HP Diags.


Best regards,

Edward L. Haletky

VMware Communities User Moderator

====

Author of the book 'VMWare ESX Server in the Enterprise: Planning and Securing Virtualization Servers', Copyright 2008 Pearson Education.

Blue Gears and SearchVMware Pro Blogs: http://www.astroarch.com/wiki/index.php/Blog_Roll

Top Virtualization Security Links: http://www.astroarch.com/wiki/index.php/Top_Virtualization_Security_Links

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill
0 Kudos
vmrulz
Hot Shot
Hot Shot
Jump to solution

In the end it turned out to be the processor/memory board. This was found through trial and error part swapping.. never did diag pickup this issue. For me I've found this to be true with just about all diags.. they sometimes find failed individual parts but rarely diagnose bad system or processor boards.

Mother's don't let your children do production support for a living!

0 Kudos