m4biz
Enthusiast
Enthusiast

Unexpeted, repeated shutdowns of EXSi server

Jump to solution

From about two weeks my Esxi server on HP Proliant DL 380 G5 shutdown unexpetedly  every two days.

After that I must manually restart it.

If I see on the vSphere dashboard the only alert message is this:

esxi.png

Any idea on how to fix it?

Thanks in advance

Ing. Cosimo Mercuro http://cosimomercuro.wordpress.com/
Tags (1)
0 Kudos
1 Solution

Accepted Solutions
m4biz
Enthusiast
Enthusiast

Hi!

I've solved the issue.

I've replaced the battery pack without solve the problem.

The real problem was related to my APC Smart UPS .

After I've contacted APC support I've simply re-started the UPS by means a very simple procedure that APC' support has mailed to me and all now works fine from about three weeks.

I hope my feedback will be useful to other with similar issue.

Ing. Cosimo Mercuro http://cosimomercuro.wordpress.com/

View solution in original post

0 Kudos
14 Replies
daphnissov
Immortal
Immortal

That usually means the cache battery on the storage controller is dead and you must replace it. As to whether that is responsible for the ESXi host shutdowns, don't know, but if you are using those internal drives in a RAID configuration and especially using write-back caching, you should plan to replace it ASAP.

Papalardo
Contributor
Contributor

Hello m4biz,

I advise logging via ssh and parsing the ESXi host logs, below the files responsible for each function and their respective locations.

VMware Knowledge Base

Then look at the HP Server logs through HP System Insight Manager..

https://www.hpe.com/us/en/product-catalog/detail/pip.489496.html

Support and Infrastructure Analyst
0 Kudos
a_p_
Leadership
Leadership

For the unexpected shutdown, login to the server's iLO, and  check the System Management Logs.

André

0 Kudos
lfichera
Enthusiast
Enthusiast

Checks if there are any events in the iLO log, as the controller cache may be experiencing problems. It can be hardware with problems or lack of firmware application.

Drivers & software for hp proliant dl380 g5 server:

https://support.hpe.com/hpsc/swd/public/detail?sp4ts.oid=1121413&swItemId=MTX_a1fd0e6f7fdc42549704ee...

https://support.hpe.com/hpsc/swd/public/detail?sp4ts.oid=1121413&swItemId=MTX_575981f040124616bcad8d...

Installing async drivers in ESXi 5.x and 6.x using esxcli and async driver VIB file (2137854):

https://support.hpe.com/hpesc/public/home/driverHome?sp4ts.oid=1121413&swLangOid=2&swEnvOid=4166

Hugs,

0 Kudos
m4biz
Enthusiast
Enthusiast

Hi, daphnissov , thanks for your reply.

I'll try asap

Ing. Cosimo Mercuro http://cosimomercuro.wordpress.com/
0 Kudos
Dave_the_Wave
Hot Shot
Hot Shot

No, a failed Smart array battery won't bug your box, but any one or combination of: a failing raid controller, especially the onboard ones, a bad motherboard, bad power supplies, bad power AC/DC power regulators, will.

Yes, are are seeing a sensor for the battery, but that is just coincidence for all the other stuff that can go bad over time for a decade old box.

All the ProLiant sensors just tell you if something is present or not-present, and doesn't account for "works some of the time".

You can try a new battery swap, but the cost of the battery will be the same as replacing the entire box, your call.

You can also try to drop in a better smart array card, but the downtime to replace the entire box is just as long.

0 Kudos
m4biz
Enthusiast
Enthusiast

Hi Dave.

Sorry for delay in my reply.

Is there anyway to disable this check and stop the continue shutdown without replace the battery pack?

My disks  not are in any RAID configuration.

Ing. Cosimo Mercuro http://cosimomercuro.wordpress.com/
0 Kudos
Finikiez
Champion
Champion

I guess you can try to physically detach battery pack from raid controller.

0 Kudos
m4biz
Enthusiast
Enthusiast

Hi Finikiez ,

thanks forr your reply.

What happen if I do this?

The server works too?

Ing. Cosimo Mercuro http://cosimomercuro.wordpress.com/
0 Kudos
Finikiez
Champion
Champion

Cache battery is necessary to avoid data corruption in case of unexpected power loss when write cache is enabled.

So try to disable write cache in controller's BIOS or from ACU cli first.

If this doesn't help, try to detach it physically.

When you disable write cache expect write performance degradation.

imacfj
Enthusiast
Enthusiast

The check alone would not be causing the shutdown so there's no need to remove it. Your best bet is to check through the iLO to see whats actually happening as it sounds like it could probably be a hardware fault as you're using a G5 server...Apart from that, check through the ESXi host logs to see if there are any software errors

m4biz
Enthusiast
Enthusiast

Hi imacfj , thanks for your reply.

I've just configured iLo 2 and I've founded this:

2018-02-15_165519.jpg

At this point I think that I must replace the battery pack:

20180104_114525.jpg

What do you think about?

Ing. Cosimo Mercuro http://cosimomercuro.wordpress.com/
0 Kudos
Finikiez
Champion
Champion

If you can replace the battery obviously you need to do this.

If you can't - disable write cache on the contoller.

m4biz
Enthusiast
Enthusiast

Hi!

I've solved the issue.

I've replaced the battery pack without solve the problem.

The real problem was related to my APC Smart UPS .

After I've contacted APC support I've simply re-started the UPS by means a very simple procedure that APC' support has mailed to me and all now works fine from about three weeks.

I hope my feedback will be useful to other with similar issue.

Ing. Cosimo Mercuro http://cosimomercuro.wordpress.com/
0 Kudos