VMware Cloud Community
eRJe
Contributor
Contributor

ESXi host crashes random; no dump or log entries

Hi,

My home lab host was running ESXi 5.5 for almost 6 years without any crash. No hardware was changed, only the location in August 2018. A couple of weeks ago I noticed that the host is rebooting regularly. Initially I thought it had something to do with an IPv6 roll out on the location, which was known to be buggy in the version I was running. After first disabling IPv6, I eventually updated to ESXi 6.5. The host is still crashing as you can see from the Xorg.log below. By reading the network switch log, I can tell that the host sometimes crashes multiple times before completing the reboot.

2019-01-10T10:48:40Z mark: storage-path-claim-completed

2019-01-10T14:10:44Z mark: storage-path-claim-completed

2019-01-10T15:46:38Z mark: storage-path-claim-completed

2019-01-10T17:03:56Z mark: storage-path-claim-completed

2019-01-10T17:46:53Z mark: storage-path-claim-completed

2019-01-10T18:12:50Z mark: storage-path-claim-completed

2019-01-10T19:24:31Z mark: storage-path-claim-completed

2019-01-10T19:51:06Z mark: storage-path-claim-completed

2019-01-10T21:23:26Z mark: storage-path-claim-completed

2019-01-10T23:30:24Z mark: storage-path-claim-completed

2019-01-10T23:59:18Z mark: storage-path-claim-completed

2019-01-11T00:29:38Z mark: storage-path-claim-completed

2019-01-11T01:32:12Z mark: storage-path-claim-completed

2019-01-11T02:19:19Z mark: storage-path-claim-completed

2019-01-11T04:09:31Z mark: storage-path-claim-completed

2019-01-11T05:35:51Z mark: storage-path-claim-completed

2019-01-11T06:51:12Z mark: storage-path-claim-completed

2019-01-11T07:17:11Z mark: storage-path-claim-completed

2019-01-11T07:57:42Z mark: storage-path-claim-completed

2019-01-11T08:30:11Z mark: storage-path-claim-completed

2019-01-11T14:59:32Z mark: storage-path-claim-completed

2019-01-11T15:37:45Z mark: storage-path-claim-completed

2019-01-11T16:20:33Z mark: storage-path-claim-completed

2019-01-11T16:49:22Z mark: storage-path-claim-completed

2019-01-11T18:30:27Z mark: storage-path-claim-completed

2019-01-11T20:23:01Z mark: storage-path-claim-completed

2019-01-12T00:06:59Z mark: storage-path-claim-completed

2019-01-12T00:34:22Z mark: storage-path-claim-completed

2019-01-12T01:02:04Z mark: storage-path-claim-completed

2019-01-12T09:10:54Z mark: storage-path-claim-completed

2019-01-12T11:16:12Z mark: storage-path-claim-completed

The crashes appear like if you just unplug the power cable. There is no purple screen, no dump files and in none of the logs I can find hints of what happened prior to the reboot.

In vmkernel.log I noticed a few memory corrections so to be sure, I ran memtest86 for 72 hours. No errors found.

The server has a redundant power supply. Currently I have disabled one module to see if it makes a difference. after 24 hours I will swap to the other module. It's a shot in the dark, specially since the memtest could run for 72 hours. Other than this I am running out of idea's. Although I can remove some non essential hardware from the host like a GFX card and 2nd storage controller. But I would expect log entries when those are failing.

Any input will be very much appreciated. Perhaps there are other logs that I am missing which contain info. Or can I enable more advanced logging?

Thanks,

Robbert

Hardware:

Motherboard: ASUS KGPE-D16 with 2x AMD Opteron 6134 / 8 core

Memory: 12x 8GB DDR3 ECC Reg

Main controller: IBM ServeRAID M5016 SAS/SATA -> LSI 9266-8i MegaRaid with SSD RAID10 and HDD RAID10

2nd controller: Supermicro -USAS2-L8i 8-Port SAS/2 (pass through)

Tags (1)
Reply
0 Kudos
7 Replies
a_p_
Leadership
Leadership

Since there's no PSOD when these crashes occur, I'd suggest you start with creating a persistent scratch location (https://kb.vmware.com/s/article/1033696).

Maybe the log files contain entries which help to determine what happened prior to the crash(es).


André

Reply
0 Kudos
eRJe
Contributor
Contributor

Hi André,

Thanks for replying. I had already made a persistent scratch partition on the SSD RAID10 datastore.I do have updating logs in /var/log but there are no dump files after the crashes.

Via IPMI I can see the concole remotely from where I am. I have seen several crashes on my screen but they just show as if the power was toggled. I'm guessing the system does not have a chance to generate the dump files.

Thanks,

Robbert

drwxr-xr-x    1 root     root           512 Jan 12 13:35 .

drwxr-xr-x    1 root     root           512 Jan 12 13:35 ..

-rw-------    1 root     root            13 Jan 12 13:35 .ash_history

-r--r--r--    1 root     root            20 Jul  7  2017 .mtoolsrc

lrwxrwxrwx    1 root     root            49 Jan 12 12:18 altbootbank -> /vmfs/volumes/715de0e4-ad245e89-1b34-6b0c39efb6a5

drwxr-xr-x    1 root     root           512 Jan 12 12:18 bin

lrwxrwxrwx    1 root     root            49 Jan 12 12:18 bootbank -> /vmfs/volumes/e46647ea-2f245bae-a60d-d5dc0ad390f0

-r--r--r--    1 root     root        505736 Jul  7  2017 bootpart.gz

drwxr-xr-x   13 root     root           512 Jan 12 13:35 dev

drwxr-xr-x    1 root     root           512 Jan 12 13:18 etc

drwxr-xr-x    1 root     root           512 Jan 12 12:18 lib

drwxr-xr-x    1 root     root           512 Jan 12 12:18 lib64

-r-x------    1 root     root         21439 Jan 12 12:01 local.tgz

lrwxrwxrwx    1 root     root             6 Jan 12 12:18 locker -> /store

drwxr-xr-x    1 root     root           512 Jan 12 12:18 mbr

drwxr-xr-x    1 root     root           512 Jan 12 12:18 opt

drwxr-xr-x    1 root     root        131072 Jan 12 13:35 proc

lrwxrwxrwx    1 root     root            23 Jan 12 12:18 productLocker -> /locker/packages/6.5.0/

lrwxrwxrwx    1 root     root             4 Jul  7  2017 sbin -> /bin

lrwxrwxrwx    1 root     root            57 Jan 12 12:18 scratch -> /vmfs/volumes/548cb1c4-30d22a56-b3a7-bcaec527ae9b/.locker

lrwxrwxrwx    1 root     root            49 Jan 12 12:18 store -> /vmfs/volumes/527fd478-f2ea3747-127a-bcaec527ae9b

drwxr-xr-x    1 root     root           512 Jan 12 12:17 tardisks

drwxr-xr-x    1 root     root           512 Jan 12 12:17 tardisks.noauto

drwxrwxrwt    1 root     root           512 Jan 12 13:01 tmp

drwxr-xr-x    1 root     root           512 Jan 12 12:17 usr

drwxr-xr-x    1 root     root           512 Jan 12 12:18 var

drwxr-xr-x    1 root     root           512 Jan 12 12:17 vmfs

drwxr-xr-x    1 root     root           512 Jan 12 12:17 vmimages

lrwxrwxrwx    1 root     root            18 Jul  7  2017 vmupgrade -> /locker/vmupgrade/

Filesystem         Bytes          Used    Available Use% Mounted on

VMFS-5      997774589952  774669074432 223105515520  78% /vmfs/volumes/IBM_RAID10_4SSD

VMFS-5     1997965099008 1619378831360 378586267648  81% /vmfs/volumes/IBM_RAID10_4HDD

VMFS-5     3000571527168 2138127204352 862444322816  71% /vmfs/volumes/SM_1HDD

vfat           299712512        131072    299581440   0% /vmfs/volumes/527fd478-f2ea3747-127a-bcaec527ae9b

vfat           261853184     172625920     89227264  66% /vmfs/volumes/715de0e4-ad245e89-1b34-6b0c39efb6a5

vfat           261853184     162902016     98951168  62% /vmfs/volumes/e46647ea-2f245bae-a60d-d5dc0ad390f0

[root@RJ-ESXi:/vmfs/volumes] ls -al

total 3844

drwxr-xr-x    1 root     root           512 Jan 12 13:59 .

drwxr-xr-x    1 root     root           512 Jan 12 13:45 ..

drwxr-xr-x    1 root     root             8 Jan  1  1970 527fd478-f2ea3747-127a-bcaec527ae9b

drwxr-xr-t    1 root     root          2940 Aug 22 20:35 548cb1c4-30d22a56-b3a7-bcaec527ae9b

drwxr-xr-t    1 root     root          1400 Nov 13  2016 54921312-3735e336-0618-bcaec527ae9b

drwxr-xr-t    1 root     root          1680 Jan  5 20:12 54921344-dba16a0f-8aa6-bcaec527ae9b

drwxr-xr-x    1 root     root             8 Jan  1  1970 715de0e4-ad245e89-1b34-6b0c39efb6a5

lrwxr-xr-x    1 root     root            35 Jan 12 13:59 IBM_RAID10_4HDD -> 54921312-3735e336-0618-bcaec527ae9b

lrwxr-xr-x    1 root     root            35 Jan 12 13:59 IBM_RAID10_4SSD -> 548cb1c4-30d22a56-b3a7-bcaec527ae9b

lrwxr-xr-x    1 root     root            35 Jan 12 13:59 SM_1HDD -> 54921344-dba16a0f-8aa6-bcaec527ae9b

drwxr-xr-x    1 root     root             8 Jan  1  1970 e46647ea-2f245bae-a60d-d5dc0ad390f0

Reply
0 Kudos
a_p_
Leadership
Leadership

You mentioned IPMI. Are there any entries in the IPMI logs regarding the reboots?

André

Reply
0 Kudos
Dave_the_Wave
Hot Shot
Hot Shot

This is exactly why I stopped building whiteboxes for clients decades ago.

I would  temporarily remove the drives, throw another single drive in there, and then install a simple OS like Win7 and run whatever diag or burn in tools for me to see where it happens.

It's always easier to diag machines on new installs of anything.

Reply
0 Kudos
eRJe
Contributor
Contributor

The IPMI log had only one memory correction message.

Reply
0 Kudos
Prime201110141
Enthusiast
Enthusiast

eRJe​ This seems probably a RAM problem as what I have experienced. since you have 12x 8GB DDR3 ECC Reg. Remove all your RAM and you can try starting Each processor side 8GB RAM module. If nothing happens Increased to second module and go on like that.

Hope it will solved the problem, if not reply please.

May the force be with you,
Prime919

Reply
0 Kudos
eRJe
Contributor
Contributor

Prime201110141​ I took my time to do some testing before reporting back. I did what you suggested and removed all memory accept for 1 DIMM for each CPU. The behavior was the same. I then replaced the memory with 2 other DIMMS. This time the server did not crash for +3 days but I noticed that vCenter was frozen. I restarted vCenter VM and withing 1 hour the server crashed again continuously.

I then replaced again the memory but also disabled the 2nd CPU in the BIOS. The server still crashed but less than once a day. A big improvement but still not satisfying. Again I changed the memory and the crashing frequency went up to 12-15 crashes a day. I continued swapping memory without significant change.

I cannot imagine that all 12 DIMMS are faulty but there is definitely a noticeable change with certain memory configurations. I now considering to physically swap the 2 CPU's. When this also doesn't make a difference, I think I have to consider the motherboard faulty and replace it.

Any thoughts?

Regards,

Robbert

Reply
0 Kudos