Alexjk
Contributor
Contributor

Unexplained ESXi CRASH

Hi Folks,

We have an ESXi host  that is running a single VM, piloting a VM for a phone system.  A few days ago the whole system crashed host and VM were unavailable and had to be powered off to restore any connectivity to it. We've gone through all the VMWare logs had a look in the HP iLO and cannot find any obvious reasons.

The only log that has some sort of indication of something up is the syslog.log file, but there are so many errors in this, as below, are these actual errors being reported?

Thanks in advance

Alex

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   In Failed Array'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   Rebuild/Remap in progress'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   Rebuild/Remap Aborted (was not completed normally)'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line=''

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   Correctable ECC/Other Correctable Memory Error'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='# Post Memory Resize'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:14:*:0::'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:14:*:1::'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line=''

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='# System Firmware Progress'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   Uncorrectable ECC/Other Uncorrectable Memory Error'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   Parity'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   Memory Scrub Failed (stuck bit)'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   Memory Device Disabled '

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:12:4:1::15'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   System Firmware Error (Post Error)'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   System Firmware Hang'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   System Firmware Progress'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:15:2:0::'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:15:2:1::'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line=''

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   Correctable ECC/Other Correctable Memory Error Logging Limit Reached'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   Presence Detected'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='# Event Logging Disabled'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   Correctable Memory Error Logging Disabled'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   Event 'Type' Logging Disabled'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   Log Area Reset/Cleared'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:12:6:0::'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:16:2:0::'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:12:6:1::'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:16:2:1::'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#  Controller access degraded or unavailable'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   All Event Logging Disabled'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   SEL Full'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   SEL Almost Full'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line=''

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#  Management controller off-line'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='# Watchdog 1'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#  Management controller unavailable'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#  Sensor Failure'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#  FRU Failure'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line=''

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   Config error'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='# Battery'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   Spare'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   Battery low (predicitive failure)'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   Battery failed'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:12:8:0::'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   Battery presence detected'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:41:2:0::'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:41:2:1::'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line=''

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   Redundancy degraded from non-redundant'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='# session audit'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   Session activated'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line=''

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='# Discrete'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:42:0:0::'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   D0 power state'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:42:0:1::'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   Session deactivated'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='12:*:0:0::2'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:42:1:0::'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='12:*:0:1::2'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:42:1:1::'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   D1 power state'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line=''

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='12:*:1:0::18'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='# Version Change'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='12:*:1:1::18'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:43:*:0::'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   D2 power state'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:43:*:1::'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='12:*:2:0::18'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line=''

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='12:*:2:1::18'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='# FRU state'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   D3 power state'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   Not installed'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='12:*:3:0::10'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:44:0:0::'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='12:*:3:1::10'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:44:0:1::'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line=''

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   Inactive'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:44:1:0::'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='# Sensor specific events follow - event reading type: 0x6f == 111'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:44:1:1::15'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   activation requested'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line=''

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:44:2:0::'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='# Temperature'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:44:2:1::8'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:1:*:0::'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   activation in progress'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:1:*:1::'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:44:3:0::'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line=''

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:44:3:1::8'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='# Voltage'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   active'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:2:*:0::'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:44:4:0::'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:2:*:1::'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:44:4:1::2'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line=''

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   deactivation requested'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='# Current'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:44:5:0::'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:3:*:0::'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:44:5:1::9'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:3:*:1::'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   deactivation in progress'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line=''

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:44:6:0::'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='# Fan'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:44:6:1::9'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:4:*:0::'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:12:8:1::'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   Automatically Throttled'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:17:*:0::'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:17:*:1::'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line=''

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='# System Event'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:4:*:1::'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line=''

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='# Physical Security'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line=''

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='# Platform Security'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   lost communication'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line=''

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='# Processor'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#  IERR'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#  Thermal Trip'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#  FRB1/BIST Failure'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#  FRB2/Hang in POST Failure'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   Undetermined System Hardware Failure'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#  FRB3/Processor Startup/Initialization Failure'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#  Configuration Error'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#  SM BIOS 'Uncorrectable CPU-complex Error''

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#  Processor Presence'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line=''

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:7:7:0::'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='# Critical Interrupt'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:7:7:1::'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   Front Panel NMI/Diagnostic Interrupt'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#  Processor Disabled'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:44:7:0::'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   Critical overtemp'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   Bus Timeout'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line=''

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:44:7:1::13'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line=''

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='# Drive Slot (Bay)'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   I/O Channel Check NMI'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   Software NMI'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   PCI PERR'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   PCI SERR'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   EISA Fail Safe Timeout'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   Bus Correctable Error'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   Drive Presence'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   Bus Uncorrectable Error'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   Fatal NMI (port 61h, bit 7)'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:13:0:0::'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:13:0:1::'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   Drive Fault'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   Predictive Failure'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   Bus fatal error'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   Hot Spare'

2018-10-01T08:58:11Z localcli: omc-ipmi: Read 754 lines from /etc/sfcb/omc/sensor_health, total entries 251

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='#   Bus degraded'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line=''

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line='# Button'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:20:*:0::'

2018-10-01T08:58:11Z localcli: Missing healthState value to report, line='111:20:*:1::'

2018-10-01T08:58:11Z localcli: Missing expected value to check for, line=''

0 Kudos
7 Replies
SupreetK
Commander
Commander

Did the VM crash or the host itself? If the host crashed, do you see any files under /var/core dir?

Cheers,

Supreet

0 Kudos
Alexjk
Contributor
Contributor

Hi Supreet,

Thanks for the reply, it was the host itself that crashed. There are no files in the /var/core directory.

Thanks

Alex

0 Kudos
SupreetK
Commander
Commander

When the host was in crashed state, did you check the DCUI screen from the iLO console? Were you able to see the ESXi login screen?

Cheers,

Supreet

0 Kudos
sk84
Expert
Expert

Can you please specify exactly which server hardware model and ESXi version you are using?

And did you check the "vmkernel.log" and "hostd.log" in the /var/log/ directory?

--- Regards, Sebastian VCP6.5-DCV // VCP7-CMA // vSAN 2017 Specialist Please mark this answer as 'helpful' or 'correct' if you think your question has been answered correctly.
0 Kudos
a_p_
Leadership
Leadership

0 Kudos
Alexjk
Contributor
Contributor

@ Supreet, the Tech onsite did a hard boot when both host and VM were offline and unresponsive to pings. Logging into the iLO afterwards didn't offer anything other than the reboot occurring.

@a.p, The server is a HP Proliant DL20 Gen 9.   The ESXi version is ESXI-6.5.0-20170104001-STANDARD

the hostd.log file only appears to have data from when the server came back up.

The VMKernal log is available here: Dropbox - ESXi - Simplify your life

Line 926 looks to be when it was rebooted and line 2376 seems to be when it comes back online.

thanks

0 Kudos
TotesHagopes
VMware Employee
VMware Employee

Judging by the fact the host stopped logging hours before the hard shutdown was performed, it looks like the host couldn't write to local storage:

2018-10-01T05:16:10.065Z cpu3:65560)NMP: nmp_ThrottleLogForDevice:3546: last error status from device t10.ATA_____ST1000DM0032D1SB10C__________________________________Z9A0NHFY repeated 10 times

2018-10-01T05:16:50.250Z cpu2:65954)NMP: nmp_ResetDeviceLogThrottling:3348: last error status from device t10.ATA_____ST1000DM0032D1SB10C__________________________________Z9A0NHFY repeated 14 times

VMB: 112: mbMagic: 2badb002, mbInfo 0x101628     <------start-up

VMB: 56: flags a6d

VMB: 59: cmdline: /jumpstrt.gz vmbTrustedBoot=false tboot=0x101b000 installerDiskDumpSlotSize=2560 no-auto-partition bootUUID=7770609e6f31eafd8cee8ffc2a23095f

VMB: 64: 133 boot modules @ 0x100db8

VMB: 71: mmap_addr 0x1016a0 (504b)

......

0:00:00:04.793 cpu0:65536)VMKernel loaded successfully.     <----VMKernel loaded

2018-10-01T08:45:46.066Z cpu1:65698)VSCSI: 2962: Starting reset watchdog world 65698   <--- latest timestamp

2018-10-01T08:45:46.066Z cpu3:65697)VSCSI: 2764: Starting reset handler world 65697/1

2018-10-01T08:45:46.067Z cpu0:65536)Boot: 563: 23023 symbols, 523544

I'd say at this point it would be best to confirm the BIOS, local disk/controller are on the appropriate driver/firmware. This KB is handy for understanding an unresponsive host and the correct action to take -> VMware Knowledge Base (1017135)

0 Kudos