VMware Cloud Community
maniee
Enthusiast
Enthusiast

esxi host reboot

Hi All,

         esxi host rebooted suddenly and we logged with OEM-vmware and we got the reply from vmware that  since the log not available they could not find the root cause.

please let me know what could make esxi to reboot since our customer asking for us RCA.

Please do the

6 Replies
a_p_
Leadership
Leadership

Well, it's hard to help without any details. Anyway, it's quite unusual that a host reboots by its own. If an severe error is detected by ESXi, it's supposed to halt the system and to display a PSOD (Purple Screen of Diagnostics). With enterprise hardware, a reboot my be caused by the server itself if a hardware error is detected. HP for example calls this option ASR (Automatic Server Reboot). Such ASR's - as well as e.g. a power loss - are then logged in the server management log file.

André

Reply
0 Kudos
FritzBrause
Enthusiast
Enthusiast

Create a persistent scratch location according to http://kb.vmware.com/kb/1033696.

Then log files are not deleted (if the are located on the ramdisk) once the server is rebooted.


"If persistent scratch space is not available, ESXi stores this temporary data on a ramdisk, which is constrained in space. This might be problematic in low-memory situations, but is not critical to the operation of ESXi. Information stored on a ramdisk does not persist across reboots, so troubleshooting information such as logs and core files could be lost. If a persistent scratch location on the host is not configured properly, you may experience intermittent issues due to lack of space for temporary files and the log files will not be updated."


VMware is right that they cannot find out the root cause if no logs are available.

If a persistent scratch location was created and the issue re-occurs, you can start with this KB: http://kb.vmware.com/kb/1019238.

Then check the logs like syslog.log, vobd.log, vmkernel.log, vmkwarning.log and hostd.log.

Check the HW logs if the server did a reboot because of a HW issue (ASR - Automatic Server Restart, NMI etc.).



Reply
0 Kudos
keshavkant
Enthusiast
Enthusiast

Hi

To determine the reason for abrupt shut down or reboot an ESX host:

 

  1. If the host is currently turned off, turn the host back on.

  2. Ensure that there are no hardware lights that may indicate a hardware issue. For more information, engage the hardware vendor.

  3. Log in to the host at the console as the root user.

  4. Run the command:

    # cat /var/log/vmksummary

  5. Determine if the ESX host was deliberately rebooted. When a user or script reboots a VMware ESX host, it generates a series of events under /var/log/vmksummary similar to:

    localhost logger: (1265803308) hb: vmk loaded, 1746.98, 1745.148, 0, 208167, 208167, 0, vmware-h-59580, sfcbd-7660, sfcbd-3524
    localhost vmkhalt: (1268148282) Rebooting system...
    localhost vmkhalt: (1268148374) Starting system...
    localhost logger: (1268148407) loaded VMkernel

    Hostd: [<YYYY-MM-DD> <TIME>.284 27D13B90 info 'TaskManager'] Task Created : haTask-ha-host-vim.HostSystem.reboot-50


    If your ESX host is deliberately restarted, review the vCenter Server logs to identify any recent tasks that may have made the ESX host to reboot. These are a list of other resources that help determine the reason for reboot of an ESX host:


  6. Determine if the VMware ESX host was deliberately shut down. When a user or script shuts down a VMware ESX host, it generates a series of events similar to:

    localhost logger: (1265803308) hb: vmk loaded, 1746.98, 1745.148, 0, 208167, 208167, 0, vmware-h-59580, sfcbd-7660, sfcbd-3524
    localhost vmkhalt: (1268149354) Halting system...
    localhost vmkhalt: (1268149486) Starting system...
    localhost logger: (1268149540) loaded VMkernel


    If your VMware ESX host is deliberately shut down, review the vCenter Server logs to identify any recent tasks that may have told the VMware ESX host to reboot. Use this list of other resources to help determine the reason for shutting down of VMware ESX host:
  7. Determine if the ESX host experienced a kernel error. When an ESX host experiences a kernel error, it generates a series of events similar to:

    vsphere5 logger: (1251788469) hb: vmk loaded, 3597562.98, 3597450.113, 13, 164009, 164009, 356, vmware-h-79976, vpxa-54148, sfcbd-12600
    vsphere5 vmkhalt: (1251797195) Starting system...
    vsphere5 logger: (1251797206) VMkernel error
    vsphere5 logger: (1251797261) loaded VMkernel


  8. Run this command to check if the ESXi host is configured to automatically reboot after a purple diagnostic screen:

    esxcfg-advcfg -g /Misc/BlueScreenTimeout

    If the value is different than 0, then the ESXi host reboots automatically after a purple diagnostic screen.


    When the host is rebooted after a failure and if the core dump is successful, the /var/log/vmksummary.log shows that a core dump is found.

    For example:

    <YYYY-MM-DD>T<TIME>Z bootstop: Host has booted
    <YYYY-MM-DD>T<TIME>Z bootstop: file core dump found


    Note: The preceding information indicates the ESXi host failure rather than indicating the ESXi host is restarted automatically.

  9. Determine if the VMware ESX host hardware abruptly rebooted. When the VMware ESX host hardware abruptly reboots, it generates a series of events similar to:

    localhost logger: (1265803308) hb: vmk loaded, 1746.98, 1745.148, 0, 208167, 208167, 0, vmware-h-59580, sfcbd-7660, sfcbd-3524
    localhost vmkhalt: (1268149486) Starting system...
    localhost logger: (1268149540) loaded VMkernel


    If your VMware ESX host has experienced an outage and it was not the result of a kernel error, deliberate reboot, or shut down, then the physical hardware may have abruptly restarted on its own. Hardware is known to reboot abruptly due to power outages, faulty components, and heating issues. To investigate further, engage the hardware vendor.

  10. Alternatively, the outage may have been deliberately triggered by an administrator by physically pressing the power button to turn off the hardware or using the hardware tools such as iLO, DRAC, RAS, etc. This occurrence may generate this event in the /var/log/vmkernel log of the ESX host:

    VMKAcpi: 1865: In PowerButton Helper

  11. If your VMware ESX host experiences an outage that is not the result of a kernel error, deliberate reboot, or shut down, then the physical hardware may have abruptly restarted on its own. Hardware may reboot abruptly due to power outages, faulty components, and heating issues. To investigate further, engage the hardware vendor.


Reply
0 Kudos
RahulVmware1985
Enthusiast
Enthusiast

please check IMM or ILO or DRAC logs events. will get at least if there is any hardware fault which missed to be triggered in Vcenter., also can check firmware level of the system to dig deeper.

Regards

Rahul

Reply
0 Kudos
manmohanbisht
Enthusiast
Enthusiast

Difficult to reply on such generic statements, provide more details, vmkernel logs, dumps ( if configured ) , hardware logs etc.

Reply
0 Kudos
sakthivelramali
Enthusiast
Enthusiast

Hi

Please Check if ESXi is configured to automatically reboot after a purple screen by executing this command:

esxcfg-advcfg -g /Misc/BlueScreenTimeout

If the value is different than 0, then ESXi reboots automatically after the purple screen

Thanks

Sakthivel R

Thanks Sakthivel R
Reply
0 Kudos