VMware Cloud Community
vmwarexl
Contributor
Contributor

ESXi 6 host keep rebooting randomly

I installed ESXi 6 on one IBM server, but the host keeps reboot randomly for unknown reason.

Before I install ESXi 6, windows server 2008 is running on this machine and it has no such problem.

I search all the logs but cannot find the root cause, can anyone advise how I can investigate this issue.

Thanks

0 Kudos
19 Replies
vmwarexl
Contributor
Contributor

Anyone have idea about this?

0 Kudos
vcallaway
Enthusiast
Enthusiast

Is the host on VMware's HCL list?

Have you checked the hostd.log on the host? /var/log/hostd.log

0 Kudos
vXav
Expert
Expert

Have you checked your server is on the HCL? (unsupported hardware can cause that)

Has the host generated a coredumps ?

What is in the vmksummary log? you can check in http://HostIP/host

0 Kudos
ArjunDooti
Enthusiast
Enthusiast

Make and model of Hardware ?

in var/core are you seeing and core dump files.

If it is HP is ASR enabled at ILO ?

What IDRAC or ILO logs says.

Thanks & Regards

Arjun Dooti

0 Kudos
vmwarexl
Contributor
Contributor

Server model is IBM x3550 M4 xeon E5-2620, it's in the HCL list.

There is no core dump file found.

No useful information in the hostd.log and vmksummary.log.

some message in the vmkwarning.log, but may not be related.

0:00:00:05.508 cpu0:32768)WARNING: VMKAcpi: 2448: Bus 13 (81) is already defined

2016-12-31T01:07:46.756Z cpu21:33266)WARNING: LinuxSignal: 541: ignored unexpected signal flags 0x2 (sig 17)

2016-12-31T01:07:50.032Z cpu1:33291)WARNING: LinNet: LinNet_CreateDMAEngine:4011: vusb0, failed to get device properties with error Not supported

2016-12-31T01:07:50.032Z cpu1:33291)WARNING: LinNet: LinNet_ConnectUplink:11920: vusb0: Failed to create DMA engine with error Not supported, it maybe a pseudo device

2016-12-31T01:07:50.481Z cpu6:33291)WARNING: LinNet: LinNet_CreateDMAEngine:4011: vusb0, failed to get device properties with error Not supported

2016-12-31T01:07:50.481Z cpu6:33291)WARNING: LinNet: LinNet_ConnectUplink:11920: vusb0: Failed to create DMA engine with error Not supported, it maybe a pseudo device

2016-12-31T01:07:50.728Z cpu19:33355)WARNING: ScsiScan: 1643: Failed to add path vmhba1:C0:T9:L0 : Not found

2016-12-31T01:07:50.730Z cpu19:33355)WARNING: ScsiScan: 1643: Failed to add path vmhba1:C0:T11:L0 : Not found

2016-12-31T01:07:50.732Z cpu19:33355)WARNING: ScsiScan: 1643: Failed to add path vmhba1:C0:T12:L0 : Not found

2016-12-31T01:07:50.735Z cpu19:33355)WARNING: ScsiScan: 1643: Failed to add path vmhba1:C0:T13:L0 : Not found

2016-12-31T01:07:50.737Z cpu19:33355)WARNING: ScsiScan: 1643: Failed to add path vmhba1:C0:T14:L0 : Not found

2016-12-31T01:07:50.739Z cpu19:33355)WARNING: ScsiScan: 1643: Failed to add path vmhba1:C0:T15:L0 : Not found

2016-12-31T01:07:50.741Z cpu19:33355)WARNING: ScsiScan: 1643: Failed to add path vmhba1:C0:T16:L0 : Not found

2016-12-31T01:07:50.743Z cpu19:33355)WARNING: ScsiScan: 1643: Failed to add path vmhba1:C0:T17:L0 : Not found

2016-12-31T01:07:53.005Z cpu2:33210)WARNING: NetDVS: 659: portAlias is NULL

2016-12-31T01:08:01.854Z cpu3:33404)WARNING: RDT: RDTModInit:1074: Kernel is not configured for IPv6

2016-12-31T01:08:02.874Z cpu8:33528)WARNING: Supported VMs 171, Max VSAN VMs 400, SystemMemoryInGB 32

2016-12-31T01:08:02.874Z cpu8:33528)WARNING: MaxFileHandles: 5130, Prealloc 1, Prealloc limit: 32 GB, Host scaling factor: 1

2016-12-31T01:08:02.874Z cpu8:33528)WARNING: DOM memory will be preallocated.

2016-12-31T01:08:05.404Z cpu5:33583)WARNING: FTCpt: 476: Using IPv4 address to start server listener

2016-12-31T01:08:09.449Z cpu11:33775)WARNING: ScsiScan: 1643: Failed to add path vmhba1:C0:T9:L0 : Not found

2016-12-31T01:08:09.451Z cpu11:33775)WARNING: ScsiScan: 1643: Failed to add path vmhba1:C0:T11:L0 : Not found

2016-12-31T01:08:09.453Z cpu11:33775)WARNING: ScsiScan: 1643: Failed to add path vmhba1:C0:T12:L0 : Not found

2016-12-31T01:08:09.455Z cpu11:33775)WARNING: ScsiScan: 1643: Failed to add path vmhba1:C0:T13:L0 : Not found

2016-12-31T01:08:09.457Z cpu11:33775)WARNING: ScsiScan: 1643: Failed to add path vmhba1:C0:T14:L0 : Not found

2016-12-31T01:08:09.459Z cpu11:33775)WARNING: ScsiScan: 1643: Failed to add path vmhba1:C0:T15:L0 : Not found

2016-12-31T01:08:09.461Z cpu11:33775)WARNING: ScsiScan: 1643: Failed to add path vmhba1:C0:T16:L0 : Not found

2016-12-31T01:08:09.463Z cpu11:33775)WARNING: ScsiScan: 1643: Failed to add path vmhba1:C0:T17:L0 : Not found

2016-12-31T01:08:12.590Z cpu14:34232)WARNING: PCI: 157: 0000:06:00.0: Bypassing non-ACS capable device in hierarchy

2016-12-31T01:08:12.590Z cpu14:34232)WARNING: PCI: 157: 0000:06:00.1: Bypassing non-ACS capable device in hierarchy

2016-12-31T01:08:12.591Z cpu14:34232)WARNING: PCI: 157: 0000:06:00.2: Bypassing non-ACS capable device in hierarchy

2016-12-31T01:08:12.591Z cpu14:34232)WARNING: PCI: 157: 0000:06:00.3: Bypassing non-ACS capable device in hierarchy

2016-12-31T01:08:30.274Z cpu14:35400)WARNING: NetDVS: 659: portAlias is NULL

Some other message in the vmkernel.com:

2017-01-12T23:37:38.784Z cpu0:32807)NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x1a (0x43a18508d5c0, 0) to dev "mpx.vmhba0:C0:T0:L0" on path "vmhba0:C0:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE

2017-01-12T23:38:34.260Z cpu10:34416)NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x1a (0x439d8695ec40, 0) to dev "naa.600605b0054459d01fbde6d065e4443f" on path "vmhba1:C2:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0. Act:NONE

2017-01-12T23:38:34.283Z cpu10:34416)ScsiDeviceIO: 2651: Cmd(0x439d8695ec40) 0x1a, CmdSN 0x3d39 from world 0 to dev "naa.600605b0054459d01fbde6d065e4443f" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

2017-01-12T23:38:34.283Z cpu10:34416)NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x85 (0x439d8695ec40, 34416) to dev "naa.600605b0054459d01fbde6d065e4443f" on path "vmhba1:C2:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE

2017-01-12T23:38:34.283Z cpu10:34416)ScsiDeviceIO: 2651: Cmd(0x439d8695ec40) 0x4d, CmdSN 0x9b5 from world 34416 to dev "naa.600605b0054459d01fbde6d065e4443f" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

2017-01-12T23:38:34.283Z cpu10:34416)NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x1a (0x439d8695ec40, 34416) to dev "naa.600605b0054459d01fbde6d065e4443f" on path "vmhba1:C2:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0. Act:NONE

2017-01-12T23:38:34.283Z cpu10:34416)ScsiDeviceIO: 2651: Cmd(0x439d8695ec40) 0x1a, CmdSN 0x9b6 from world 34416 to dev "naa.600605b0054459d01fbde6d065e4443f" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

2017-01-12T23:38:34.284Z cpu10:34416)ScsiDeviceIO: 2651: Cmd(0x439d8695ec40) 0x1a, CmdSN 0x3d3e from world 0 to dev "naa.600605b0054459d01fbde6d065e4443f" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

2017-01-12T23:38:34.284Z cpu10:34416)NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x85 (0x439d8695ec40, 34416) to dev "naa.600605b0054459d01fbde6d065e4443f" on path "vmhba1:C2:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE

2017-01-12T23:42:38.782Z cpu15:32822)NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x1a (0x43a180310700, 0) to dev "mpx.vmhba0:C0:T0:L0" on path "vmhba0:C0:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE

2017-01-12T23:47:38.781Z cpu21:32828)NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x1a (0x43a18036bf40, 0) to dev "mpx.vmhba0:C0:T0:L0" on path "vmhba0:C0:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE

2017-01-12T23:52:38.778Z cpu21:32828)NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x1a (0x43a1870e1f00, 0) to dev "mpx.vmhba0:C0:T0:L0" on path "vmhba0:C0:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE

2017-01-12T23:57:38.783Z cpu15:32783)NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x1a (0x43a180308240, 0) to dev "mpx.vmhba0:C0:T0:L0" on path "vmhba0:C0:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE

2017-01-13T00:01:02.009Z cpu19:35388)NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x1a (0x43a18036c9c0, 0) to dev "naa.600605b0054459d01fbde6d065e4443f" on path "vmhba1:C2:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0. Act:NONE

2017-01-13T00:01:02.037Z cpu19:35388)ScsiDeviceIO: 2651: Cmd(0x43a18036c9c0) 0x1a, CmdSN 0x3d48 from world 0 to dev "naa.600605b0054459d01fbde6d065e4443f" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

2017-01-13T00:01:02.522Z cpu10:32860)ScsiDeviceIO: 2651: Cmd(0x439d85532500) 0x1a, CmdSN 0x3d4d from world 0 to dev "naa.600605b0054459d01fbde6d065e4443f" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

2017-01-13T00:01:38.775Z cpu15:33298)NMP: nmp_ResetDeviceLogThrottling:3349: last error status from device naa.600605b0054459d01fbde6d065e4443f repeated 2 times

2017-01-13T00:02:38.782Z cpu18:32825)NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x1a (0x43a1870d5dc0, 0) to dev "mpx.vmhba0:C0:T0:L0" on path "vmhba0:C0:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE

2017-01-13T00:02:38.782Z cpu18:32825)ScsiDeviceIO: 2635: Cmd(0x43a1870d5dc0) 0x1a, CmdSN 0x3d4e from world 0 to dev "mpx.vmhba0:C0:T0:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

2017-01-13T00:07:38.784Z cpu10:32817)NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x1a (0x43a184f737c0, 0) to dev "mpx.vmhba0:C0:T0:L0" on path "vmhba0:C0:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE

2017-01-13T00:08:34.292Z cpu10:34416)NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x1a (0x439d802f4900, 0) to dev "naa.600605b0054459d01fbde6d065e4443f" on path "vmhba1:C2:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0. Act:NONE


Last reboot time is around 12/31/2016 9:08:21 AM

The disk is configured as raid 00, no redundancy.

0 Kudos
vXav
Expert
Expert

Looks to me like there is a problem with your USB (controller?)

2016-12-31T01:07:50.032Z cpu1:33291)WARNING: LinNet: LinNet_CreateDMAEngine:4011: vusb0, failed to get device properties with error Not supported
2016-12-31T01:07:50.032Z cpu1:33291)WARNING: LinNet: LinNet_ConnectUplink:11920: vusb0: Failed to create DMA engine with error Not supported, it maybe a pseudo device
2016-12-31T01:07:50.481Z cpu6:33291)WARNING: LinNet: LinNet_CreateDMAEngine:4011: vusb0, failed to get device properties with error Not supported
2016-12-31T01:07:50.481Z cpu6:33291)WARNING: LinNet: LinNet_ConnectUplink:11920: vusb0: Failed to create DMA engine with error Not supported, it maybe a pseudo device

Again, is your server and its components on the HCL for ESXi 6.0? And did you install ESXi with an IBM customized ISO?

0 Kudos
vmwarexl
Contributor
Contributor

Thanks for your reply. Yes, it's in the HCL.

It's not IBM customized ISO, only general version, I don't know there is customized version.

I don't use the USB on this server, the USB failure will cause the server reboot?

Can I just disable the USB device?

0 Kudos
ArjunDooti
Enthusiast
Enthusiast

Hi,

Is it possible to upload last vmksummary.log and vmkernel.log files before the reboot.

cd /var/lrun/log

out put of below command

less vmksummary.log | grep boot

Verify is core dump partition configured if yes follow the below the below kb and generate the core dump file.

Extracting a core dump file from the diagnostic partition following a purple diagnostic screen error...

Configuring ESXi coredump to file instead of partition (2077516) | VMware KB

If core dump file not configured follow the below kb and configure

Generating a VMkernel zdump manually from a dump file in ESXi host (2081902) | VMware KB

Thanks & Regards

Arjun Dooti

0 Kudos
vXav
Expert
Expert

Sorry I didn't see your reply.

You can download a customized ISO on Lenovo's website there : https://www-947.ibm.com/support/entry/portal/docdisplay?lndocid=migr-5098036

Maybe try reinstalling with it might be a driver issue.

0 Kudos
vmwarexl
Contributor
Contributor

Thanks Arjun.

I have attached the vmkernel.log and vmksummary.log, but I don't know how to find the vmkernel.log before the reboot, it only has the latest information.

I followed your links and found that there is no coredump generated on partition, now I have created a file for coredump, there is also no coredump message found in the vmksummary.log.

Here is the message from "less vmksummary.log | grep boot"

2016-11-15T19:03:47Z bootstop: Host has booted

2016-11-15T19:08:09Z bootstop: Host is rebooting

2016-11-15T19:11:42Z bootstop: Host has booted

2016-11-18T15:41:03Z bootstop: Host is rebooting

2016-11-18T07:51:42Z bootstop: Host has booted

2016-11-20T12:38:12Z bootstop: Host has booted

2016-11-20T12:54:33Z bootstop: Host has booted

2016-11-21T09:46:26Z bootstop: Host has booted

2016-11-21T13:18:27Z bootstop: Host has booted

2016-11-28T04:16:23Z bootstop: Host has booted

2016-11-28T04:31:43Z bootstop: Host has booted

2016-11-28T06:53:09Z bootstop: Host has booted

2016-11-28T18:31:56Z bootstop: Host has booted

2016-11-29T19:23:56Z bootstop: Host has booted

2016-11-29T22:56:02Z bootstop: Host has booted

2016-12-03T00:07:02Z bootstop: Host has booted

2016-12-03T13:27:01Z bootstop: Host has booted

2016-12-05T20:16:45Z bootstop: Host has booted

2016-12-06T16:17:42Z bootstop: Host has booted

2016-12-09T23:18:14Z bootstop: Host has booted

2016-12-19T02:39:22Z bootstop: Host is powering off

2016-12-19T03:27:27Z bootstop: Host has booted

2016-12-19T03:31:44Z bootstop: Host is powering off

2016-12-19T03:42:24Z bootstop: Host has booted

2016-12-26T13:59:29Z bootstop: Host has booted

2016-12-27T20:07:23Z bootstop: Host has booted

2016-12-30T21:01:43Z bootstop: Host has booted

2016-12-30T21:23:24Z bootstop: Host has booted

2016-12-31T01:08:27Z bootstop: Host has booted

0 Kudos
vmwarexl
Contributor
Contributor

Thanks vxav.

I have downloaded the customized ISO.

The file is only several MB and I don't know how to use it, the readme file is not clear either.

Can you help to explain it, thanks.

0 Kudos
SureshKumarMuth
Commander
Commander

You have uploaded the kernel logs from 18th till 20th Jan, I dont see any reboot event occurred. When was the last reboot occurred, I can see it is on December based on one of your reply. Is the server running fine now ? for more than 20 days ?

Regards,
Suresh
https://vconnectit.wordpress.com/
0 Kudos
vmwarexl
Contributor
Contributor

Hi, kumar

I have found the previous vmkernel log, the last reboot is on Dec 31st at around 9am.

Not sure what caused the reboot, looks like there may be some disk issue?

0 Kudos
ArjunDooti
Enthusiast
Enthusiast

Hi,

Have you noticed any recent reboots if yes upload the vmkerne files from /var/run/log folder

0 Kudos
vmwarexl
Contributor
Contributor

Hi, Arjun

From ESXi host events log, the latest reboot is around 1/29/2017 2:15:43AM.

I have attached the vmkernel log, there is some error message, but not sure what it is.

Thanks.

0 Kudos
ArjunDooti
Enthusiast
Enthusiast

Hi ,

Esxi runs with UTC time zone. Attached vmkernel log files seems to be after reboot.

Please upload vobd.log file and  vmkernel files from /var/run/log.

You will find similar  files with vmkernel.01.gz upload all files.

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=10336...

Thanks & Regards

Arjun Dooti

0 Kudos
vmwarexl
Contributor
Contributor

Hi, Arjun

I have attached the log files, thanks.

0 Kudos
ArjunDooti
Enthusiast
Enthusiast

Last time I believe you have configured core dump partition. Verify /var/core/log folder, If you notice any core files let me know other wise follow the below KB and extract   the core dump  and let me know

Is esxi configured with VSAN ?

Seems to be Local disk error in the vmkernel file. Run intensive Hardware Diagnostic test on the server and open ticket with HW vendor.

If Core dump files exist let me know I will send you next steps.

Extracting a core dump file from the diagnostic partition following a purple diagnostic screen error (1002769)

Extracting a core dump file from the diagnostic partition following a purple diagnostic screen error...

reboot Time

2017-01-28T18:15:50.613Z: [UserLevelCorrelator] 49090900us: [esx.audit.host.boot] Host has booted.

2017-01-28T18:15:16.430Z: [netCorrelator] 14908101us: [esx.audit.net.firewall.port.hooked] Port vmk0 is now protected by Firewall.

2017-01-28T18:15:16.430Z: An event (esx.audit.net.firewall.port.hooked) could not be sent immediately to hostd; queueing for retry.

2017-01-28T18:15:20.625Z: [scsiCorrelator] 19102463us: [vob.scsi.scsipath.pathstate.on] scsiPath vmhba1:C2:T0:L0 changed state from dead

2017-01-28T18:15:20.638Z: [scsiCorrelator] 19115806us: [vob.scsi.scsipath.pathstate.on] scsiPath vmhba0:C0:T0:L0 changed state from dead

2017-01-28T18:15:25.114Z: [GenericCorrelator] 23591060us: [vob.user.coredump.configured2] At least one coredump target is enabled.

2017-01-28T18:15:25.114Z: [UserLevelCorrelator] 23591060us: [vob.user.coredump.configured2] At least one coredump target is enabled.

2017-01-28T18:15:25.114Z: [UserLevelCorrelator] 23591568us: [esx.clear.coredump.configured2] At least one coredump target has been configured. Host core dumps will be saved.

2017-01-28T18:15:25.114Z: An event (esx.clear.coredump.configured2) could not be sent immediately to hostd; queueing for retry.

2017-01-28T18:15:31.507Z: [GenericCorrelator] 29984024us: [vob.user.dcui.enabled] The DCUI has been enabled

2017-01-28T18:15:31.507Z: [UserLevelCorrelator] 29984024us: [vob.user.dcui.enabled] The DCUI has been enabled

2017-01-28T18:15:31.507Z: [UserLevelCorrelator] 29984554us: [esx.audit.dcui.enabled] The DCUI has been enabled.

2017-01-28T18:15:31.507Z: An event (esx.audit.dcui.enabled) could not be sent immediately to hostd; queueing for retry.

2017-01-28T18:15:50.613Z: [UserLevelCorrelator] 49090465us: [vob.user.host.boot] Host has booted.

2017-01-28T18:15:50.613Z: [GenericCorrelator] 49090465us: [vob.user.host.boot] Host has booted.

2017-01-28T18:15:50.613Z: [UserLevelCorrelator] 49090900us: [esx.audit.host.boot] Host has booted.

Thanks & Regards

Arjun Dooti

0 Kudos
vmwarexl
Contributor
Contributor

Hi, Arjun

There is no VSAN. I cannot find the core dump on the partition either, it was not generated.

Do you mean the hardware diagnostic tools from server firmware?

0 Kudos