VMware Cloud Community
BobLeHareng
Contributor
Contributor

bootstop: Host has booted / unexpected reboot : how to debug ?

Hello guys,

For some time now I've been having problems with unexpected rebooting on my esxi.

I have looked at the article :

https://kb.vmware.com/s/article/1019238#:~:text=If%20your%20VMware%20ESXi%20host,faulty%20components....

Without finding anything special.

I guess the last reboot is between : 2020-09-19T08:42:24.481Z and 2020-09-19T08:51:14.365Z2020-09-19T08:42:24.481Z

I don't see anything in "hostd.log" :

2020-09-19T08:41:51.482Z verbose hostd[2099022] [Originator@6876 sub=Default opID=esxui-8c0c-17bc user=root] AdapterServer: target='vmodl.query.PropertyCollector:ha-property-collector', method='waitForUpdatesEx'

2020-09-19T08:41:51.560Z verbose hostd[2099058] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: guest.disk, 7. Sent notification immediately.

2020-09-19T08:41:51.561Z verbose hostd[2099131] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: guest.disk, 7. Sent notification immediately.

2020-09-19T08:41:53.849Z verbose hostd[2099059] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: guest, 7. Sent notification immediately.

2020-09-19T08:41:53.849Z verbose hostd[2099059] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: summary.guest, 7. Sent notification immediately.

2020-09-19T08:41:54.484Z verbose hostd[2099023] [Originator@6876 sub=Default opID=esxui-1cf5-17bd user=root] AdapterServer: target='vmodl.query.PropertyCollector:ha-property-collector', method='waitForUpdatesEx'

2020-09-19T08:42:20.553Z verbose hostd[2099518] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: guest.disk, 2. Sent notification immediately.

2020-09-19T08:42:20.615Z verbose hostd[2099024] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: guest, 2. Sent notification immediately.

2020-09-19T08:42:20.615Z verbose hostd[2099024] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: summary.guest, 2. Sent notification immediately.

2020-09-19T08:42:21.485Z verbose hostd[2099059] [Originator@6876 sub=Default opID=esxui-ae15-17be user=root] AdapterServer: target='vmodl.query.PropertyCollector:ha-property-collector', method='waitForUpdatesEx'

2020-09-19T08:42:21.561Z verbose hostd[2099062] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: guest.disk, 7. Sent notification immediately.

2020-09-19T08:42:21.561Z verbose hostd[2099515] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: guest.disk, 7. Sent notification immediately.

2020-09-19T08:42:23.833Z verbose hostd[2099517] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: guest, 7. Sent notification immediately.

2020-09-19T08:42:23.833Z verbose hostd[2099517] [Originator@6876 sub=PropertyProvider] RecordOp ASSIGN: summary.guest, 7. Sent notification immediately.

2020-09-19T08:42:24.481Z verbose hostd[2099522] [Originator@6876 sub=Default opID=esxui-d074-17bf user=root] AdapterServer: target='vmodl.query.PropertyCollector:ha-property-collector', method='waitForUpdatesEx'

2020-09-19T08:51:14.365Z - time the service was last started, Section for VMware ESX, pid=2098958, version=6.7.0, build=16316930, option=Release

2020-09-19T08:51:14.365Z warning -[2098958] [Originator@6876 sub=Default] Failed to load vsansvc configuration file: N7Vmacore22AlreadyExistsExceptionE(Already Exists)

--> [context]zKq7AVICAgAAAAL6+AAJLQAALE42bGlidm1hY29yZS5zbwAAsL4bAL6dFwCeShcBbX7JaG9zdGQAAYc9yQGbs2ICfRkCbGliYy5zby42AAGt1mI=[/context]

2020-09-19T08:51:14.365Z info -[2098958] [Originator@6876 sub=Default] Supported VMs 334

2020-09-19T08:51:14.365Z info -[2098958] [Originator@6876 sub=Handle checker] Setting system limit of 3740

2020-09-19T08:51:14.365Z info -[2098958] [Originator@6876 sub=Handle checker] Set system limit to 3740

2020-09-19T08:51:14.365Z info -[2098958] [Originator@6876 sub=Default] Setting malloc mmap threshold to 32 k

2020-09-19T08:51:14.365Z info -[2098958] [Originator@6876 sub=Default] getrlimit(RLIMIT_NPROC): curr=4096 max=8192

2020-09-19T08:51:14.365Z info -[2098958] [Originator@6876 sub=Default] Glibc malloc guards disabled.

2020-09-19T08:51:14.365Z info -[2098958] [Originator@6876 sub=Default] Initialized SystemFactory

Same for the "vmksummary.log". :

2020-09-19T07:00:00Z heartbeat: up 3d20h57m50s, 2 VMs; [] []

2020-09-19T08:00:00Z heartbeat: up 3d21h57m50s, 2 VMs; [] []

2020-09-19T08:51:23Z bootstop: Host has booted

2020-09-19T09:00:01Z heartbeat: up 0d0h12m8s, 1 VM; [] []

What else can I look at to understand and solve this?

The host is a Dell R710, latest bios / firmware updated.

Current esxi version: 6.7.0 Update 3 (Build 16316930) --> (same reboot problem with previous versions)

I have few VMs on this esxi, it happens when there are only 2 VMs started.

An idea ?

Thanks !

Best regards,

Bob

NB : Logs attached

Reply
0 Kudos
9 Replies
a_p_
Leadership
Leadership

ESXi usually stops with a POSD on errors, i.e. doesn't reboot automatically.

Unless already done, check the logs within iDRAC to find out whether they contain any hints about the reason for the reboot.


André

Reply
0 Kudos
BobLeHareng
Contributor
Contributor

Absolutely nothing on the iDRAC side.

I just changed the PSU. Just in case Smiley Happy

Wait and see...

Reply
0 Kudos
khiregange
VMware Employee
VMware Employee

Looks like a Hardware issue

from the vmkernel.log  ->

2020-09-19T08:42:07.413Z cpu12:2097897)DVFilter: 6054: Checking disconnected filters for timeouts

2020-09-19T08:42:14.415Z cpu12:2098339)SunRPC: 1099: Destroying world 0x20a4eb

VMB: 66: Reserved 4 MPNs starting @ 0x4a0

VMB: 113: mbMagic: 1badb005, mbInfo 0x600000

VMB: 106: Changed PAT MSR from 0x7040600070406 to 0x7010600070106

VMB_SERIAL: 264: Serial port set to default configuration.

KB - VMware Knowledge Base

Determine if the VMware ESX host hardware abruptly rebooted. When the VMware ESX host hardware abruptly reboots, it generates a series of events similar to:

localhost logger: (1265803308) hb: vmk loaded, 1746.98, 1745.148, 0, 208167, 208167, 0, vmware-h-59580, sfcbd-7660, sfcbd-3524
localhost vmkhalt: (1268149486) Starting system...
localhost logger: (1268149540) loaded VMkernel

If your VMware ESX host has experienced an outage and it was not the result of a kernel error, deliberate reboot, or shut down, then the physical hardware may have abruptly restarted on its own. Hardware is known to reboot abruptly due to power outages, faulty components, and heating issues. To investigate further, engage the hardware vendor.

nachogonzalez
Commander
Commander

Hey, hope you are doing fine

Usually when a ESXi host reboots with "no reason" is because of a PSOD (purple screen of death) mostly those are caused because of a hardware or driver faillure (Same as Windows's)

Do you have some sort of Management console (iLO/iDRAC/BMC)?
Usually those kind of consoles show you if the host was rebooted via hardware or if you have a failing piece of hardware

Another thing you should check is if you have ESXi coredumps configured and available

this article might help VMware Knowledge Base

In case you don't, configure them and check the core dump's next time server fails.

Reply
0 Kudos
BobLeHareng
Contributor
Contributor

Hi,

Many thanks for your answer.

According to the iDrac there is no specific problem.

But this is an idrac 6 and the diag is really limited!

I also think about hardware.

I've been activated dump on Esxi, and I've also changed the server power supply.

I'm monitoring...

I'll keep my eyes open Smiley Happy

Reply
0 Kudos
NathanosBlightc
Commander
Commander

Regardless of disk dump or net dump configuration, not bad to investigate the following log files too: vmkernel.log and vmksummary.log

Because you may not understand the dump information easily Smiley Wink

Please mark my comment as the Correct Answer if this solution resolved your problem
Reply
0 Kudos
edwardlin
Contributor
Contributor

hello,i have the same problem.

I have 2 host on the same time to reboot Esxi.

The host is a Dell PowerEdge R740, latest bios / firmware updated.

Current esxi version: 7.0.0 ((Build 16324942))

I have few VMs on this esxi, it happens when there are only 2 VMs started. (2 host smae condition)

We use the last new idrac 7 .

our  vmksummary.log  like this

2021-06-14T08:00:00Z heartbeat: up 31d7h58m53s, 2 VMs; [[2099656 hostd 84096kB] [2100529 vmx 12580864kB] [2100500 vmx 33288324kB]] []
2021-06-14T09:00:01Z heartbeat: up 31d8h58m53s, 2 VMs; [[2099656 hostd 84948kB] [2100529 vmx 12580864kB] [2100500 vmx 33288324kB]] []
2021-06-14T09:50:04Z bootstop: Host has booted
2021-06-14T10:00:00Z heartbeat: up 0d0h10m54s, 0 VMs; [[2097691 vmsyslogd 15032kB] [2099757 vpxa 23376kB] [2099491 hostd 64764kB]] []
2021-06-14T11:00:00Z heartbeat: up 0d1h10m54s, 0 VMs; [[2097691 vmsyslogd 15032kB] [2099757 vpxa 23416kB] [2099491 hostd 65932kB]] []

Can you let me know the status of your computer? Is the problem recurring?

Reply
0 Kudos
Arthos
Enthusiast
Enthusiast

from vmkernel.log

2020-09-19T08:51:19.474Z cpu18:2098586)Starting service xorg
2020-09-19T08:51:19.477Z cpu18:2098586)Activating Jumpstart plugin xorg.
2020-09-19T08:51:19.882Z cpu18:2098586)Jumpstart plugin xorg activated.
2020-09-19T08:51:19.886Z cpu18:2098586)Starting service vmtoolsd
2020-09-19T08:51:19.893Z cpu18:2098586)Activating Jumpstart plugin vmtoolsd.
2020-09-19T08:51:19.968Z cpu1:2098339)SunRPC: 3303: Synchronous RPC abort for client 0x430589e741c0 IP 192.168.1.200.0.111 proc 3 xid 0x48170796 attempt 2 of 3
2020-09-19T08:51:20.095Z cpu18:2098586)Jumpstart plugin vmtoolsd activated.
2020-09-19T08:51:20.254Z cpu19:2099518)Hardware: 572: Reading cpu0 MSR 0x17
2020-09-19T08:51:21.139Z cpu20:2098958)Config: 703: "VMOverheadGrowthLimit" = 4294967295, Old Value: -1, (Status: 0x0)
2020-09-19T08:51:23.487Z cpu19:2099588)Boot Successful -> host boot up finished here..
2020-09-19T08:51:23.862Z cpu23:2099600)Starting up rc scripts
2020-09-19T08:51:25.190Z cpu20:2099055 opID=2e3b53aa)World: 11943: VC opID vim-cmd-f9-a4ab-a4ac maps to vmkernel opID 2e3b53aa
2020-09-19T08:51:25.190Z cpu20:2099055 opID=2e3b53aa)Config: 703: "SIOControlFlag2" = 1, Old Value: 0, (Status: 0x0)
2020-09-19T08:51:25.223Z cpu0:2099666)MemSched: vm 2099666: 5745: extended swap to 20480 pgs
2020-09-19T08:51:25.307Z cpu0:2099666)MemSched: vm 2099666: 5745: extended swap to 22016 pgs
2020-09-19T08:51:25.308Z cpu0:2099666)World: vm 2099667: 7076: Starting world vmm0:freenas of type 8
2020-09-19T08:51:25.308Z cpu0:2099666)Sched: vm 2099667: 6206: Adding world 'vmm0:freenas', group 'host/user', cpu: shares=-3 min=0 minLimit=-1 max=-1, mem: shares=-3 min=6291456 minLimit=-1 max=-1
2020-09-19T08:51:25.308Z cpu0:2099666)Sched: vm 2099667: 6221: renamed group 11237 to vm.2099666
2020-09-19T08:51:25.308Z cpu0:2099666)Sched: vm 2099667:

 

 

from vmksummary:

2020-09-19T00:00:00Z heartbeat: up 3d13h57m52s, 2 VMs; [] []
2020-09-19T01:00:01Z heartbeat: up 3d14h57m52s, 2 VMs; [] []
2020-09-19T02:00:00Z heartbeat: up 3d15h57m51s, 2 VMs; [] []
2020-09-19T03:00:00Z heartbeat: up 3d16h57m51s, 2 VMs; [] []
2020-09-19T04:00:01Z heartbeat: up 3d17h57m51s, 2 VMs; [] []
2020-09-19T05:00:00Z heartbeat: up 3d18h57m50s, 2 VMs; [] []
2020-09-19T06:00:00Z heartbeat: up 3d19h57m51s, 2 VMs; [] []
2020-09-19T07:00:00Z heartbeat: up 3d20h57m50s, 2 VMs; [] []
2020-09-19T08:00:00Z heartbeat: up 3d21h57m50s, 2 VMs; [] []
2020-09-19T08:51:23Z bootstop: Host has booted
2020-09-19T09:00:01Z heartbeat: up 0d0h12m8s, 1 VM; [] []

 

 

Now , before system goes for a reboot there are many RPC calls for destroying worlds..  we can check what network worlds are running and the packets corresponding to 192.168.1.185.0.111.

 

2020-09-19T08:09:24.303Z cpu13:2098339)SunRPC: 1099: Destroying world 0x20a464
2020-09-19T08:10:25.306Z cpu12:2098339)SunRPC: 1099: Destroying world 0x20a466
2020-09-19T08:11:27.308Z cpu12:2098339)SunRPC: 1099: Destroying world 0x20a472
2020-09-19T08:11:57.309Z cpu12:2098339)CpuSched: 699: user latency of 2139252 RPC-tx-192.168.1.185.0.111 0 changed by 2098339 NFSv3-RemountHandler -6
2020-09-19T08:12:07.310Z cpu14:2097897)DVFilter: 6054: Checking disconnected filters for timeouts
2020-09-19T08:12:29.311Z cpu19:2098339)SunRPC: 1099: Destroying world 0x20a474
2020-09-19T08:13:30.316Z cpu19:2098339)SunRPC: 1099: Destroying world 0x20a477
2020-09-19T08:14:31.320Z cpu12:2098339)SunRPC: 1099: Destroying world 0x20a479
2020-09-19T08:15:33.323Z cpu14:2098339)SunRPC: 1099: Destroying world 0x20a485
2020-09-19T08:16:34.327Z cpu13:2098339)SunRPC: 1099: Destroying world 0x20a487
2020-09-19T08:17:35.331Z cpu23:2098339)SunRPC: 1099: Destroying world 0x20a489
2020-09-19T08:18:36.334Z cpu12:2098339)SunRPC: 1099: Destroying world 0x20a48b
2020-09-19T08:19:38.337Z cpu13:2098339)SunRPC: 1099: Destroying world 0x20a48d
2020-09-19T08:20:38.341Z cpu20:2098339)SunRPC: 1099: Destroying world 0x20a499
2020-09-19T08:21:40.344Z cpu15:2098339)SunRPC: 1099: Destroying world 0x20a49b
2020-09-19T08:22:07.344Z cpu12:2097897)DVFilter: 6054: Checking disconnected filters for timeouts
2020-09-19T08:22:42.346Z cpu15:2098339)SunRPC: 1099: Destroying world 0x20a49d
2020-09-19T08:23:43.351Z cpu12:2098339)SunRPC: 1099: Destroying world 0x20a49f
2020-09-19T08:24:45.355Z cpu23:2098339)SunRPC: 1099: Destroying world 0x20a4a1
2020-09-19T08:25:46.359Z cpu12:2098339)SunRPC: 1099: Destroying world 0x20a4ad
2020-09-19T08:26:47.362Z cpu16:2098339)SunRPC: 1099: Destroying world 0x20a4af
2020-09-19T08:27:49.363Z cpu12:2098339)SunRPC: 1099: Destroying world 0x20a4b1
2020-09-19T08:28:50.367Z cpu12:2098339)SunRPC: 1099: Destroying world 0x20a4b3
2020-09-19T08:29:52.373Z cpu12:2098339)SunRPC: 1099: Destroying world 0x20a4b5
2020-09-19T08:30:54.376Z cpu12:2098339)SunRPC: 1099: Destroying world 0x20a4c1
2020-09-19T08:31:57.378Z cpu12:2098339)SunRPC: 1099: Destroying world 0x20a4c3
2020-09-19T08:32:07.378Z cpu15:2097897)DVFilter: 6054: Checking disconnected filters for timeouts
2020-09-19T08:32:59.383Z cpu13:2098339)SunRPC: 1099: Destroying world 0x20a4c5
2020-09-19T08:34:01.387Z cpu23:2098339)SunRPC: 1099: Destroying world 0x20a4c7
2020-09-19T08:35:02.390Z cpu23:2098339)SunRPC: 1099: Destroying world 0x20a4c9
2020-09-19T08:36:05.392Z cpu19:2098339)SunRPC: 1099: Destroying world 0x20a4d5
2020-09-19T08:37:06.397Z cpu22:2098339)SunRPC: 1099: Destroying world 0x20a4d7
2020-09-19T08:38:08.401Z cpu22:2098339)SunRPC: 1099: Destroying world 0x20a4d9
2020-09-19T08:39:10.405Z cpu22:2098339)SunRPC: 1099: Destroying world 0x20a4db
2020-09-19T08:40:11.408Z cpu22:2098339)SunRPC: 1099: Destroying world 0x20a4dd
2020-09-19T08:41:12.412Z cpu12:2098339)SunRPC: 1099: Destroying world 0x20a4e9
2020-09-19T08:42:07.413Z cpu12:2097897)DVFilter: 6054: Checking disconnected filters for timeouts
2020-09-19T08:42:14.415Z cpu12:2098339)SunRPC: 1099: Destroying world 0x20a4eb
VMB: 66: Reserved 4 MPNs starting @ 0x4a0   --> reboot initiated here.
VMB: 113: mbMagic: 1badb005, mbInfo 0x600000
VMB: 106: Changed PAT MSR from 0x7040600070406 to 0x7010600070106
VMB_SERIAL: 264: Serial port set to default configuration.
VMB_MEMMAP: 2317: memmap[0]: addr 0, len 94000, type 1
VMB_MEMMAP: 2317: memmap[1]: addr 100000, len bf279000, type 1
VMB_MEMMAP: 2317: memmap[2]: addr bf379000, len 16000, type 2
VMB_MEMMAP: 2317: memmap[3]: addr bf38f000, len 3f000, type 3
VMB_MEMMAP: 2317: memmap[4]: addr bf3ce000, len c32000, type 2
VMB_MEMMAP: 2317: memmap[5]: addr e0000000, len 10000000, type 2
VMB_MEMMAP: 2317: memmap[6]: addr fe000000, len 2000000, type 2
VMB_MEMMAP: 2317: memmap[7]: addr 100000000, len e40000000, type 1
VMB_MEMMAP: 830: Low

 

 

Reply
0 Kudos
edwardlin
Contributor
Contributor

Thank you for your description

I check our vmkernel.log

Maybe our problem is vmw_psp_filxed  problem.

I try to use another web-link
https://communities.vmware.com/t5/VMware-vSphere-Discussions/Local-RAID5-latency-spikes-every-5-minu...

but not work

Are there any other suggestions?

2021-06-25T02:02:56.611Z cpu5:2097577)WARNING: VMW_SATP_LOCAL: satp_local_updatePath:807: Marking path vmhba1:C0:T7:L0 as STANDBY
2021-06-25T02:02:56.611Z cpu5:2097577)vmw_psp_fixed: psp_fixedSelectPathToActivateInt:439: Switching to preferred path vmhba1:C0:T7:L0 in STANDBY state for device mpx.vmhba1:C0:T7:L0.
2021-06-25T02:02:56.611Z cpu5:2097577)WARNING: vmw_psp_fixed: psp_fixedSelectPathToActivateInt:464: Selected current STANDBY path vmhba1:C0:T7:L0 for device mpx.vmhba1:C0:T7:L0 to activate. This may lead to path thrashing.
2021-06-25T02:02:56.611Z cpu5:2097577)vmw_psp_fixed: psp_fixedSelectPathToActivateInt:439: Switching to preferred path vmhba1:C0:T7:L0 in STANDBY state for device mpx.vmhba1:C0:T7:L0.
2021-06-25T02:02:56.611Z cpu5:2097577)WARNING: vmw_psp_fixed: psp_fixedSelectPathToActivateInt:464: Selected current STANDBY path vmhba1:C0:T7:L0 for device mpx.vmhba1:C0:T7:L0 to activate. This may lead to path thrashing.
2021-06-25T02:02:56.611Z cpu5:2097577)WARNING: VMW_SATP_LOCAL: satp_local_tryToActivatePath:1415: Checking TUR on STANDBY path vmhba1:C0:T7:L0
2021-06-25T02:07:56.613Z cpu5:2097577)WARNING: VMW_SATP_LOCAL: satp_local_updatePath:807: Marking path vmhba1:C0:T7:L0 as STANDBY
2021-06-25T02:07:56.613Z cpu5:2097577)vmw_psp_fixed: psp_fixedSelectPathToActivateInt:439: Switching to preferred path vmhba1:C0:T7:L0 in STANDBY state for device mpx.vmhba1:C0:T7:L0.
2021-06-25T02:07:56.613Z cpu5:2097577)WARNING: vmw_psp_fixed: psp_fixedSelectPathToActivateInt:464: Selected current STANDBY path vmhba1:C0:T7:L0 for device mpx.vmhba1:C0:T7:L0 to activate. This may lead to path thrashing.
2021-06-25T02:07:56.613Z cpu5:2097577)vmw_psp_fixed: psp_fixedSelectPathToActivateInt:439: Switching to preferred path vmhba1:C0:T7:L0 in STANDBY state for device mpx.vmhba1:C0:T7:L0.
2021-06-25T02:07:56.613Z cpu5:2097577)WARNING: vmw_psp_fixed: psp_fixedSelectPathToActivateInt:464: Selected current STANDBY path vmhba1:C0:T7:L0 for device mpx.vmhba1:C0:T7:L0 to activate. This may lead to path thrashing.
2021-06-25T02:07:56.613Z cpu5:2097577)WARNING: VMW_SATP_LOCAL: satp_local_tryToActivatePath:1415: Checking TUR on STANDBY path vmhba1:C0:T7:L0

Reply
0 Kudos