Hello,
I've recently installed a new vSphere host. There are several VMs running fine, but sometimes the host reboots out of a sudden. There is no PSOD on the console. From the logs I could see that it happen sometimes between every 2 days and up to 3 times/day. Number of running VMs or load does not seem to play any role.
The host is running build 11675023. What can I do to capture the reboot event? The logs are not telling me anything.
Sounds like a hardware issue more than an ESXi problem (unless everything gracefully shutsdown and reboots).
In the case of hardware I would check power supplies and operating temps. CPU could be overheating causing the host to shutdown/reboot. A few questions would be...
- What server hardware are you running on?
- Is the hardware on the HCL?
- Does your server have lights out/out of band management (Dell has iDRAC, HP has iLO, etc.)? Check those logs if so
Welcome to the Community,
most server vendors have integrated logs (independent on the operating system) through e.g. iLO, iDRAC, ... which may be helpful in such a case.
André
Hi,
Can you give more information regarding the underlying hardware ?
Is the hardware certified by vendor to run ESXi 6.7 ?
Has the hardware vendor provided any best practices or BIOS settings for VMware ESXi 6.7 ? If Yes, have same been configured ?
Are the firmware and driver combinations for hardware components (HBA/storage controller/NIC) in line with VMware HCL ?
You can check in ESXi logs; hostd, vmkernel, vmkwarning, vobd at /var/run/log location. If logs are getting wiped after ESXi reboot, you can configure syslog server to push ESXi logs to. VM KB for same here.
Sounds like a hardware issue more than an ESXi problem (unless everything gracefully shutsdown and reboots).
In the case of hardware I would check power supplies and operating temps. CPU could be overheating causing the host to shutdown/reboot. A few questions would be...
- What server hardware are you running on?
- Is the hardware on the HCL?
- Does your server have lights out/out of band management (Dell has iDRAC, HP has iLO, etc.)? Check those logs if so
Thanks for the welcome and all the suggestions.
The server is a self made system with two Xeon Silver 4114 and a Asus Z11PA-D8 mainboard. It has a ASMB9-iKVM module integrated witch I've configured to send syslog data to another hardware server. The same syslog server also receives data from the ESXi host. I thought that should make it easy to find out what was happening on the host prior to the crash/reboot, but it's not. At least I counld't figure it out. Maybe you guys can read the log better?
2019-04-01 13:36:54 Local4.Info 192.168.0.44 2019-04-01T11:36:09.614Z vm-server-iv.dc01.local Hostd: info hostd[2099821] [Originator@6876 sub=Solo.HTTP server /host user=root] Sent OK response for GET /host/vmkwarning.log
2019-04-01 13:36:54 Local4.Debug 192.168.0.44 2019-04-01T11:36:10.021Z vm-server-iv.dc01.local Rhttpproxy: verbose rhttpproxy[2100709] [Originator@6876 sub=Proxy Req 00104] Resolved endpoint : [N7Vmacore4Http16LocalServiceSpecE:0x000000a58710a1c0] _serverNamespace = /vpxa action = Allow _port = 8089
2019-04-01 13:36:55 Local4.Debug 192.168.0.44 2019-04-01T11:36:10.461Z vm-server-iv.dc01.local Rhttpproxy: verbose rhttpproxy[2099149] [Originator@6876 sub=Proxy Req 00098] Resolved endpoint : [N7Vmacore4Http16LocalServiceSpecE:0x000000a587103830] _serverNamespace = /sdk action = Allow _port = 8307
2019-04-01 13:36:55 Local4.Debug 192.168.0.44 2019-04-01T11:36:10.981Z vm-server-iv.dc01.local Rhttpproxy: verbose rhttpproxy[2100708] [Originator@6876 sub=Proxy Req 00091] Resolved endpoint : [N7Vmacore4Http16LocalServiceSpecE:0x000000a5871035b0] _serverNamespace = /host action = Allow _port = 8309
2019-04-01 13:36:55 Local4.Info 192.168.0.44 2019-04-01T11:36:10.992Z vm-server-iv.dc01.local Hostd: info hostd[2099823] [Originator@6876 sub=Solo.HTTP server /host user=root] Sent OK response for GET /host/vmkeventd.log
2019-04-01 13:36:57 Local4.Error 192.168.0.44 2019-04-01T11:36:13.267Z vm-server-iv.dc01.local Hostd: error hostd[2099834] [Originator@6876 sub=Hostsvc.NsxSpecTracker] Object not found/hostspec disabled
2019-04-01 13:36:59 User.Notice 192.168.0.44 Injector: Sleeping!
vm-server-iv.dc01.local sdrsInjector:
2019-04-01 13:37:01 Cron.Info 192.168.0.66 1 2019-04-01T13:37:01.728046+02:00 192 CROND 30805 - - (root) CMD (. /etc/profile.d/VMware-visl-integration.sh; /usr/lib/applmgmt/backup_restore/scripts/SchedulerCron.py >>/var/log/vmware/applmgmt/backupSchedulerCron.log 2>&1)
2019-04-01 13:37:01 Cron.Info 192.168.0.66 1 2019-04-01T13:37:01.735139+02:00 192 CROND 30806 - - (root) CMD ( test -x /usr/sbin/vpxd_periodic && /usr/sbin/vpxd_periodic >/dev/null 2>&1)
2019-04-01 13:37:02 Local4.Debug 192.168.0.44 2019-04-01T11:36:17.443Z vm-server-iv.dc01.local Rhttpproxy: verbose rhttpproxy[2100708] [Originator@6876 sub=Proxy Req 00084] Resolved endpoint : [N7Vmacore4Http16LocalServiceSpecE:0x000000a58710a1c0] _serverNamespace = /vpxa action = Allow _port = 8089
2019-04-01 13:37:02 Local4.Info 192.168.0.44 2019-04-01T11:36:17.444Z vm-server-iv.dc01.local Vpxa: info vpxa[2100170] [Originator@6876 sub=vpxLro opID=PollQuickStatsLoop-74856499-f1] [VpxLRO] -- BEGIN lro-210 -- vpxa -- vpxapi.VpxaService.fetchQuickStats -- 52f46758-89a9-823b-c2af-9541947c6b40
2019-04-01 13:37:02 Local4.Info 192.168.0.44 2019-04-01T11:36:17.445Z vm-server-iv.dc01.local Vpxa: info vpxa[2100170] [Originator@6876 sub=vpxLro opID=PollQuickStatsLoop-74856499-f1] [VpxLRO] -- FINISH lro-210
2019-04-01 13:37:02 User.Info 192.168.0.66 1 2019-04-01T13:37:03.051189+02:00 192 updatemgr - - - 2019-04-01T13:37:03:051Z 'Activation' 140668836292352 INFO [activationValidator, 368] Leave Validate. Succeeded for integrity.VcIntegrity.retrieveHostIPAddresses on target: Integrity.VcIntegrity
2019-04-01 13:37:02 User.Info 192.168.0.66 1 2019-04-01T13:37:03.058737+02:00 192 updatemgr - - - 2019-04-01T13:37:03:051Z 'VcIntegrity' 140668836292352 INFO [vcIntegrity, 1519] Getting IP Address from host name: 192
2019-04-01 13:37:02 User.Info 192.168.0.66 1 2019-04-01T13:37:03.064899+02:00 192 updatemgr - - - 2019-04-01T13:37:03:064Z 'VcIntegrity' 140668836292352 INFO [vcIntegrity, 1536] Cannot get IP address for host name: 192
2019-04-01 13:37:03 User.Debug 192.168.0.66 1 2019-04-01T13:37:03.499180+02:00 192 updatemgr - - - 2019-04-01T13:37:03:499Z 'JobDispatcher' 140668863104768 DEBUG [JobDispatcher, 415] The number of tasks: 0
**********************************
*** I think reset happend here ***
**********************************
2019-04-01 13:37:07 Local0.Warning 192.168.0.45 Apr 1 12:37:07 vm-server-iv-kvm IPMIMain: [640 : 704 WARNING][IPMBIfc.c:727]IPMBIfc.c : Error sending IPMB packet to Slave 0x16
2019-04-01 13:37:12 Local0.Critical 192.168.0.45 Apr 1 12:37:12 vm-server-iv-kvm IPMIMain: [640 : 734 CRITICAL][PnmTask.c:520]NMAPI.c : Error fetching messages from NM_IPMB_MSG_Q
2019-04-01 13:37:12 Local0.Critical 192.168.0.45 Apr 1 12:37:12 vm-server-iv-kvm IPMIMain: [640 : 735 CRITICAL][NMAPI.c:152]PnmTask.c : Error fetching messages from NM_RESPONSE_MSG_Q
2019-04-01 13:37:12 Kernel.Warning 192.168.0.45 Apr 1 12:37:12 vm-server-iv-kvm kernel: [1134290.790000] NCSI(eth1): Link is Down
2019-04-01 13:37:12 Kernel.Warning 192.168.0.45 Apr 1 12:37:12 vm-server-iv-kvm kernel: [1134290.790000] NCSI(eth1): Unknown Speed and Duplex
2019-04-01 13:37:12 Kernel.Warning 192.168.0.45 Apr 1 12:37:12 vm-server-iv-kvm kernel: [1134290.800000] NCSI(eth1): Link is Down
2019-04-01 13:37:12 Kernel.Warning 192.168.0.45 Apr 1 12:37:12 vm-server-iv-kvm kernel: [1134290.800000] NCSI(eth1): Unknown Speed and Duplex
2019-04-01 13:37:12 Kernel.Debug 192.168.0.45 Apr 1 12:37:12 vm-server-iv-kvm kernel: [1134290.810000] NCSI(eth1): Channel 0.0 Disabled
2019-04-01 13:37:12 Kernel.Debug 192.168.0.45 Apr 1 12:37:12 vm-server-iv-kvm kernel: [1134290.820000] NCSI(eth1): Channel 1.0 Disabled
2019-04-01 13:37:13 Kernel.Info 192.168.0.45 Apr 1 12:37:13 vm-server-iv-kvm kernel: [1134291.120000] LPC RESET
2019-04-01 13:37:13 Kernel.Warning 192.168.0.45 Apr 1 12:37:13 vm-server-iv-kvm kernel: [1134291.120000] Reset ioctl unlocked
2019-04-01 13:37:13 Local0.Critical 192.168.0.45 Apr 1 12:37:13 vm-server-iv-kvm IPMIMain: [640 : 686 CRITICAL][BTIfc.c:68] LPC Reset Occurred
2019-04-01 13:37:13 Local0.Critical 192.168.0.45 Apr 1 12:37:13 vm-server-iv-kvm IPMIMain: [640 : 726 CRITICAL][SensorEvent/SensorDevice/Sensor.c:1956]Error in getting TLS data 640
*** 20 repetition of previous line
2019-04-01 13:37:13 Local0.Critical 192.168.0.45 Apr 1 12:37:13 vm-server-iv-kvm IPMIMain: [640 : 726 CRITICAL][SensorEvent/SensorDevice/Sensor.c:1480]Error in getting TLS data 640
2019-04-01 13:37:13 Local0.Warning 192.168.0.45 Apr 1 12:37:13 vm-server-iv-kvm IPMIMain: [640 : 701 WARNING][IPMBIfc.c:727]IPMBIfc.c : Error sending IPMB packet to Slave 0x16
2019-04-01 13:37:13 Local0.Warning 192.168.0.45 Apr 1 12:37:13 vm-server-iv-kvm IPMIMain: [640 : 701 WARNING][IPMBIfc.c:727]IPMBIfc.c : Error sending IPMB packet to Slave 0x16
2019-04-01 13:37:13 Local0.Critical 192.168.0.45 Apr 1 12:37:13 vm-server-iv-kvm IPMIMain: [640 : 726 CRITICAL][SensorEvent/SensorDevice/Sensor.c:1956]Error in getting TLS data 640
*** 110 repetition of previous line
2019-04-01 13:37:14 Local0.Critical 192.168.0.45 Apr 1 12:37:14 vm-server-iv-kvm IPMIMain: [640 : 726 CRITICAL][SensorEvent/SensorDevice/Sensor.c:1956]Error in getting TLS data 640
2019-04-01 13:37:14 Local0.Warning 192.168.0.45 Apr 1 12:37:14 vm-server-iv-kvm IPMIMain: [640 : 701 WARNING][IPMBIfc.c:727]IPMBIfc.c : Error sending IPMB packet to Slave 0x16
2019-04-01 13:37:14 Kernel.Debug 192.168.0.45 Apr 1 11:36:29 vm-server-iv-kvm kernel: [1134292.840000] NCSI(eth1): Channel 0.0 Enabled
2019-04-01 13:37:14 Kernel.Debug 192.168.0.45 Apr 1 11:36:29 vm-server-iv-kvm kernel: [1134292.850000] NCSI(eth1): Channel 1.0 Enabled
2019-04-01 13:37:15 Cron.Info 192.168.0.45 Apr 1 11:36:29 vm-server-iv-kvm /usr/sbin/cron[5655]: (CRON) INFO (pidfile fd = 4)
2019-04-01 13:37:15 Cron.Info 192.168.0.45 Apr 1 11:36:29 vm-server-iv-kvm /usr/sbin/cron[5657]: (CRON) STARTUP (fork ok)
2019-04-01 13:37:15 Cron.Info 192.168.0.45 Apr 1 11:36:30 vm-server-iv-kvm /usr/sbin/cron[5657]: (CRON) INFO (Running @reboot jobs)
The message sources are:
192.168.0.44: ESXi Host
192.168.0.45: iKVM
192.168.0.66: VCSA
Hardware issues might be possible. Memtest86 is currently running on the machine. (though its ECC memory)
Hi minivlab,
your guess was right. It was the power supply. After replacing this part the problem disapeared. Thanks a lot!
good
when every you face this type issue then you have to featch,IML and DSET log . these log help to diagone the isse this is hardware issue of software.
always remmember that.