VMware Cloud Community
baber
Expert
Expert

why my esxi host has bee restarted

Dear all

Hi

one of my esxi hosts has been restart last night at my time (04:40 5/03/2017) and i could not find any reason in hostd.log file i have attached log file can you help me please

BR

Please mark helpful or correct if my answer resolved your issue.
0 Kudos
15 Replies
dja234
Enthusiast
Enthusiast

Please refer below link & provide the correct information.

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=10192...

Darshana Jayathilake
0 Kudos
baber
Expert
Expert

thanks

but i have 2 questions :

1 - i don't have syslog server now how can understand why my esxi host has been restarted? which log file do i have to read ?

2 - what is the benefit of coredump ?

BR

Please mark helpful or correct if my answer resolved your issue.
0 Kudos
RajeevVCP4
Expert
Expert

restart host event captured in vmkernel.log please attach vmkernel.logs.

Rajeev Chauhan
VCIX-DCV6.5/VSAN/VXRAIL
Please mark help full or correct if my answer is use full for you
0 Kudos
baber
Expert
Expert

ii have attached vmkernel.log file

BR

Please mark helpful or correct if my answer resolved your issue.
0 Kudos
a_p_
Leadership
Leadership

Does the server have an integrated management, like iLO for HP, or iDRAC for Dell?

Maybe you can find information in the server's management logs.


André

0 Kudos
baber
Expert
Expert

my server is hp and i have read that in ilo but there is not any log there just write server is restarted how can understand why my server has been restarted ?

realy there is not any log in esxi ??

Please mark helpful or correct if my answer resolved your issue.
0 Kudos
a_p_
Leadership
Leadership

The reason why I asked for the server's management log is that it looks like a hard server reboot/reset rather than a graceful reboot, and in case of an ESXi kernel panic, the server wouldn't usually reboot, but stop with a PSOD (Purple Screen of Diagnostics).

I could be wrong, but to me this rather looks like a hardware issue, where either the server restarted itself (ASR), or there was a power loss!?

André

0 Kudos
dariusd
VMware Employee
VMware Employee

Your kernel log shows a large number of correctable Machine Check Exceptions (MCE), which usually indicate a hardware problem.  Please perform hardware diagnostics.  Your host's firmware or service processor may be able to provide additional information regarding the failures.

2017-05-03T06:35:51.485Z cpu3:33269)MCE: 1020: cpu3: MCA error detected via CMCI (Gbl status=0x0): Restart IP: invalid, Error IP: invalid, MCE in progress: no.

2017-05-03T06:35:51.485Z cpu3:33269)MCE: 190: cpu3: bank0: status=0xd800014000020e0f: (VAL=1, OVFLW=1, UC=0, EN=1, PCC=0, S=0, AR=0), Addr:0x0 (invalid), Misc:0x1 (valid)

2017-05-03T06:35:51.485Z cpu3:33269)MCE: 199: cpu3: bank0: MCA recoverable error (CE): "Bus and Interconnect: OtherTrans Bus Generic error."

Thanks,

--

Darius

0 Kudos
TheBobkin
Champion
Champion

Hello Baber

Do you have anything that can restart hosts after they PSOD or get the plug pulled? (such as HP ASR)

It is possibly PSODing and getting restarted - do you see a core file having been created at /var/core ?

If there is, was a coredump file created successfully?

From boot log it says you are using 6.0 U3 (build: 5050593), is this correct?

I ask as the error messages seen in the vmkernel.log appear similar to error message associated with PSODs that was resolved in 5.5 U3b P10 (build: 3568722) and 6.0 U2 (build: 3620759)

https://kb.vmware.com/kb/2140848

vmkernel.log before and after reboot:

2017-05-02T14:57:48.293Z cpu11:33695)MCE: 1020: cpu11: MCA error detected via CMCI (Gbl status=0x0): Restart IP: invalid, Error IP: invalid, MCE in progress: no.

2017-05-02T14:57:48.293Z cpu11:33695)MCE: 190: cpu11: bank0: status=0xd80000c000020e0f: (VAL=1, OVFLW=1, UC=0, EN=1, PCC=0, S=0, AR=0), Addr:0x0 (invalid), Misc:0x1 (valid)

2017-05-02T14:57:48.293Z cpu11:33695)MCE: 199: cpu11: bank0: MCA recoverable error (CE): "Bus and Interconnect: OtherTrans Bus Generic error."

2017-05-02T21:43:04.235Z cpu26:33205)ScsiDeviceIO: 2636: Cmd(0x43b9803087c0) 0x85, CmdSN 0x508 from world 34795 to dev "naa.600508b1001cab20db564b8c75321fa5" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

VMB: 49: mbMagic: 2badb002, mbInfo 0x101404

017-05-03T00:15:33.033Z cpu1:32857)MCE: 1020: cpu1: MCA error detected via CMCI (Gbl status=0x0): Restart IP: invalid, Error IP: invalid, MCE in progress: no.

2017-05-03T00:15:33.033Z cpu1:32857)MCE: 190: cpu1: bank0: status=0xd800014000020e0f: (VAL=1, OVFLW=1, UC=0, EN=1, PCC=0, S=0, AR=0), Addr:0x0 (invalid), Misc:0x1 (valid)

2017-05-03T00:15:33.033Z cpu1:32857)MCE: 199: cpu1: bank0: MCA recoverable error (CE): "Bus and Interconnect: OtherTrans Bus Generic error."

2017-05-03T00:16:13.616Z cpu1:33539)MCE: 1020: cpu1: MCA error detected via CMCI (Gbl status=0x0): Restart IP: invalid, Error IP: invalid, MCE in progress: no.

2017-05-03T00:16:13.616Z cpu1:33539)MCE: 190: cpu1: bank0: status=0xd800008000020e0f: (VAL=1, OVFLW=1, UC=0, EN=1, PCC=0, S=0, AR=0), Addr:0x0 (invalid), Misc:0x1 (valid)

2017-05-03T00:16:13.616Z cpu1:33539)MCE: 199: cpu1: bank0: MCA recoverable error (CE): "Bus and Interconnect: OtherTrans Bus Generic error."

2017-05-03T00:17:31.744Z cpu18:33364)MCE: 1020: cpu18: MCA error detected via CMCI (Gbl status=0x0): Restart IP: invalid, Error IP: invalid, MCE in progress: no.

2017-05-03T00:17:31.744Z cpu18:33364)MCE: 190: cpu18: bank0: status=0xd80000c000020e0f: (VAL=1, OVFLW=1, UC=0, EN=1, PCC=0, S=0, AR=0), Addr:0x0 (invalid), Misc:0x1 (valid)

2017-05-03T00:17:31.744Z cpu18:33364)MCE: 199: cpu18: bank0: MCA recoverable error (CE): "Bus and Interconnect: OtherTrans Bus Generic error."

2017-05-03T00:17:51.796Z cpu7:34777)MCE: 1020: cpu7: MCA error detected via CMCI (Gbl status=0x0): Restart IP: invalid, Error IP: invalid, MCE in progress: no.

2017-05-03T00:17:51.796Z cpu7:34777)MCE: 190: cpu7: bank0: status=0xd800010000020e0f: (VAL=1, OVFLW=1, UC=0, EN=1, PCC=0, S=0, AR=0), Addr:0x0 (invalid), Misc:0x1 (valid)

2017-05-03T00:17:51.796Z cpu7:34777)MCE: 199: cpu7: bank0: MCA recoverable error (CE): "Bus and Interconnect: OtherTrans Bus Generic error."

While the error message are present, I don't think it spamming enough to lock up the host but then again it may not be logging to vmkernel when it is in panic state so it could be related to this older issue.

If you have something auto-rebooting your hosts, turn this utility off and you can get a screenshot of the PSOD (assuming it is PSODing) so that you have some backtrace data to work with, or alternatively if a core dump is being generated successfully open a case with us guys at VMware Support for further analysis.

Bob

-o- If you found this comment useful please click the 'Helpful' button and/or select as 'Answer' if you consider it so, please ask follow-up questions if you have any -o-

0 Kudos
baber
Expert
Expert

how could you understand hard server reboot/reset rather than a graceful reboot ?

in kernel.log you see that ?

can you say me exactly where do i have to see the reason if other time occure i can find that

BR

Please mark helpful or correct if my answer resolved your issue.
0 Kudos
baber
Expert
Expert

Dear TheBobkin

Q :Do you have anything that can restart hosts after they PSOD or get the plug pulled? (such as HP ASR) ?

A : i did not do any specific config on my hp server it is default

why psod appear ?

how can understand psod has been appeared?

is there any solution for understand these problems?

now how can understand if the other time my esxi host has been restarte what is that reason ?????????

Please mark helpful or correct if my answer resolved your issue.
0 Kudos
RajeevVCP4
Expert
Expert

What is build/version number of ESXi 5.5

Rajeev Chauhan
VCIX-DCV6.5/VSAN/VXRAIL
Please mark help full or correct if my answer is use full for you
0 Kudos
baber
Expert
Expert

is that important which version of esxi use?

there is not any general solution ?

BR

Please mark helpful or correct if my answer resolved your issue.
0 Kudos
RajeevVCP4
Expert
Expert

Yes this is important , because of wrong patch/update lot of time we got such type of issue.

If you logged case with vmware they will ask first this thing.

Rajeev Chauhan
VCIX-DCV6.5/VSAN/VXRAIL
Please mark help full or correct if my answer is use full for you
0 Kudos
baber
Expert
Expert

my esxi version is : VMware ESXi 6.0.0 build-5050593

Please mark helpful or correct if my answer resolved your issue.
0 Kudos