Re: ESXi blue screen

faziz · ‎10-10-2021

Hello

I have hp blade server , all esxi went to blue screen, with following error 'cpu 7 / world 7905081 tried to re-acquire lock'

-What could be the issue?

-How i know if this done bcz power lost, how i can check logs?

Thanks

Vikramaditya_J · ‎10-11-2021

Blue screen error generally comes due to fault in server memory, that's may be due to a faulty DIMM card. However, since all your ESXi servers got blue screen, so I presume it can be an issue with the firmware.

I suggest you to first pick any server and check its iLO event logs. You will definitely get some clue if there was a problem with any hardware component. Further, check firmware versions in iLO and contact HP to check if there's a need to update firmware or replace any faulty hardware.

Thank you!
Vikramaditya J

faziz · ‎10-11-2021

I checked ilo there is nothing error show at all, but it is strange it happened on couple of servers and other don't although all of them has same firmware, where i can find logs be useful to know root cause and how i know if this issue related to vmware or hpe, is there specific logs can confirm to me

Vikramaditya_J · ‎10-11-2021

Something is definitely common between rebooted servers, could be common power supply, or common hardware or firmware.

It still doesn't look to me any issue with the OS, but seems hardware fault.

However, once look into ESXi hosts logs and follow troubleshooting steps given in this KB: https://kb.vmware.com/s/article/1019238

Specially logs in # cat /var/run/log/vmksummary

Thank you!
Vikramaditya J

Vikramaditya_J · ‎10-11-2021

Also if all the servers have the same firmware, it does not mean that there is no problem because of it. There's a possibility that the firmware are outdated or there're firmware & hardware compatibility issues. Better once get in touch with vendor and ask for advice.

Thank you!
Vikramaditya J

Texiwill · ‎10-11-2021

Hello,

Can you explain more about your configuration? Specifically, the model of the Blade server, the amount of resources in each, the type of storage you are using (VSAN, iSCSI, NFS, SAN)? What hardware is in each box?

Can you get to the logs or are they exported to somewhere? What are the logs before the one you mentioned?

"tried to re-acquire lock'' is a pretty generic term it could refer to pretty much anything so we need more information. If there are no ILO errors, then this could be related to a storage error, depending on what storage you are using. Since it is affecting more than one host, it is worth looking into other subsystems besides memory.

Best regards,

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill

e_espinel · ‎10-11-2021

Hello.
I recommend you to check the log of the blade chassis management module, there should be some details of the problem. Also check the log of the internal switches of the chassis.

Finally update the Firmware of the entire chassis (management module, switches, nodes ... etc.).
This is recommended at least once a year as part of the maintenance of the equipment.

What version and build of ESXi do you have, indicate the details of the blades (brand, type and model).

In the chassis management module you can get the list of your hardware levels (Firmware), you can attach them in this post.

Enrique Espinel
Senior Technical Support on IBM, Lenovo, Veeam Backup and VMware vSphere.
VSP-SV, VTSP-SV, VTSP-HCI, VTSP
Please mark my comment as Correct Answer or assign Kudos if my answer was helpful to you, Thank you.
Пожалуйста, отметьте мой комментарий как Правильный ответ или поставьте Кудо, если мой ответ был вам полезен, Спасибо.

Gabrie1 · ‎10-11-2021

Have had this happen long ago for a complete cluster when the storrage array had a "hickup" for a few replicated LUNs and then sent an errorcode to the hosts that they couldn't understand and therefore triggered a purple screen.

http://www.GabesVirtualWorld.com

jrehman · ‎10-11-2021

This blue screen issue resolve. Either by upgrade the ESXi host to the latest version or remove all the E1000 NIC adapters in the environment and make the VMs use the VMXNET-3 adapters.

Texiwill · ‎10-11-2021

Hello,

WIthout knowing the actual version of the hosts involved, suggesting an upgrade is somewhat dangerous. Some upgrades do not work on older hardware, etc. As for remove the E1000 adapters from the VM, that also sounds odd to me. I have blades (BL460c gen10 and everything from gen7) and e1000 has never been an issue. Storage has been an issue from time to time, memory, even a missing heat sink, but never the the e1000.

Since this impacts multiple hosts, it could be VM moved around the hosts (logging will tell us that), or it could be something with specific cluster communication . Since most 'lock' messages are about storage in the logs, that is generally where I first look. But this sounds more like a crash dump mesage than a log message. Log messagees about locks would be extremely helpful here

At this time the landscape is too broard to pinpoint a potential solution.

Best regards,

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill

faziz · ‎10-12-2021

The environment with multiple clusters, each cluster have couple of hosts, all hosts same model ProLiant BL460c Gen8 with esxi 6.5 build-4564106 all hosts connected to SAN
all hosts on one cluster went to POSD screen, other cluster working fine so could be cluster issue?!

Gabrie1 · ‎10-12-2021

Search for the common component. Probably storage.

Cluster seems highly unlikely since it wouldn't create a PSOD, at least I can't imagine any action that would case the PSOD over all hosts at the seem time.

If you have a core dump from the host, I suggested creating a support case and upload the logs.

http://www.GabesVirtualWorld.com

Texiwill · ‎10-12-2021

Hello,

I have seen this before when an FC-HBA mezzanine was going bad but was not dead yet. Please review your /var/log/messages from the impacted nodes and look for HBA related errors. You may see 3 distinct sets of error messages, but they all say the same thing. The HBA is going bad. But not bad enough for ILO to pick it up.

Best regards,

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill

BerniceDonald · ‎11-21-2022

OK, now it all makes sense. I appreciate your response.

--------------------------------

stickman fighter

All

ESXi blue screen