I have hp blade server , all esxi went to blue screen, with following error 'cpu 7 / world 7905081 tried to re-acquire lock'
-What could be the issue?
-How i know if this done bcz power lost, how i can check logs?
Blue screen error generally comes due to fault in server memory, that's may be due to a faulty DIMM card. However, since all your ESXi servers got blue screen, so I presume it can be an issue with the firmware.
I suggest you to first pick any server and check its iLO event logs. You will definitely get some clue if there was a problem with any hardware component. Further, check firmware versions in iLO and contact HP to check if there's a need to update firmware or replace any faulty hardware.
I checked ilo there is nothing error show at all, but it is strange it happened on couple of servers and other don't although all of them has same firmware, where i can find logs be useful to know root cause and how i know if this issue related to vmware or hpe, is there specific logs can confirm to me
Something is definitely common between rebooted servers, could be common power supply, or common hardware or firmware.
It still doesn't look to me any issue with the OS, but seems hardware fault.
However, once look into ESXi hosts logs and follow troubleshooting steps given in this KB: https://kb.vmware.com/s/article/1019238
Specially logs in # cat /var/run/log/vmksummary
Also if all the servers have the same firmware, it does not mean that there is no problem because of it. There's a possibility that the firmware are outdated or there're firmware & hardware compatibility issues. Better once get in touch with vendor and ask for advice.
Can you explain more about your configuration? Specifically, the model of the Blade server, the amount of resources in each, the type of storage you are using (VSAN, iSCSI, NFS, SAN)? What hardware is in each box?
Can you get to the logs or are they exported to somewhere? What are the logs before the one you mentioned?
"tried to re-acquire lock'' is a pretty generic term it could refer to pretty much anything so we need more information. If there are no ILO errors, then this could be related to a storage error, depending on what storage you are using. Since it is affecting more than one host, it is worth looking into other subsystems besides memory.
I recommend you to check the log of the blade chassis management module, there should be some details of the problem. Also check the log of the internal switches of the chassis.
Finally update the Firmware of the entire chassis (management module, switches, nodes ... etc.).
This is recommended at least once a year as part of the maintenance of the equipment.
What version and build of ESXi do you have, indicate the details of the blades (brand, type and model).
In the chassis management module you can get the list of your hardware levels (Firmware), you can attach them in this post.
Have had this happen long ago for a complete cluster when the storrage array had a "hickup" for a few replicated LUNs and then sent an errorcode to the hosts that they couldn't understand and therefore triggered a purple screen.
This blue screen issue resolve. Either by upgrade the ESXi host to the latest version or remove all the E1000 NIC adapters in the environment and make the VMs use the VMXNET-3 adapters.
WIthout knowing the actual version of the hosts involved, suggesting an upgrade is somewhat dangerous. Some upgrades do not work on older hardware, etc. As for remove the E1000 adapters from the VM, that also sounds odd to me. I have blades (BL460c gen10 and everything from gen7) and e1000 has never been an issue. Storage has been an issue from time to time, memory, even a missing heat sink, but never the the e1000.
Since this impacts multiple hosts, it could be VM moved around the hosts (logging will tell us that), or it could be something with specific cluster communication . Since most 'lock' messages are about storage in the logs, that is generally where I first look. But this sounds more like a crash dump mesage than a log message. Log messagees about locks would be extremely helpful here
At this time the landscape is too broard to pinpoint a potential solution.
The environment with multiple clusters, each cluster have couple of hosts, all hosts same model ProLiant BL460c Gen8 with esxi 6.5 build-4564106 all hosts connected to SAN
all hosts on one cluster went to POSD screen, other cluster working fine so could be cluster issue?!
Search for the common component. Probably storage.
Cluster seems highly unlikely since it wouldn't create a PSOD, at least I can't imagine any action that would case the PSOD over all hosts at the seem time.
If you have a core dump from the host, I suggested creating a support case and upload the logs.
I have seen this before when an FC-HBA mezzanine was going bad but was not dead yet. Please review your /var/log/messages from the impacted nodes and look for HBA related errors. You may see 3 distinct sets of error messages, but they all say the same thing. The HBA is going bad. But not bad enough for ILO to pick it up.