As I mentioned in the title, I am encountering the vmware pink screen error. I have vsphere version 6.7 U3. I am using samsung nvme ssd disk in my server. I see this error on my dell and ibm servers. I've been trying for a solution for a few months, but I couldn't reach a result. I don't know what else to do but I think it's a software issue. Sometimes I encounter a pink screen error every 3 months, sometimes every 2 weeks. I am attaching the screenshot that I captured at the time of the error. Could you please help me how to solve this problem?
What I tried for the solution;
1-) I have applied all available firmware updates for my Dell and IBM servers. All software is up to date.
2-) I applied the latest updates released for vmware esxi 6.7.0 Update 3 (Build 17700523) , (The version in the screenshot may be different, I caught it before updating. The same problem continues.)
3-) I replaced nvme ssd disks and converter cards. I tried different brands for converters.
4-) I checked the processor and ram. In good condition.
These were not enough for me, I need your help. Thanks.
Your case is very particular, since they are two different brands (IBM and DELL) of Servers with the same version and Build of ESXi installed.
I understand that the nvme ssd disks and converter cards are the device from which ESXi is installed.
It is necessary to consider that in general the nvme ssd
can be from different vendors (DELL, Lenvo) but the manufacturer can be the same as Sansung, Kingston and others.
In my experience I would try to test installing the same version on a USB key and configuring the server to boot from the USB key.
You could also use an internal HD disk (if there is a free bay) to install ESXi and boot the server from this disk.
Firstly, thank you for your reply. The system does not boot directly with nvme ssd disk. Vmware esxi is installed in 500GB sata ssd and there is no virtual machine in sata ssd. This is just for the system to boot. Nvme ssd disks are defined as the second datastore and my virtual servers are running here. I did not try to install it on a USB disk, but since the server will not boot the system from only nvme ssd disk, vmware esxi operating system is installed in sata ssd disk and nvme disks are added as datastore2. I am using samsung 970 evo plus as nvme ssd disk.
I'm attaching an image to the answer for more details.
I attach the following link that may help you to understand the problem.
It's really strange that the same thing is happening on two different manufacturers' servers (IBM and DELL).
LINT1/NMI messages are always related to hardware problems.
Has any hardware been installed lately on these servers such as additional memory, cards or other devices.
There have been events such as power failure, significant temperature increase at the server site.
It would be convenient to check the electrical outlets where the servers are connected.
When an NMI error occurs it is recorded in the server hardware log, did you check these logs?
One of the servers you indicated is IBM or Lenovo, because if it is IBM it must be old.
Did you ever run a DSA on the IBM server? If so you can re-run it and attach your result.
I had previously reviewed the https://kb.vmware.com/s/article/1804 article. I checked the cpu, ram and other hardware in the article and as you mentioned. I ran a hardware diagnostic test, but none of my servers had any problems. I was using sata ssd disk before on these servers and had never seen psod at that time. I started having problems after upgrading to nvme ssd disks. If I had seen this problem on 1 or 2 servers, I could think that the problem was hardware as you mentioned, but recently I switched to nvme ssd disk on 6 of my servers. 3 of them are Dell, 3 of them are IBM brand servers. And I started seeing PSOD on my 6 servers as well. So I have no choice but to think that the problem is software. In order not to have any problems with electrical connections, servers from 2 separate UPS sources work as double power as redundant. Even if a UPS source fails, the servers can receive energy from a different power source. When I talk to the data center where the servers are hosted, they say that there is no problem in the ambient cooling and that it is in the ideal cold. I downloaded the coredump file from 2 different servers where I saw the last psod. The error lines are almost the same. I will share these logs under my message, as I think it may help us better understand the problem. Different friends of mine who use the same brand of nvme ssd disks say they have no problems. The PCIE converters we use are also the same. In this case, is there such a bug for vmware 6.7?
This has become a problem that I can't get out of but I hope we can find a solution.
You don't mention the server models. But anyway, as e_espinel mentioned, an IBM brand server would be an old server (the acquisition by Lenovo of IBM x86 server business was in 2014). With old servers, there is also a chance the server motherboard chipset is still using PCIe 2.0 (and not PCIe 3.0) and also likely before NVMe storage became more common. With PCIe 2.0, there may not be enough bandwidth to handle the NVMe throughput anyway.
I guess you have too much invested in terms of time, money, reputation, promises/expectation with the NVMe storage that you refuse to see it as a hardware problem.
Another way to look at it is the hardware problem is with the converter cards (again you don't mention any brand/model for the cards). There is a reason why there is an HCL for ESXi so that these off-the-shelf converter cards are not used and then have PSODs. So if you don't find any of the converter cards you used in the HCL, no point in trying to pin a LINT/NMI PSOD to software.
As you mentioned, I made a big investment for nvme ssd disks. I have approximately 60 more nvme disks unopened from the box. Right now I'm not going to upgrade more servers without solving the current issue. I've been using nvme disk for more than 1 year but I'm just trying to understand the problem. There were nvme disks that failed before, but in this case, the system did not give PSOD and the disks were completely inoperable. We swapped disks and I continued to run the system. In case of PSOD, it is fixed by restarting the system. The brand/model information of the server and the components I use is as follows.
3x IBM/Lenovo X3650 M4 servers
3x Dell R720/R720xd servers
2x samsung 970 evo plus 2TB Nvme SSD per server
Thermal supported nvme converters with Bigboy and Akasa heatsinks.
The converters are connected to the PCIE 3.0 slot on the riser card. I haven't had any problems with performance, I see really good results when I run reading and writing tests. Just an untimely PSOD is ruining things. At first I thought it was hardware related as you mentioned. I changed disks, changed riser boards, changed converters but no solution. HCL hardware support includes samsung PM series, but I don't know exactly whether it supports evo models. I don't see PSOD all the time, sometimes 2 weeks, sometimes 6 months, the system works uninterruptedly, but an untimely PSOD can occur and this error happens on all my servers where I have nvme ssd installed.
So I'm looking for a solution and trying to find ways how to fix this problem without any more problems.
The concept of Non-Maskable Interrupt (NMI) has been around as long as microprocessors have been around. Going back to a relatively simple 40-pin 8086 CPU,
you can see that pin 17 is the NMI pin, that is where the signal is sent to the CPU. (Life was much simpler then compared to the thousands of pins in current CPUs).
An NMI signal is telling the CPU to stop whatever it is doing and attend to this external event. So it is a trigger outside of CPU execution; so it won't be software (i.e. instructions currently being executed by the CPU).
If it was software at fault (whether ESXi kernel or a device driver or VM currently running), the PSOD would have shown PF Exception instead of NMI.
I don't know whether it is the NVMe SSD itself at fault or the converter cards that you use. I assume when you say converter cards it is the card where the NVMe devices are mounted and the card itself is placed into a PCIe slot.
Anyway, the VMware HCL has a pretty long list of NVMe devices. Some appear to be add-in cards with built-in SSD device. With such an old server, I doubt that these server motherboard can do PCIe bifurcation and thus would require a card that can bifurcate the PCIe lanes (i.e. so that a x16 lane PCIe slot can be split into four separate x4 lanes for each NVMe device). That is why it is possible it is the card that is the problem. It could also be both card and the NVMe device. You probably need have to find an NVMe PCIe add-in card that is supported under ESXi 6.7u3.