mcym
Contributor
Contributor

LINT1 motherboard interrupt PSOD

Hello everyone,
As I mentioned in the title, I am encountering the vmware pink screen error. I have vsphere version 6.7 U3. I am using samsung nvme ssd disk in my server. I see this error on my dell and ibm servers. I've been trying for a solution for a few months, but I couldn't reach a result. I don't know what else to do but I think it's a software issue. Sometimes I encounter a pink screen error every 3 months, sometimes every 2 weeks. I am attaching the screenshot that I captured at the time of the error. Could you please help me how to solve this problem?
What I tried for the solution;

1-) I have applied all available firmware updates for my Dell and IBM servers. All software is up to date.

2-) I applied the latest updates released for vmware esxi 6.7.0 Update 3 (Build 17700523) , (The version in the screenshot may be different, I caught it before updating. The same problem continues.)

3-) I replaced nvme ssd disks and converter cards. I tried different brands for converters.

4-) I checked the processor and ram. In good condition.

These were not enough for me, I need your help. Thanks.

psod.jpg

0 Kudos
8 Replies
mcym
Contributor
Contributor

Hello,
Is there anyone who can help?

0 Kudos
e_espinel
Virtuoso
Virtuoso

Hello.
Your case is very particular, since they are two different brands (IBM and DELL) of Servers with the same version and Build of ESXi installed.
I understand that the nvme ssd disks and converter cards are the device from which ESXi is installed.
It is necessary to consider that in general the nvme ssd
can be from different vendors (DELL, Lenvo) but the manufacturer can be the same as Sansung, Kingston and others.
In my experience I would try to test installing the same version on a USB key and configuring the server to boot from the USB key.

You could also use an internal HD disk (if there is a free bay) to install ESXi and boot the server from this disk.

 

Enrique Espinel
Senior Technical Support IBM, Lenovo, Veeam Backup and VMware vSphere.
VSP-SV, VTSP-SV, VTSP-HCI, VTSP
Please mark my comment as Correct Answer or assign Kudos if my answer was helpful to you, Thank you.
Пожалуйста, отметьте мой комментарий как Правильный ответ или поставьте Кудо, если мой ответ был вам полезен, Спасибо.
0 Kudos
mcym
Contributor
Contributor

Firstly, thank you for your reply. The system does not boot directly with nvme ssd disk. Vmware esxi is installed in 500GB sata ssd and there is no virtual machine in sata ssd. This is just for the system to boot. Nvme ssd disks are defined as the second datastore and my virtual servers are running here. I did not try to install it on a USB disk, but since the server will not boot the system from only nvme ssd disk, vmware esxi operating system is installed in sata ssd disk and nvme disks are added as datastore2. I am using samsung 970 evo plus as nvme ssd disk.

I'm attaching an image to the answer for more details.
Thanks.

data.png

0 Kudos
e_espinel
Virtuoso
Virtuoso

Hello.
I attach the following link that may help you to understand the problem.

https://kb.vmware.com/s/article/1804


It's really strange that the same thing is happening on two different manufacturers' servers (IBM and DELL).

LINT1/NMI messages are always related to hardware problems.

Has any hardware been installed lately on these servers such as additional memory, cards or other devices.

There have been events such as power failure, significant temperature increase at the server site.

It would be convenient to check the electrical outlets where the servers are connected.

When an NMI error occurs it is recorded in the server hardware log, did you check these logs?

One of the servers you indicated is IBM or Lenovo, because if it is IBM it must be old.
Did you ever run a DSA on the IBM server? If so you can re-run it and attach your result.

 

Enrique Espinel
Senior Technical Support IBM, Lenovo, Veeam Backup and VMware vSphere.
VSP-SV, VTSP-SV, VTSP-HCI, VTSP
Please mark my comment as Correct Answer or assign Kudos if my answer was helpful to you, Thank you.
Пожалуйста, отметьте мой комментарий как Правильный ответ или поставьте Кудо, если мой ответ был вам полезен, Спасибо.
0 Kudos
mcym
Contributor
Contributor

Hello,
I had previously reviewed the https://kb.vmware.com/s/article/1804 article. I checked the cpu, ram and other hardware in the article and as you mentioned. I ran a hardware diagnostic test, but none of my servers had any problems. I was using sata ssd disk before on these servers and had never seen psod at that time. I started having problems after upgrading to nvme ssd disks. If I had seen this problem on 1 or 2 servers, I could think that the problem was hardware as you mentioned, but recently I switched to nvme ssd disk on 6 of my servers. 3 of them are Dell, 3 of them are IBM brand servers. And I started seeing PSOD on my 6 servers as well. So I have no choice but to think that the problem is software. In order not to have any problems with electrical connections, servers from 2 separate UPS sources work as double power as redundant. Even if a UPS source fails, the servers can receive energy from a different power source. When I talk to the data center where the servers are hosted, they say that there is no problem in the ambient cooling and that it is in the ideal cold. I downloaded the coredump file from 2 different servers where I saw the last psod. The error lines are almost the same. I will share these logs under my message, as I think it may help us better understand the problem. Different friends of mine who use the same brand of nvme ssd disks say they have no problems. The PCIE converters we use are also the same. In this case, is there such a bug for vmware 6.7?
This has become a problem that I can't get out of but I hope we can find a solution.
Kind regards.

Spoiler
2022-01-25T06:20:54.638Z cpu36:2098011)DVFilter: 6068: Checking disconnected filters for timeouts
2022-01-25T06:21:55.882Z cpu24:2098057)ScsiDeviceIO: 3469: Cmd(0x45bbd9b0efc0) 0x1a, CmdSN 0x18fc79 from world 0 to dev "naa.6c81f660d4ab8c00282c473705d26bb1" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
2022-01-25T06:22:54.639Z cpu39:2098057)ScsiDeviceIO: 3469: Cmd(0x45bacac65d00) 0x1a, CmdSN 0x18fce3 from world 0 to dev "naa.6c81f660d4ab8c00282c473705d26bb1" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
2022-01-25T06:24:07.777Z cpu16:2098056)ScsiDeviceIO: 3469: Cmd(0x459c6ad8fac0) 0x1a, CmdSN 0x18fd47 from world 0 to dev "naa.6c81f660d4ab8c00282c473705d26bb1" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
2022-01-25T06:25:00.674Z cpu27:2098057)ScsiDeviceIO: 3469: Cmd(0x45bacad245c0) 0x1a, CmdSN 0x18fdab from world 0 to dev "naa.6c81f660d4ab8c00282c473705d26bb1" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
2022-01-25T06:26:24.631Z cpu24:2098057)ScsiDeviceIO: 3469: Cmd(0x45bbd9aaaac0) 0x1a, CmdSN 0x18fe1e from world 0 to dev "naa.6c81f660d4ab8c00282c473705d26bb1" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
2022-01-25T06:27:33.964Z cpu2:2098056)ScsiDeviceIO: 3469: Cmd(0x459c6ace4580) 0x1a, CmdSN 0x18fe84 from world 0 to dev "naa.6c81f660d4ab8c00282c473705d26bb1" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
2022-01-25T06:28:42.330Z cpu30:2098057)ScsiDeviceIO: 3469: Cmd(0x45bad83a4400) 0x1a, CmdSN 0x18feea from world 0 to dev "naa.6c81f660d4ab8c00282c473705d26bb1" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
2022-01-25T06:29:39.461Z cpu27:2098057)ScsiDeviceIO: 3469: Cmd(0x45bbd9a393c0) 0x1a, CmdSN 0x18ff50 from world 0 to dev "naa.6c81f660d4ab8c00282c473705d26bb1" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
2022-01-25T06:30:51.080Z cpu44:2098057)ScsiDeviceIO: 3469: Cmd(0x45bb51555240) 0x1a, CmdSN 0x18ffb8 from world 0 to dev "naa.6c81f660d4ab8c00282c473705d26bb1" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
2022-01-25T06:30:54.651Z cpu37:2098011)DVFilter: 6068: Checking disconnected filters for timeouts
2022-01-25T06:31:16.703Z cpu0:2104918)WARNING: NMI: 819: NMI received; attempting to diagnose...
2022-01-25T06:31:16.703Z cpu0:2104918)ApeiHEST: 278: Invoked HestNMIHandler
2022-01-25T06:31:16.703Z cpu0:2104918)World: 3015: PRDA 0x418040000000 ss 0xfd0 ds 0xfd0 es 0xfd0 fs 0x0 gs 0x0
2022-01-25T06:31:16.703Z cpu0:2104918)World: 3017: TR 0xfb8 GDT 0xfffffffffc409000 (0xffff) IDT 0xfffffffffc408000 (0xffff)
2022-01-25T06:31:16.703Z cpu0:2104918)World: 3018: CR0 0x80050033 CR3 0x21597bd000 CR4 0x142660
2022-01-25T06:31:16.741Z cpu0:2104918)Backtrace for current CPU #0, worldID=2104918, fp=0x417fd1db9bc0
2022-01-25T06:31:16.741Z cpu0:2104918)0x4509c0002c60:[0x418011b0c0e5]PanicvPanicInt@vmkernel#nover+0x439 stack: 0x180000005, 0x418011e99898, 0x4509c0002d08, 0x0, 0x36313a00000001
2022-01-25T06:31:16.741Z cpu0:2104918)0x4509c0002d00:[0x418011b0c318]Panic_NoSave@vmkernel#nover+0x4d stack: 0x4509c0002d60, 0x4509c0002d20, 0xe, 0x1, 0x5d210
2022-01-25T06:31:16.741Z cpu0:2104918)0x4509c0002d60:[0x418011b08d9d]NMICheckLint1@vmkernel#nover+0x196 stack: 0x0, 0x0, 0x0, 0x0, 0x4509c0002f40
2022-01-25T06:31:16.741Z cpu0:2104918)0x4509c0002e20:[0x418011b08e52]NMI_Interrupt@vmkernel#nover+0xb3 stack: 0x0, 0x0, 0x0, 0x0, 0x0
2022-01-25T06:31:16.741Z cpu0:2104918)0x4509c0002ea0:[0x418011b455dc]IDTNMIWork@vmkernel#nover+0x99 stack: 0x0, 0x0, 0xffffffffffffffef, 0xfffffffffc07b9fe, 0xfc8
2022-01-25T06:31:16.741Z cpu0:2104918)0x4509c0002f20:[0x418011b46ad0]Int2_NMI@vmkernel#nover+0x19 stack: 0x0, 0x418011b63067, 0x0, 0x0, 0x0
2022-01-25T06:31:16.741Z cpu0:2104918)0x4509c0002f40:[0x418011b63066]gate_entry@vmkernel#nover+0x67 stack: 0x0, 0xffffffff, 0x0, 0xffffffff, 0x430739582a10
2022-01-25T06:31:16.741Z cpu0:2104918)0x451a8d11bb20:[0x418011ce76e4]MSIXMaskVector@vmkernel#nover+0x4c stack: 0x0, 0x418011aef796, 0x1500, 0x418011af094c, 0x10000007c
2022-01-25T06:31:16.741Z cpu0:2104918)0x451a8d11bb40:[0x418011aef795]IntrCookieMaskInt@vmkernel#nover+0x36 stack: 0x10000007c, 0x1, 0x0, 0x7c00000000, 0x0
2022-01-25T06:31:16.741Z cpu0:2104918)0x451a8d11bb50:[0x418011af094b]IntrCookie_SetDestination@vmkernel#nover+0xec stack: 0x0, 0x7c00000000, 0x0, 0x43010d21adc0, 0x4
2022-01-25T06:31:16.741Z cpu0:2104918)0x451a8d11bbb0:[0x418011af438d]ITIntrMove@vmkernel#nover+0x1e stack: 0x451a4ea23340, 0x4, 0x4, 0x418011af4f63, 0x6300000000
2022-01-25T06:31:16.741Z cpu0:2104918)0x451a8d11bbe0:[0x418011af4f62]IT_IntrUserControlMove@vmkernel#nover+0x87 stack: 0x451a67c23900, 0x0, 0x451a4ea23340, 0x451a4ea23340, 0x0
2022-01-25T06:31:16.741Z cpu0:2104918)0x451a8d11bc20:[0x418011d21198]CpuSched_SchedContextIntrUserControlMove@vmkernel#nover+0x39 stack: 0x77e63ba38f4fb, 0x0, 0x451a4ea23100, 0x418011d09f0b, 0x0
2022-01-25T06:31:16.741Z cpu0:2104918)0x451a8d11bc50:[0x418011d09f0a]CpuSched_VcpuMigrate@vmkernel#nover+0xf3 stack: 0xba38f4fb, 0x451a4ea23780, 0x77e63ba38f4fb, 0x451a8d11bd70, 0x451a8d11bdb0
2022-01-25T06:31:16.741Z cpu0:2104918)0x451a8d11bcc0:[0x418011d08210]CpuSchedVcpuMakeReady@vmkernel#nover+0x1a9 stack: 0x451a4ea23780, 0x77e63ba38f4fb, 0x451a8d11bd70, 0x418011d082f3, 0x0
2022-01-25T06:31:16.741Z cpu0:2104918)0x451a8d11bcf0:[0x418011d082f2]CpuSchedWorldWakeup@vmkernel#nover+0x8b stack: 0x451a4ea23100, 0x451a4ea23000, 0x0, 0x451a4ea232c0, 0x451a8d11bd70
2022-01-25T06:31:16.741Z cpu0:2104918)0x451a8d11bd30:[0x418011d0857e]CpuSchedWakeupWorldList@vmkernel#nover+0x8b stack: 0x77e63ba38f4fb, 0x451a8d11bdb0, 0x1, 0x418011b1605a, 0x0
2022-01-25T06:31:16.741Z cpu0:2104918)0x451a8d11bdb0:[0x418011d0866c]CpuSchedWakeupCount@vmkernel#nover+0x25 stack: 0x430c3596e558, 0x418011d0a9d1, 0x4501d8cca800, 0x418011b29d8d, 0x280
2022-01-25T06:31:16.741Z cpu0:2104918)0x451a8d11bdd0:[0x418011d0a9d0]CpuSched_Wakeup@vmkernel#nover+0x19 stack: 0x280, 0x41801244a676, 0x0, 0x418011acd776, 0x0
2022-01-25T06:31:16.741Z cpu0:2104918)0x451a8d11bde0:[0x418011b29d8c]vmk_WorldWakeup@vmkernel#nover+0x9 stack: 0x0, 0x418011acd776, 0x0, 0x418011d4ac98, 0x418040007510
2022-01-25T06:31:16.741Z cpu0:2104918)0x451a8d11bdf0:[0x41801244a675]nvmeCoreProcessCq@(nvme)#<None>+0x1e6 stack: 0x0, 0x418011d4ac98, 0x418040007510, 0x418000000000, 0x0
2022-01-25T06:31:16.741Z cpu0:2104918)0x451a8d11be60:[0x41801244be1b]NvmeQueue_IntrHandler@(nvme)#<None>+0x18 stack: 0x451a00000016, 0x430c3598fd80, 0x0, 0x0, 0x451a8d11bea0
2022-01-25T06:31:16.741Z cpu0:2104918)0x451a8d11be70:[0x418011aefaab]IntrCookieBH@vmkernel#nover+0x1e0 stack: 0x0, 0x0, 0x451a8d11bea0, 0x451a00000001, 0x430102202c90
2022-01-25T06:31:16.741Z cpu0:2104918)0x451a8d11bf10:[0x418011acd9f3]BH_DrainAndDisableInterrupts@vmkernel#nover+0x124 stack: 0x0, 0x0, 0x0, 0x4180400004d0, 0xffffffffffffffff
2022-01-25T06:31:16.741Z cpu0:2104918)0x451a8d11bfa0:[0x418011b36cb3]VMMVMKCall_Call@vmkernel#nover+0x13c stack: 0x0, 0x400, 0x0, 0x82, 0x1
2022-01-25T06:31:16.741Z cpu0:2104918)0x451a8d11bfe0:[0x418011b5cecd]VMKVMM_ArchEnterVMKernel@vmkernel#nover+0xe stack: 0x418011b5cec0, 0xfffffffffc008e12, 0x0, 0x0, 0x0
2022-01-25T06:31:16.758Z cpu0:2104918)VMware ESXi 6.7.0 [Releasebuild-17700523 x86_64]
LINT1/NMI (motherboard nonmaskable interrupt), undiagnosed. This may be a hardware problem; please contact your hardware vendor.
Spoiler
2022-03-19T21:03:16.533Z cpu3:2098005)DVFilter: 6068: Checking disconnected filters for timeouts
2022-03-19T21:03:47.707Z cpu33:2098051)ScsiDeviceIO: 3469: Cmd(0x45bb2b3c4980) 0x1a, CmdSN 0x318632 from world 0 to dev "naa.644a84201a2e7f00298e94600c3a8af6" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
2022-03-19T21:05:01.466Z cpu36:2098051)ScsiDeviceIO: 3469: Cmd(0x45bb2b2985c0) 0x1a, CmdSN 0x3186a0 from world 0 to dev "naa.644a84201a2e7f00298e94600c3a8af6" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
2022-03-19T21:06:14.706Z cpu0:2098050)ScsiDeviceIO: 3469: Cmd(0x459c449fdd40) 0x1a, CmdSN 0x31871e from world 0 to dev "naa.644a84201a2e7f00298e94600c3a8af6" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
2022-03-19T21:07:08.185Z cpu0:2098050)ScsiDeviceIO: 3469: Cmd(0x459df9f2a340) 0x1a, CmdSN 0x318771 from world 0 to dev "naa.644a84201a2e7f00298e94600c3a8af6" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
2022-03-19T21:08:26.259Z cpu24:2098051)ScsiDeviceIO: 3469: Cmd(0x45bb5d4902c0) 0x1a, CmdSN 0x3187e1 from world 0 to dev "naa.644a84201a2e7f00298e94600c3a8af6" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
2022-03-19T21:09:21.997Z cpu35:2098051)ScsiDeviceIO: 3469: Cmd(0x45bbaa50fe80) 0x1a, CmdSN 0x318847 from world 0 to dev "naa.644a84201a2e7f00298e94600c3a8af6" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
2022-03-19T21:10:29.631Z cpu28:2098051)ScsiDeviceIO: 3469: Cmd(0x45bc776dfb00) 0x1a, CmdSN 0x3188af from world 0 to dev "naa.644a84201a2e7f00298e94600c3a8af6" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
2022-03-19T21:11:58.040Z cpu3:2098050)ScsiDeviceIO: 3469: Cmd(0x459b9c6fb500) 0x1a, CmdSN 0x318913 from world 0 to dev "naa.644a84201a2e7f00298e94600c3a8af6" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
2022-03-19T21:13:04.886Z cpu24:2098051)ScsiDeviceIO: 3469: Cmd(0x45bc776f39c0) 0x1a, CmdSN 0x31897b from world 0 to dev "naa.644a84201a2e7f00298e94600c3a8af6" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
2022-03-19T21:13:16.551Z cpu42:2098005)DVFilter: 6068: Checking disconnected filters for timeouts
2022-03-19T21:13:57.096Z cpu18:2098050)ScsiDeviceIO: 3469: Cmd(0x459cf2cb7140) 0x1a, CmdSN 0x3189e9 from world 0 to dev "naa.644a84201a2e7f00298e94600c3a8af6" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
2022-03-19T21:15:14.637Z cpu0:2097420)WARNING: NMI: 819: NMI received; attempting to diagnose...
2022-03-19T21:15:14.637Z cpu0:2097420)ApeiHEST: 278: Invoked HestNMIHandler
2022-03-19T21:15:14.637Z cpu0:2097420)World: 3015: PRDA 0x418040000000 ss 0x0 ds 0xfd0 es 0xfd0 fs 0xfd0 gs 0xfd0
2022-03-19T21:15:14.637Z cpu0:2097420)World: 3017: TR 0xfd8 GDT 0x4509c4200000 (0xfe7) IDT 0x41800c165000 (0xfff)
2022-03-19T21:15:14.637Z cpu0:2097420)World: 3018: CR0 0x8001003d CR3 0x3f87000 CR4 0x10216c
2022-03-19T21:15:14.675Z cpu0:2097420)Backtrace for current CPU #0, worldID=2097420, fp=0x417fcc3b9bc0
2022-03-19T21:15:14.675Z cpu0:2097420)0x4509c0002c60:[0x41800c10c0e5]PanicvPanicInt@vmkernel#nover+0x439 stack: 0x180000001, 0x41800c499898, 0x4509c0002d08, 0x0, 0x34313a00000001
2022-03-19T21:15:14.675Z cpu0:2097420)0x4509c0002d00:[0x41800c10c318]Panic_NoSave@vmkernel#nover+0x4d stack: 0x4509c0002d60, 0x4509c0002d20, 0xe, 0x1, 0xceb91
2022-03-19T21:15:14.675Z cpu0:2097420)0x4509c0002d60:[0x41800c108d9d]NMICheckLint1@vmkernel#nover+0x196 stack: 0x0, 0x0, 0x0, 0x0, 0x4509c0002f40
2022-03-19T21:15:14.675Z cpu0:2097420)0x4509c0002e20:[0x41800c108e52]NMI_Interrupt@vmkernel#nover+0xb3 stack: 0x0, 0x0, 0x0, 0x0, 0x0
2022-03-19T21:15:14.675Z cpu0:2097420)0x4509c0002ea0:[0x41800c1455dc]IDTNMIWork@vmkernel#nover+0x99 stack: 0x0, 0x0, 0x0, 0x0, 0x0
2022-03-19T21:15:14.675Z cpu0:2097420)0x4509c0002f20:[0x41800c146ad0]Int2_NMI@vmkernel#nover+0x19 stack: 0x0, 0x41800c163067, 0xfd0, 0xfd0, 0x0
2022-03-19T21:15:14.675Z cpu0:2097420)0x4509c0002f40:[0x41800c163066]gate_entry@vmkernel#nover+0x67 stack: 0x0, 0x2e123117, 0x34c, 0x10a303, 0x10a25b856c5935
2022-03-19T21:15:14.675Z cpu0:2097420)0x451a4861bee8:[0x41800c11cac6]Timer_GetCycles@vmkernel#nover+0x2 stack: 0x10a25b70b5411e, 0x1, 0x0, 0x73, 0x4301021cb280
2022-03-19T21:15:14.675Z cpu0:2097420)0x451a4861bef0:[0x41800c115ef5]SP_WaitLockIRQ@vmkernel#nover+0xce stack: 0x1, 0x0, 0x73, 0x4301021cb280, 0x4301021cb290
2022-03-19T21:15:14.675Z cpu0:2097420)0x451a4861bf40:[0x41800c116045]SPLockIRQWork@vmkernel#nover+0x3e stack: 0x1600000, 0x41800c0ef111, 0x418040006700, 0x418040000000, 0x10a25b70b53524
2022-03-19T21:15:14.675Z cpu0:2097420)0x451a4861bf60:[0x41800c0ef110]IntrCookieRetireLoop@vmkernel#nover+0x121 stack: 0x10a25b70b53524, 0x0, 0x418040006708, 0x60, 0x3
2022-03-19T21:15:14.675Z cpu0:2097420)0x451a4861bfe0:[0x41800c31106a]CpuSched_StartWorld@vmkernel#nover+0x77 stack: 0x0, 0x0, 0x0, 0x0, 0x0
2022-03-19T21:15:14.692Z cpu0:2097420)VMware ESXi 6.7.0 [Releasebuild-17700523 x86_64]
LINT1/NMI (motherboard nonmaskable interrupt), undiagnosed. This may be a hardware problem; please contact your hardware vendor.
0 Kudos
bluefirestorm
Champion
Champion

You don't mention the server models. But anyway, as e_espinel mentioned, an IBM brand server would be an old server (the acquisition by Lenovo of IBM x86 server business was in 2014). With old servers, there is also a chance the server motherboard chipset is still using PCIe 2.0 (and not PCIe 3.0) and also likely before NVMe storage became more common. With PCIe 2.0, there may not be enough bandwidth to handle the NVMe throughput anyway.

I guess you have too much invested in terms of time, money, reputation, promises/expectation with the NVMe storage that you refuse to see it as a hardware problem.

Another way to look at it is the hardware problem is with the converter cards (again you don't mention any brand/model for the cards). There is a reason why there is an HCL for ESXi so that these off-the-shelf converter cards are not used and then have PSODs. So if you don't find any of the converter cards you used in the HCL, no point in trying to pin a LINT/NMI PSOD to software.

0 Kudos
mcym
Contributor
Contributor

Hello,
As you mentioned, I made a big investment for nvme ssd disks. I have approximately 60 more nvme disks unopened from the box. Right now I'm not going to upgrade more servers without solving the current issue. I've been using nvme disk for more than 1 year but I'm just trying to understand the problem. There were nvme disks that failed before, but in this case, the system did not give PSOD and the disks were completely inoperable. We swapped disks and I continued to run the system. In case of PSOD, it is fixed by restarting the system. The brand/model information of the server and the components I use is as follows.

3x IBM/Lenovo X3650 M4 servers
3x Dell R720/R720xd servers
2x samsung 970 evo plus 2TB Nvme SSD per server
Thermal supported nvme converters with Bigboy and Akasa heatsinks.

The converters are connected to the PCIE 3.0 slot on the riser card. I haven't had any problems with performance, I see really good results when I run reading and writing tests. Just an untimely PSOD is ruining things. At first I thought it was hardware related as you mentioned. I changed disks, changed riser boards, changed converters but no solution. HCL hardware support includes samsung PM series, but I don't know exactly whether it supports evo models. I don't see PSOD all the time, sometimes 2 weeks, sometimes 6 months, the system works uninterruptedly, but an untimely PSOD can occur and this error happens on all my servers where I have nvme ssd installed.
So I'm looking for a solution and trying to find ways how to fix this problem without any more problems.
Thank you.

0 Kudos
bluefirestorm
Champion
Champion

The concept of Non-Maskable Interrupt (NMI) has been around as long as microprocessors have been around. Going back to a relatively simple 40-pin 8086 CPU,

https://en.wikipedia.org/wiki/Intel_8086#/media/File:Intel_8086_pinout.svg

you can see that pin 17 is the NMI pin, that is where the signal is sent to the CPU. (Life was much simpler then compared to the thousands of pins in current CPUs).

An NMI signal is telling the CPU to stop whatever it is doing and attend to this external event. So it is a trigger outside of CPU execution; so it won't be software (i.e. instructions currently being executed by the CPU).

If it was software at fault (whether ESXi kernel or a device driver or VM currently running), the PSOD would have shown PF Exception instead of NMI.

I don't know whether it is the NVMe SSD itself at fault or the converter cards that you use. I assume when you say converter cards it is the card where the NVMe devices are mounted and the card itself is placed into a PCIe slot.

Anyway, the VMware HCL has a pretty long list of NVMe devices. Some appear to be add-in cards with built-in SSD device. With such an old server, I doubt that these server motherboard can do PCIe bifurcation and thus would require a card that can bifurcate the PCIe lanes (i.e. so that a x16 lane PCIe slot can be split into four separate x4 lanes for each NVMe device). That is why it is possible it is the card that is the problem. It could also be both card and the NVMe device. You probably need have to find an NVMe PCIe add-in card that is supported under ESXi 6.7u3.

 

0 Kudos