paulbugala
Contributor
Contributor

ESXi 7.0.0 rebooting at random

Hi, I noticed every few weeks my ESXi reboots at random, and everytime before it happens the vmkwarning.log records the below.

Seems to me like a memory leak.

I've deployed 2 servers with ESXi 7.0 on the exact same hardware but only this one giving grief.

Both are running the same hardware, the same BIOS version and release.

It's a HPE ProLiant DL20 Gen10 with a HPE Smart Array E208i-a SR Gen10 Array Controller

Server #1 been running fine since deploy (45 days), while this one reboot every week or so.

Last reboot was today 12 Oct, previous one was on 5 Oct

Could it be bad RAM, or is ESXi leaking memory?

Build is 16324942

Any help is appreciated.

2020-10-12T05:38:48.017Z cpu0:524288)WARNING: PCI: 259: Failed to add Non-PF mem (0x4000006000 - 0x4000006fff): Limit exceeded

2020-10-12T05:38:48.017Z cpu0:524288)WARNING: PCI: 320: 0000:00:12.0: Failed to add BAR[0] (MEM64 f=0x4 0x4000006000-0x4000007000) status: Limit exceeded, parent: \_SB_.PC00

2020-10-12T05:38:48.017Z cpu0:524288)WARNING: PCI: 184: Failed to remove Non-PF mem (0x4000006000 - 0x4000006fff): Limit exceeded

2020-10-12T05:38:48.017Z cpu0:524288)WARNING: PCI: 453: 0000:00:12.0: Unable to free BAR[0] (MEM64 f=0x4 0x4000006000-0x4000007000): Limit exceeded

2020-10-12T05:38:48.017Z cpu0:524288)WARNING: PCI: 259: Failed to add Non-PF mem (0x4000000000 - 0x4000001fff): Limit exceeded

2020-10-12T05:38:48.017Z cpu0:524288)WARNING: PCI: 320: 0000:00:14.2: Failed to add BAR[0] (MEM64 f=0x4 0x4000000000-0x4000002000) status: Limit exceeded, parent: \_SB_.PC00

2020-10-12T05:38:48.017Z cpu0:524288)WARNING: PCI: 184: Failed to remove Non-PF mem (0x4000000000 - 0x4000001fff): Limit exceeded

2020-10-12T05:38:48.017Z cpu0:524288)WARNING: PCI: 453: 0000:00:14.2: Unable to free BAR[0] (MEM64 f=0x4 0x4000000000-0x4000002000): Limit exceeded

2020-10-12T05:38:48.017Z cpu0:524288)WARNING: PCI: 184: Failed to remove Non-PF mem (0x4000005000 - 0x4000005fff): Limit exceeded

2020-10-12T05:38:48.017Z cpu0:524288)WARNING: PCI: 453: 0000:00:14.2: Unable to free BAR[2] (MEM64 f=0x4 0x4000005000-0x4000006000): Limit exceeded

2020-10-12T05:38:48.017Z cpu0:524288)WARNING: PCI: 259: Failed to add Non-PF mem (0x4000004000 - 0x4000004fff): Limit exceeded

2020-10-12T05:38:48.017Z cpu0:524288)WARNING: PCI: 320: 0000:00:16.0: Failed to add BAR[0] (MEM64 f=0x4 0x4000004000-0x4000005000) status: Limit exceeded, parent: \_SB_.PC00

2020-10-12T05:38:48.017Z cpu0:524288)WARNING: PCI: 184: Failed to remove Non-PF mem (0x4000004000 - 0x4000004fff): Limit exceeded

2020-10-12T05:38:48.017Z cpu0:524288)WARNING: PCI: 453: 0000:00:16.0: Unable to free BAR[0] (MEM64 f=0x4 0x4000004000-0x4000005000): Limit exceeded

2020-10-12T05:38:48.017Z cpu0:524288)WARNING: PCI: 259: Failed to add Non-PF mem (0x4000003000 - 0x4000003fff): Limit exceeded

2020-10-12T05:38:48.017Z cpu0:524288)WARNING: PCI: 320: 0000:00:16.4: Failed to add BAR[0] (MEM64 f=0x4 0x4000003000-0x4000004000) status: Limit exceeded, parent: \_SB_.PC00

2020-10-12T05:38:48.017Z cpu0:524288)WARNING: PCI: 184: Failed to remove Non-PF mem (0x4000003000 - 0x4000003fff): Limit exceeded

2020-10-12T05:38:48.017Z cpu0:524288)WARNING: PCI: 453: 0000:00:16.4: Unable to free BAR[0] (MEM64 f=0x4 0x4000003000-0x4000004000): Limit exceeded

2020-10-12T05:38:48.020Z cpu0:524288)WARNING: PCI: 184: Failed to remove Non-PF mem (0x4000006000 - 0x4000006fff): Limit exceeded

2020-10-12T05:38:48.020Z cpu0:524288)WARNING: PCI: 453: 0000:00:12.0: Unable to free BAR[0] (MEM64 f=0x4 0x4000006000-0x4000007000): Limit exceeded

2020-10-12T05:38:48.020Z cpu0:524288)WARNING: PCI: 184: Failed to remove Non-PF mem (0x4000000000 - 0x4000001fff): Limit exceeded

2020-10-12T05:38:48.020Z cpu0:524288)WARNING: PCI: 453: 0000:00:14.2: Unable to free BAR[0] (MEM64 f=0x4 0x4000000000-0x4000002000): Limit exceeded

2020-10-12T05:38:48.020Z cpu0:524288)WARNING: PCI: 184: Failed to remove Non-PF mem (0x4000005000 - 0x4000005fff): Limit exceeded

2020-10-12T05:38:48.020Z cpu0:524288)WARNING: PCI: 453: 0000:00:14.2: Unable to free BAR[2] (MEM64 f=0x4 0x4000005000-0x4000006000): Limit exceeded

2020-10-12T05:38:48.020Z cpu0:524288)WARNING: PCI: 184: Failed to remove Non-PF mem (0x4000004000 - 0x4000004fff): Limit exceeded

2020-10-12T05:38:48.020Z cpu0:524288)WARNING: PCI: 453: 0000:00:16.0: Unable to free BAR[0] (MEM64 f=0x4 0x4000004000-0x4000005000): Limit exceeded

2020-10-12T05:38:48.020Z cpu0:524288)WARNING: PCI: 184: Failed to remove Non-PF mem (0x4000003000 - 0x4000003fff): Limit exceeded

2020-10-12T05:38:48.020Z cpu0:524288)WARNING: PCI: 453: 0000:00:16.4: Unable to free BAR[0] (MEM64 f=0x4 0x4000003000-0x4000004000): Limit exceeded

2020-10-12T05:38:48.021Z cpu0:524288)WARNING: PCI: 239: 0000:00:1f.5: BAR[0] (MEM f=0x0 0xfe010000-0xfe011000) registration failed (Bad address range)

2020-10-12T05:38:54.896Z cpu6:524775)WARNING: APEI: 306: Could not initialize EINJ

2020-10-12T05:38:56.352Z cpu6:524835)WARNING: WARN: smartpqi: pqisrc_display_device_info:248: added scsi BTL 1:0:1:  HPE      LOGICAL VOLUME   RAID 1(1+0)  SSDSmartPathCap- En- Exp+ qd=0

2020-10-12T05:38:56.352Z cpu6:524835)WARNING: WARN: smartpqi: pqisrc_display_device_info:248: added scsi BTL 2:1088:1:  HPE      E208i-a SR Gen10 RAID 0       SSDSmartPathCap- En- Exp+ qd=1014

2020-10-12T05:38:56.446Z cpu10:524692)WARNING: etherswitch: PortCfg_ModInit:1078: Skipped initializing etherswitch portcfg for VSS to use cswitch and portcfg module

2020-10-12T05:38:57.808Z cpu0:524692)WARNING: FBFT not enabled

2020-10-12T05:39:01.957Z cpu9:524692)WARNING: NMP: nmpPathClaimEnd:1393: All Helper Completed registering device 2

0 Kudos
5 Replies
bluefirestorm
Virtuoso
Virtuoso

The log entries looks to be related to PCIe Base Address Register (BAR). So it unlikely to be bad RAM hardware if it is PCI BAR.

I don't think that it is a memory leak. My guess the "Unable to free" gives that impression. It is unable to free memory locations that it failed to add.

2020-10-12T05:38:48.017Z cpu0:524288)WARNING: PCI: 320: 0000:00:12.0: Failed to add BAR[0] (MEM64 f=0x4 0x4000006000-0x4000007000) status: Limit exceeded, parent: \_SB_.PC00

2020-10-12T05:38:48.017Z cpu0:524288)WARNING: PCI: 184: Failed to remove Non-PF mem (0x4000006000 - 0x4000006fff): Limit exceeded

2020-10-12T05:38:48.017Z cpu0:524288)WARNING: PCI: 453: 0000:00:12.0: Unable to free BAR[0] (MEM64 f=0x4 0x4000006000-0x4000007000): Limit exceeded

Does this host causing grief have VM(s) that have PCIe passthrough configured? Perhaps someone else is trying a PCIe device passthrough in a VM and knowingly/unknowingly caused a host crash while trying to boot up the VM and this person just keeps quiet.

If you have VM(s) that have PCIe passthrough, make sure the host is using UEFI and have MMIO above 4GB. The VM(s) also need to be using virtual EFI and have the 64-bit MMIO enabled.

Look at this https://kb.vmware.com/s/article/2142307 for reference.

0 Kudos
paulbugala
Contributor
Contributor

None of the VMs have PCIe Passthrough, they are a DC and MDT\WSUS VM, and no one users those VMs except me.

Sounds like a HPE case then to me, let's see what they say.

0 Kudos
bluefirestorm
Virtuoso
Virtuoso

All the address ranges (e.g 0x4000006000 - 0x4000006fff, 0x4000000000 - 0x4000001fff, etc) except one are at the 256GB and above address space. Only one is under the 4GB address space (0xfe010000-0xfe011000). And this address space is usually part of the PCI hole (varies between 3.5GB/3.75GB to under 4GB depending on the machine).

It does look strange that the attempt is to register multiple PCI bus BARs (0000:00:12.0, 0000:00:14.2, etc).

0 Kudos
paulbugala
Contributor
Contributor

That's interesting, I only have 32GB in that server.

0 Kudos
bluefirestorm
Virtuoso
Virtuoso

Having 32GB RAM in the server while the PCI BAR address ranges is far beyond that is fine. The primary purpose of having BARs high up in the address space range is to avoid having the unusable address space for RAM like the PCI holes that lies between 3.5-4GB and 640KB - 1MB. The strange thing is why the log shows failure.

0 Kudos