VMware Cloud Community
RUG201110141
Enthusiast
Enthusiast
Jump to solution

NMI: 1193 and PCPU didn't have a heartbeat for 181 seconds

Has anybody seen these particular errors. VMWare states it's a hardware issue and IBM says there is nothing wrong with the hardware. I've reinstalled the OS and get the same errors. It's running ESX 3.0.2

warning 9/7/2007 12:08:59 PM Issue detected on server02.thecreek.com in Farm: NMI: 1193: Faulting eip:esp \[0x7c8bd4:0x3477f14]

(0:00:08:01.774 cpu3:1053)

warning 9/7/2007 12:08:59 PM Issue detected on server02.thecreek.com in Farm: Heartbeat: 469: PCPU 3 didn't have a heartbeat for 421 seconds. \*may* be locked up

(0:00:08:01.774 cpu5:1029)

warning 9/7/2007 12:04:59 PM Issue detected on server02.thecreek.com in Farm: NMI: 1193: Faulting eip:esp \[0x7c8bd4:0x3477f14]

(0:00:04:01.774 cpu3:1053)

warning 9/7/2007 12:04:59 PM Issue detected on server02.thecreek.com in Farm: Heartbeat: 469: PCPU 3 didn't have a heartbeat for 181 seconds. \*may* be locked up

(0:00:04:01.774 cpu5:1029)

Reply
0 Kudos
1 Solution

Accepted Solutions
Texiwill
Leadership
Leadership
Jump to solution

Hello,

All NMI's are produced by the hardware. Generally either CPU or memory related. This is definitely a hardware issue. It could be a heat related issue, or something like that which means diagnostics must run > 10 hours and sometimes 100 hours even to reproduce the problem. Short runs will not vet the hardware if it is that type of problem.

Best regards,

Edward

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill

View solution in original post

Reply
0 Kudos
18 Replies
Texiwill
Leadership
Leadership
Jump to solution

Hello,

All NMI's are produced by the hardware. Generally either CPU or memory related. This is definitely a hardware issue. It could be a heat related issue, or something like that which means diagnostics must run > 10 hours and sometimes 100 hours even to reproduce the problem. Short runs will not vet the hardware if it is that type of problem.

Best regards,

Edward

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill
Reply
0 Kudos
RUG201110141
Enthusiast
Enthusiast
Jump to solution

Yeah, definitely hardware related. I don't know what exactly has failed, but I happened to have a spare server that was the exact same model. So I took the CPU tray, hard drives, hba's, and memory out of the faulty machine and placed them in the spare and viola no problems whatsoever. Now I get to argue with IBM some more.

Reply
0 Kudos
alexonline2
Contributor
Contributor
Jump to solution

After the upgrade from 3.0.1 to 3.0.2 I have exact the same problem. Our server did only run with 3.0.1. With this version the server works since several month without any problems. I think it is not a hardware problem.

Reply
0 Kudos
a_wolf
Contributor
Contributor
Jump to solution

I have the same problem when i try to install esx 3.0.2 ........

with 3.0.1 work all fine !!

have you LSI logic controller SCSI in your server ???? ...

I have LSI 53c1030 on board (INTEL motherboard) and when I has install patch ESX-7408807 the server crash same as 3.0.2

Reply
0 Kudos
Rob_Bohmann1
Expert
Expert
Jump to solution

Just got this today on a DL585G1 (dualcore 2.4Ghz - amd 880's) running ESX 3.0.1 build 39823 though the core dump file posted a message about build 40087...

Just trolling to see if anyone else has seen this error message besides the post above. I have an SR open, pursuing all avenues.

16:22:33:11.714 cpu1:1152)Heartbeat: 469: PCPU 0 didn't have a heartbeat for 61 seconds. may be locked up

16:22:35:11.714 cpu1:1152)Heartbeat: 469: PCPU 0 didn't have a heartbeat for 181 seconds. may be locked up

16:22:38:41.812 cpu0:1024)Host: 3293: BEGIN

Starting coredump to disk Starting coredump to disk Dumping using slot 1 of 1... using slot 1 of 1... log

If the service console is bound to cpu 0 and pcpu0 and the service console cannot communicate (i guess no heartbeat implies that) then how is the service console keeping track of time to know how long it has gone without a heartbeat?

Inquiring minds would like to know...

Reply
0 Kudos
astronyth
Contributor
Contributor
Jump to solution

I encountered the same problem this morning on an HP BL460c that has been running 3.0.1 with no problems. A few weeks ago I upgraded it to 3.0.2 and I can't help but wonder based on the posts of others here if it's not related. I'm planning on upgrading to 3.5 this weekend.

Reply
0 Kudos
mixolydian
Contributor
Contributor
Jump to solution

Have you or anyone located a solution for this? I have the same problem with 3.5 on an IBM x3500. I can recreate the problem by doing a rescan for storage in the Management Interface. Reinstalled ESX and applied patches one at a time and tested after each patch with same results.

Thank you,

Brian

Reply
0 Kudos
astronyth
Contributor
Contributor
Jump to solution

To clarify my experience, while I was running 3.0.1 I never saw this problem. After upgrading to 3.0.2 and during the couple weeks before I upgraded to 3.5, it happened on 3 different hosts a handful of times. After I upgraded to 3.5 the problem has not happened again.

Reply
0 Kudos
brandt_triple
Contributor
Contributor
Jump to solution

Same problem - also on IBM X3500 - anyone found a solution?

Reply
0 Kudos
Friendlyware
Contributor
Contributor
Jump to solution

Hi, we have the same problem in two IBM x3400 machines. The machines were working fine with v3.0.2 but since we updated to v3.5 we got an similar error. CPU1:1075 Heartbeat 470 PCPU0 didn't have a heartbeat for 18s - may be locked up.

Did somebody found a solution for that ?

Reply
0 Kudos
brandt_triple
Contributor
Contributor
Jump to solution

Updated both BIOS and BMC on the server to newest versions - now everything works

Reply
0 Kudos
Friendlyware
Contributor
Contributor
Jump to solution

Same here, updated the IBM X3400 to latest BIOS and both our systems now run stable on ESX 3.5

Reply
0 Kudos
credfern
Contributor
Contributor
Jump to solution

I have this same problem, but for the IBM x346. I used the UpdateXpress CD to update the BIOS and BMC but it did NOT fix the problem.

Any suggestions?

Reply
0 Kudos
Henry_Dorset_Ca
Contributor
Contributor
Jump to solution

Hello,

I got the same errors on two brand-new IBM x3650 with BIOS 1.10 and in addition to that I got a BIOS Error 00180103 stating "Device resource allocation error". My Qlogic FC-Adapter did not show up at boot time and consequentially the qla2300 module could not be loaded. I found an IBM support document at https://www-304.ibm.com/systems/support/supportsite.wss/docdisplay?brandind=5000008&lndocid=MIGR-610... (though for HS20 with some other additional HW) that states that maybe there is not enough Option ROM space left at boot time. After disabling PXE boot on the second onboard NIC the machine comes up clean and the PCPU-Error is no longer logged in vmkwarning.

Regards,

HDC

Reply
0 Kudos
Henry_Dorset_Ca
Contributor
Contributor
Jump to solution

Hello,

I got the same errors on two brand-new IBM x3650 with BIOS 1.10 and in addition to that I got a BIOS Error 00180103 stating "Device resource allocation error". My QLE 2460 FC-Adapter did not show up at boot time and consequentially the qla2300 module could not be loaded. I found an IBM support document at https://www-304.ibm.com/systems/support/supportsite.wss/docdisplay?brandind=5000008&lndocid=MIGR-610... (though for HS20 with some other additional HW) that states that maybe there is not enough Option ROM space left at boot time. After disabling PXE boot e.g. on the second onboard NIC the machine comes up clean and the PCPU-Error is no longer logged in vmkwarning.

This worked for me, maybe somebody is able to check this workaround and give some feedback.

Regards

HDC

Reply
0 Kudos
Henry_Dorset_Ca
Contributor
Contributor
Jump to solution

Hello,

I got the same errors on two brand-new IBM x3650 with BIOS 1.10 and in addition to that I got a BIOS Error 00180103 stating "Device resource allocation error". My QLE 2460 FC-Adapter did not show up at boot time and consequentially the qla2300 module could not be loaded. I found an IBM support document at https://www-304.ibm.com/systems/support/supportsite.wss/docdisplay?brandind=5000008&lndocid=MIGR-610... (though for HS20 with some other additional HW) that states that maybe there is not enough Option ROM space left at boot time. After disabling PXE boot e.g. on the second onboard NIC the machine comes up clean and the PCPU-Error is no longer logged in vmkwarning.

This worked for me, maybe somebody is able to check this workaround and give some feedback.

Regards

HDC

Reply
0 Kudos
lstcuser
Contributor
Contributor
Jump to solution

We had the same problem:

: cpuX: 1067) NMI: 1625: Faulting eip:esp (0x8a66c5:0x3aafd08)

: cpuX: Heartbeat: 470: PCPU 0 didn't have a heartbeat for 1861 seconds. may be locked up

....

The error occured when we copied some files on the internal storage volume from one folder to another and when we tried to convert an existing virtual machine onto the ESX- Host using VMware Converter.

We are using a SATA- RAID with an "Areca SATA RAID II"- controller. Befor the installation of ESX we replaced the original RAM- module by an 1GB- module, which was declared as supported by this controller; with this one we got the heatbeat errors.

Now, using the original 256MB- module instead of the 1GB- module, everything works fine...

Reply
0 Kudos
Will_DeHaan
Contributor
Contributor
Jump to solution

I had this same error on a phenom 9850/amd 780 system until I disabled cool&quiet and the "amd c1e" option in BIOS

Reply
0 Kudos