Solved: Re: NMI: 1193 and PCPU didn't have a heartbeat for...

RUG201110141 · ‎09-07-2007

Has anybody seen these particular errors. VMWare states it's a hardware issue and IBM says there is nothing wrong with the hardware. I've reinstalled the OS and get the same errors. It's running ESX 3.0.2

warning 9/7/2007 12:08:59 PM Issue detected on server02.thecreek.com in Farm: NMI: 1193: Faulting eip:esp \[0x7c8bd4:0x3477f14]

(0:00:08:01.774 cpu3:1053)

warning 9/7/2007 12:08:59 PM Issue detected on server02.thecreek.com in Farm: Heartbeat: 469: PCPU 3 didn't have a heartbeat for 421 seconds. \*may* be locked up

(0:00:08:01.774 cpu5:1029)

warning 9/7/2007 12:04:59 PM Issue detected on server02.thecreek.com in Farm: NMI: 1193: Faulting eip:esp \[0x7c8bd4:0x3477f14]

(0:00:04:01.774 cpu3:1053)

warning 9/7/2007 12:04:59 PM Issue detected on server02.thecreek.com in Farm: Heartbeat: 469: PCPU 3 didn't have a heartbeat for 181 seconds. \*may* be locked up

(0:00:04:01.774 cpu5:1029)

Texiwill · ‎09-07-2007

Hello,

All NMI's are produced by the hardware. Generally either CPU or memory related. This is definitely a hardware issue. It could be a heat related issue, or something like that which means diagnostics must run > 10 hours and sometimes 100 hours even to reproduce the problem. Short runs will not vet the hardware if it is that type of problem.

Best regards,

Edward

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill

View solution in original post

Texiwill · ‎09-07-2007

Hello,

All NMI's are produced by the hardware. Generally either CPU or memory related. This is definitely a hardware issue. It could be a heat related issue, or something like that which means diagnostics must run > 10 hours and sometimes 100 hours even to reproduce the problem. Short runs will not vet the hardware if it is that type of problem.

Best regards,

Edward

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill

RUG201110141 · ‎09-10-2007

Yeah, definitely hardware related. I don't know what exactly has failed, but I happened to have a spare server that was the exact same model. So I took the CPU tray, hard drives, hba's, and memory out of the faulty machine and placed them in the spare and viola no problems whatsoever. Now I get to argue with IBM some more.

alexonline2 · ‎09-19-2007

After the upgrade from 3.0.1 to 3.0.2 I have exact the same problem. Our server did only run with 3.0.1. With this version the server works since several month without any problems. I think it is not a hardware problem.

a_wolf · ‎11-22-2007

I have the same problem when i try to install esx 3.0.2 ........

with 3.0.1 work all fine !!

have you LSI logic controller SCSI in your server ???? ...

I have LSI 53c1030 on board (INTEL motherboard) and when I has install patch ESX-7408807 the server crash same as 3.0.2

Rob_Bohmann1 · ‎12-18-2007

Just got this today on a DL585G1 (dualcore 2.4Ghz - amd 880's) running ESX 3.0.1 build 39823 though the core dump file posted a message about build 40087...

Just trolling to see if anyone else has seen this error message besides the post above. I have an SR open, pursuing all avenues.

16:22:33:11.714 cpu1:1152)Heartbeat: 469: PCPU 0 didn't have a heartbeat for 61 seconds. may be locked up

16:22:35:11.714 cpu1:1152)Heartbeat: 469: PCPU 0 didn't have a heartbeat for 181 seconds. may be locked up

16:22:38:41.812 cpu0:1024)Host: 3293: BEGIN

Starting coredump to disk Starting coredump to disk Dumping using slot 1 of 1... using slot 1 of 1... log

If the service console is bound to cpu 0 and pcpu0 and the service console cannot communicate (i guess no heartbeat implies that) then how is the service console keeping track of time to know how long it has gone without a heartbeat?

Inquiring minds would like to know...

astronyth · ‎02-08-2008

I encountered the same problem this morning on an HP BL460c that has been running 3.0.1 with no problems. A few weeks ago I upgraded it to 3.0.2 and I can't help but wonder based on the posts of others here if it's not related. I'm planning on upgrading to 3.5 this weekend.

mixolydian · ‎02-18-2008

Have you or anyone located a solution for this? I have the same problem with 3.5 on an IBM x3500. I can recreate the problem by doing a rescan for storage in the Management Interface. Reinstalled ESX and applied patches one at a time and tested after each patch with same results.

Thank you,

Brian

astronyth · ‎02-19-2008

To clarify my experience, while I was running 3.0.1 I never saw this problem. After upgrading to 3.0.2 and during the couple weeks before I upgraded to 3.5, it happened on 3 different hosts a handful of times. After I upgraded to 3.5 the problem has not happened again.

brandt_triple · ‎02-29-2008

Same problem - also on IBM X3500 - anyone found a solution?

Friendlyware · ‎03-09-2008

Hi, we have the same problem in two IBM x3400 machines. The machines were working fine with v3.0.2 but since we updated to v3.5 we got an similar error. CPU1:1075 Heartbeat 470 PCPU0 didn't have a heartbeat for 18s - may be locked up.

Did somebody found a solution for that ?

brandt_triple · ‎03-12-2008

Updated both BIOS and BMC on the server to newest versions - now everything works

Friendlyware · ‎03-12-2008

Same here, updated the IBM X3400 to latest BIOS and both our systems now run stable on ESX 3.5

credfern · ‎03-24-2008

I have this same problem, but for the IBM x346. I used the UpdateXpress CD to update the BIOS and BMC but it did NOT fix the problem.

Any suggestions?

Henry_Dorset_Ca · ‎06-16-2008

Hello,

I got the same errors on two brand-new IBM x3650 with BIOS 1.10 and in addition to that I got a BIOS Error 00180103 stating "Device resource allocation error". My Qlogic FC-Adapter did not show up at boot time and consequentially the qla2300 module could not be loaded. I found an IBM support document at https://www-304.ibm.com/systems/support/supportsite.wss/docdisplay?brandind=5000008&lndocid=MIGR-610... (though for HS20 with some other additional HW) that states that maybe there is not enough Option ROM space left at boot time. After disabling PXE boot on the second onboard NIC the machine comes up clean and the PCPU-Error is no longer logged in vmkwarning.

Regards,

HDC

Henry_Dorset_Ca · ‎06-16-2008

Hello,

I got the same errors on two brand-new IBM x3650 with BIOS 1.10 and in addition to that I got a BIOS Error 00180103 stating "Device resource allocation error". My QLE 2460 FC-Adapter did not show up at boot time and consequentially the qla2300 module could not be loaded. I found an IBM support document at https://www-304.ibm.com/systems/support/supportsite.wss/docdisplay?brandind=5000008&lndocid=MIGR-610... (though for HS20 with some other additional HW) that states that maybe there is not enough Option ROM space left at boot time. After disabling PXE boot e.g. on the second onboard NIC the machine comes up clean and the PCPU-Error is no longer logged in vmkwarning.

This worked for me, maybe somebody is able to check this workaround and give some feedback.

Regards

HDC

Henry_Dorset_Ca · ‎06-16-2008

Hello,

I got the same errors on two brand-new IBM x3650 with BIOS 1.10 and in addition to that I got a BIOS Error 00180103 stating "Device resource allocation error". My QLE 2460 FC-Adapter did not show up at boot time and consequentially the qla2300 module could not be loaded. I found an IBM support document at https://www-304.ibm.com/systems/support/supportsite.wss/docdisplay?brandind=5000008&lndocid=MIGR-610... (though for HS20 with some other additional HW) that states that maybe there is not enough Option ROM space left at boot time. After disabling PXE boot e.g. on the second onboard NIC the machine comes up clean and the PCPU-Error is no longer logged in vmkwarning.

This worked for me, maybe somebody is able to check this workaround and give some feedback.

Regards

HDC

lstcuser · ‎07-15-2008

We had the same problem:

: cpuX: 1067) NMI: 1625: Faulting eip:esp (0x8a66c5:0x3aafd08)

: cpuX: Heartbeat: 470: PCPU 0 didn't have a heartbeat for 1861 seconds. may be locked up

....

The error occured when we copied some files on the internal storage volume from one folder to another and when we tried to convert an existing virtual machine onto the ESX- Host using VMware Converter.

We are using a SATA- RAID with an "Areca SATA RAID II"- controller. Befor the installation of ESX we replaced the original RAM- module by an 1GB- module, which was declared as supported by this controller; with this one we got the heatbeat errors.

Now, using the original 256MB- module instead of the 1GB- module, everything works fine...

Will_DeHaan · ‎10-22-2009

I had this same error on a phenom 9850/amd 780 system until I disabled cool&quiet and the "amd c1e" option in BIOS

All

NMI: 1193 and PCPU didn't have a heartbeat for 181 seconds