VMware Cloud Community
max70
Contributor
Contributor

PSOD: PCPU Locked up failed to ack tlb invalidate

Hi everybody,

I have a production esxi server (3i, 3.5.0,110271) on a ibm eServer x3400 with 10 virtual machines on it. The server stopped with a PSOD with a message saying:

"PCPU 3 locked up failed to ack tlb invalidate" Panic from another cpu world 2955, machine check exception

This happened twice in the last 2 weeks (!!). After rebooting the server the virtual machines started with no problems, i have no evidence of hardware problems in server hardware logs.

Thanks in advance

Tags (2)
Reply
0 Kudos
25 Replies
matthewk
Enthusiast
Enthusiast

I had the same issue, although I could reproduce it by attempting to load Fedora 9. It would consistantly PSOD right after formating the volume. I deleted my VM and haven't worked up the courage to try it again on that server. I was previously using that VM with Centos 5.2 with no problem. Fedora 9 installed fine on my test ESXi box, though.

My problematic server is a white box/custom build with an Asus DSVB-D motherboard and 2 2.0ghz Xeon E5335 CPUs. BIOS is current. Everything else is stable and fortunately, I don't need to use Fedora 9.

Reply
0 Kudos
alvinswim
Hot Shot
Hot Shot

I had the same issue the other day with a windows 2003 32bit guest runing on a Dell M600 Blade.

this time its PCPU7

do you know why this is happening.. I've seen one other time during setup. but the occurance seems to be erratic. and not on the same blade

Reply
0 Kudos
SuryaVMware
Expert
Expert

Lot of these things change with the Version of ESX you are using and the kind of hardware you have. So every PCPU locked up PSOD is not the same.

Post the screenshot if you can i will try to figure out what caused an issue. Most of the PSODs are hardware related issues.

-Surya

Reply
0 Kudos
thingy
Enthusiast
Enthusiast

Hi

We too have come across this issue. In our case, we did a new installation of ESX 3i onto a Dell PowerEdge M600 Blade and everything was working fine until we came in the next morning and found that the ESX host had pink screen'ed.

Attached is the screenshot of the error messages displayed.

This has happened onto TWO blades in the same chassis so far. These are brand new blades and so I'm leaning towards this being an ESX 3i server issue but I can't rule out hardware yet.

regards,

Jinesh

Reply
0 Kudos
SuryaVMware
Expert
Expert

Jinesh,

Did you collect the vm-support dumps from the server? Can you post the /var/log/vmkernel for the crash time.

-Surya

Reply
0 Kudos
max70
Contributor
Contributor

Hi all,

after lot of work and hedaches we found that problem was bios of server not updated. After bios update we did not experience any psod (til now...)

This is ibm reference of the problem that is caused by an issue in intel quad core processors

Hope this helps

Reply
0 Kudos
thingy
Enthusiast
Enthusiast

Thank you for your reply.

We resolved our issue by removing the additional memory that had been installed in the server.

The Dell Blade server had a strict requirement that if all 8 slots were filled up with memory, then ALL the installed memory must be the same size + speed + type.

We had tried to mix different sized memory in order to maximise the amount of memory installed in the box and make use of the spare dimms we had available.

Its odd that the memory testing we had done, did not show up the issue and so we had looked at other causes.

regards,

Jinesh

Reply
0 Kudos
alvinswim
Hot Shot
Hot Shot

I haven't seen anymore issues with our set up since that one time a few weeks back, I suspect it was because our cluster and resources weren't completely setup, we were adding network resources, nics, VMkernel networks, VLAN's, Firewall configs, etc etc, once we got it all squared away, we've not seen any issues

Reply
0 Kudos
cordysmallik
Contributor
Contributor

Hi All,

i also have the same issue yesterday in Dell M600 Blade with the following error. This is happens second time in two months. I logged a call to dell requested to generate a report using a tool. In report there is no error with regards to hardware.After rebooting the server everything is fine.

any update why it is happening?

OS: esx 3.5 updated3

RAM :32GB

Thanks in Advance

Mallikarjun

Reply
0 Kudos
NormanA
Contributor
Contributor

Same issue here.

Latest version of ESXi, all bios updates applied. M600 blade, ESXi has locked up with PSOD on the two M600 blades we got running on this bladeserver.

Both blades run through dell diagnostics just fine, memtest86 et all.

Any idea what may be causing this?

Reply
0 Kudos
vmesxipro
Contributor
Contributor

Same problem as above, happening every week or so.

Anyone got a solution? Could it be a faulty CPU?

Reply
0 Kudos
NormanA
Contributor
Contributor

Not likely faulty CPU or hardware. More likely software bug or something that requires microcode-update for CPU .

It has happened across 4 different M600 bladservers. All running either E5410 or L5420 cpu's.

Similar problem used to occur on Supermicromachines running 5400 series CPU's, but was fixed by vendor through bios updates

Reply
0 Kudos
vmesxipro
Contributor
Contributor

Norman, thanks for your answer!

I'm actually running Supermicro X7DWU mobo with E5410 CPUs. I'm also on the latest BIOS for that mobo (11/4/2008).

I did read about the microcode issue and sent an email to Supermicro to get their feedback. Maybe the mobo I'm using was not updated with the new microcode yet. I'll let you guys know.

When you say software bug, you mean that one of the VMs on the server might be causing this?

Reply
0 Kudos
NormanA
Contributor
Contributor

Software bug = combination of VMWare being too picky to not workaround the problem. And the "hardware" feature causing the issue. Microcode disables that hardware feature.

Reply
0 Kudos
Jaroslavka
Contributor
Contributor

Hello.

We have very similar problem in our production. The only difference is that there is mentioned FAT corruption (please see attached photo of screen, JPG). Does this error mean corruption of filetable of esxi installation drive? ESXi is installed on the 4GB CompactFlash (CF) card, using CF-to-IDE converter. I know this is not very usual installation method.

We also announced that there is very strange file system table on the device, where ESXi is installed (please see attached PNG file - output of "fdisk -l" command).

Where we can found crash dump file? Maybe there we will find more details about this error?

Thank you for explanation and help.

Reply
0 Kudos
NormanA
Contributor
Contributor

This doesnt look like a similar issue at all. More likely your disks are corrupt or have faulty hardware.

Reply
0 Kudos
Jaroslavka
Contributor
Contributor

Norman, thank you very much for reply.

We will check our hardware.

Reply
0 Kudos
maokaman
Contributor
Contributor

Hello.

We have the same issue. PSOD ones or twice a month.

ESXi 3.5 U3, then upgraded to U4.

Installation ".iso" and post installation system partition were bundeled with 3ware oem.tgz (http://www.3ware.com/kb/Article.aspx?id=15416).

3ware RAID controller is the only storage device used in system so it is also bootable device. We use RAID5 configuration.

Server config:

1 x Supermicro Platforms SuperServer 6015B-UB

1 x AOC-SIMSO+

2 x CPU Intel Xeon E5405 (2GHz/1333MHz, cache 12MB, Quad-Core, 80W)

4 x Memory 1GB DDR2-667 FB-DIMM ECC

1 x RAID controller 3WARE 3W-9650SE-4LPML, PCI-E x4, 4xSATA II Drivers, LP Plane

4 x HDD S-ATA 500GB 7200RPM 3.5"

BIOS File Name: 7DBUC208.zip

BIOS Revision: R 2.1a

Supermicro says that AX52 erratum has been addressed since 2.0b BIOS (3208).

Reply
0 Kudos
SMAR78700
Contributor
Contributor

Hello,

I have the same problem with six Dell Blade M600. I post a screenshot of a PSOD.

The version of VI 3 is 3.5.0 Update 3 and Virtual Center is 2.5.0 Update 3.

The OS of the vm are Windows 2003 32 et 64 bits, Windows 2008 32 et 64 bits and Windows XP.

Do you know why this is happening.. I've seen one other time during setup. but the occurance seems to be erratic. and not on the same blade.

After contact the vmware support, they said there is a new release of a patch this 29/04/2009. Is anybody resolve this problem?

Thanks for help.

Reply
0 Kudos