VMware Cloud Community
porschenm
Contributor
Contributor

PSOD on Dell R515 BIOS 2.0.2 - ESXi 5.0 Updatetd

Hi,

today i had a Purple Screen on a new Dell R515 with the latest BIOS (2.0.2).

We bought 2 of these machines in 2012 with BIOS 1.10.0 and had no problems.

Yesterday i installed 2 additional R515 with BIOS 2.0.2 (installed in factory) and one of the two crashed last night.

Are there BIOS settings which i should disable?

Like DMA Virtualization ?

C1E ?

Power Settings to High Performance in BIOS instead of OS-Control?

On one of the new machines i try to downgrade the BIOS an firmware to the versions of the first 2 machines and will test it.

Regards Michael

[Windows 7 Help|http://windows-7-board.de]
0 Kudos
44 Replies
wojtowvm
Contributor
Contributor

Note that it doesn't seem to be possible to downgrade to the 1.2.4 version of the R515 BIOS that is listed on the download page as an alternative to the 2.0.2 BIOS.  It says it successfully loads and that you should reboot to complete the process, but upon rebooting, it says that the loading process failed and returns to the current BIOS.

If you click on the "previous versions" link below the 2.0.2 BIOS entry on the download page, you have a variety of options to choose from.   I chose 1.9.3 which did load successfully on reboot and so far has not crashed with a PSOD.  (though its only been a few days).

0 Kudos
arnoldveenema
Contributor
Contributor

Hi GBTurpin,

Could you provide me with the Dell case number (private message)? I'm having the same issue on BIOS 3.0.4 and would like to refer to you situation as where Dell provided the BIOS downgrade option.

Kind regards,

Arnold Veenema

0 Kudos
jrmunday
Commander
Commander

We had exactly the same issue with BIOS version 3.0.4 on the R815's ... the PSOD issue is specific to the AMD Opteron 6200 series processors. From speaking to VMware it seems that both Dell and HP servers are affected, but the latest HP BIOS has a critical update so cant be downdraded - hopefully this will prompt a quicker fix. Solution / workaround is to downgrade back to the earlier BIOS version until a permanent fis is released.

vExpert 2014 - 2022 | VCP6-DCV | http://www.jonmunday.net | @JonMunday77
0 Kudos
GBTurpin
Enthusiast
Enthusiast

Folks,

I can't give you case numbers, as that violates out corporate policy.

This is what I know:

Dell's R815 3.0.4 BIOS causes PSOD in VMware.  Downgrading to 2.9.0 with ALL of the other firmwares in the server left up-to-date causes no issues, and ALL of the R815s (a large number of them) are running without issue.

[I build Linux bootable ISOs with ALL possible firmware for the R815 platform and update everything when updaing servers.]

Dell contacted me and asked if I'd  "enter the BIOS Settings for the Processor and set C1E to disabled."

I did not do this as I'm not rebooting production hosts to reflash the BIOS with 3.0.4 and then play with flags in hopes that the systems will not PSOD.

Some of the R815s would  [with thw 3.0.4 BIOS] expolde quickly (PSOD) and others took a week or so.  This means that helping Dell would be a severe resource drain.

GB

0 Kudos
jrmunday
Commander
Commander

Hi GB,

  "enter the BIOS Settings for the Processor and set C1E to disabled."

I would highly recommend disabling C1E in the BIOS - I have found this to increase performance significantly in both virtual and physical systems. As an example of the difference it made, some calculations which took ~23 seconds to compute with C1E enabled only took 4 seconds to computer with C1E disabled.

I also found that in some cases I would see PSOD on R815's (with BIOS ver 3.0.4) even when the host has no VM's ... I got Dell in to replace motherboards on 3x hosts and this made no difference ... I'm back on BIOS version 2.9.0 and have not had the issue again. My support request with both Dell and VMware remains open until this is fixed.

Cheers,

Jon

vExpert 2014 - 2022 | VCP6-DCV | http://www.jonmunday.net | @JonMunday77
0 Kudos
GBTurpin
Enthusiast
Enthusiast

jrmunday?

Dell really wanted to close our tickets... I assume that they were fine with not actually have solved the issue.

I'm pretty sure the flag is disabled on all of our systems, unless they BIOS updates caused them to be reset.

[A prior ticket had Dell disabling this as well, and THAT series of tickets related to the bad motherboard issues with the Dell R815s; the Dell R815s have been the worst series we've purchased in the history of Dell servers at our company, and we buy a LOT of Dell equipment.  I find that having to replace motherboards at a rate of 2 times per R815 and as many as 3 times to be pretty much unacceptable.]

GB

0 Kudos
jrmunday
Commander
Commander

I am pretty disappointed with the R815's myself. I replaced 10x HP DL380 G5 hosts with 6x Dell R815 hosts ... so far, I've replaced 3x motherboards, CPU and RAM. On the plus side, the R815's downfalls have been responsible for exposing some flaws in the ESXi builds (which have been resolved by VMware).

The C1E flags should remain unchanged with the BIOS upgrades, but from version 2.8.2 HPC mode is added - I have enabled this on my hosts;

http://downloads.dell.com/FOLDER00955427M/1/PER815-030004BIOS.txt

* Added High Performance Computing (HPC) mode support for AMD Opteron 6200 series processor

Processor HPC Mode—Information Update

http://support.dell.com/support/edocs/systems/per715/en/TS/HPC_EN.pdf

AMD Opteron™ 6200 Series Processor Quick Reference Guide

http://www.amd.com/us/Documents/Opteron_6000_QRG.pdf

Cheers,

Jon


vExpert 2014 - 2022 | VCP6-DCV | http://www.jonmunday.net | @JonMunday77
0 Kudos
omen
Contributor
Contributor

Hi,

we have two Dell R715 (each with 2x16 core AMD 6276) running with BIOS 3.0.4 and we see similar PSODs. I already contacted Dell support.

Btw., following the suggestion in this thread, I disabled the C1E flag in BIOS. But this didn't help.

Are there any new insights? Or is the BIOS downgrade still the recommended workaround?

Regards,

Olaf.

(corrected the Servertype)

0 Kudos
jrmunday
Commander
Commander

BIOS downgrade is the only solution at the moment. My support call remains open with VMware, so I can update this thread as soon as there is any progress.

vExpert 2014 - 2022 | VCP6-DCV | http://www.jonmunday.net | @JonMunday77
0 Kudos
patm521
Contributor
Contributor

I am currently out of the office and will return on March 4. During this time I will not have access to my email.

Regards,

Patrick Mahoney

0 Kudos
omen
Contributor
Contributor

OK, thanks for the updatel. This morning our Dell support technician has confirmed, that there is a known bug happening with 62xx processors during vmotion, when the latest BIOS is running. Thus I downgraded our two servers to 2.9.0 as well.

He told me, that the Dell development team is working on a new, fixed BIOS and promised, he will inform me, as soon as it becomes available.

Olaf.

0 Kudos
jrmunday
Commander
Commander

Hi Olaf,

As an FYI, this affects HP hardware as well;

https://h20566.www2.hp.com/portal/site/hpsc/template.PAGE/public/psi/topIssuesDisplay/?javax.portlet...

Cheers,

Jon

vExpert 2014 - 2022 | VCP6-DCV | http://www.jonmunday.net | @JonMunday77
0 Kudos
wojtowvm
Contributor
Contributor

Interesting that the HP advisory says it is fixed in more recent BIOS for their products and recommend that over downgrading.  I wonder if they actually fixed the problem, or just re-released a prior BIOS with the changes that caused the problem backed out (but perhaps keeping other fixes and enhancements)?    Dell needs to respond more quickly to this issue.

0 Kudos
wojtowvm
Contributor
Contributor

vMotion is not necessary to trigger the bug as it happened multiple times for me on a single R515 that doesn't even have vMotion enabled.

OK, thanks for the updatel. This morning our Dell support technician  has confirmed, that there is a known bug happening with 62xx processors  during vmotion, when the latest BIOS is running.

0 Kudos
jrmunday
Commander
Commander

As far as I am aware, HP had no choice but to fix the issue as the previous BIOS release contained a critical update.

vExpert 2014 - 2022 | VCP6-DCV | http://www.jonmunday.net | @JonMunday77
0 Kudos
brennanmichaelj
Contributor
Contributor

I was the customer that worked with HP on the BIOS fix for 12/17/12.

For those Dell customers that are still affected, you want to push Dell to update their AMD microcode in the BIOS to microcode level 0x06000629

VMware had indicated that they think the bug we were encountering was related to AMD Errata #734.

We have had HP BIOS 12/17/12 in our environment for a couple weeks now and the spontaneous reboots, Machine Check Exceptions (MCEs), VMM64 page fault 14 errors, PF #14 PSODs for single VMs or vmotionStream and other issues involving virtual machine memory corruption have gone away.

We're running ESX4.1, but this will affect both ESX and ESXi 4.x and 5.x.  We specifically had problems with HP DL585 G7s and BL685c G7s, all running AMD 6200 Series processors.

HP BIOS 03/19/12 - ESX servers crashing, MCEs, PSODs (PF #14 on individual VMs)

HP BIOS 08/15/12 - ESX servers crashing, MCEs, PSODs (PF #14 on individual VMs)

HP BIOS 12/09/12 - MCEs, PSODs (not as many, but PF #14 on individual VMs or vmotionStream), introduction to VMM64 page fault 14s (which causes VMs to crash, both Linux and Windows) and memory corruption errors on VMs (Windows DLL crashes in event viewer)

If you need to figure out the microcode level, load something like CentOS LiveCD on the ESX host and run 'dmesg | grep “micro"'.  It should output something like:

microcode_amd_fam15h.bin

patch_level=0x600062e

If the patch level is not at least 0x06000629, then you will experience problems.  Push them to fix the problem.

HP was able to reproduce the bug by sending traffic back and forth to VMs, so they used some kind of stress utility specific for network traffic to reproduce.

0 Kudos
omen
Contributor
Contributor

Hi brennenmichaelj,

thanks a lot for the details. I just forwarded your mail to the Dell support.

Btw., we still haven't got any PSODs since we downgraded to BIOS 2.9.0.

omen.

0 Kudos
brennanmichaelj
Contributor
Contributor

You probably will not see any problems with the older BIOS, but at least in our case, we were seeing MCE's in the IML logs on that older BIOS, so you still may encounter problems running the older Dell BIOS, maybe just not as frequently.  I looked back at old PSODs since we try to take screenshots of them and I was seeing PF #14 PSODs going way back.  We also had some outdated drivers and firmware that may have contributed to those problems, which is why we kept on upgrading BIOS revisions.

The older Dell BIOS may be a the best short term to make the pain go away, but there still are some microcode bugs that would be present in that older release.  Hopefully Dell will try to duplicate the problem given the information you forwarded on to them and provide the microcode update in a new BIOS release.

0 Kudos
lesmithjr
Contributor
Contributor

I am also getting the PSOD running on ESXi 5.1 using HP BL490 G6 and BL490 G7 blades. According to this link https://h20566.www2.hp.com/portal/site/hpsc/template.PAGE/public/psi/topIssuesDisplay/?javax.portlet...

It does not address the server models I am using. Attached is a screenshot of the PSOD.16-03-49 (small).jpg

***Update ***

It does appear that BL490 G6 and BL 490 G7 are also prone to this issue.

http://h20565.www2.hp.com/portal/site/hpsc/template.PAGE/public/psi/topIssuesDisplay/?sp4ts.oid=3884...

0 Kudos
brennanmichaelj
Contributor
Contributor

Have you contacted VMware Support about this one yet and have you sent them your vm-support files?

Are you on the latest HP BIOS?

http://h20565.www2.hp.com/portal/site/hpsc/template.PAGE/public/psi/swdDetails/?sp4ts.oid=4268597&sp...

You should honestly open a new thread than post on this one, since your PSOD has nothing to do with AMD 6200 Series issues.

PSODs can be generated from a number of problems.  Once you open a new thread, if you can type out the PSOD error on the first line.  I wasn't able to read it with the graphic you provided.  It was too small.

I don't have any Intel servers (at least not yet), so I haven't encountered this PSOD yet.  VMware is usually pretty good at figuring these things out.

0 Kudos