VMware Cloud Community
dhertanu
Contributor
Contributor

Host pink screen

Hi,

We're running ESXi 6.0 on a Cisco UCS C220 with a Cisco custom image. The host crashed once a while ago but now it start crashing very often - it went down with a pink screen about 4 times during the weekend. The pink screen says "#PF Exception 14 in world 33504:Cmpl-vmhba2- IP 0x41801c8bf04d addr 0x4310edfe96b6...".

I used esxcfg-dumppart to extract the zdump file from the dump partition and I got a 190MB file (about 160MB gzipped). Then I used vmkdump_extract to get the log file. I can see the same exception error in the log. As far as I can tell vmhba2 should be the RAID controller, but other than that I can't tell much.

There is anyone who can help me figure it out? Not sure how to make available that large file though...

Thanks,

Daniel

Reply
0 Kudos
15 Replies
Finikiez
Champion
Champion

Hello!

Attach a PSOD screenshot, please.

Reply
0 Kudos
dhertanu
Contributor
Contributor

Reply
0 Kudos
Vijay2027
Expert
Expert

Can you re-attach the PSOD screenshot please.

For some reason I am unable to access the file. Thank you.

Reply
0 Kudos
dhertanu
Contributor
Contributor

I'm not sure what's going on, I wasn't able to download it either, not a new screenshot I tried to upload.

Anyway, I updated the post with an URL to the image.

Thanks,

Daniel

Reply
0 Kudos
ryanrpatel
Enthusiast
Enthusiast

That's a page fault exception. Have you already ruled out any memory errors via Hardware diagnostics?

Reply
0 Kudos
dhertanu
Contributor
Contributor

No, I didn't. The server is in a remote location, it will take a few days until I can reach it. I was hoping to get an idea about what's going on before that.

Daniel

Reply
0 Kudos
raviverma17
Contributor
Contributor

Hey Daniel,

I saw the PSOD screenshot that you had attached to community page.

Here is an article which talks about this PSOD that you saw on your server : https://kb.vmware.com/s/article/102018​.

I would recommend you to perform Hardware diagnostics test (Stress test to be precise) and do check for the latest firmware / driver supported for the hardware in use by help of the hardware vendor, and upgrade to them best suitable version, just to eliminate the known issues.

Regards,

Ravi Verma.

Regards Ravi Verma
Reply
0 Kudos
dhertanu
Contributor
Contributor

Hello Ravi,

Thank you for your input. I'll perform a proper hardware test as soon as I can. Btw, I think the article number in your link is missing an "1" at the end.

Regards,

Daniel

Reply
0 Kudos
raviverma17
Contributor
Contributor

Sorry about the missing character, I have started using a new laptop and I have been facing some issues with it's touchpad.

Regards Ravi Verma
Reply
0 Kudos
dhertanu
Contributor
Contributor

Hi,

I was finally able to perform a hardware test using Cisco SCU and memtest86. The tools didn't reveal anything wrong with the server's hardware.

The server keeps crashing in less than 24 hours even when there is no VM running. At one time I noticed another error when the server crashed. The error was saying :

"Could not start pcpu 1; TSC sync timed out

cr0=....

*PCPU0:32768/bootstrap

PCPU 0:SXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Code start: 0x41801e800000 VMK uptime: 0:00:02:03.273

......"

For the rest of the times, the crash error remains the one from my initial post.

If there is no hardware issue, could this be related to the OS itself? Some corruption or something? Anyone has any idea?

Thank you,

Daniel

Reply
0 Kudos
dhertanu
Contributor
Contributor

The system has two 10 cores CPU. Out of ideas, I went to BIOS and limited the number of cores to half (10 versus the maximum 20).

The server crashed after 28 minutes with a new crash error, just after I shutdown all VMs on it (I was intending to copy all VMs on an external storage in case I decide to reinstall ESXi):

pastedImage_0.png

Daniel

Reply
0 Kudos
vFouad
Leadership
Leadership

Hi Daniel,

At this point I think we would need a full log bundle and a support request to get this solved. I have take a quick look at the known VMWare issues for your build of 6.0 (2494585) and I didn't find anything that was an exact match, close matches all pointed towards a hardware level issue with the physical CPU, but logs would help confirm this.

Kind regards,

Fouad

Reply
0 Kudos
dhertanu
Contributor
Contributor

Hello Fouad,

It's been a while since we struggle with this server. We opened a TAC case with Cisco, they couldn't find anything, we opened a case with Vmware (18904700308) and still nothing.

I post to the case some logs that directed me to this article: VMware Knowledge Base .

Recently the system start crashing without even providing the debug option. It would just freeze with a pink screen and two or three lines of logs on top of it.

If you could have a look at the case it would be great.

Thanks,

Daniel

Reply
0 Kudos
vFouad
Leadership
Leadership

Hi Daniel,

I have given the Support team some direction, and I will keep an eye on this issue.

I see that the issues you are seeing appear to be different in each bundle so it may take a little time to untangle these issues and resolve each one.

However, one issue did initially leap out:

You are running:

vmhba2  lsi_mr3   7.703.19.00-1OEM.600.0.0.2768847

With Firmware Cisco    UCSC-MRAID12G  4.62

This driver firmware combination could be causing some of the issues you are seeing as:

                http://partnerweb.vmware.com/comp_guide2/search.php?deviceCategory=io&VID=1000&DID=005d&SVID=1137&SS...

Show that CISCO has only certified a minimum firmware of 24.12.1-0411 for the 7.703.19 driver.

There have been observed interop issues that can cause some of the diagnostic screens observed when there are uncertified driver firmware combinations.

Can we clean this up and see if it reduces some of the noise of the other crashes seen?

Thanks,

Fouad

Reply
0 Kudos
dhertanu
Contributor
Contributor

Fouad,

Thank you for your quick response.

I just check CIMC information and, in the Firmware Management section, for Cisco 12G SAS Modular Raid Controller, it reports Running version: 24.12.1-0411.

Which makes sense as I updated the server firmware just before I opened the TAC case with Cisco.

pastedImage_1.png

On the ESXi side I have:

[root@blinc:~] vmkload_mod -s lsi_mr3 |grep Version

Version: 7.703.19.00-1OEM.600.0.0.2768847

Daniel

Reply
0 Kudos