VMware Cloud Community
wsaxon
Contributor
Contributor

Page Fault exceptions since upgrading to vSphere 4

I've filed an SR about this but I'm still working through the 'maybe it's your hardware' stuff with the support reps.

We upgraded one of our ESX 3.5 hosts to ESX 4 and ran for a couple of days with no issues. Then we upgraded the rest of our cluster (6 machines) this past weekend. I have since experienced 7 PSOD lockups on 3 of the machines, all identical. I have attached a sample PSOD.

My understanding is that I'm supposed to be able to retrieve a core dump image from my VMKCORE partition using esxcfg-dumppart, however when I try to do this I get the following:

Single slot coredump

Error running command. Unable to copy the dump partition: Couldn't find a valid VMKernel dump file. Dump partition might be uninitialized.

I am not sure how to initialize the dump partition. These were set up automatically by the ESX installation software. I have gone to each VM host and issued a 'esxcfg-dumppart -a'. I figure either the partition is still not initialized, or ESX is actually not writing to the partition like it says.

We used to have similar issues (random machine check exceptions) with 3.5, but these were fixed by a BIOS update.

Has anyone else experienced this issue with ESX 4 ?

All system components are on the HCL except for our NICs - these are integrated Intel Pro/1000 EB controllers which were on the HCL for 3.5U4. We don't have any other cards to use, so if this is the culprit we'll not be able to upgrade to 4.0.

Tags (2)
0 Kudos
11 Replies
rsukumar
Contributor
Contributor

Hi,

Can i have the SR no so that we can try re-producing the issue.

Thanks,

Sukumar.

0 Kudos
wsaxon
Contributor
Contributor

1425473501

Currently it looks like this may be related to my onboard LSI 1064 controller. I've provided a dump file but have not heard back from support since doing so.

0 Kudos
CrazyTao
Contributor
Contributor

I encounter the same issue.I installed VMware ESX4.0(Build164009) on one Intel MP server S7000FC4UR.

After running 4 days with no exception,it crashed with PSOD shown in the below picture.

I reboot the crashed server and run the following command on each affected ESX.

  1. esxcfg-dumppart -l

VM Kernel Name Console Name Is Active Is Configured

naa.6001517974d660001237fda72a80158b:2 dev/sda2 yes yes

Single slot coredump

Error running command. Unable to copy the dump partition: Couldn't find a valid VMKernel dump file. Dump partition might be uninitialized.

Dose anybody have resolved this problem?

Thanks~~

-


Hardware Configure:

CPU:Intel Xeon 7460×4

Memory:Samsung M395T5160QZ4-CE66 DDR2 667 FBD 4G×16

HD: Fujitsu MBB2147RC 147G 10k 2.5"×6

Raid Controller:Intel® Integrated RAID activation key HW RAID support for 0, 1,1A, 5, 6, 10, 50, 60 (AXXRAKSAS2)

HBA Card: Qlogic QLE2460 -E 4Gb HBA Card

NIC: Dual Onboard Gigabit (Gb) Ethernet ports; Intel® Remote Management Module 2 (AXXRMM2) Provide 2×1Gb Ethernet Port.

-


SR Number:1446872971

0 Kudos
jmonros
Contributor
Contributor

I have just installed ESX 4 on two identical SR7000fc4ur systems with the exact same resulst PSOD. Did you ever get a resolution? I have verified that all BIOS and Firware levels are up-to-date. After 3-4 days, both systems PSOD. Any help/direction you can provide is appreciated.

0 Kudos
CrazyTao
Contributor
Contributor

My SR Number:1446872971,Unfortunately unresolved !

The engineer of vmware let me to check the dump Partitionmany times,But, I think there is no problem with the dump partition.

When PSOD occcurred,dump partition was used to save the system information,and at the end of the PSOD screen there will be some words like "Starting coredump to disk,using slot 1 of 1 ...998876543210 DiskDump Successful",But in my PSOD screen there was no those words.

So, I think DUMP is unsuccessful !

I contact Intel support center,they have no effective measures.

Now,I Removed the Intel Remote Management Module,and the server works fine,but I still don't know the real reason.

0 Kudos
wsaxon
Contributor
Contributor

VMWare and LSI both think this is caused by an issue with the onboard LSI Logic RAID controller. I was provided a debug build of the megaraid-sas driver which was supposed to give us more information in the event of a future crash. I have not experienced a crash since installing this debug build.

I am not sure if I am permitted to distribute this file or not, so I am not attaching it to this reply.

If you are experiencing this same problem on Intel server hardware, log on to the system and issue an 'lspci' command as root. If the output contains this:

04:0e.0 RAID bus controller: LSI Logic / Symbios Logic LSI Logic MegaRAID SAS1064R

I would suggest asking VMware about the debug build of the driver.

0 Kudos
admin
Immortal
Immortal

can you check the HSC Firmware version ?

In the vm-support bundle is a file lsi. ###.txt that contains the information about the HSC firmware

find ./ -type f -name lsi* -exec grep ESG-SHV {} \;

T19: 0 00400005 00020 0d 00000000 0 0 ESG-SHV. SCA HSBP M12.... 2.08 0 0 09 5000007000580000 09 08 09

I have seen this issue with anything below 2.10 like 2.05,2.08,2.09

there are reports that updating the HSC firmware to 2.10. or 2.11 resolved the PSOD.

Intel did not release the 2.10 & 2.11 for all servers. So

the 2.09 might be the latest for some which is reported to have the

issue.

0 Kudos
wsaxon
Contributor
Contributor

Looking at my old support archives, I see that in the June/July timeframe we were on 2.07 when having problems. We have since updated firmware one time (maybe in October?), but we're still on 2.07:

# find ./ -type f -name lsi* -exec grep ESG-SHV {} \;

T64: d 00400005 00020 0d 00000000 0 0 ESG-SHV. SCA HSBP M9..... 2.07 0 0 00 50015074e869c000 09 08 09

T65: d 00400005 00020 0d 00000000 0 0 ESG-SHV. SCA HSBP M9..... 2.07 0 0 00 50015074e869c000 09 08 09

T69: d 00400005 00020 0d 00000000 0 0 ESG-SHV. SCA HSBP M9..... 2.07 0 0 00 50015074e869c000 09 08 09

T59: d 00400005 00020 0d 00000000 0 0 ESG-SHV. SCA HSBP M9..... 2.07 0 0 00 50015074e869c000 09 08 09

It looks like 2.11 was released in June, so evidently the firmware pack downloaded by the Deployment Assistant does not include the latest firmware for each supported component.

Thank you for pointing this out. We'll update a server and see what happens.

0 Kudos
admin
Immortal
Immortal

No problem, let me know if the 2.11 firmware did fix the PSOD please.

Strange is that intel did not list anything as fixed for 2.11

And also not all servers with this controller and HSC backplace have the firmware 2.11 available.

SR2500 Hot Swap Controller Firmware


http://downloadcenter.intel.com/Detail_Desc.aspx?agr=Y&DwnldID=17698&ProdId=2451&lang=eng


SR2500_2.11.zip</font></font>

SR2500 v2.11:

- None

SR2500 v2.09:

- Improved HSC FW update by implementing 128 bytes data transfer.

SR2500 v2.05:

- None

SR2500 v2.02:

- None

0 Kudos
wsaxon
Contributor
Contributor

For the file I downloaded, they do list a fix for 2.11:

SR1550 v2.11:

- HSC misses some SES2/I2C requests causing system crashes

SR1550 v2.09:

- Unexpected completion code returned for issuing "Get Enclosure Slot Map" command with invalid beginning slot position.

- The reserved bit didn't return 0b

- Unexpected completion code returned for issuing "Set Sensor Hysteresis" command with invaild reserved byte.

- The assertion/de-assertion event message enabled bits of "Get Sensor Event Enable" for HSBP Sensor number 0x01 are different than FRUSDR package.

0 Kudos
admin
Immortal
Immortal

found the same description., i did miss this before.thanks for sharing.

0 Kudos