VMware Cloud Community
jmartin819
Contributor
Contributor

IBM x3850 M2 - Upgrade to ESXi 5 Up 1 host stops responding

I upgraded a host yesterday to ESXi 5 Up1 and tested my Veeam backup and the host stopped responding so I upgraded my Veeam installation and tested again and again the host stopped responding so I disabled all jobs because I didn't want the host to go down. But with Veeam out of the picture the host again has stopped responding. And when I mean stop responding the host is completely locked up even at the console and I have to hard boot it. I am on build 721882. I am starting a request but thought I would see if anyone else is having the same issue?

0 Kudos
11 Replies
vmroyale
Immortal
Immortal

Have you verified that the system is supported on the HCL for U1? Some of those systems have footnotes about supported features.

Also, have you updated the firmware, BIOS, etc on the host to ensure that everything is the latest version?

Brian Atkinson | vExpert | VMTN Moderator | Author of "VCP5-DCV VMware Certified Professional-Data Center Virtualization on vSphere 5.5 Study Guide: VCP-550" | @vmroyale | http://vmroyale.com
jmartin819
Contributor
Contributor

Thanks for  the reply Brian. I did verify the HCL before the upgrade and it is supported without any notes. But I am checking with IBM to verify that it is supported on their end. Because they haven't released a BIOS update for the model we are upgrading since December of 2010 which is very concerning to me.

0 Kudos
vmroyale
Immortal
Immortal

That does seem like a long time for no updates. Please post your solution, once you get there. Thanks!

Brian Atkinson | vExpert | VMTN Moderator | Author of "VCP5-DCV VMware Certified Professional-Data Center Virtualization on vSphere 5.5 Study Guide: VCP-550" | @vmroyale | http://vmroyale.com
0 Kudos
jmartin819
Contributor
Contributor

I called IBM support and vSphere 5 is not support for the model of x3850M2 which is 7141. The other models of x3850 M2s are but not the particular model we are running. So a learning lesson there. Check with VMware and the other venders.

Cheers!

0 Kudos
jmartin819
Contributor
Contributor

Funny thing. IBM called me back and they have conflicting documentation that they do support vSphere 5. So we'll see.

0 Kudos
zzmike76
Contributor
Contributor

Hello

We're experiencing exactely the same issue on the same hardware model...did you manage to solve it ?

Thanks and Regards

Michele

0 Kudos
jmartin819
Contributor
Contributor

We have contacted our customer rep at IBM and it's not looking too promising.

0 Kudos
stefan_reichle
Contributor
Contributor

We face the exact same issues with the same hardware (IBM xSeries 3850 M2 7141-4SG) and we found out that we had 3 defect memory modules in two servers. in both of them the rsa II board would show messages like this:

| IPMI:(06/16/2012 18:17:33)   (Memory - 😞 Assertion: Correctable ECC / other correctable memory error. DIMM/SIMM/RIMM 12.Transition to Non-Critical from OK.                                           |
| IPMI:(06/16/2012 18:07:32)   (Memory - 😞 Assertion: Correctable ECC / other correctable memory error. DIMM/SIMM/RIMM 12.Transition to Non-Critical from OK.

| IPMI:(07/23/2012 04:17:27)   (Memory - 😞 Assertion: Correctable ECC / other correctable memory error. DIMM/SIMM/RIMM 10.Transition to Critical from less severe.

However this error would not trigger the alert led on the server nor would it change the system status from Ok which is what our xSeries guys monitor due to lack of IBM Directory (our xSeries guys don't like to spend the huge effort on maintaining it...) also VMWare Hardware Status would detect these issues... BIOS POST would not find any issues during ECC checks and memory size would be correctly detected if the server is restarted.

Only a boot to IBM DSA (preboot version 4 installed) would show the error sometimes on memory quick test and always on memory extended full test... but be prepared to spend serious time waiting for the test to complete. systems with 256GB Memory take ~7 hours to complete if all memory is fine.

As mentioned we found 3 failed modules in the two servers,after replacing we also rerun the memory test to confim all memory is working again fine.

Since then the x3850 M2 would work again without the sudden freeze symptoms experienced before, however this is now only 36 hours but before it would crash after 3 hours max...

Of course we checked for latest firmware on all components and found no issues there, however it seems suspicious that the bios is from late 2010 and there have not been any updates since then.. also on the rsa II we miss cpu temperatures and voltage details which should be displayed (unless incorrect firmware is installed) but so far no hints from IBM about this point...

It seems our models of x3850 M2 are too old now and we will replace them as soon as possible since we only have them populated with 2 out of 4 cpu's and memory is DDR2....

hope this helps somebody, contact me in case you need IBM case numbers to help your local IBM Support to get started.

0 Kudos
jmartin819
Contributor
Contributor

So we have been going back an forth with IBM, and VMware. IBM wanted us to test their image of 5.0 Update 1 with their drivers embedded in the image that you can download from their site. So I booted one of the hosts off a USB stick and let it run for 5 days with a test VM running a small load. The host did not go down. After telling IBM this they want us to install their image on a production host to verify it will work. And if that works, they still won't certify the server and we can call in to a special group for support if there are any issues. And it still won't be certified by VMware for support. Not very comforting to me or management. Especially on our production environment.

IBM told us that it's up to VMware to cerify the servers for the HCL. But after talking to VMware it's the venders that do the testing and certifications of their own hardware and report back to VMware for the HCL. Which makes complete sense.

So as of now we are at a stand still. I believe we are budgeting for new hosts for next year and start migrating off of our 3850M2's.

0 Kudos
skywalkr
Contributor
Contributor

Hi,  Yeap.  I've fighting the same.    x3850 M2 hosts which ran for 300-400+ days on ESXi 4.0/4.1+ without so much as a blip using the standard VMware Install CDs have been looming mad since going to 5.0 + any vSphere updates it seems made after March 2012.   The initial 5.00 seems much more stable. 

Since updating hosts after March's drops:

I've had memory errors surface which have NEVER been reported before on multiple hosts.  These may be a function of loading.

I've had solid host LOCK UPS multiple times on multiple machines now where the console just freezes no PSOD, nada just frozen.  No errors in the logs which VMware can decipher even after capturing them immediately after a reboot.    No errors in the RSA for HW.  Frozen and dead. Power off/on.

I've opened numerous problems with VMware on this.. and also worked  with IBM from the Hardware side on some of the bad DIMMs... which IBM says due to  the memory RAIDing tech in play, ESXi should never have seen the Protextion error because it was effectively masked but it's too much of a coincidence that we have Protextion events and lockups coinciding with those events.

I suspect these are driver issues:  Broadcom onboards,  Intel NICs,  Qlogic HBAs and LSI RAID cards are the top culprits.

What IS news is that VMware is NOT loading the latest drivers into these updates.  I was told "you have to do that yourself..."  and trust me, finding the correct drivers down in the bowels of ESXi and loading them separately is a major hassel.. not to mention ensuring you can find the right driver on VMware's web site and whether it's in or out of play.

What should happen is vSphere's Update Manager should identify the drivers running on the HW then allow the admins to download the drivers just like patches or updates and then install them if they are selected for the UM update.

Instead, I have to spend hours chasing what I hope are the correct drivers, then downloading them, then installing them manually into Update Manager, then using UM to patch all these hosts..

VMware needs to be loading the current drivers into updates more frequently and automating the support for automatic driver loads.

I believe, based on my gut and what you have said, that this IS A DRIVER ISSUE and the only way we are going to restore stability will be to update the drivers I mentioned to the current levels and hope that 5.0 Ux or 5.1 plays nice.

And BTW, we flash our Firmware and BIOS to the latest levels and January 2012 and usually everytime we do a major upgrade so that's 1x or 2x a year.  With the older hardware, the firmware updates will be less frequenent because the platform is supposedly mature and stablized.

G. Mobley

Later, GC Mobley
0 Kudos
AWoroch
Contributor
Contributor

http://communities.vmware.com/message/2030115

See if that thread helps you guys out.  I know it's slightly different hardware, but turns out the IBM HS22 is on the HCL as supported but the LSI 1064E that it has IS NOT.  There is a link to an IBM KB article, which links to a VMware KB on how to disable the 64 bit interrupts if I recall correctly.

It may not be the same issue - but it's worth sharing, in case it helps.

0 Kudos