VMware Cloud Community
a2pr
Contributor
Contributor

IBM 3850 M2 and ESX3.5

Is there anybody that run ESX 3.5 on the new IBM server 3850 M2?

We recently installed one of our new 3850 M2 with ESX3.5 and having trouble with that when loading the server it is reebooted......

We have talked to IBM support and they said that the server is not proven with ESX 3.5...??

Vmvare server compability list include 3850 M2 and ESX 3.5 - but IBM server proven list does not!!

Would be nice to here from you that are running such a config

Tags (4)
0 Kudos
26 Replies
bolsen
Enthusiast
Enthusiast

Are the most recent patches installed?

0 Kudos
private00
Enthusiast
Enthusiast

Hi,

i've installed 6x IBM 3850 M2 with ESX 3.5 last week.

We only had to fix a problem with the broadcom network adapters (see ESX350-200712401-BG and ESX350-200712407-BG).

What's the problem right now? Have you configure a raid?

Chris

0 Kudos
Full_Halsey
Contributor
Contributor

I've seen a few posts like this in the past few weeks. If you do a search you will see them, just sort by date. One post stated they had engaged IBM and there was an issue with firmware code levels. My past personal experience is the same as yours, if it's not on the proven website, IBM will NOT help you. Personally, I wished VMware would not post something on their HCL until the vendor has signed off on it.

0 Kudos
a2pr
Contributor
Contributor

The problem we have is that the server reboot by itself when we put some load on it (it work when we running a couple of VM on it......) In RSA-log you can see error with SPINT........

We got from IBM support that we can try to install the broadcom fix and hope it will work.........but no guarantees?!?!?

...still they say the server is not "proven" for ESX3.5 - only for 3.02

We are booting ESX on 3 HD configured as mirror (I don't remember what raid level that was in config.....)

The server we are running has the extra add on raid card , 2 FC cards (IBM) and 3 dual eth cards (IBM intel)

Are you running your server i full production with high load (utilization..)??

What options (cards etc) do you have in your servers?

/Per

0 Kudos
DigitalVoodoo
Enthusiast
Enthusiast

What is the SPINT error you're getting, and what brand and how much RAM do you have in the server? I have 2 3850 M2s with 64GB of Crucial memory, and have had SPINT - machine check errors relating to memory on both servers. Each time this happened, a SPINT error would be thrown, the server would reboot, and a pair of DIMMs (card 1, slots 1 and 5) would be reported as having a double-bit error and subsequently would be disabled in the BIOS. However, when we swapped the memory cards and DIMMs around, the next time the error occured it reported the same card and DIMMs (card 1, DIMMs 1 and 5) as having a problem - i.e. the problem didn't follow the previously reported memory as those were now in a different card and slots on the other card (now card 2 and slots 2 and 6).

IBM came out and replaced the system board in one of the servers to see if that would resolve the problem since the errors didn't follow the memory around, but they're "concerned" that we're using Crucial memory and not IBM-blessed memory (even though the problem doesn't follow the memory around).

0 Kudos
a2pr
Contributor
Contributor

The SPINT error we got is

Machine check asserted for Card or Link - SPINT, CPU Card, CPU Card

Machine check asserted - SPINT, North Bridge

(we are running IBM original memory 32Gb)

IBM came out a first time and replaced our cpu board....with no luck....

Now, after 2 weeks waiting - they say ESX 3.5 is not supported until April on this model.....?!?!?!?

/Per

0 Kudos
private00
Enthusiast
Enthusiast

-> a2pr

Our config:

- 2x dual intel network adapters

- 2x fc qlogic

- 40 gb memory

- raid 10 (4 disks)

Our server do not have any high load at this time. What load is the reason (cpu, memory, disk i/o,...)?

/chris

0 Kudos
DigitalVoodoo
Enthusiast
Enthusiast

Do you have an IBM case number for your issues that you'd be willing to share? We are having this same issue, and even after IBM replaced the system board the problem reoccured last night. I'm leaning on them to get to the root cause, and having another case number to reference would be very helpful. Feel free to PM me if you'd feel more comfortable with that. Thanks in advance!

0 Kudos
mike_laspina
Champion
Champion

Hello,

I would suggest you install 3.0.2 and if you still have an issue then you can call them up on that.

It would be interesting to hear what they would say if it has the same issues. (How can you tell I have worked with IBM support before)

http://blog.laspina.ca/ vExpert 2009
0 Kudos
gpeck29
Enthusiast
Enthusiast

a2pr,

I currenlty have 3 IBM x3850 M2's in my test environment with ESX 3.5 on them. Each is configures with 64GB RAM, the 2 system NICs "teamed" for service console and vmotion, 2 dual port network adapters, and 2 fiber channel cards. My original intent was to run the VMmark benchmark test on one of these servers. I was unable to run a full load test, as I was not able to get the SPEC tests. However, I ran 4 of the workloads...exchange, file, database, and standby and It seemed to work fine. I only ran one sub-"tile", but I had no problems with load when we did this.

I might have some time this week and I coud possibly run a coupe more sub-"tiles" of these test to see ifI encounter the issue. So far these servers have been pretty good in testing. I think I have 10 VMs on one of them right now.

My manager would like for me to run some test on 3.0.2 on this hardware too. Can you tell us where the issues are and I can try to assist in replicating it on 3.5 and 3.0.2.

0 Kudos
mats82
Contributor
Contributor

I too am having issues with the 3850 M2. I've added it to our existing cluster of xseries IBM servers and intel processors but nothing can vmotion to it.

It comes up with host CPU incompatibility errors. I've had a read on kb article 1993 and it seems the only way is to turn off each individual vm and change the setting to hide the flag? I have over 50 Vm's with most running fairly critical 24-7 apps..

Anyone else had this issue?

0 Kudos
Wim_Backx
Contributor
Contributor

I have the same issue on one IBM X3950 in combination with 3.0.2 .

Installed several X3950 and X3850 M2 systems with the same kixstart install without problem. So it must be hardware. The machine where I have the problem now had allready several hardware interventions before it was installable (Mother board replaced). So I will open an new case @ IBM.

kind regards

W

0 Kudos
Schorschi
Expert
Expert

We have had more issues with 3950s and 3850s in the last 6 moths, than HP and Dell put together for the last 3 years, in reference to validation and certification for our production environment. We have barely started are scaled testing, let alone functional testing. IBM just does not support ESX OS well compared to HP, and Dell. We also see real bottleneck throughput issues with the 3950s and 3850s between the PCI backplane, and the processors, the embedded NICs seem to take the biggest performance hit. IBM is working on the issue, and swears we are smoking something, but we have real, concrete comparative data that IBM is really having trouble explaining away. Servers we have in the lab right now, just to not work was well comparable as HP and Dell systems. We also are flat out tired of IBM saying... update the firmware, updated the drivers, etc., etc. Also, IBM has yet to certify on ESX 3.5 Update 1 of all things! Oh, the ESX OS as a straight OEM load is certified, but none of the IBM components, not Director Agent, not ServeRAID Agent, not the OSA IPMI driver, not the RSA Daemon, not the IBM Mapping Layer, LM78 driver - since IBM Director has blind spots with IPMI, etc. We just found out today that it appears that the VMware implementation of Pegasus and the Pegasus instance that Director installs may be stepping on each other, but the jury is still out on this point. How is the OEM OS load for ESX of any real value, when you can not monitor the hardware correctly or effectively? We have been at this with IBM for months, opened a ton of PMRs, and still finding new issues. In fact, I suggested to my managment that we walk away from IBM for 6 months to 1 year until they get their act together. Maybe with IBM Director 6.x and ESX 4.x IBM will get this straight.

0 Kudos
jeffnotcarl123
Contributor
Contributor

Schorschi,

I would be GREATLY interested in hearing more about your problems with the 3850s. We've been having many of the same performance issues and have benchmarked the 3650 series as considerably faster than the 3850s. Feel free to contact me with additional details. We're in the process of presenting our findings to various vendors. Anything you contribute may prove very useful.

Thanks,

Jeff

email: dementedlinuxgeek@yahoo.com

0 Kudos
Schorschi
Expert
Expert

IBM finally got everything resolved, at least in the lab, but it took many man hours to do so, and months of calendar days, it boiled down to 3 basic issues... These issues are common to all vendors....

1) Horrible documentation, dated, missing, incomplete, IBM never prepared for ESX 3.5 right, from my perspective, it appeared that IBM did not understand that Update 1 and Update 2 (and Update 3?) are not just simple updates, but new OSes versions, with new functionality core to the kernel, thus IBM makes assumptions that gets them into hot water over and over. They often update code only after a given ESX OS has been out for months, 3 to 6 minimum per our experience.

2) The IBM website is full of inaccurate information or very hard to find information, the documentation lags the code base. The search engine is horrible, why? Because IBM almost never states that VMware ESX is supported on the web pages, only that Linux is supported on the web pages, so even if the actual read-me files say ESX is supported, the search engine on the web site does not flag the right documentation or files. And, as noted in point #1, IBM often just says ESX 3 or worse 3.0, so that only when you demand explanation, an IBM tech will often say.... oh, we know it supports 3.5, but we don't know about Update 1 or 2 or 3 yet. This makes things confusing.

3) Get the right resource for the issue in front of the problem, we often have to go through 3 or 4 resources to get to a resource that actually has done real work on ESX OS version, the specific version to be certified. This means that we are finally talking to a) the original developer of the code in question or b) a tech that has actual hands on experience, that has actually done the work on the correct version of ESX OS.... And I mean the correct version right down to the build # at times. For example, when we found that ESX OS 3.0.2 Update 1 broke IBM code, it took 5 IBM resources to understand and resolve the issue, 3 tech support personel that knew Linux or ESX but not the version we were testing, or they only knew the older hardware, not the latest hardware, and 2 developers, the first stated that since they only tested the given code on Linux, but since ESX was a type of Linux, it should work on ESX anyway.... you got to be kidding me. Of course, by the time this is resolved, a new version of ESX OS is out, and IBM is not ready for it, we had to start over on that certification cycle.

Like I said, once we got the right people, working on the issues, and issues were understood, we got the results we needed, but it took months total to get to a stability point in the lab, and past all the misinformation or misguiding information. VMware is not guiltless either, they only do simple testing with vendor agents, they never really due any integration testing. For example, with IBM Director Agent 5.20.1 and 5.20.2 they never loaded the IBM RSA driver, so they never realized that the Director Agent and the RSA Agent had interaction issues, and since VMware never loaded the ServeRAID agent either, neither VMware nor IBM owned up to the fact that ServerRAID agent 'manager' sub component does not work right on ESX OS, this was never mentioned in the documentation, and was only disclosed after more than 30 days of silence on the issue after we demanded a statement of support for IBM Director, IBM RSA, and ServerRAID agents level 1 and 2 for ESX OS, for ESX 3.0.2 Update 1, 3.5, and 3.5 Update 1. Of course, since IBM CIM subscriptions are by default, incomplete, we had to deal with that as well, did you know that IBM Director out of the box, and the default agent installation on ESX OS never generate NIC online or offline events as SEV 1 to IBM Director Server? It is reported as informational error, so if you use IBM Tivoli TEC/TBSM, you never get a NIC failure alert escalated? Per IBM, by default the local logging never reports NIC status via CIM subscriptions, we had to have IBM customize the cimscriptions, by default, even if you scrap the logs on ESX server via IBM Tivoli ITM, that as well has a blind spot! This work took months to get resolved, again, the 3 issues above all played a key role in this extensive time to resolution.

So, what does all this mean? We are only now, this month, are ordering IBM hardware for production, for ESX OS provisioning, in total more than 6 months after we setup the first 3950s and HS21 XM blades, in the lab to certifiy what IBM sales and marketing stated was with a straight face a 'done' deal. I am looking forward to VMware VI4 certification on IBM, I expect things will be the same at worst, or if better, good, either way, I think our experience has shown IBM they need to improve, and I am keeping by predictions on this point, to myself at this point.

The interesting thing is... even IBM stated, that we, dig deeper than anyone else, or more than most customers, when we certify stuff.... which begs the question, just how far did IBM go in their own QA process? Not far enough it would appear. Every single issue we found should have been found before IBM released anything. Am I not right in saying this? Expecting this?

I have seen and documented a continual trend over the last 5 years, every vendor is doing the same thing, IBM, Dell and HP, even EMC, NetApp, you name it., very vendor I know of to some degree has.. are... all cutting corners and we are not seeing the same quality product and support that we once saw. Alpha code is labeled Beta, Beta code is labeled RC or worse GA, etc. So our total time to certify anything is getting longer, just when management as a rule, is demanding things should be certified faster... nuts. I can not tell you how many times management has said.... "Trust the vendors. We should not have to re-validate. Why are we re-testing everything?" Well, everyone knows why they just don't want to believe it, I guess.

Do you not find it odd, that this trend has developed in significance as the trend for increased outsourcing as resulted? Now that would be a great research project.... to see if there is a rational, logical, and factual relationship between outsourcing and obious quality loss in the IT industry? I wonder why Gartner has never commented on this? Oh, wait, maybe they have been outsourced as well?

0 Kudos
meistermn
Expert
Expert

Thanks for this information.

What do say tell about ESXi with there integrated agents !!! Is this although not tested?

Our are not alone. We tested HP ESXi on USB on HP 585 G2/G5.

After 2 Month ever ESXi was disconnect from VC .

Than after speaking to vmware , say told as that the usb sticks were not certified. This were the green ones.

Now we got the black ones. I was so glad that it was in our prelive environment and not in production.

So now we run for a month and will run for two other month.

With ESX 3.5 Update 3 we have no problem on the HP 585 G2 /G5. We although do not install the hp agents. Till now HP has updated and updated their agents,

but really fix they are not.

When chossing a IBM 3850 /3950 there is more testing for vmware , because ibm is the only vendor with its own chipset EX4.

0 Kudos
meistermn
Expert
Expert

What does IBM say abou the vmware results for x3850 M2 and M2?

If they run this heavy workloads without problems, ask which has done this tests at ibm.

http://www.vmware.com/products/vmmark/results.html

http://www.vmware.com/files/pdf/vmmark/vmmark_ibm4.pdf

Do you use the same Network card and SAN Card on Page 3

Vmware Result for X3950 M2

http://www.vmware.com/files/pdf/vmmark/VMmark-IBM-2008-10-02-x3950.pdf

0 Kudos
jmartin819
Contributor
Contributor

We updated to 3.5 Update 3 and one host would not boot up. We removed hardware, and reinstalled and everything. We even replaced the motherboard and updated the firmware. We ended up resetting the BIOS from 1.07 back to 1.02 and everything is going good. There is a switch on the motherboard to boot off of a backup BIOS. The weird thing was we were able to boot into debug mode with no problems. But when we tried to boot normally it would fail at Initializing Scheduler it would shoot up a CPU error. We even loaded Windows on the server and that would boot fine.

0 Kudos
ziffle85
Contributor
Contributor

We just had the exact same issue!

5

SERVPROC

02/20/09, 11:54:41

Resetting system due to an unrecoverable error

6

< <strong>*

SERVPROC

02/20/09, 11:54:41

Machine check asserted for Card or Link - SPINT, CPU Card, CPU Card

7

< <strong>*

SERVPROC

02/20/09, 11:54:02

Resetting system due to an unrecoverable error

8

< <strong>*

SERVPROC

02/20/09, 11:54:02

Machine check asserted - SPINT, North Bridge

Anybody have some insight on this? Thanks!!

0 Kudos