VMware Cloud Community
bernp
Contributor
Contributor
Jump to solution

ESXi 3.5u3 on an IBM eServer 326m

Hello,

I'm trying to run ESXi 3.5u3 (130755) on an IBM eServer 326m (bi-opteron 275 dual core, 8Gb, 2x160Gb SATA).

The installation is OK, all hardware is recognized and I can access and managed with the remote CLI, but it always crashed after few tens of minutes... (without any VM inside yet, the VMkernel is alone...).

Does anybody tried this config ?

Thanks

Reply
0 Kudos
1 Solution

Accepted Solutions
tfindlay
Enthusiast
Enthusiast
Jump to solution

Fantastic stuff!

I've got to stack some more ram in ours, so I'll try this later in the week when we have an outage.

View solution in original post

Reply
0 Kudos
24 Replies
nick_couchman
Immortal
Immortal
Jump to solution

When you say that it crashes, does it get the Purple Screen of Death? If so, can you either post a shot of that screen or just the messages that are on the screen at the time it crashes?

Reply
0 Kudos
bernp
Contributor
Contributor
Jump to solution

Yes, I got the purple screen...

But I don't saved it, and rebooted the server.

How to capture this screen ? (I will do next time)

PS: in fact, sometimes, I don't get this screen, it is freezed completly, and I was obliged to hard reset the server.

Reply
0 Kudos
nick_couchman
Immortal
Immortal
Jump to solution

Got a camera phone? Or a digital camera?

Reply
0 Kudos
bernp
Contributor
Contributor
Jump to solution

So, here is a purple screen after crash. Most times, I have no purple screen, the server is just freezed.

Reply
0 Kudos
bernp
Contributor
Contributor
Jump to solution

Hello,

Nobody can help me about this "purple screen" ?

Thanks

Reply
0 Kudos
Dave_Mishchenko
Immortal
Immortal
Jump to solution

Have you upgrade the server's BIOS to the lastest release and are the CPUs exactly the same?

Reply
0 Kudos
bernp
Contributor
Contributor
Jump to solution

Yes, I flashed the last BIOS,but it did not change anything.

And yes, the CPUs are the same...

A native system (i.e. Redhat Entreprise 5) works fine on these servers.

Reply
0 Kudos
tfindlay
Enthusiast
Enthusiast
Jump to solution

Damn! I've got a 326m also and have almost exactly the same problem.

Sometimes it runs for an hour or two, but ultimately it just locks up. I dont get the purple screen thought, mine just locks up with the yellow screen still showing and not responding.

I should add that tweaked mine up a bit with 2 x Opteron 280's and 8GB of ram with an iSCSI HBA and no local disk. I'm using ESXi off a USB key.

I dont suppose anyone is really interested in doing much with this problem though as its only a cheapish machine. If by any chance you or anyone has any ideas let me know.

Reply
0 Kudos
Dave_Mishchenko
Immortal
Immortal
Jump to solution

If you reboot, press ALT+F11 after it up and leave it on that screen you might get a helpful error message.

Reply
0 Kudos
bernp
Contributor
Contributor
Jump to solution

Damn! I've got a 326m also and have almost exactly the same problem.

OK... I'm not mythomaniac...

As I have this problem with 4 differents eServer 326m, I think it is not an hardware failure.

mine just locks up with the yellow screen still showing and not responding.

Some times purple screen, sometimes just freezed... same problem...

If by any chance you or anyone has any ideas let me know.

Not yet... don't know what to do or try...

Reply
0 Kudos
tfindlay
Enthusiast
Enthusiast
Jump to solution

Hi,

I tried the Alt+F11 thing which went to a great screen ..... it ran for about 30mins before locking up with this error:

0:00:09:18.106 cpu1:1563)Heartbeat: 470: PCPU 0 didn't have a heartbeat for 65 seconds. may be locked up

I know the BMC has a heartbeat thing in it, so I'll hook up a console cable and see what it can tell me, or I might even try removing the BMC and firing it up to see what it says.

Reply
0 Kudos
tfindlay
Enthusiast
Enthusiast
Jump to solution

Just a further update .... I took the BMC module out and the system has run cleanly for over 4hrs now....

Without the BMC you loose alot of the nice system information things, but at least this may isolate the problem....

If its still running tomorrow I'll do a few clean restarts and see how it goes. If its all good towards tomorrow arvo, I'll pop the BMC back into to confirm it creates the problem again.

Reply
0 Kudos
bernp
Contributor
Contributor
Jump to solution

Thanks for this tip !

So, I just removed the BMC on three eServer 326m, and reboot ESXi to see if it works or not like that.

For the moment (10mn ...), it runs. The immediate little problem without BMC is that the "power ON" LED on the front panel never light, so, very hard to see if the server is on or off rapidly...

Reply
0 Kudos
tfindlay
Enthusiast
Enthusiast
Jump to solution

Correct, there is a LOT of things "missing" when you remove the BMC, other noteable things include the server fans will run flat-out, I guess the temp. control is managed by the BMC also.

I should note - I'm not suggesting we actually run the server this way, its more just a point of identifying the cause of the incompatibility. The real solution may simply be logging into the BMC and adjusting the settings, I know it has a watchdog type of service and other functions perhaps it is this thats causing the lockup ?

Removing the BMC is really a pretty crude method of diagnosing the cause of the problem, I dont think IBM would endorse this sort of change. Smiley Happy

Reply
0 Kudos
bernp
Contributor
Contributor
Jump to solution

So, it's confirmed : without the BMC, ESXi is stable since four days.

A friend said to me that there are also issues with IPMI (so the BMC) under Linux (freeze after two hours of power on), and to "never use IPMI on eServer 326m".

The question/problem now... : how to disable IPMI on the BMC under ESXi ?

Reply
0 Kudos
bernp
Contributor
Contributor
Jump to solution

Hello,

I just tried two things (after reinstalling the BMC) :

- on one server, tried to disable the CIM, by unchecking the "Misc.CimEnabled" parameter

- on another server, tried to disabled the IPMI driver (with command "esxcfg-module -d ipmi_si_drv" on unsupported console)

Just wait and see...

Edit :

These two "solutions" don't works... my two servers are freezed this morning...

Will look for other solutions...

Reply
0 Kudos
bernp
Contributor
Contributor
Jump to solution

May be I've found a solution !

I've tried many tuning : disabling the ipmi modules, no success (the modules are disabled, but loaded).

I've tried stopping sfcbd daemons, non success.

A lot of parameters, no succes...

And finaly, I've deleted ipmi modules ( ipmi_devintf.o, ipmi_msghandler.o and ipmi_si_drv.o) : it seems to be a working solution. My server is up since 2 days now, I will wait a bit more, but it seems OK (before, it freezed after a few hours maximum).

To delete these modules, it is not possible to do that directly in the /mod directory, this directory is rebuild at startup.

Here is how to do :

- boot a live Linux (from a CD)

- mount /dev/sda5 and /dev/sda6 (these are the two "banks" of the firmware)

- extract the binmod.tgz archive, remove the 3 ipmi modules, rebuild this archive, and replace the original.

That's all. The ipmi modules will not be loaded now, and eServer will not crash or freeze.

It will be necessary to redo that after each firmware update.

Reply
0 Kudos
tfindlay
Enthusiast
Enthusiast
Jump to solution

Fantastic stuff!

I've got to stack some more ram in ours, so I'll try this later in the week when we have an outage.

Reply
0 Kudos
bernp
Contributor
Contributor
Jump to solution

Hello,

My eServer is up now since 5 days, so I think it is the working solution !

Conclusion : IPMI, and IPMI drivers are the problem with this kind of IBM eServer.

And the solution exposed in my previous message is a good working one.

Best regards

Reply
0 Kudos