Re: HP BL460C G1 Blade Slowdown is Driving US Craz...

barhorst · ‎05-22-2008

In an HP C7000 chassis connected via FC to EVA6000 with 1 LUN - we have 3 HP BL460C blades.. each running dual Intel Quad Core E5430 with 16 GB ram.

We are booting VMWARE ESX3i 3.5 Build 85332 via USB flash drive on all three. We turned on clustering, DRS and HA.

After loading about a dozen VM's we noticed some peculiar issues with some of the machines slowing down or freezing. We turned off clustering, DRS and HA and isolated the issue to the 3rd blade. HP would not believe us so we rebuilt the VMWARE on flash. We've checked the fabric switches (redundant) the ethernet networking (redundant Cisco 3020's) and moved the blade to another slot. We've confirmed the validity of all these configurations with HP and VMWARE. All three blades are configured identically.

HP finally arrived and changed out EVERY part in the blade with no difference. Any VM we load on that blade runs slowly and intermittantly and IO suffers.

The first two blades continue to run our VM's (in production) normally with no observable errors.

HP MIGHT send us a new blade but we are not optimistic. :_|

Does anyone else have similiar issues or have a similar configuration with HP blade hardware?*

At ths point we are at a standstill migrating the rest of our data center because we cannot get past this issue. HP seems to be baffled.

Thanks for any help!

Troy_Clavell · ‎05-22-2008

what kind of HBA are you using? We use Qlogic and when the system boots, there is an option to hit <ctl q> to configure the speed of the card. Have you confirmed the speed of the bad blade in question is the same as all the rest?

barhorst · ‎05-22-2008

The blades ALL have imbedded mezzanine Qlogic QMH2462 and they are all set to "auto"

That was the first part we changed out!

Thanks

Troy_Clavell · ‎05-22-2008

and your ROM Firmare version, it is current? BL460's are tempermental. I am working with HP on one as we speak for a different issue.

barhorst · ‎05-22-2008

Firmware on the funky blade is:

I15 02/29/2008

Firmware on the two working blades is:

I15 10/06/2007

I know they were all the same version (10/06/2007) when we started

Looks like there is a newer version on the website that came out 2008.04.01

Not sure if I should update all three blades or not.

Troy_Clavell · ‎05-22-2008

All of ours are running

I15 04/01/2008. I don't know if it will fix it or not, but it's worth a shot atleast on the blade that is not functional for you guys at the moment.

barhorst · ‎05-23-2008

Updated the firmware on the blade in question to the latest. Still has the same problem.

jhanekom · ‎05-23-2008

- How frequently does the problem occur?

- How are you determining that IO, or performance in general, suffers?

- What is the nature of the "freezes" - are they temporary or do you have to power off the VM to get past them? (Also, how frequently do they occur?)

- Are there any entries in the guest OS logs during the time that the problems occur that might indicate what the problem is?

- Are there any entires on any of the ESX hosts' /var/log/vmkwarning log files during the time that the problem occurs?

- You mention that all the components were replaced. Does this include memory?

barhorst · ‎05-23-2008

- How frequently does the problem occur? - All the time - it cycles around every 30 seconds - somewhat usable then a freeze up It's done this with EVERY VM we have run on it - several flavors of windows and linux

- How are you determining that IO, or performance in general, suffers?

GUi's (windows lock up ) command line freezes then releases... write to disk takes a VERY long time.. A VM can take 10 minutes to boot...

- What is the nature of the "freezes" - are they temporary or do you have to power off the VM to get past them? (Also, how frequently do they occur?)

Some OS's have had to have the blade reset as they went into a weird state (Red Hat Linx)

- Are there any entries in the guest OS logs during the time that the problems occur that might indicate what the problem is?

Nope -- cannot see anything - (VMWare actually looked at them as well)

- Are there any entires on any of the ESX hosts' /var/log/vmkwarning log files during the time that the problem occurs?

This is ESX embedded - no such log

- You mention that all the components were replaced. Does this include memory?

Yes.. Processors and memory .. the ONLY component that wasn't replaced is the SAS disk controller - which also has the USB connector on it.

-

T3Steve · ‎05-23-2008

Troubleshooting 101 says to use process of elmination. I've worked for many years with the EVA and would be suspect of it.

Do you have room on the local disks to add a VMFS partition and load a VM there for testing functionality w/o the SAN. This will point you in the right direction.

VCP3|VCP4|VSP|VTSP

barhorst · ‎05-27-2008

We've installed several VM's on that particular blades local SAS drives. All the VM's run Fast!

IT's only when they run on the SAN connection that they slow down and only in the third blade!

So what is it about the Qlogic QMH2462 and/or the EVA 6000 that's causing this grief? :smileyalert:

We are running all VM's on ONE LUN.

T3Steve · ‎05-27-2008

Now comes the fun part. You need to look a the fiber portion of your topology. Start with the fiber switch port configurations. Do you see errors on the ports for that ESX host? This could indicate a bad GBIC or cable or HBA.

Are the switches in your enclosure or are you using pass-through to an external switch?

VCP3|VCP4|VSP|VTSP

barhorst · ‎05-27-2008

We've been there.. done that. The Qlogic HBA adaptor has been replaced as well as the GBIC and the Fibre cable.

We're using FC pass thru and we've tried another port there as well. We have examined the errors on the fabric and though we see some errors on that blade, HP didn't seem to think that there were a lot of them. At least not enough to make them think hard failure.

Since we have redundant paths we've tried it going through the other switch - which would seem to eliminate the Fibre topology and fabric as a cause. I'm thinking that it's some inherent problem in the VMware Qlogic driver and/or some issue with the EVA6000 although I don't know why the other two blades work normally.

I've checked that the Qlogic cards have the latest Bios firmware - 1.26.

I'm open for other ideas.

T3Steve · ‎05-27-2008

Out of curiosity, what is the que length of the suspect HBA, compared to the others?

VCP3|VCP4|VSP|VTSP

barhorst · ‎05-27-2008

I believe this is a Bios setting???

Frame size is 2048 but I'm not sure about the queue...

In any case, all 3 controllers were used with default settings.

Thanks!

RaulJBA · ‎05-29-2008

Hi I have similar problems. In my case they are four servers and only one is working very slow. Do you found any solution?

Bye..

barhorst · ‎05-30-2008

We think we have the problem solved. We replaced the whole blade and

also had MULTIPLE SFP's that were defective. We started seeing more and

more errors on our fabric switches which was the thing that finally

clued us in.

IMO, the multiple defective SFP issue was the most likely culprit.

I'd recommend replacing SFP's and cables if you see errors.

All

HP BL460C G1 Blade Slowdown is Driving US Crazy