VMware Cloud Community
six4rm
Enthusiast
Enthusiast

IBM HS22 - High latency to shared datastore over FibreChannel

Hi,

I've just installed and configured two new hosts for our environment, ESX13 & ESX14. They're identical IBM HS22 blade servers both with ESX 4.1 U1 installed.

In order to make sure everything was working correctly I powered off an existing VM and migrated it across to one of the new hosts, in the first instance ESX13. I powered on the VM and got to the Windows login where I enter credentials. It then sits with "applying user profile" on the screen and the blue circle just spinning away for a good 10 minutes. A little confused I reset the VM and the same thing happens. After much digging around I discover that the datastore latency from ESX13 to the datastore where this VM resides is huge.

With the VM still powered on and trying to login I VMotion it across to ESX14. On successful migration the VM comes to life and logs in to Windows. The latency can clearly be seen from the screenshot where I have included VM disk latency and datastore latency for both hosts. You can clearly see that the migration was just before 11:30.

The VM resides on a NetApp SAN which is connected via FC and accessible from both hosts.

Is there anything I can check as to what could be causing this 'host specfic' latency?

EDIT:

Just to add that both ESX hosts have two HBA's and are connected directly to the same fibre switches. Each switch then has two paths to the SAN, one to each controller, giving four paths to each host. The hosts are using Round Robin pathing policy and have been optimised for the NetApp SAN using the NetApp Virtual Storage Console plugin for vCenter.

EDIT2:

I've just migrated another VM from our other SAN (IBM DS3400 connected through the same Fibre Switches) to this host and get exactly the same behaviour. I've checked the Fibre Switch config and there's no differences between the ports for ESX13 & ESX14, so I'm really at a loss here. It's got to be an issue with the host or HBAs?

0 Kudos
25 Replies
HannaL
Enthusiast
Enthusiast

If your problem went away by moving the blade to a new slot that indicates a problem with the slot itself.  If you call back and ask for the case to be escalated so that they can determine if the slot is bad and send you a replacement then you might be happier.  I see it happen a lot.

Hope that helps, Hanna --- BSCS, VCP2, VCP VI3, VCP vSphere, VCP 5 https://www.ibm.com/developerworks/mydeveloperworks/blogs/vmware-support-ibm
0 Kudos
HannaL
Enthusiast
Enthusiast

This is a general message that just means the latency of the i/o has reached a threshold that triggered a message.  It means i/o is slow basically.  It can be due to multiple factors, likely not just one thing.  Here is more info to get started analyzing

http://kb.vmware.com/kb/2001676

http://kb.vmware.com/kb/1008205

Hope that helps, Hanna --- BSCS, VCP2, VCP VI3, VCP vSphere, VCP 5 https://www.ibm.com/developerworks/mydeveloperworks/blogs/vmware-support-ibm
0 Kudos
Laromodo
Contributor
Contributor

Hello,
For us, we fix our issue with ugrading our Cisco 4G FC switch on our Frame to the latest firmware provide by Cisco and not the one offer on IBM website. Since this as been done, everything works fine.
Thanks
0 Kudos
six4rm
Enthusiast
Enthusiast

That's interesting. I'm going to be doing this in a few weeks time anyway as the current version of Cisco SAN-OS that is running on my fibre switches doesn't support NPIV, which is required for our latest project. I'll definitely test this issue out after the upgrade to see whether I can get my HS22s running in Slot13 without the previous latency issues.

0 Kudos
grosaba
Contributor
Contributor

Had the same problem described in the first post. HS22 Blades.  It doesn't seem slot specfic, except that it occurs for the blades in the last 2 occupied slots in the bladecenter.

Issue started with a bus error on one of the blades.   After updating firmware on the blade UEFI (Build P9E156C  Version 1.17 released 02/03/2012) and the AMM (Build BPET62J.  File CNETCMUS.PKT.  Released 01/19/2012), that cleared, but still latency. 

Turns out that the DS3512 SAN controller had a false problem with the battery:

Event type: 210C
Description: Controller cache battery failed
Event specific codes: 0/0/0
Event category: Internal
Component type: Battery Pack
Component location: Enclosure 0, Controller 2, Slot 2
Logged by: Controller in slot B

This happened because of a learn cycle instigated 3 days early for some unknown reason.  I didn't see this error until after the Blade firmware was updated (!). 

Moved all of my LUNs over to controller A as preferred path temporarily.

Reset the controller with tools> execute script and enter: reset controller [b];  Then tools> execute script only. Where b was the affected controller.

That fixed the issue, but then we updated the Controller firmware and NVSRAM (Controller_Code_07734000). Still need to do the drives after we shut them down.

0 Kudos
budlomaxi
Contributor
Contributor

Hi six4rm,

Did you do the upgrade and did it fix the issue? I too am seeing the issue.

Thanks,

0 Kudos