Solved: No signs of life on the new HBA after changing pre...

frankdenneman · ‎08-16-2007

Hi,

In the current situation all the ESX host servers have two QLogic hba's connected to a HP EVA8000.

ESX picks up the EVA as an active/active SAN and the load balance type defaults to fixed.

Which is as expected. All active and preferred paths are assigned to the first HBA and the first SP (i.e. vmhba1:0:1, vmhba1:0:2).

In other words no one has touched the default settings.

I would like to distribute the I/O load between both HBA's.

I do not want to change the managing controller on the SAN end.

Vmhba 1 is going to manage the active and preferred paths for the datastores with uneven lun IDs.

Vmhba 2 is going to manage the active and preferred paths for the datastores with even lun IDs.

Quite straightforward configuration.

On the first try i used the manage paths option from the Virtual Infrastructure client.

The preferred paths where being changed from vmhba1:0:2 to vmhba2:0:2, etc.

To check if the vmhba reveived any i/o commands I checked with esxtop (-d)

But all the commands where being issued by vmhba1, vmhba2 is sitting there quietely with 0.00 commands.

No read and writes aswell for vmhba2.

I checked the preferred path in the cos via esxcfg-mpath -l which showed me this:

Disk vmhba1:0:7 /dev/sdl (512000MB) has 4 paths and policy of Fixed

FC 7:10.0 50060b000085d16c<->50001fe1500937dc vmhba1:0:7 On active preferred

FC 7:10.0 50060b000085d16c<->50001fe1500937d8 vmhba1:1:7 On

FC 8:12.0 50060b0000a7f5f6<->50001fe1500937dd vmhba2:0:7 On

FC 8:12.0 50060b0000a7f5f6<->50001fe1500937d9 vmhba2:1:7 On

Disk vmhba1:0:8 /dev/sdm (512000MB) has 4 paths and policy of Fixed

FC 7:10.0 50060b000085d16c<->50001fe1500937dc vmhba1:0:8 On

FC 7:10.0 50060b000085d16c<->50001fe1500937d8 vmhba1:1:8 On

FC 8:12.0 50060b0000a7f5f6<->50001fe1500937dd vmhba2:0:8 On active preferred

FC 8:12.0 50060b0000a7f5f6<->50001fe1500937d9 vmhba2:1:8 On

Disk vmhba1:0:9 /dev/sdn (512000MB) has 4 paths and policy of Fixed

FC 7:10.0 50060b000085d16c<->50001fe1500937dc vmhba1:0:9 On active preferred

FC 7:10.0 50060b000085d16c<->50001fe1500937d8 vmhba1:1:9 On

FC 8:12.0 50060b0000a7f5f6<->50001fe1500937dd vmhba2:0:9 On

FC 8:12.0 50060b0000a7f5f6<->50001fe1500937d9 vmhba2:1:9 On

After seeing this I'm under the impression that the active route for the vmkernel to send I/O for the 8th lun is vmhba2:0:8.

Why isn't esxtop showing any signs of life when viewing vmhba2

ADAPTR CID TID LID WID NCHNS NTGTS NLUNS NVMS AQLEN LQLEN WQLEN ACTV QUED %USD LOAD CMDS/s READS/s WRITES/s MBREAD/s MBWRTN/s

vmhba0 - - - - 1 1 1 1 128 0 0 0 0 0 0.00 13.69 0.00 13.69 0.00 0.58

vmhba1 0 - - - 1 4 18 27 4096 0 0 0 0 0 0.00 27.97 0.00 27.97 0.00 0.41

vmhba2 - - - - 1 4 0 0 4096 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 0.00

What also seems odd is that are zero luns listed for vmhba2 (Nluns). Or is this because ESX assigns the first discovered path as name for the lun (i.e. vmhba1:0:1)?

A'm I forgetting a step in changing the active paths?

I also tried rescanning the storage adapters, but vmhba2 won't issue any commands.

There are Virtual Machines using the datastores on the LUNs who paths are being altered.

I've also tried to vmotion a Virtual Machine off and back again, but the commands issued keep on showing that nasty 0.

Blogging: frankdenneman.nl Twitter: @frankdenneman Co-author: vSphere 4.1 HA and DRS technical Deepdive, vSphere 5x Clustering Deepdive series

BUGCHK · ‎08-17-2007

requests and interrupts is generated on the vmhba2

OK, I was just puzzled, because you said you checked the switch port and it did only show a bit of traffic. Maybe it was the wrong port?

If you have missed it, you can get the adapter WWNs from /proc, too:

\# grep 'scsi-qla.*-adapter' /proc/scsi/qla2300/[[b]0[/b][u]1[/u]]

/proc/scsi/qla2300/[b]0[/b]:scsi-qla[b]0[/b]-adapter-node=200000e08XXXXXXa;

/proc/scsi/qla2300/[b]0[/b]:scsi-qla[b]0[/b]-adapter-port=210000e08XXXXXXa;

/proc/scsi/qla2300/[u]1[/u]:scsi-qla[u]1[/u]-adapter-node=200100e08XXXXXXa;

/proc/scsi/qla2300/[u]1[/u]:scsi-qla[u]1[/u]-adapter-port=210100e08XXXXXXa;

#

View solution in original post

BUGCHK · ‎08-16-2007

All active and preferred paths are assigned to the first HBA and the first SP

Normally, WWPNs that end in 8,9,A,B belong to the A controller and C,D,E,F belong to the B controller. You fabric cabeling defines which port is 'seen' first.

When I initialize an EVA, I start the top controller first so it is recognized at the A controller. However, Command View EVA cannot determine the physical order so it orders by something else. This can result in the top controller being display in the bottom position. Use the LOCATE function.

I do not want to change the managing controller on the SAN end.

The EVA can transfer virtual disk ownership between controllers automatically, depending on load.

What also seems odd is that are zero luns listed for vmhba2 (Nluns).
Or is this because ESX assigns the first discovered path as name for the lun
(i.e. vmhba1:0:1)?

As far as I know, there is no documentation \_how_ ESX selects the canonical path.

I also tried rescanning the storage adapters,
but vmhba2 won't issue any commands.

I bet it does, but the output from some utilities is inconsistent. Some of them, like esxcfg-mpath can tap below the multipath layer, others don't do that.

Why isn't esxtop showing any signs of life when viewing vmhba2

Maybe esxtop has the same problem. Sorry, I don't have time right now to try it myself.

Can you check the Fibre Channel switch port's counters?

I've also tried to vmotion a Virtual Machine off and back again

But that just moves the VM's memory across the LAN, not any disk files!

frankdenneman · ‎08-17-2007

Hi,

Thanks for the reply

Normally, WWPNs that end in 8,9,A,B belong to the A
controller and C,D,E,F belong to the B controller.
You fabric cabeling defines which port is 'seen'
first.

The SAN admin made Controller B the managing controller for the LUNS for the EVA.

D8 & D9 are port 1 and port 2 from Controller A, Dc & DD are respectively port 1 and 2 from controller B.

The EVA can transfer virtual disk ownership between
controllers automatically, depending on load.

Yes I'm aware of that. But this behaviour occurs when I/O is constantly sent to the non-owning controller.

In this config, the data is sent to the owning controller.

As far as I know, there is no documentation \_how_ ESX
selects the canonical path.

I've read it somewhere, so it took me some time to find it, but its listed in the esx 2.1 documentation:

The report identifies disks by their canonical name. The canonical name for a disk is the first path ESX Server finds to the disk. Since ESX Server begins its scans at the first controller and the lowest device number, the first path (and thus the canonical name of the disk) is the path with the lowest number controller and device number. For example, if the paths to a disk are vmhba0:0:2, vmhba1:0:2, vmhba0:1:2 and vmhba1:1:2, then the canonical name of the disk is vmhba0:0:2.

http://www.vmware.com/support/esx21/doc/esx21admin_multipath_disks.html

But what i meant to say was that it was strange that only the vmhba1 has got luns appointed to it, when viewing ESXTOP.

I bet it does, but the output from some utilities is
inconsistent. Some of them, like esxcfg-mpath can tap
below the multipath layer, others don't do that.

What exacly do you mean with the multipath layer?

As far as I know is that queueing is done beneath the VMFS layer, before it's being handled by the disk scheduler and the storage device driver.

Do you know which layer the esxcfg commands query?

Why isn't esxtop showing any signs of life when

viewing vmhba2
Maybe esxtop has the same problem. Sorry, I don't
have time right now to try it myself.
Can you check the Fibre Channel switch port's
counters?

I've checked the switch port counter of the second hba and it's slowly incrementing. So It seems that the ESX host isn't using the adapter yet

I've also tried to vmotion a Virtual Machine off

and back again
But that just moves the VM's memory across the LAN,
not any disk files!

I know, it was just a desperate try. (desperate times demand desperate measures )

The VM was the only machine using that LUN at that time on the host, I was hoping that it released some sort of lock.

So when moving the vm back, it would see the new path. But that wasn't the case.

I'm going to reboot the esx server and see if that solves the problem.

But that's something I didn't expect to do for such a minor) change.

Blogging: frankdenneman.nl Twitter: @frankdenneman Co-author: vSphere 4.1 HA and DRS technical Deepdive, vSphere 5x Clustering Deepdive series

BUGCHK · ‎08-17-2007

By multipath layer, I mean the component that decides which path to use for I/O. It is below the VMFS, but on top of the SCSI device drivers.

\+----

+

\| VMFS |

\+----

+

\| vmhba1:0:1 (canonical path) |

\| ===== multipath layer ===== |

\| ! |

----

--
+
+--

+

----

----

+

\| vmhba1 | vmhba2 |

\+----

+

ESXTOP appears to grabs the data from the top of the layer!

I do a simple copy between two VMFSes at LUN address 1 and 2.

\# esxcfg-mpath -l

Disk vmhba0:0:0 /dev/cciss/c0d0 (69997MB) has 1 paths and policy of Fixed

Local 6:0.0 vmhba0:0:0 On active preferred

Disk vmhba0:1:0 /dev/cciss/c0d1 (69981MB) has 1 paths and policy of Fixed

Local 6:0.0 vmhba0:1:0 On active preferred

RAID Controller (SCSI-3) vmhba1:0:0 (0MB) has 4 paths and policy of Fixed

FC 11:0.0 50060b00XXXXXXX0<->50001fe1XXXXXXX8 vmhba1:0:0 On active preferred

FC 11:0.0 50060b00XXXXXXX0<->50001fe1XXXXXXXc vmhba1:1:0 On

FC 11:0.1 50060b00XXXXXXX2<->50001fe1XXXXXXX9 vmhba2:0:0 On

FC 11:0.1 50060b00XXXXXXX2<->50001fe1XXXXXXXd vmhba2:1:0 On

Disk vmhba1:0:1 /dev/sda (102400MB) has 4 paths and policy of Fixed

FC 11:0.0 50060b00XXXXXXX0<->50001fe1XXXXXXX8 vmhba1:0:1 On

FC 11:0.0 50060b00XXXXXXX0<->50001fe1XXXXXXXc vmhba1:1:1 On active preferred

FC 11:0.1 50060b00XXXXXXX2<->50001fe1XXXXXXX9 vmhba2:0:1 On

FC 11:0.1 50060b00XXXXXXX2<->50001fe1XXXXXXXd vmhba2:1:1 On

Disk vmhba1:0:2 /dev/sdc (51200MB) has 4 paths and policy of Fixed

FC 11:0.0 50060b00XXXXXXX0<->50001fe1XXXXXXX8 vmhba1:0:2 On

FC 11:0.0 50060b00XXXXXXX0<->50001fe1XXXXXXXc vmhba1:1:2 On

FC 11:0.1 50060b00XXXXXXX2<->50001fe1XXXXXXX9 vmhba2:0:2 On active preferred

FC 11:0.1 50060b00XXXXXXX2<->50001fe1XXXXXXXd vmhba2:1:2 On

Disk vmhba1:0:11 /dev/sdb (102400MB) has 4 paths and policy of Fixed

FC 11:0.0 50060b00XXXXXXX0<->50001fe1XXXXXXX8 vmhba1:0:11 On active preferred

FC 11:0.0 50060b00XXXXXXX0<->50001fe1XXXXXXXc vmhba1:1:11 On

FC 11:0.1 50060b00XXXXXXX2<->50001fe1XXXXXXX9 vmhba2:0:11 On

FC 11:0.1 50060b00XXXXXXX2<->50001fe1XXXXXXXd vmhba2:1:11 On

#

2:58:28pm up 27 days, 13:17, 45 worlds; CPU load average: 0.09, 0.05, 0.04

ADAPTR CID TID LID WID NCHNS NTGTS NLUNS NVMS AQLEN LQLEN WQLEN ACTV QUED %USD LOAD CMDS/

vmhba0 - - - - 1 2 2 2 128 0 0 0 0 0 0.00 6.2

vmhba1 0 0 - - 1 1 4 5 4096 0 0 4 0 0 0.00 920.1

vmhba1 0 1 - - 1 1 0 0 4096 0 0 0 0 0 0.00 0.0

vmhba2 0 0 - - 1 1 0 0 4096 0 0 0 0 0 0.00 0.0

vmhba2 0 1 - - 1 1 0 0 4096 0 0 0 0 0 0.00 0.0

I DO[/b] see that traffic by checking the adapter counters:

\# tail -11 /proc/scsi/qla2300/\[01]

==> /proc/scsi/qla2300/0 <==

SCSI LUN Information:

(Id:Lun) * - indicates lun is not registered with the OS.

( 0: 0): Total reqs 11, Pending reqs 0, flags 0x0, 0:0:81,

( 0: 1): Total reqs 8079, Pending reqs 0, flags 0x0, 0:0:81,

( 0: 2): Total reqs 70731, Pending reqs 0, flags 0x0, 0:0:81,

( 0:11): Total reqs 163677, Pending reqs 0, flags 0x0, 0:0:81,

( 1: 0): Total reqs 9, Pending reqs 0, flags 0x0, 0:0:82,

( 1: 1): Total reqs 1010022, Pending reqs 0, flags 0x0, 0:0:82, =====>

( 1: 2): Total reqs 7951, Pending reqs 0, flags 0x0, 0:0:82,

( 1:11): Total reqs 7951, Pending reqs 0, flags 0x0, 0:0:82,

Bus:Function = 0xb:0x0

==> /proc/scsi/qla2300/1 <==

SCSI LUN Information:

(Id:Lun) * - indicates lun is not registered with the OS.

( 0: 0): Total reqs 9, Pending reqs 0, flags 0x0, 1:0:81,

( 0: 1): Total reqs 7950, Pending reqs 0, flags 0x0, 1:0:81,

( 0: 2): Total reqs 169218, Pending reqs 0, flags 0x0, 1:0:81, <=====

( 0:11): Total reqs 7950, Pending reqs 0, flags 0x0, 1:0:81,

( 1: 0): Total reqs 9, Pending reqs 0, flags 0x0, 1:0:82,

( 1: 1): Total reqs 7950, Pending reqs 0, flags 0x0, 1:0:82,

( 1: 2): Total reqs 7950, Pending reqs 0, flags 0x0, 1:0:82,

( 1:11): Total reqs 7950, Pending reqs 0, flags 0x0, 1:0:82,

Bus:Function = 0xb:0x1

#

\# tail -11 /proc/scsi/qla2300/\[01]

==> /proc/scsi/qla2300/0 <==

SCSI LUN Information:

(Id:Lun) * - indicates lun is not registered with the OS.

( 0: 0): Total reqs 11, Pending reqs 0, flags 0x0, 0:0:81,

( 0: 1): Total reqs 8079, Pending reqs 0, flags 0x0, 0:0:81,

( 0: 2): Total reqs 70731, Pending reqs 0, flags 0x0, 0:0:81,

( 0:11): Total reqs 163677, Pending reqs 0, flags 0x0, 0:0:81,

( 1: 0): Total reqs 9, Pending reqs 0, flags 0x0, 0:0:82,

( 1: 1): Total reqs 1010717, Pending reqs 1, flags 0x0, 0:0:82, =====>

( 1: 2): Total reqs 7951, Pending reqs 0, flags 0x0, 0:0:82,

( 1:11): Total reqs 7951, Pending reqs 0, flags 0x0, 0:0:82,

Bus:Function = 0xb:0x0

==> /proc/scsi/qla2300/1 <==

SCSI LUN Information:

(Id:Lun) * - indicates lun is not registered with the OS.

( 0: 0): Total reqs 9, Pending reqs 0, flags 0x0, 1:0:81,

( 0: 1): Total reqs 7950, Pending reqs 0, flags 0x0, 1:0:81,

( 0: 2): Total reqs 172779, Pending reqs 0, flags 0x0, 1:0:81, <=====

( 0:11): Total reqs 7950, Pending reqs 0, flags 0x0, 1:0:81,

( 1: 0): Total reqs 9, Pending reqs 0, flags 0x0, 1:0:82,

( 1: 1): Total reqs 7950, Pending reqs 0, flags 0x0, 1:0:82,

( 1: 2): Total reqs 7950, Pending reqs 0, flags 0x0, 1:0:82,

( 1:11): Total reqs 7950, Pending reqs 0, flags 0x0, 1:0:82,

Bus:Function = 0xb:0x1

#

frankdenneman · ‎08-17-2007

Like you said it before, it seems that esxtop uses the canonical path name for its reporting. That would explain why esxtop doesn't show any activity on vmhba2. And why vmhba1 has got data in the Nluns column and vmhba2 does not. Vmware should take a look at this "bug"

When monitoring the /proc/scsi/qla2300/0 and 1 files a steady stream of requests and interrupts is generated on the vmhba2

Thanks for the help Bugchk, please respond so I can award you the correct points.

Blogging: frankdenneman.nl Twitter: @frankdenneman Co-author: vSphere 4.1 HA and DRS technical Deepdive, vSphere 5x Clustering Deepdive series

BUGCHK · ‎08-17-2007

requests and interrupts is generated on the vmhba2

OK, I was just puzzled, because you said you checked the switch port and it did only show a bit of traffic. Maybe it was the wrong port?

If you have missed it, you can get the adapter WWNs from /proc, too:

\# grep 'scsi-qla.*-adapter' /proc/scsi/qla2300/[[b]0[/b][u]1[/u]]

/proc/scsi/qla2300/[b]0[/b]:scsi-qla[b]0[/b]-adapter-node=200000e08XXXXXXa;

/proc/scsi/qla2300/[b]0[/b]:scsi-qla[b]0[/b]-adapter-port=210000e08XXXXXXa;

/proc/scsi/qla2300/[u]1[/u]:scsi-qla[u]1[/u]-adapter-node=200100e08XXXXXXa;

/proc/scsi/qla2300/[u]1[/u]:scsi-qla[u]1[/u]-adapter-port=210100e08XXXXXXa;

#

frankdenneman · ‎08-17-2007

It was just plain stupid, I was monitoring the ports on the fibreswitch and another admin placed the server into maintenance mode.

ESXTOP is not the tool to use when monitoring I/O traffic.

Thanks for the help!

Blogging: frankdenneman.nl Twitter: @frankdenneman Co-author: vSphere 4.1 HA and DRS technical Deepdive, vSphere 5x Clustering Deepdive series

abaum · ‎08-27-2007

Question on how you got this data...I am trying to duplicate your steps, but can't seem to see the adapter counters. I use Emulex adapters so I took your tail command and substituted my adapter info. All I get is:

Emulex LightPulse FC SCSI 7.3.2_vmw2_1

HP FC2243 4Gb PCI-X 2.0 DC HBA on PCI bus 13 device 08 irq 201

SerialNum: MY10637E0W

Firmware Version: 2.10A7 (B2F2.10A7)

Hdw: 1036406d

VendorId: 0xfd0010df

Portname: 10:00:00:00:c9:59:f8:06 Nodename: 20:00:00:00:c9:59:f8:06

Link Up - Ready:

PortID 0xdd0028

Fabric

Current speed 2G

lpfc0t00 DID dd0006 WWPN 50:00:1f:e1:50:09:b4:1c WWNN 50:00:1f:e1:50:09:b4:10

lpfc0t01 DID dd0008 WWPN 50:00:1f:e1:50:09:b4:18 WWNN 50:00:1f:e1:50:09:b4:10

lpfc0t02 DID dd0018 WWPN 50:00:1f:e1:50:0a:e9:b8 WWNN 50:00:1f:e1:50:0a:e9:b0

I don't see any counter info. Is there another way to get the counters?

adam

frankdenneman · ‎08-28-2007

Adam,

Unfortunately the proc node /proc/scsi/lpfc/0 won't show the same info as its counterpart from Qlogic.

An alternative is to watch /proc/vmware/scsi/vmhba1/stats

Blogging: frankdenneman.nl Twitter: @frankdenneman Co-author: vSphere 4.1 HA and DRS technical Deepdive, vSphere 5x Clustering Deepdive series

5truja · ‎12-12-2007

Hi,

Need help,

Trying to use esxtop for storage monitoring and wont to expand lun and even world mode, I go to d-mode (storage) enter "l" and it's asking me for adaptername (if I check in VI client I have the following active path to my LUN vmhba1:0:2) AND I enter "vmhba1", the next question is channel name (what is that in this case isn't suppossed to be hba-target-lun-partition format) I enter "0", than target name I enter 0 and than lun name and I enter 2. It's telling me non-existing lun...etc. The channel name is confusing me and I would like to go to world-level. What I'm doing wrong here?

Thanks

MCTIP,VCP3,VCP4,VCP5,VCA-CLOUD,VCP-CLOUD,VTSP

All

No signs of life on the new HBA after changing preferred paths to LUNs