VMware Cloud Community
Artful_Dodger
Contributor
Contributor

HP EVA SAN Issues

Hi,

We appear to be having issues with ESX 3.02 and our EVA8000 SAN. Throughout the day we get quite a few of these error messages:-

Dec 11 13:47:37 esx01 vmkernel: 2:23:44:28.236 cpu5:1759)Init: 740:Received INIT from world 1759 Dec 11 13:57:27 esx01 vmkernel: 2:23:54:17.490 cpu2:1034)FS3: 1717:Checking if lock holders are live for lock [type 10c00002 offset 14313472 v 338,hb offset 3647488 Dec 11 13:57:27 esx01 vmkernel: gen 8559, mode 1, owner 475aac1c-d524069b-2dd8-001a4be5f6be mtime 256881] Dec 11 13:58:44 esx01 vmkernel: 2:23:55:35.306 cpu0:1044)SCSI: 8062:vmhba1:0:6:0 Retry (unit attn) Dec 11 13:58:44 esx01 last message repeated 3 times Dec 11 13:58:44 esx01 vmkernel: 2:23:55:35.307 cpu0:1044)SCSI: 8062:vmhba1:0:6:0 Retry (unit attn) Dec 11 13:59:00 esx02 vmkernel: 3:01:15:35.568 cpu5:1044)SCSI: 8062:vmhba1:0:6:0 Retry (unit attn) Dec 11 13:59:00 esx02 last message repeated 3 times Dec 11 13:59:00 esx02 vmkernel: 3:01:15:35.569 cpu5:1044)SCSI: 8062:vmhba1:0:6:0 Retry (unit attn)

And during the night when our backups run we are getting a lot of these errors and about 75% of the time at least one of our VM's will fail to backup fully:-

Dec 11 01:07:58 esx01 vmkernel: 2:11:04:50.325 cpu1:1049)LinSCSI: 2608:Forcing host status from 7 to SCSI_HOST_OK

Dec 11 01:07:58 esx01 vmkernel: 2:11:04:50.325 cpu1:1049)LinSCSI: 2610:Forcing device status from SDSTAT_GOOD to SDSTAT_BUSY

The ESX hosts are set to type "Vmware" in the host configuration on the EVA and the path type is set to Fixed in ESX with only one path active. However one thing I have noticed is that there are only 3 paths showing up in ESX on each HBA, so a total of 6 paths are only showing up. I'm not sure if this is part of the problem or not. On our Cisco switches all the EVA Controllers are zoned to each HBA so I am confused why only 6 paths are showing up.

If anybody has any ideas about what might be happening please let me know.

Reply
0 Kudos
16 Replies
admin
Immortal
Immortal

As a rule I normally have 4 paths, 1 is the active per ESX host and fixed!

Thats the name I wanted Artful !

Reply
0 Kudos
jhanekom
Virtuoso
Virtuoso

You should definitely be seeing four paths on each HBA - two to each controller on each fabric. The fact that you're not seeing this would usually indicate a zoning problem - get your SAN team to go over the zoning config with a fine-tooth comb again.

Are you perhaps using different fixed paths on your various ESX hosts? (i.e. one host talks over vmhba0:1, another over vmhba0:2, etc.) This might be the cause of the other messages you're seeing - it's almost like the LUN is failing over a-la active-passive, even though the EVA 8000 is active-active.

Chances are at least some of the problems will disappear if you sort out the zoning issue, so pay attention to that first.

The following HP best-practices document might also be useful: http://h71019.www7.hp.com/ActiveAnswers/downloads/VMware3_StorageWorks_BestPractice.pdf

Reply
0 Kudos
Artful_Dodger
Contributor
Contributor

The zoning looks fine to me. I'm not a SAN/Fibre expert but each HBA for each server is in its own zone with the 4 controllers of the EVA.

Our SAN guy is away at the moment but back tomorrow so I will pick his brains. But to be honest everything looks the same as all our other windows servers with SAN attached drives.

As far as I can tell our SAN is active/active and they are all using the the first path. The only thing that has just occured for me to check would be that the first HBA is actually the first phyiscal HBA.

ie. where I have zoned Card1 to Zone_A and Card2 to Zone_B that I didn't accidently get the pair of cables the wrong way round and so Card 1 is actually zoned to ZONE_B on one fo the servers. I will double check all of this.

Reply
0 Kudos
rminick
Contributor
Contributor

Have any luck on this one? We are seeing the same errors on all our ESX hosts connected to a EVA 5000 & 8000.

Richard J Minick, VCP
Reply
0 Kudos
dconvery
Champion
Champion

Check the VCS / XCS code levels too. They will need to be listed on the HCL.

Dave Convery, VCDX-DCV #20 ** http://www.tech-tap.com ** http://twitter.com/dconvery ** "Careful. We don't want to learn from this." -Bill Watterson, "Calvin and Hobbes"
Reply
0 Kudos
rminick
Contributor
Contributor

I've verified everything we have is on the Vmware HCL. I opened a case with Vmware and they have not really help thus far. They recommended patch 1002428 but we see this on systems with Emulex not just the QLogic HBA's. It's sounds to me like a multipathing/MPIO problem. Our SAN guy talked to the storage folks at HP and they say it's a Vmware issue. I have verified the host type is set to Vmware in commandview. I hear chit-chat about others getting the same errors but nobody has posted a fix. Smiley Sad

Any ideas? I'm going to put that patch on anyways then see if making a path change makes any difference.

Richard J Minick, VCP
Reply
0 Kudos
bobross
Hot Shot
Hot Shot

make sure also you have the correct 'custom type' value in ESX for the EVA in question. On the HP forum, see

Did you set multipathing policy to fixed or MRU? HP recommends MRU; fixed can lead to thrashing, which is evidenced by the unit attn messages you are seeing. Paths are thrashing and the array has to send unit attn when it changes per SCSI protocol.

This other thread on HP forums may also be of use.

Reply
0 Kudos
jhanekom
Virtuoso
Virtuoso

An active/active SAN, by definition, cannot really have LUN thrashing, since all LUNs are accessible through both controller paths. (How the controllers deal with it internally may be another matter.)

All HP EVA SANs are active/active, provided you're running current versions of VCS. In the past, the 3000/5000 series were active/passive, but this changed with VCS 5.x and up IIRC. (All of the higher-end EVA's with XCS are active/active.)

Reply
0 Kudos
bobross
Hot Shot
Hot Shot

jhanekom said:

An active/active SAN, by definition, cannot really have LUN thrashing, since all LUNs are accessible through both controller paths. (How the controllers deal with it internally may be another matter.)

True, all LUNs are accessible through both controllers (but never at the same time, of course; has to be one controller or the other). But, sorry, that does not preclude LUN thrashing whatsoever. Unit attn must be returned if the I/O path changes for any reason. The ESX host is moving the path around, and if the host is set to fixed policy, it can do that at will since all paths are equivalent. This is why HP recommends MRU; it reduces thrashing. Or, the VM itself who has live paths to that LUN could also be invoking multipathing at that level, e.g. if the ESX host has multiple physical ports/HBAs in play and the VM has multiple virtual HBAs.

IIRC, EVA controllers are 'dumb' in the sense that they cannot act on their own to change paths; host software must be invoked. Back in the day, that was Secure Path, now they use whatever host-based multipathing is present in the OS (e.g. MPIO for Windows). EVA ports are not virtual, unfortunately. I do not know if EVA would return unit attn if the back-end loop pathing changed while the front-end did not; some vendors do, others hide it.

Reply
0 Kudos
RobBuxton
Enthusiast
Enthusiast

What version of XCS are you on on the EVA?

We had some issues with v5 under certain load conditions. Mostly these were related to Continuous Access. A number of issues did get cleared up when we went to XCS 6.110.

Reply
0 Kudos
rminick
Contributor
Contributor

I know we are running ver. 6.000 on the 8000. I'll get with our SAN guy on what's on the 5000's.

Richard J Minick, VCP
Reply
0 Kudos
rminick
Contributor
Contributor

I came across this snippet from a Vmware slide on common storage issues.

Richard J Minick, VCP
Reply
0 Kudos
bobross
Hot Shot
Hot Shot

Exactly. This is an example of the array returning unit attn (per SCSI protocol) because of a host-invoked path change.

Reply
0 Kudos
jhanekom
Virtuoso
Virtuoso

I'm curious to know which HP reference material you have that says to use MRU. The "[VMware Infrastructure 3, HP StorageWorks best practices|http://h71019.www7.hp.com/ActiveAnswers/downloads/VMware3_StorageWorks_BestPractice.pdf]" guide says to use "fixed" for active/active arrays on page 10/11.

You can quite definitely access the same LUN through different controllers - without thrashing. Also, the controllers are intelligent - they will automaticaly move LUNs to the correct controller based on the number of reads (writes have to be committed to both controllers' cache anyway, so that apparently does not play a role in deciding where the LUN "lives".) Inter-controller data is passed through a back-end connection.

From this whitepaper: "The EVA4000 is an active/active disk array and both controllers simultaneously handle requests from the host server to a specific LUN. However, only the owning controller will issue the request to the spindles, thus balancing LUN ownership will maintain an equal workload on both controllers. (If a request comes into the proxy controller, the request is passed to the owning controller over the mirror port.)"

I suspect you may be confusing the (recent) EVAs with systems that "fake" active/active such as the IBM DS4800 (or maybe the EVA 3000/5000 with pre-4.x VCS.) On that, I have seen the exact symptoms you describe, as the controllers are not truly active/active, but both - if they're not set up correctly - return "unit ready" for all paths when queried about a specific LUN, even though the back-end is active/passive.

Reply
0 Kudos
bobross
Hot Shot
Hot Shot

What I was referencing was for the 5000 - the original questioner said he was checking on his 5000 firmware version. Forgive any ambiguity. For the 8000, fixed is correct.

However, something else just triggered a thought. There is a bug with 6.000 firmware (or earlier) with Cisco switches and EVA host ports not re-logging back in after fabric events (such as an ESX path change which forces an FSPF update). That could be the cause of the unit attns.

I would update the 8000 to 6.110.

Reply
0 Kudos
abaum
Hot Shot
Hot Shot

After reading this thread, I would still followup with the firmware. Our EVAs have a specific VMware setting. Came in with the 6.x release of the firmware. You mentioned that you were using the Custom setting. We had a nasty issue with the 6.00 code. Took out an entire EVA. The advisory for upgrading to the latest code (came out during the summer) is lsited as "Critical".

adam

Oops - just read another entry where you mention that you are set to "Vmware", not custom. I'd still go with a f/w update just ot be on the safeside.

Reply
0 Kudos