Solved: Slow SAN performance with STK FLX380

Sefirosu · ‎02-19-2007

Hello,

We currently have an environment of 4 DL585 with ESX 3.0.1 all connected to the same LUNs on a Storagetek FLX380.

We were, at some point, experiencing slower that normal performance on this SAN, 10 minutes to transfer 1.4gb VS ~50 seconds normally, until we set our SAN policy to "fixed path" instead of MRU (like suggested in VMware docs).

Basically, with MRU, paths were "ping ponging" between each others which obviously slowed down performance and the SAN did not like that very well.

That brought us another problem. With fixed path mode when one of the ESX boxes switches path for some reason (in this case we simulated an HBA or fibre connection failure on one box) then we have the "ping pong" problem all over again.

That brings me to \_the_ question

What is the correct SAN policy for this SAN box ? Storagetek claims that it should be fixed, VMware documents say the reverse... and we have problems with both.

The paths are currently connected in a way that one HBA sees controller A and the other sees controller B. We are going to try now that each HBA sees both controllers, even though Storagetek says they did not test it but it "may work".

Any ideas ?

bertdb · ‎02-19-2007

for clarity: VMware calls an array active/active (and will verify fixed path behaviour) when both (strictly more than one) storage processors accept I/O for the same LUN at the same time.

if only one storage processor can accept I/O for a certain LUN at the same time (even though it can be a different SP for separate LUNs), the disk array is active/passive in VMware terminology.

View solution in original post

bertdb · ‎02-19-2007

f!xed path (aka preferred path) on an Active/Passive disk array is a \_bad_ idea.

Any host choosing a path to a different storage processor, is basically forcing every other host to come along. If they have the same setting, one of them (or all of them) will force them back. And there's what you are seeing: path thrashing. Lots of LUN handovers between the storage processors, not a lot of I/O being done.

With MRU, the same situation \_can_ happen, but only if your path matrix is not complete. If host A sees storage processor A but not SP B, and host B only sees SP B, but not SP A, then even with MRU you'll have path thrashing.

Check your path availability !

bertdb · ‎02-19-2007

forgot to mention this: one of the reasons for an incomplete path matrix could be having too many LUNs. The total number of paths an ESX will manage is 256, so with 4 paths per LUN you can't manage a full path matrix for more than 64 LUNs.

Sefirosu · ‎02-19-2007

There seems to be some confusion over about active and passive modes.

It seems our LUNs are "active/passive" (seen from one controller at a time) however, the SAN box controllers are "active/active", you can have LUN1 on controller A and LUN2 on controller B at the same time.

Reading from VMware's documentation, it seems we should put available both controllers on each HBA on all hosts and use "fixed path".

Am I right ?

What are you referring to when you are talking about "active/passive" ?

bertdb · ‎02-19-2007

hehe, typical terminology discussion.

VMware is interested in the behaviour per LUN. Is it active on 2 storage processors at once or not.

Some SAN vendors are proud that two storage processors are working in parallel, even if it means working on completely separate tasks (=LUNs).

If your StorageTek contacts insist on the "fixed path" setting on the basis of their own "active/active" definition, that demonstrates a sad lack of understanding of 1) VMkernel - SAN interaction, and 2) possible problematic SAN situations.

bertdb · ‎02-19-2007

for clarity: VMware calls an array active/active (and will verify fixed path behaviour) when both (strictly more than one) storage processors accept I/O for the same LUN at the same time.

if only one storage processor can accept I/O for a certain LUN at the same time (even though it can be a different SP for separate LUNs), the disk array is active/passive in VMware terminology.

bertdb · ‎02-20-2007

Sefirosu, you didn't respond yet, but I hope I made it clear that what you (storagetek?) are describing as active/active is \_not_ active/active for VMware, and should therefore not be used with "fixed path".

regards,

Bert.

Sefirosu · ‎02-20-2007

Sefirosu, you didn't respond yet, but I hope I made
it clear that what you (storagetek?) are describing
as active/active is \_not_ active/active for VMware,
and should therefore not be used with "fixed path".
regards,
Bert.

Yeah, thanks for the mixup clarifications. Storagetek still claims we should stay "fixed" but at this point we're not sure they know really how VMware works so we are going to do tests on our own. I think we are on the right path now on fixing this but more tests still need to be done before we claim victory.

I'll keep you posted on our results...

multirotor · ‎04-30-2007

Storagetek has modified the LINUX profile on our FLX380 to disable AVT after I convinced them with the IBM Redbook about the FaSTt & VMware.

Now we are using MRU.

Some FaSTt models are identical to the FLX range but IBM has a special profile for "no AVT" while the FLX does not.

We have had a lot of problems with path trashing in the past on 2.5 but after this change we had no more path failovers without a good reason.

I have to admit that I still consider the FLX380 to be the weak point in our setup.

We have been suffering from controller resets which is so far unsolved: http://www.vmware.com/community/message.jspa?messageID=570717

I have hurried with the upgrade to 3.0.1 just to be on a supported platform. I installed the last host this morning.

I have already needed to open a support call because some LUNs appear on controller A on one host while they show on controller B on another host. That's simply impossible on the FLX !

There is another thread about this problem: http://www.vmware.com/community/thread.jspa?messageID=633861&#633861

All

Slow SAN performance with STK FLX380