VMware Cloud Community
dsp2267
Contributor
Contributor

How does multipathing work?

Hi all, I'm getting my notes together for VCP-550 (yes, 550) and have concluded that I really don't understand multipathing. So I will be asking a whole flurry of questions on the subject.

First question;

How does multipathing really work?

I am trying to think through how it would be implemented in software, and here's a mental model I have;

- there is an ESXi host with 2 or more FC HBAs

- there are 2 or more FC switches

- there is a storage array with 2 or more FC ports

the ESXi hypervisor must have a CPU process that manages a "routing table" analogous to a Cisco router maintaining its routing table;

in the case of FC multipathing, the host has a destination FC address and figures out that there are N possible paths, and lets say all N paths are up. so the "routing table" has a list of target LUNs, and for each LUN it has a list of egress FC ports and maybe what the policy is (Fixed, MRU, or RR).

when a VM issues a SCSI write command to a virtual disk, ESXi "intercepts" and encapsulates that command into FCP frames, looks at its "routing table", realizes there are multiple paths and that multipathing has been configured. the hypervisor figures out what to do with each outgoing FCP frame based on the multipathing config, lets say Round Robin. it transmits the frames to the array according to the design of the NMP.

the frames eventually reach the storage array, possibly hitting multiple FC ports, so the array network stack needs to assemble into order and then pass the raw SCSI command to the SCSI controller. after the writes, the SCSI controller returns a completion status message. the storage array's network stack then looks at its own "routing table", chooses an egress FC port, and transmits the frame back to the host.

so in other words, multipathing is handled entirely by the OS of the host and of the array; the VM has no clue, and FC switches in between have no idea what those two rascals are doing.

is this anywhere close to how multipathing works?

0 Kudos
15 Replies
Gortee
Hot Shot
Hot Shot

Evening,

Let me see if I can help.

- there is an ESXi host with 2 or more FC HBAs  - We will call them node1 and node2

- there are 2 or more FC switches - We will call them switch-a switch-b

- there is a storage array with 2 or more FC ports - We will call them controller-a controller-b

the ESXi hypervisor must have a CPU process that manages a "routing table" analogous to a Cisco router maintaining its routing table;

in the case of FC multipathing, the host has a destination FC address and figures out that there are N possible paths, and lets say all N paths are up. so the "routing table" has a list of target LUNs, and for each LUN it has a list of egress FC ports and maybe what the policy is (Fixed, MRU, or RR).

-Yes but it's a little more complicated because each path also has a state... active, active optimized or standby which is determined by a special SCSI command and the storage array set's this status.   Policy is set by vSphere but does not void array settings

when a VM issues a SCSI write command to a virtual disk, ESXi "intercepts" and encapsulates that command into FCP frames, looks at its "routing table", realizes there are multiple paths and that multipathing has been configured. the hypervisor figures out what to do with each outgoing FCP frame based on the multipathing config, lets say Round Robin. it transmits the frames to the array according to the design of the NMP.

-Well here is the fun part.  Let's take on RR since you mentioned it.  VMware Round robin requires at least two active paths (active/active or AULA array most ALUA arrays don't use RR) VMware write 1000 commands down one path then switches in order.  So it's 1000 writes switch etc...  So it's really simple VMware knows what path it needs to use of the active / active and write 1000 times then switches.  On MRU it just writes down the single path until it's gone.  On Fixed it writes down fixed until gone.


the frames eventually reach the storage array, possibly hitting multiple FC ports, so the array network stack needs to assemble into order and then pass the raw SCSI command to the SCSI controller. after the writes, the SCSI controller returns a completion status message. the storage array's network stack then looks at its own "routing table", chooses an egress FC port, and transmits the frame back to the host.

-Correct roughly  each array does it a little different (for example using cache)


so in other words, multipathing is handled entirely by the OS of the host and of the array; the VM has no clue, and FC switches in between have no idea what those two rascals are doing.

->100% Correct multipathing is a software construct that guest vm's are completely unaware of in any way.   FC Switches also know nothing their job it's to fling packets (just like a L2 switch - which they are)


is this anywhere close to how multipathing works?

->Pretty close. 


You knowledge is good.  There is a great vmware storage book out there if you want to dive in called

Storage Implementation in vSphere 5.0


Which will provide more detail than you ever need... I am not sure the VCP requires this level of detail but in case it never hurts to know more.  Multipathing on other OS's is more fun vmware implements a basic but effective solution to multipathing.  All of this applies to Fiber channel only.


Let me know if you have additional questions or I did not answer your question


Thanks,

J

Joseph Griffiths http://blog.jgriffiths.org @Gortees VCDX-DCV #143
dsp2267
Contributor
Contributor

Thanks much. I haven't seen mention of the 1000 SCSI commands per path thing before, I had assumed RR worked on a frame-by-frame. I'll have more followup questions in a few days after I digest the above, but here's one that comes to mind immediately:

Lets say the active SP for LUN number 43 craps out on an Active-Passive array. At some point the ESXi host should detect the failure (if nothing else, during its every-five-minutes SAN scan), and if its doing multipathing then the PSA will revisit the claim rules and make up a new path table or whatever its called. But does the array notify the host? I'm thinking that maybe if its a Fibre Channel SAN, the RSCNs flying around will hit the HBA and hopefully the HBA driver will pass the notification up to the PSA, but I haven't read anything to that effect.

0 Kudos
Gortee
Hot Shot
Hot Shot

Morning,

Let's see if I can answer this one:

Lets say the active SP for LUN number 43 craps out on an Active-Passive array. At some point the ESXi host should detect the failure (if nothing else, during its every-five-minutes SAN scan), and if its doing multipathing then the PSA will revisit the claim rules and make up a new path table or whatever its called. But does the array notify the host? I'm thinking that maybe if its a Fibre Channel SAN, the RSCNs flying around will hit the HBA and hopefully the HBA driver will pass the notification up to the PSA, but I haven't read anything to that effect.

->First of all it depends if it's a active/passive or AULA arrary.   It also depends on the location of the failure.   Let's look at the possible locations for failure (array,cable,switch/switches,hba)  each will have a different behavior.  Let me play out some examples:

1. Active / passive array ( AULA)

Failure location:array (this would be a storage controller failure) scsi sense command from esxi host fails or array notifies the host via scsi command that the active path has moved.  PSA switches path based upon policy

Failure location:cable SCSI sense commands fail and host does a lun tresspass letting the array know it wants to access the lun on the other path.  Array swiches ownership and IO continues

Failure location: Switch/Switches same as cable

Failure location:HBA depends on issue if driver maybe nothing happens other than IO breaks. If it's failed then OS knows and initiates tresspass. 

I really need to write a blog article on this matter I should get one posted in the next day or two.  If you have more questions keep asking I will help me figure out what to include.

Thanks,

J

Joseph Griffiths http://blog.jgriffiths.org @Gortees VCDX-DCV #143
0 Kudos
Gortee
Hot Shot
Hot Shot

Sorry about the delay.  I just posted a article on multipathing with fiber channel.  I hope it helps,  Let me know if you have additional questions:

http://blog.jgriffiths.org/?p=689

Joseph Griffiths http://blog.jgriffiths.org @Gortees VCDX-DCV #143
dsp2267
Contributor
Contributor

Gortee, thanks for publishing that blog post. Only one question so far...

AIUI, a host can send a RTPG command to an ALUA array concerning a particular LUN, and thus learn the best port on which to access that LUN. That seems to be the ultimate high speed low drag way to handle pathing. The array keeps track of LUNs and ports, the ESXi host simply does what its told ("Access LUN 82 on my port with WWN xxx") and focuses on running VMs and crunching numbers.

At least at the enterprise, cost is no object level, why isn't this architecture pretty much universal? Cost? Inertia?

0 Kudos
Gortee
Hot Shot
Hot Shot

At least at the enterprise, cost is no object level, why isn't this architecture pretty much universal? Cost? Inertia?

->Great question.   I think there are two reasons.  First each vendor has different technology and they have invested a lot of money in their technology.  To throw away years of research to fundamentally change their technology does not make sense.   AULA arrays sell well to customers as a lower cost choice to active/active.   They are stuck supporting legacy technologies.  In a lot of cases their who technology is based around a active/passive option which means none of their speed algorithms will work with a/a.


For enterprise why are they not all useing a/a that depends on requirements.  Sometime a/p is good enough perfomance.  Mostly it's cost.  1PB of AULA is simply cheaper than 1PB of A/A.  This is due to the hardware and cache cost.  


Personally I love to use A/A array's and I am seeing them take over everywhere except solid state.  Almost every solid state array I have seen is AULA if Fiber channel.  SSD brings a whole new technology game.


Let me know if you have additional questions.


Thanks,

J

Joseph Griffiths http://blog.jgriffiths.org @Gortees VCDX-DCV #143
0 Kudos
chriswahl
Virtuoso
Virtuoso

Another reason to create an array that is active / passive is to ensure that no performance is lost when a controller goes through maintenance or fails. Many customers like to run both controllers in an active / active array to the ceiling without leaving any headroom aside for degraded states. Some newer active / active arrays provide the headroom in software and artificially rate limit the controllers.

Food for thought.

VCDX #104 (DCV, NV) ஃ WahlNetwork.com ஃ @ChrisWahl ஃ Author, Networking for VMware Administrators
0 Kudos
Gortee
Hot Shot
Hot Shot

What a great point Chris.   Really important to remember in all aspects of virtualization.  Their are limits and you want to be below them..  I have never thought of using a a\p array to enforce them.  Really good point thanks for getting me to think about them in a different way.

Joseph Griffiths http://blog.jgriffiths.org @Gortees VCDX-DCV #143
0 Kudos
dsp2267
Contributor
Contributor

Chris, thanks for joining in, but you lost me there. What I am inferring from your response is that an individual SP in an A/P array has way more power than an individual SP in an A/A array. I.e. if one SP in the A/P array goes down scheduled or ohno, the remaining SP not only can take ownership of all orphaned LUNs, it has enough horsepower to handle all the extra load for continuous ops. Versus an SP in an A/A array which would eventually melt down under double the load.

Is that really true? If so, what's wrong with the storage guys? My understanding is that A/A arrays tend to be designed for enterprise scale, A/P for SMBs, so I'd expect the A/A arrays to have much more fault tolerance than the A/P arrays.

0 Kudos
chriswahl
Virtuoso
Virtuoso

I wouldn't quite phrase it that way. A/P is definitely not an SMB thing. It's just an architectural choice. An active / passive (A/P) array is designed specifically for fault tolerance. For each volume or LUN being served, the active controller provides IO while the passive controller is simply waiting for a failure or maintenance activity to take over. You could also serve multiple volumes / LUNs and make one controller active for a subset and the other controller active for the remaining subset. If you choose to do this, you should ensure that the controllers do not exceed 100% of combined available IO capacity or there will be a performance degradation in the event of a failure.

The same holds true for active / active (A/A) arrays, except you can use both controllers to serve IO to a single volume or LUN simultaneously without thrashing issues. If you exceed 100% of the combined available IO capacity on your controllers, and one fails, you will suffer performance degradation.

For example, if you run both controllers at 80% of its maximum capabilities, you're running at a total of 160% of what a single controller can serve. It doesn't matter how the controllers are configured (A/A vs A/P). You can think of this like a car analogy - imagine driving your car at top speed down a two lane highway with another car beside you. If his lane goes away, he'll have to drive in your lane - which means one of you will need to slow down to let the other into the surviving lane.

VCDX #104 (DCV, NV) ஃ WahlNetwork.com ஃ @ChrisWahl ஃ Author, Networking for VMware Administrators
0 Kudos
Gortee
Hot Shot
Hot Shot

As I mentioned above I had never thought of using a a/p array to enforce fault tolerance but it's true it does help avoid over subscription by strictly enforcing it.  Performance on the A/P array would be the same after a controller failure while a over subscribed A/A will have a major performance issue.   It really makes sense why so many Solid state arrays are A/P (beyond some architectural issues)  It is very easy to over subscribe SSD.

Thanks again for opening my eye's Chris.

Let us both know if you have any additional multipathing questions.

Thanks,

J

Joseph Griffiths http://blog.jgriffiths.org @Gortees VCDX-DCV #143
0 Kudos
dsp2267
Contributor
Contributor

I'm afraid I still don't get how A/P is any better than A/A, post SP failure. Lets look at two hypothetical arrays using round numbers, one A/A and one A/P. Each houses 1 TB of net storage divided into 10 LUNs. Each array has two SPs that can handle 1,000 IOPS each, and lets say the average and peak workload is 1,500 IOPS.

The A/A array delivers about 750 IOPS on each SP, but if one SP fails, the remaining SP can only provide 67% of the needed I/O. The A/P array also delivers about 750 IOPS on each SP, but if one SP fails, the remaining SP can only provide 67% of the needed I/O. How is A/P any more fault tolerant?

0 Kudos
chriswahl
Virtuoso
Virtuoso

Let's not think of it as better or worse. They're just two different ways to design a storage array.

VCDX #104 (DCV, NV) ஃ WahlNetwork.com ஃ @ChrisWahl ஃ Author, Networking for VMware Administrators
0 Kudos
Gortee
Hot Shot
Hot Shot

I'm afraid I still don't get how A/P is any better than A/A, post SP failure. Lets look at two hypothetical arrays using round numbers, one A/A and one A/P. Each houses 1 TB of net storage divided into 10 LUNs. Each array has two SPs that can handle 1,000 IOPS each, and lets say the average and peak workload is 1,500 IOPS.

The A/A array delivers about 750 IOPS on each SP, but if one SP fails, the remaining SP can only provide 67% of the needed I/O. The A/P array also delivers about 750 IOPS on each SP, but if one SP fails, the remaining SP can only provide 67% of the needed I/O. How is A/P any more fault tolerant?

->I agree with Chris.  In your example it really comes down to a design flaw that should never be allowed.  Simply put in every way A/A is better but only if you don't over subscribe it.  In your example it would be the following situation:

1. A/A array each SP can do 750 IOPS so total number of available IOPS 1500 taking into account two SP's

2.A/P array each SP can do 750 IOPS so total number of available IOPS 750 taking into account two SP's A/P

So as long as you don't expect 1500 IOPS from number 1 you are good.   Number two forces a upper limit of 750 before you see performance issues in normal state.   I have seen just like I am sure Chris has in his work so many A/A array's over subscribed.   They just keep packing on applications and IO assuming they are ok until a failure hits.  Assuming that if your do 1500 IOPS and if you loose 1 SP 750 will be better than nothing is not always true... One of the largest problems with large arrays is the whole world lives on the array.  It's really a poor design choice.  Storage admins should take into account that a A/A array can only do 50% of it's total IOPS/performance while A/P does not allow larger than that.

Thanks,

j

Joseph Griffiths http://blog.jgriffiths.org @Gortees VCDX-DCV #143
0 Kudos
dsp2267
Contributor
Contributor

I had a chance to flip thru my EMC text (Information Storage and Management, 2009 edition). In its chapter on Intelligent Storage Systems, it talks about A-A and A-P arrays by using the Clarion and Symmetrix arrays as examples of each. The book asserts that "High-end storage systems, referred to as active-active arrays, are generally aimed at" blah blah blah, and then "Midrange storage systems are also referred to as active-passive arrays and they are best suited for" blah blah blah. The book implies that A/A arrays have modular SPs that can be stacked as needed on the front-end, whereas A/P arrays have exactly two SPs. So the marketing-speak has drifted substantially from the technical, as related in this thread I found last night;

Active vs Passive vs ALUA Storage

I now have the impression that plain A/P arrays can assign some LUNs to SP-a and some to SP-b, with failover/failback capability, but assignment has to be done manually and load-balanced manually, thus the appeal of A/A.

0 Kudos