MD3000i dynamically learns MRU paths?

GeneNZ · ‎08-27-2009

Hi there,

I've got an interesting question that I just recently noticed and wanted to clarify. We have two MD3000i's each connected to 3 ESX4 hosts (for a total of 6 hosts). These hosts are running the latest versions, and the MD3000i firmware is also the latest version. Each MD3000i is connected to the same two switches. Our current pathing setup is "Most Recently Used". We had used "Round Robin" for similar performance, until we converted to Jumbo frames, in which we saw poor performance from Jumbo Frames and Round Robin, but similar performance to original with MRU and Jumbo Frames. I suspect its a problem with the Broadcom driver here, but the official reason from VMware is that they only support the MRU on the MD3000i.

Recently, I had to reboot the two iSCSI switches, and I decided to reboot them out of the normal order I reboot them in. After heading back to virtualcenter, I ran some disk performance benchmarks on the VM's (using HD Tune Pro) and noticed that the disk performance was absolutely abysmal (giving around 2mb/s read transfer rate, vs our normal 40-50mb/s we usually get). If I changed the MRU path back to the original path (by disabling the path, letting the machine failover to the alternate path, and reenabling the path), the performance was restored to what it was before.

What I'm trying to understand is why this is occurring. It is occuring across two MD3000i's, so its unlikely that its a fault with the MD3000i, which makes me think its the switch, but even with the load on the switches reduced by turning off some hosts, we're still seeing the performance problem (we are using Dlink DGS-1224T's for our switches). Which makes me wonder, does the MD3000i or ESX learn the MRU path somehow which results in a performance degradation when another path is used? If so, would performance improve overtime if I left it on the 'slower' path? Because at the moment, it means if one path were to fail, then we would just get slow VM's. That or we have to manually balance the load.

Thanks in advance for your help!

Gene

AndreTheGiant · ‎08-28-2009

When you have slow performance what do you see on your MD3000i management interface?

Remember that the controllers are active/passive for the same VirtualDisk.

So 2 paths are really slower (for the first time frame) cause they require a change of the active controller (with the relative change on active path on the other ESX).

So check if your VirtualDisk is on preferred controller during the performance issue.

And (IMHO) do not use a Round Robin policy

Andre

Andrew | http://about.me/amauro | http://vinfrastructure.it/ | @Andrea_Mauro

GeneNZ · ‎08-30-2009

Thanks for the reply,

From the MDSM MD3000i Management interface everything reports back optimal, and the virtual path of each LUN are evenly distributed across both controllers. What we found is that even on the active path (i.e. the path to the controller that owns that specific LUN), if we disable the 'primary path' (i.e. the path that gives the better performance), it correctly fails over to the second path (which is the other interface on the same controller that owns the LUN), and the performance on this second path is abysmal. These are the active paths, and not the standby paths (which is the second controller that doesn't own the virtual path to the LUN).

Hence I am rather confused. We even drew a pretty complex diagram on our whiteboard, and we thought it could be something to do with one of the switches we are using. However, we are seeing both fast and slow performance on both switches, depending which path a host is taking to a LUN.

Thanks,

Gene

AndreTheGiant · ‎08-31-2009

Switches are dedicated only for iSCSI IP traffic?

On MD3000i each DiskGroup has only one VirtualDisk?

Andre

Andrew | http://about.me/amauro | http://vinfrastructure.it/ | @Andrea_Mauro

GeneNZ · ‎09-01-2009

The switches are indeed dedicated for iSCSI traffic.

One MD3000i is setup with each diskgroup having one virtual disk, and the other is setup for multiple virtualdisks in one disk group. I know the latter is not best practise, but doesn't change the fact that both SANs are displaying the same issue?

Gene

GeneNZ · ‎09-03-2009

I've done some further testing and now have some numbers to show performance differences.

But as a recap: We have two MD3000i's setup in independent ESX4 clusters showing the following issues, across both controllers on each MD3000i, so I suspect it not to be a hardware issue. I'm getting strange performance differences when a ESX host connects connects to one of the two ports of a controller to access a LUN. Each ESX host uses the "Most Recently Used" Path Policy instead of the alternative Round Robin path policy. My ESX setup is identical to shown here: . i.e. Each port of one controller on the MD3000i is connected to an independent subnet across its own independent switch.

The performance differences are tested using the following method. I have three hosts connected to the MD3000i, and one virtual machine. One of the hosts is using "Port 1, Controller 0" for its I/O traffic, and the other two are using "Port 0, Controller 0" for their I/O traffic, and each port is on the controller that owns the virtual disk. I use HDTune Pro as my benchmarking tool (other tools show similar results). I test by VMotioning a virtual machine onto one machine, running HDTune Pro, then VMotioning the VM onto the other machine and running HDTune Pro again. The results are below.

http://www.delltechcenter.com/page/VMwareESX4.0andPowerVault+MD3000i

On the top is the VM connected on host to "Port 0, Controller 0", on the bottom is the VM connected to host on to "Port 1, Controller 0". Although not obvious (due to the scales being different - sorry I couldn't get the scales to match), the I/O results are almost 10-20mb/s slower on transfer rates on Port 1, and the burst rate is hugely slower (although access times are about even). If I force the machine connected on "Port 1, Controller 0" onto "Port 0, Controller 0", the performance improves, however, this isn't ideal since it means there is a disused port on one of the controllers.

I know it can't be the switches at fault either, because those machines accessing other LUNs, say on "Port 1, Controller 1", have good performance, and "Port 0, Controller 1" has poor performance. Note that Port 0 on both controllers are connected to Switch 1, and both port 1's on both controllers are connected to Switch 2. Also, both MD3000i's are displaying the same behaviour, and they are independent of each other.

This leads me to believe its something to do with the MD3000i itself, and is related to how it is accessing the LUNs. Does anyone have any clue with this, or why it is displaying this behaviour. Note that the MD3000i's are all showing optimal status, and each LUN is appropriately owned by one of the two controllers (i.e. the virtual disks are all on their preferred paths).

Thanks in advance!

Gene

All

MD3000i dynamically learns MRU paths?