Re: ESX 3.5 Round Robin Load Balancing

Damin · ‎03-08-2008

Hello,

I just wanted to report back on my success using Round Robin Multipath Load Balancing for Vmware ESX 3.5. My production SAN is running on the Wasabi Systems iSCSI Target. I'm very happy w/ the results of the Wasabi Systems software.

My backend SAN is based on the following hardware:

Intel S5000PSLSATA Dual 771 Intel 5000P SSI EEB 3.6 (Extended ATX) Server Motherboard (NewEgg Link:

http://www.newegg.com/product/product.aspx?Item=N82E16813121038)

(This is the same motherboard in the Intel SSRC storage servers)

3ware 9550SXU-16ML Array Controller

(NewEgg Link: http://www.newegg.com/Product/Product.aspx?Item=N82E16816116059&Tpk=N82E1681

6116059)

16 x Seagate Barracuda ES.2 ST31000340NS 1TB 7200 RPM 32MB Cache SATA 3.0Gb/s Hard Drives (NewEgg Link:

http://www.newegg.com/Product/Product.aspx?Item=N82E16822148278&Tpk=ST310003

40NS)

For now, I am using the two onboard Gig-E NICs, and have plans to add a pair of Intel PT1000 dual-port cards to get 6 total Gig-E NICS in the future. Each NIC is plugged into a dedicated Gig-E switch which only carries iSCSI traffic for the SAN.

After pretty exhaustive testing of different raid stripe sizes, I settled on RAID-10 w/ 64k stripes (the default for the 3Ware). This gives me roughly

7.21 TB of usable storage accross the 16 drives, and native read/write speeds of close to 1.2 Gigabytes / second from the Drive Array to local memory on the box. I've broken it up in to 6 LUNS of roughly 1.2 TB each, assuming that I'll be able to put 12 production VMs w/ an average of 80 GB hard drive space, while still maintaining enough overhead for logs, snapshots and swap.

On the Wasabi target, I export all of the LUNS through different target IP addresses. I.E. all LUNS are accessible through BOTH targets. These are called "NODES" in Wasabi terminology.

10.1.2.254 - Target 1 - NIC 1 - Physical Switch 1

10.1.3.254 - Target 2 - NIC 2 - Physical Switch 2

My ESX Servers have been configured thusly:

vSwitch 1 - NIC 1 - Physical Switch 1

- Service Console (iscsi Auth) 10.1.2.1

- Vmkernel Interface (iniator) 10.1.2.2

VSwitch 2 - NIC 2 - Physical Switch 2

- Service Console (iscsi Auth) 10.1.3.1

- Vmkernel Interface (iniator) 10.1.3.2

I have configured the iSCSI target to do Dynamic Discovery to both of the targets: 10.1.2.254 / 10.1.3.254

When I scan the target, I end up w/ 12 paths to the 6 LUNS. Doing an esxcfg-mpath -l shows that the luns are setup in a Fixed path failover scenario, where the secondary path is only used in the case of failure on the first path.

Disk vmhba32:0:0 /dev/sdb (1228800MB) has 2 paths and policy of Fixed iScsi sw iqn.1998-01.com.vmware:esxhost3-6fd23d7e<->iqn.2000-05.com.wasabisystems.sto

ragebuilder:iscsi-0 vmhba32:0:0 On preferred iScsi sw iqn.1998-01.com.vmware:esxhost3-6fd23d7e<->iqn.2000-05.com.wasabisystems.sto

ragebuilder:iscsi-0 vmhba32:1:0 On active

After lots of testing, I decided to use the following settings on my LUNs:

esxcfg-mpath --lun=vmhba32:0:0 -p rr

esxcfg-mpath --lun=vmhba32:0:0 -H any -B 64 -C 64 -T any

Basically, this sets the initiator for that HBA into Round Robin mode, and switches paths after every 64 blocks and every 64 commands. That has proven to provide the best overall speed for my workloads.

Doing so, I am able to get sustained Read/Write speeds of nearly 150 MB / second utilizing both paths. The tests were performed using a Centos 5 Virtual Machine doing "dd if=/dev/zero of=/dev/sdb bs=1024k count=1024". I ran them for over 48 hours on a single VM configured w/ 128 megs of memory, and since I am doing raw I/O to a bock device on a VMFS (/dev/sdb) this should be raw network transport speeds w/out cache or filesystem overhead within the VM.

When I run 8 concurrent VMs hammering the storage array at the same time, I am able to see peak read/write speeds of nearly 180 MB / second.

I spent the better part of the afternoon working on a conference call w/ Wasabi and Vmware engineers and did identify two gotchas in my setup that were causing some issues at first. The first one was pretty simple.. I had to disable auto-negotiation on the Gig-E nics and force 1000 Full Duplex on both them and the switch. This cleaned up the storage path really well.

The second issue related to the spreading of Interrupts accross my the 8 cores in my machine. I have a SuperMicro KVM over IP card that relies on USB support for an HID keyboard/mouse setup. As a result, interrupts for the different NICS in my Vmware box were not being properly distributed across all CPUs. By disabling USB (and losing the keybord for the KVM.. bummer) I was able to get the IRQs distributed across the CPUs in the box, and took my speed from 95 MB / second to 150 MB / second.

I hope this is helpful if anyone else is looking to multipath their storage, and I'd be very interested in seeing how this works under other iSCSI targets.

One specfic issue that I'm having is that I'd like the Round Robin / Multipath settings to survive a reboot. Currently, they do not. And occasionally, it appears as if the multipathing reverts back to a single fixe path, despite the fact that esxcfg-mpath shows the LUN as multipathed. Anyone else having issues with this?

spinner · ‎06-03-2008

hello,

i am using wasabi 2000sx, with two nics for iscsi and one for lan management

i have the same issue as you do with respect to round robin not surviving reboots

the only way to fix this for now is to write your own linux startup script which set's it to RR mode

if you need help with the script let me know and i will post it

spinner · ‎06-03-2008

hello,

i am using wasabi 2000sx, with two nics for iscsi and one for lan management

i have the same issue as you do with respect to round robin not surviving reboots

the only way to fix this for now is to write your own linux startup script which set's it to RR mode

if you need help with the script let me know and i will post it

Damin · ‎06-03-2008

Figured I would post a followup to this thread.

After about a month and a half of pretty intense work and testing w/ the Wasabi developers, I've got Multipath I/O w/ Round Robin Load balancing working at a production stable level. Along the way, I managed to identify a few issues w/ Wasabi that resulted in a few driver updates and some patches to their iSCSI target that solved some issues that I've had. I am currently running a specially compiled Beta (v4.0.2 *Pre-Release* BETA5-N2NET-4) that has some changes in the way that they handle SCSI reservation conflicts. With those changes, and the updated drivers I can get sustained reads and writes of 180 MB / second using two NICs.

Short story, if you are running Wasabi, make sure you are running the 4.0.2 release.

Gabrie1 · ‎06-04-2008

What happens to your config with round robin if one path fails? Will round robin stop or accidentily (like with dns) point you to a dead IP ???

Gabrie

http://www.GabesVirtualWorld.com

Damin · ‎06-04-2008

If one path fails (say we reboot a switch, or migrate an Ethernet cable on the Target server) nothing BAD(tm) happens. Vmware just marks the path as "dead", and any further requests are then issued out the redundant path. There is a heavy amount of syslog spamming that occurs as the path switches are attempted and fail, but that is to be expected.

As the Entry-Level iSCSI targets mature a bit and begin to implement truly bi-directional block level and session mirroring between targets, this is going to open up a whole new world of low cost Fault Tolerant SAN options for Vmware implementations.

Paul_Lalonde · ‎06-04-2008

Hey Damin, I sure am glad to hear your Wasabi iSCSI setup is working well!

Thanks for posting your success!

Paul

Berniebgf · ‎06-04-2008

Great Write up Damin,

Good info with Good detail.

I think as more people jump onboard the iSCSI bandwagon the more clearly defined redundant iSCSI storage solution we will see, and this will only grow as 10Gb ethernet is taken up...

good one.

Bernie.

http://sanmelody.blogspot.com

Damin · ‎06-05-2008

Paul,

It was touch and go for a while, and there are still a few minor issues that I'm not entirely comfortable with related to reservation conflicts. I do not believe that I should be seeing the number of conflicts that I do see under load, and Wasabi does not see the same behavior in their LAB setup (which is an exact duplicate of what I have). However, these conflicts are now properly handled by the target and do not result in messages going back to the initiator, so it seems to be a benign issue. Under VMFS, reservation conflicts are going to occur... there is nothing that you can do about it.

In my setup, I can still cause the VMFS filesystem to become unavailble under EXTREME write conditions (I.E. 16 VMS pounding away doing full speed writes), but this appears to be a VMFS issue that can be corrected by clearing the outstanding reservations and re-scanning the target.

The good thing is that Wasabi is working w/ Vmware to certify their VMX appliances, so the software should soon be fully supported by Vmware.

actixsupport · ‎06-05-2008

Hi,

Great article which I've been trying to replicate myself. Could you help with some advice for our setup I'm currently putting together.

4 x ESX 3.5 on DL380's with 2 x onboard and 2 x NC380T (with the bloody icsi HBA that doesn't work with ESX!)

2 x DL380 with 2 x onboard and 2 x NC380T running Sanmelody in failover.

So I've followed what you've done below and dedicated a vSwitch to a NIC and different IP's but can still only see 1 path.

In Sanmelody I see the machine on the 2 targets, each with the different IP but lists the disk only once.

I know I can get multiple paths going because I've had it running with a second server with Sanmelody replicatio/HA

Is it just the software being to smart for it's own good or am I missing something here?

Thanks

Ray

actixsupport · ‎06-09-2008

Update,

So thanks to the Datacore support they let me know how to setup the twin targets properly.

Ran IO meter and was able to get a solid 220Mb/sec sequential read/write, nice!

Unfortunately my final setup is to have 2 x Sanmelody setup in failover, so ESX also writes to the backup SAN as well and performance goes down the sink

Still great article!

Ray

Berniebgf · ‎06-12-2008

Depending on your Configuration actixsupport, you may find little to no performance impact.

if they (DataCore SANmelody servers) are to be at the same site, with redundant mirror links (preferably 4gb fibre between them)......it would work as follows.

1. Read's to a volume on either DataCore box in the HA Cluster = no data is "sent" between the DCS's...instant response back to App server once I/O is committed to cache (ram)

2. Writes to a volume are committed to cache (ram) on the first "cluster member" and them sent across the mirror link to be committed to cache in the second "cluster member".

Then the I/O confirmation is sent back to the Application server, so your data is protected in two locations, similar to how a standard array works with redundant controllers and mirrored cache.

So if your I/O bus is fast, PCIe....and you are using 4Gb Fibre mirror links (even just cross over...no FC switch if using iSCSI at front end.....you will find the "HA mirroring" for failover will have next to no impact on performance.

Oh....and using ethernet for the mirror links sucks a bit but does work...the downsite is you need to setup the microsoft iSCSi initiator on each DCS to get the "initiator" mirror channel talking to the "target" mirror channel....

So bit more stuffing around on setup and performance can be a little bit of a concern. + CPU ovehead .... (I am talking about HA Sync Mirror situations here)

Hope that helps.

best regards

Bernie

http://sanmelody.blogspot.com

jeangaud · ‎07-24-2008

Have you found a solution to make these 2 commands persistend after a reboot?

Currently in my labtest, i have the following hardware

2 servers DL380 G4, with 2 Qlogic QLA4052C (4 ports)

1 port for boot on SAN, 3 ports for the datastore.

SAN, Equallogic PS100E

I've tested the commands on one server and the throughput seem to be much faster, I'm going to test with IOmeter to get more results.

DXS_Matt · ‎12-16-2008

My appolgies for digging up a dead thread, but Damin, can you give a little more info on your setup?

I've got a VMX2000SX setup with 12 1TB drives, and I'm not seeing nearly the performance you are.

I'm beginning to wonder if it's my Dell PowerConnect 2427 Switches which are causing my bottleneck.

I've emailed wasabi requesting the lastest version of thier software, as I'm currently running on 4.01

However I just wanted to verify that my setup is correct.

Disk vmhba32:3:0 /dev/sdc (2000000MB) has 2 paths and policy of Round Robin/Balanced

iScsi sw iqn.1998-01.com.vmware:nova-6acd5008<->iqn.2008-05.com.wasabisystems.vmx:wasabi-0 vmhba32:3:0 On preferred

iScsi sw iqn.1998-01.com.vmware:nova-6acd5008<->iqn.2008-05.com.wasabisystems.vmx:wasabi-0 vmhba32:8:0 On

Disk vmhba32:2:0 /dev/sdb (2000000MB) has 2 paths and policy of Round Robin/Balanced

iScsi sw iqn.1998-01.com.vmware:nova-6acd5008<->iqn.2000-05.com.wasabisystems.vmx:wasabi-1 vmhba32:2:0 On active preferred

iScsi sw iqn.1998-01.com.vmware:nova-6acd5008<->iqn.2000-05.com.wasabisystems.vmx:wasabi-1 vmhba32:7:0 On

Disk vmhba32:4:0 /dev/sdd (2000000MB) has 2 paths and policy of Round Robin/Balanced

iScsi sw iqn.1998-01.com.vmware:nova-6acd5008<->iqn.2000-05.com.wasabisystems.vmx:wasabi-2 vmhba32:4:0 On active preferred

iScsi sw iqn.1998-01.com.vmware:nova-6acd5008<->iqn.2000-05.com.wasabisystems.vmx:wasabi-2 vmhba32:9:0 On

Disk vmhba32:5:0 /dev/sde (2000000MB) has 2 paths and policy of Round Robin/Balanced

iScsi sw iqn.1998-01.com.vmware:nova-6acd5008<->iqn.2000-05.com.wasabisystems.vmx:wasabi-3 vmhba32:5:0 On active preferred

iScsi sw iqn.1998-01.com.vmware:nova-6acd5008<->iqn.2000-05.com.wasabisystems.vmx:wasabi-3 vmhba32:10:0 On

Disk vmhba32:6:0 /dev/sdf (1855361MB) has 2 paths and policy of Round Robin/Balanced

iScsi sw iqn.1998-01.com.vmware:nova-6acd5008<->iqn.2000-05.com.wasabisystems.vmx:wasabi-4 vmhba32:6:0 On preferred

iScsi sw iqn.1998-01.com.vmware:nova-6acd5008<->iqn.2000-05.com.wasabisystems.vmx:wasabi-4 vmhba32:11:0 On active

Any insight on the issues I'm experiencing would be greatly appreciated.

Thanks,

Matt

Damin · ‎01-14-2009

DXS_Matt,

Sorry that I haven't been paying much attention to this thread, but I wanted to check in and see if you have had any success.

I've been running this in production w/ very minor issues since this post was made. I have run into some minor snags, but I'm slowly working through them. Here is a quick list of the issues that I have seen.

1. Wasabi confirmed for me today a suspicion I've had w/ the Intel MB since day one. There appear to be some issues between ACPI and the PCI-X bus that cause occasional controller resets and other strange results. These are few and far between, but have been more notable since I updated the Intel to the latest (Rev 96) bios. This issue only seems to affect Net and FreeBSD architectures, so it is very possible that the issue will eventually get corrected. However, I decided not to wait. On Wasabi's reccomendation, I am replacing the MB in my production SAN with the following:

I will report back after the MB has been swapped and let you know how things are working.

2. I have also kept my 3Ware firmware up to date and am flashed to the latest firmware on the 9550-SXU(16ML).

3. I use the Seagate ST31000340NS 1TB drives. They shipped w/ Drive Firmware SN03 on them, but after doing some searches on the net, and talking w/ Seagate technical support, I was able to obtain firmware SN06. This actually resulted in a Performance boost on writes for me, although I haven't done a full set of performance analytics yet.

All

ESX 3.5 Round Robin Load Balancing