Thunder85
Contributor
Contributor

odd multipathing issue md3200i dell r610

ok, I think i'm loosing my mind so bear with me.  multipathing is not working on my host, even though I believe i've followed all the instructions and forum recommendations to the letter (followed this article: http://virtualgeek.typepad.com/virtual_geek/2009/09/a-multivendor-post-on-using-iscsi-with-vmware-vs...)  I've enabled jumbo frames on switch, san, vswitches and vmk1 & 2.  I've bound vmk1 & 2 to my iscsi initiator (vmhba39).  I can ping all targets and vmk ports from my two switches and vmkping from host1.   I've tried to unbind the vmk ports so i can re-bind them but i just get errors.  the performance is abysmal due to the multipathing issue but i've hit a wall here and am fairly tempted to just reinstall and start over again.  any help would be GREATLY appreciated.

Here is my configuration:

2 active ports w/ round robin initiated, only one target even though it says I have 4 connected targets w/7 devices and 28 paths.

iscsi example1.png

switch/host/san config:

Switch a:
All 10.1.52.x and 10.1.54.x iSCSI traffic

Switch b:
All 10.1.53.x and 10.1.55.x iSCSI traffic

SAN
Controller 0:
Port 0: 10.1.52.10 (Switch A)
Port 1: 10.1.54.10 (Switch A)
Port 2: 10.1.53.10 (Switch B)
Port 3: 10.1.55.10 (Switch B)

Controller 1:
Port 0: 10.1.53.11 (Switch B)
Port 1: 10.1.55.11 (Switch B)
Port 2: 10.1.52.11 (Switch A)
Port 3: 10.1.54.11 (Switch A)

SERVERS
Server1:
ISCSIA: 10.1.52.56 (Switch 1)
ISCSIB: 10.1.53.56 (Switch 2)

Server2:
ISCSIA: 10.1.54.57 (Switch 1)
ISCSIB: 10.1.55.57 (Switch 2)

binding confirmation:

[root@VMhost1 ~]#  esxcli swiscsi nic list -d vmhba39

vmk1
    pNic name:
    ipv4 address: 10.1.52.56
    ipv4 net mask: 255.255.255.0
    ipv6 addresses:
    mac address: 00:00:00:00:00:00
    mtu: 9000
    toe: false
    tso: false
    tcp checksum: false
    vlan: false
    vlanId: 52
    ports reserved: 63488~65536
    link connected: false
    ethernet speed: 0
    packets received: 0
    packets sent: 0
    NIC driver:
    driver version:
    firmware version:

vmk2
    pNic name:
    ipv4 address: 10.1.53.56
    ipv4 net mask: 255.255.255.0
    ipv6 addresses:
    mac address: 00:00:00:00:00:00
    mtu: 9000
    toe: false
    tso: false
    tcp checksum: false
    vlan: false
    vlanId: 53
    ports reserved: 63488~65536
    link connected: false
    ethernet speed: 0
    packets received: 0
    packets sent: 0
    NIC driver:
    driver version:
    firmware version:

and vmkping results to all 4 targets on san:


[root@VMhost1 ~]# vmkping 10.1.52.10
PING 10.1.52.10 (10.1.52.10): 56 data bytes
64 bytes from 10.1.52.10: icmp_seq=0 ttl=64 time=0.476 ms
64 bytes from 10.1.52.10: icmp_seq=1 ttl=64 time=0.232 ms
64 bytes from 10.1.52.10: icmp_seq=2 ttl=64 time=0.244 ms

--- 10.1.52.10 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.232/0.317/0.476 ms
[root@VMhost1 ~]# vmkping 10.1.52.11
PING 10.1.52.11 (10.1.52.11): 56 data bytes
64 bytes from 10.1.52.11: icmp_seq=0 ttl=64 time=0.456 ms
64 bytes from 10.1.52.11: icmp_seq=1 ttl=64 time=0.252 ms
64 bytes from 10.1.52.11: icmp_seq=2 ttl=64 time=0.438 ms

--- 10.1.52.11 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.252/0.382/0.456 ms
[root@VMhost1 ~]# vmkping 10.1.53.11
PING 10.1.53.11 (10.1.53.11): 56 data bytes
64 bytes from 10.1.53.11: icmp_seq=0 ttl=64 time=0.451 ms
64 bytes from 10.1.53.11: icmp_seq=1 ttl=64 time=0.227 ms
64 bytes from 10.1.53.11: icmp_seq=2 ttl=64 time=0.226 ms

--- 10.1.53.11 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.226/0.301/0.451 ms
[root@VMhost1 ~]# vmkping 10.1.53.10
PING 10.1.53.10 (10.1.53.10): 56 data bytes
64 bytes from 10.1.53.10: icmp_seq=0 ttl=64 time=0.468 ms
64 bytes from 10.1.53.10: icmp_seq=1 ttl=64 time=0.229 ms
64 bytes from 10.1.53.10: icmp_seq=2 ttl=64 time=0.233 ms

--- 10.1.53.10 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.229/0.310/0.468 ms
[root@VMhost1 ~]#

and a picture of my vswitch config:

iscsi example 2.png

0 Kudos
39 Replies
4nd7
Enthusiast
Enthusiast

Is it normal that vmk1 and vmk2 to be in separate vlans? Why would you do that?

0 Kudos
JRink
Enthusiast
Enthusiast

I should be able to give you a hand with this on Monday. I have a 3200i setup and working well with 4.1.  I can email you a diagram of our setup for you to review too.   Drop me a msg at (removed) and maybe we can exchange #'s and figure it out.  Monday afternoon works well for me since I'll be on site with the md3200i and ESX servers (I'm only at that location 2x a week)

JR

0 Kudos
AndreTheGiant
Immortal
Immortal

I suggest to follow Dell guide, and expecially the old on MD3000i configuration (the only difference is that the MD3200i has 4 NICs for controller).

Basically you must have a different logical network for each group of interfaces.

For the same reason use more vSwitches (each with a single NIC) the those different networks.

In your case seems that you only use 2 interfaces (I see only two different network), so split your vSwitch in two pieces.

Andre

Andre | http://about.me/amauro | http://vinfrastructure.it/ | @Andrea_Mauro
0 Kudos
Thunder85
Contributor
Contributor

That is what I did on my other server (2 hosts).  I have it split as you recommended and still only see one path (2 active interfaces round robin) which is part of what is stumping me.  I've configured these servers and iscsi interfaces as all on one vlan, 2 vlans, and as you see now with 4 separate distributed vlans (2 per host, 2 per controller) across the two cisco 2960 switches.

does anyone know how to release a vmk from a vmhba?  I just want to disassociate the vmk ports again so i can go through  the multipathing procedure one more time making sure i don't miss a  single step but i keep getting the below error when I try.  I've tried disabling the vmhba, i've put the system in maintenance mode and still get the same error.

[root@VMhost1 ~]# vmkiscsi-tool -V -r vmk1 vmhba39
Removing NIC vmk1 ...
Error: SCSI Busy

additionally, is there something I need to select in the configuration of the MD3200i to allow multipathing?  I understand its uses a primary/secondary controller setup and as you can see in the IP config list I have all 4 vlan's represented on both controllers to make sure that both hosts have 2 paths to the primary controller and 2 to the secondary at all times.

0 Kudos
4nd7
Enthusiast
Enthusiast

Try to remove it with:

esxcli swiscsi session remove --adapter vmhba39

vmkiscsi-tool -V -r vmk1 vmhba39

Thunder85
Contributor
Contributor

here are some screen shots of my 2nd host showing that I have the split vSwitch as you mentioned, showing two active i/o paths yet only one target:

iscsi example 3.png

iscsi example 4.png

and here are the vmkping results showing I can ping all 4 available host ports from host 2.

[root@VMhost2 ~]# vmkping 10.1.54.10
PING 10.1.54.10 (10.1.54.10): 56 data bytes
64 bytes from 10.1.54.10: icmp_seq=0 ttl=64 time=0.261 ms
64 bytes from 10.1.54.10: icmp_seq=1 ttl=64 time=0.237 ms
64 bytes from 10.1.54.10: icmp_seq=2 ttl=64 time=0.236 ms

--- 10.1.54.10 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.236/0.245/0.261 ms
[root@VMhost2 ~]# vmkping 10.1.54.11
PING 10.1.54.11 (10.1.54.11): 56 data bytes
64 bytes from 10.1.54.11: icmp_seq=0 ttl=64 time=0.264 ms
64 bytes from 10.1.54.11: icmp_seq=1 ttl=64 time=0.240 ms
64 bytes from 10.1.54.11: icmp_seq=2 ttl=64 time=0.243 ms

--- 10.1.54.11 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.240/0.249/0.264 ms
[root@VMhost2 ~]# vmkping 10.1.55.10
PING 10.1.55.10 (10.1.55.10): 56 data bytes
64 bytes from 10.1.55.10: icmp_seq=0 ttl=64 time=0.248 ms
64 bytes from 10.1.55.10: icmp_seq=1 ttl=64 time=0.235 ms
64 bytes from 10.1.55.10: icmp_seq=2 ttl=64 time=0.234 ms

--- 10.1.55.10 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.234/0.239/0.248 ms
[root@VMhost2 ~]# vmkping 10.1.55.11
PING 10.1.55.11 (10.1.55.11): 56 data bytes
64 bytes from 10.1.55.11: icmp_seq=0 ttl=64 time=0.444 ms
64 bytes from 10.1.55.11: icmp_seq=1 ttl=64 time=0.239 ms
64 bytes from 10.1.55.11: icmp_seq=2 ttl=64 time=0.261 ms

--- 10.1.55.11 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.239/0.315/0.444 ms

0 Kudos
Thunder85
Contributor
Contributor

well, I decided to just start fresh so I disassociated the vmk ports and completely deleted vSwitch2.  I created it line for line as described in the dell vmware md3200i configuration document (pg 13-26), resulting in exactly the same scenario.  4 live connections, 2 active, only one target working.  I've tried selecting the dell recommended MRU connection type as well as round robin as recommended on the "a multivendor post on using iscsi with vmware vsphere" article.  still the host sees 4 working paths but only one target.  is there a bug or driver issue i'm missing?

0 Kudos
JRink
Enthusiast
Enthusiast

Out of curiosity, have you spoke with Dell Tech?  Sometimes they are pretty good, depending on which tech you get on the phone.

JR

0 Kudos
Thunder85
Contributor
Contributor

I'm going to try that shortly, waiting to hear back from my dell rep as to which number i should call for vmware/san support.  I'm thinking it must be a bug.. both hosts are configured differently yet seeing the same glitch.

0 Kudos
Thunder85
Contributor
Contributor

Ok, I may be on to something. I'm seeing a higher latency than i would expect when I do a jumbo frame vmkping.  please review my settings and see if i'm missing something:

[root@VMhost1 ~]# vmkping 10.1.52.10 -s 9000
PING 10.1.52.10 (10.1.52.10): 9000 data bytes
9008 bytes from 10.1.52.10: icmp_seq=0 ttl=64 time=1.655 ms
9008 bytes from 10.1.52.10: icmp_seq=1 ttl=64 time=1.375 ms
9008 bytes from 10.1.52.10: icmp_seq=2 ttl=64 time=1.372 ms

vswitch info:

Switch Name      Num Ports   Used Ports  Configured Ports  MTU     Uplinks
vSwitch2                  128               5               128               9000    vmnic2,vmnic5

  PortGroup Name        VLAN ID  Used Ports  Uplinks
  iSCSIb                            53            1           vmnic5
  iSCSIa                            52            1           vmnic2

Interface  Port Group/DVPort   IP Family IP Address   Netmask         Broadcast       MAC Address     MTU     TSO     MSS   Enabled Type
vmk1       iSCSIa              IPv4      10.1.52.56    255.255.255.0   10.1.52.255     00:50:56:7a:aa:9d      9000    65535     true    STATIC
vmk2       iSCSIb              IPv4      10.1.53.56    255.255.255.0   10.1.53.255     00:50:56:78:7e:34      9000    65535     true    STATIC

Name    PCI                Driver      Link     Speed    Duplex  MAC Address     MTU                                   Description
vmnic2  0000:02:00.00 bnx2        Up   1000Mbps  Full   84:2b:2b:fa:93:1d   9000    Broadcom Corporation PowerEdge R610 BCM5709 Gigabit Ethernet
vmnic5  0000:04:00.01 bnx2        Up   1000Mbps  Full   00:10:18:8d:b4:26  9000   Broadcom Corporation Broadcom NetXtreme II BCM5709 1000Base-T

and my switch configs (2x cisco 2960g):

VM2960A#show sys mtu

System MTU size is 1500 bytes
System Jumbo MTU size is 9000 bytes
Routing MTU size is 1500 bytes

VM2960B#sho sys mtu

System MTU size is 1500 bytes

System Jumbo MTU size is 9000 bytes

Routing MTU size is 1500 bytes

I've configured everything to handle jumbo frames..  is there something i'm missing?  why would the latency be so high.. I have a 4 port etherchannel between the two switches as well as one trunk each back to the core switch.  any help would be greatly appreciated

0 Kudos
JRink
Enthusiast
Enthusiast

Any reason you have VLAN IDs setup on your Port Groups for vSwitch2?  You shouldn't need that if you're tagging properly on your physical switch.   You really only need to setup VLAN IDs on your Port Group if you have multiple VLANs you're wanting to pass through the same vSwitch or pNICs.  And in your case, I don't think you need that. 

JR

0 Kudos
Thunder85
Contributor
Contributor

they are on because I have all server connections configured as trunks.  I did so because I was changing things so much (trying different scenarios/vlan combos) that it was easier to assign a vlan tag than reconfiguring the vlan on the switch manually every time.

0 Kudos
Thunder85
Contributor
Contributor

well, jrink had me run several tests which verified something different than multipathing as my problem..  my brand new 4 week old MD3200i was borked.

here's where it gets fun.  the last week I've been working directly with dell (two techs together) and we verified that even with the server directly connected to the san (using a single path) I saw rediculously slow access speeds on random real-life iometer tests (4MBs) but would see anywhere from 60MBs to 114MBs synchronous read or write.  when I added a second direct connection to the san to enable multipathing it actually SLOWED DOWN.  So the dell techs continued doing tests and they first found that the san was sent to me with the switch set to split bus instead of single.  reflashed the firmware and nothing was gained.  lots of database and sync issues between the two controllers, and after much commandline fun and additional experimentation they sent me a new controller 1 as it seemed to be the one with the most issues.  this didn't help and the two controllers went to crap again so they did a master reset on controller 0 with controller one out and while attempting to recover my array he accidentally copy and pasted a script to initialize instead of rebuild so the primary vmstore containing my vcenter and test vms was destroyed.  after i calmed down we moved forward and created a raid 0 partition and attempted to connect directly to an existing windows server to take out any vmware switching or potential issues.  this also failed as it could see the disk, but couldn't access it or initialize the drive even after several resets and re-applications of the host permissions on the SAN.  that was finally the last straw and they took all the iometer tests and notes and forwarded it on the analyst/engineers.

so here I am today, waiting to hear from an engineer to see if they are planning on sending me a new san or try to throw more parts at it.  at this point i'm not sure how much i trust this thing to run my infrastructure...  i'm concerned its a backplane issue as that is the only thing I can think of that would cause nvram/database corruption on the controllers, even after factory resets and replacements.

let this be a lesson to everyone... even if your SAN hardware is brand spanking new do a direct connection test first to get a baseline performance number.  don't trust the software if it says "optimal" unless the performance is verified.  I wasted so much time troubleshooting the servers, multipathing, my new cisco switches, etc that i neglected to test the san itself.  luckily jrink sent me some performance tests he had done on a similarly configured md3200i to use in comparison.

i'll keep this up to date as to what happens so someone else who sees this doesn't go through the massive pile of BS I have.

0 Kudos
lege
Contributor
Contributor

I have the exact same issue with a Windows 2008 R2 + Hyper-V setup using the MD3200i - with read speeds of around 2-3MB/s.

I didn't realize this bottleneck until very recently, and we had some production VMs running on there. However, I think the performance must've degraded over time somehow, I don't think the MD3200i exhibited this problem for us when we first set it up.


Fortunately, we also have an older model MD3000i in our network which works properly with read speeds of 100MB/s, so I'm migrating all the VMs over to it so I can troubleshoot this other SAN.

I'm going to try various things (currently installing SP1 for the OS) and see if I can get it back in working condition.

One thing I noticed that seemed weird, is that in the system event log on the SAN there are entries for unexpectedly closed connections every 30 minutes like clockwork. I'm not sure if this is a configuration issue somehow in our network or something the SAN is doing on a scheduler. But the older MD3000i doesn't show the same behavior.

All in all, we had a ton of issues with these MD3xxxi SANs from Dell, the other one being their Hardware VSS (snapshot provider) not working properly and Dell (or LSI) not wanting to do anything about it.

I wonder if this issue is somehow related to snapshots, we created a few while troubleshooting a few months back. Since then they've been deleted, but I wonder if the firmware got stuck on them somehow.

Have you taken any snapshots on yours?

0 Kudos
Thunder85
Contributor
Contributor

That sounds almost exactly like what I am experiencing.. My system also exhibited resets.  I don't think it has anything to do with snapshots as I have never used the function.  If I were you I would establish a direct connection (no switches) and do some tests.  If you see the same results call Dell and get a case going.  Perhaps there is a bad batch of these as I know others have excellent performance with theirs.

0 Kudos
lege
Contributor
Contributor

If you do a simple google search, you'll see a bunch of other people experiencing the same problem. I just ran an ATTO disk benchmark on a new LUN and noticed the following:

http://i.imgur.com/1O1zK.png

The problem seems to start with block sizes after 32KB, with 128KB and 256KB being particularly bad.

Update1: I just recreated the LUN with a block size of 32KB and formatted the NTFS partition on it with a 32KB block size, and that didn't help, Same 1.8MB/s read speeds. Also installed 2008 R2 SP1 but there was no improvement. I will try disabling multipathing as a next step.

Update2: The following is with MPIO disabled. As expected, the transfer speed caps at 110MB/s (1gbps), and surprisingly, the read speed is better than with MPIO enabled, but still drops to 5MB/s (although sooner - at just 32KB block size)

http://i.imgur.com/SIeCt.png

Note: Dell support is notoriously bad from our experience, opening a new ticket in our name will waste a week for us trying to explain and re-explain the problem to various technicians we're getting reassigned to. If possible, if you still have the support ticket open, you could try showing them these findings? Thanks.

0 Kudos
lege
Contributor
Contributor

Ok, I seem to have resolved this somehow.

I started by reconfiguring my iSCSI host ports from scratch in the controller:

Controller 0

Port 0: 192.168.133.110 (Switch 1)

Port 1: 192.168.131.110 (Switch 1)

Port 2: 192.168.132.110 (Switch 2)

Port 3: 192.168.130.110 (Switch 2)

Controller 1:

Port 0: 192.168.133.111 (Switch 1)

Port 1: 192.168.131.111 (Switch 1)

Port 2: 192.168.132.111 (Switch 2)

Port 3: 192.168.130.111 (Switch 2)

After reconfiguring the IPs however, even though I could ping them, the iSCSI initiator would throw errors and was unable to connect.


I don't have physical access to the SAN (it's colocated) to reboot it. So I had to use the SMCLI to reset both controllers on the SAN:

smcli.exe -n san2 -c "reset controller [0];"

smcli.exe -n san2 -c "reset controller [1];"

After both controllers were reset, the iSCSI Initiators were able to connect again fine.

I have also reinstalled the "Host software" on my test machine and after this I was able to get 150MB-200MB/s read speeds over MPIO with 2 NICs.

I'm not sure exactly which step helped here, but before this I was using only 2 subnets, and not 4 for the iSCSI host ports - though you seem to be using 4.

It might be just the reset of the controllers that helped, you could try and see if that helps.

0 Kudos
Thunder85
Contributor
Contributor

I'm glad to hear you've gotten yours straightened out..  I'm willing to bet it was the smcli resets you performed that cleared your errors however, not the vlan configuration.  In my dealings with the last 2 dell technicians they stated that typically an smcli reset of the controllers will clear up any communication errors (which is why mine was stumping them).  Something may have gotten corrupted and effected the sync between your controllers causing the resets and whatnot.

as for my issue, I've used the 4 vlan config you found ( I think we both read the same thread it was shown in) as well as 2 vlans, 1 vlan with an etherchannel trunk between my switches, etc etc with no change in throughput.  my last tests with dell consisted of one esx host directly connected via two cat5e cables to SAN controller0 using a different subnet for each pairing.  even then I saw the 4MBs speeds on random read/write tests.  the dell technicians reset my controllers multiple times, sent me a new controller, and even did a full factory reset and still the sync, database and nvram issues were present.

mine has never worked right so I'm thinking I might have the rare case of a bad backplane or something to that effect.  something has to be wrong that is not contained within the controllers as even the new one they sent took a dump as soon as it was introduced into my san.  the part that's killing me is I NEED this system up and running within the next 10 days and still nobody knows what is going on.  My management (and myself honestly) are questioning whether or not we even trust this thing even if Dell does pull through and finds a way to fix it.  I plan on running my core infrastructure on this thing and if we can't trust it we may have to simply return it and go with a hp p2xxx or something comparable.

0 Kudos
Thunder85
Contributor
Contributor

additionally, what did you search to turn up all the others with our issues?  I'm gathering info for dell and the engineers assigned to my service call and not finding alot of cases.

0 Kudos