VMware Cloud Community
neilynz
Contributor
Contributor

Multipath Redundancy Issue with Equallogic PS4000

We have an issue with our iSCSI environment as below

3 * vSphere ESXi hosts on Dell R710 servers (2 embedded NIC's used for iSCSI)

2 * Dell 6248 Powerconnect switches

2 * Dell Equallogic PS4000x SANS (In a group each with an active controller)

ESXi has been setup as per the Dell \ VMware best parctice and all connections are redundant between ESXi, the switches & the SAN. The powerconnects are used for iSCSI traffic only. No other servers connect to the SAN.

We have 2 volumes presented to the ESXi servers and all connects fine and multipathing sets up 4 paths (2 per LUN) to the SAN. All paths are active (Round Robin) with connections to either of the SAN.

The issue we have is when testing redundancy we turn off one of the powerconnects - the ESXi servers notice the dropped NIC connected to the switch but loses all connectivity to the SAN and you see the paths become dead.

1\ In an active \ active environment this should not happen

2\ Connectivity does not return unless I manually rescan the software iSCSI HBA.

Another concern I have is how the Equallogic load balances connections, for example I will see the 2 connections for LUN 1 both to the same SAN storage processor (sometime the same NIC on that SP). This means that the connection from one of the NICs in the ESXi server must have taken a longer route (via the ISL) to the SAN rather the going to the SP connected to the same switch ???

We have checked switch, VMWare and San config and no issues found..yet...

Has anyone exlse experienced simular issues with Dell Kit and VMware ?

Cheers

Neil

Reply
0 Kudos
15 Replies
J1mbo
Virtuoso
Virtuoso

Hi, the switches need a beefy connection between them, 4-port LAG minimum and preferably 10G+ stacked.

For the EQL only one controller is active, so each server and each PS series controller needs to be connected to both switches (one connection to each).

The EQL virtual interface IP and the two physical NIC IPs, as well as both vmkernel port IPs on each server all need to be on the same subnet.

To implement the load balancing more fully, the IOPS value needs to be reduced from the default of 1000 to 3 for peak throughput, and this needs to be applied via a host start-up script since the value is not maintained by ESX presently - see this thread for details and sample code.

Hope that helps.

Please award points to any useful answer.

s1xth
VMware Employee
VMware Employee

You should not have any issues. From what you are describing it sounds like you have a wiring problem. The system should be entirely redundant from the storage, the switches to the servers. One connection from ETH0 to switch 1, one connection from ETH1 to switch 2. Management network can go to another switch or subnet if you want out of band management. On the server side, you should have one connection from one nic to switch 1 and one connection to switch 2. In this configuration there is ALWAYS a path to the storage in a redundant config.

I have your setup except with one PS4000....configuration would be identical. Also, a 4GB lag would be fine for two PS4000's since they only can handle 2GB of throughput at a time, if you want to be safe a 6GB lag would be an option but again, that is overcommitting ports when they arent going to be used but just giving more overhead, and and in some cases actually hinder (in lags...if they are stacked then its fine).

Hope that helps!

http://www.virtualizationimpact.com http://www.handsonvirtualization.com Twitter: @jfranconi
Reply
0 Kudos
neilynz
Contributor
Contributor

Thanks J1mbo,

Yup confirmed done all that - apart from the LAG - including dropping the IOPS to 3. Our issue sounds similar to

http://communities.vmware.com/thread/215039?start=60&tstart=0

But we are currently waiting on Dell testing and willupdate with what they find

Cheers

Neil

Reply
0 Kudos
wolfwolf
Contributor
Contributor

Hi Neil,

Did you find a fix for your issue?

Thanks.

Reply
0 Kudos
JonathanBarley
Contributor
Contributor

I have seen exactly this and also share your concerns about load balancing across EqualLogic NICs.

I have raised with EqualLogic but no response so far; have you made any progress?

Thanks, Jonathan.

Reply
0 Kudos
neilynz
Contributor
Contributor

Hi - No the issue i sstill not fixed - apologies for not getting back sooner, I didnt notice the post had been updated. I have been off on l;evae this week but apparently there have been some developments. I willupdate the post whn I am back at work tommorow.

Thanks Again

Neil

Reply
0 Kudos
ChrisSlack
Contributor
Contributor

Hey thanks for the reply. So have you guys just been rolling without redundancy? I've been working on this issue with Dell for 3 months. They seem to think its an issue with ESX. If you notice whatever switch you fail that looses all of your paths vmk2 is probably attached to that vmknic on the switch. Vmk2 seems to be a default route for ISCSI traffic, it could be different on yours but do a esxcfg-route -l Check which vmk your ISCSI is routed to and I can almost guarentee it will be connected to the physical switch that when powered down redundancy fails. What does seem to work is having a single PS4000. If Dell doesn't resolve our issue by the ending of this week we are demanding they send us out a PS6500 series to replace our two PS4000s so we have some redundancy. Being located in Austin we have actually had several Dell engineers onsite, they even brought out there whole lab configuration and they still were not able to resolve it. At this point I cannot go any higher for technical help at Dell, engineering has submitted a defect to VMware but my company can't wait for that turn around to happen. If Dell sold you that configuration I'd be asking for a hardware replacement because they sold you a certified solution that was never tested.

Reply
0 Kudos
neilynz
Contributor
Contributor

Hi Chris,

I work for an IT services company and have only put this exact config in at one customer. After spending weeks with very disappointing Dell Techs (like you they have been over to our offices here and failed to fix it) they suggested just select the other vmnic on the iSCSI vswitch as as 'standby' rather than 'unused' . This then survives a switch failure (as I would expect it would). This was back in Feb, and until now they have made no inroads as to why all of the paths fail when 1 vmnic is 'Active' and one is 'unused' on each iSCSI portgroup as documented in Dell and VMware configuration guides and using the config that was sold to the customer.

The SAN is now fully in Prod as the customer couldnt wait for Dell to figure out exactly what was wrong and made the decision to push on so unfortunately I can't do any further testing but I would suspect that your observation on the esxcfg-route will be correct.

Apparently our local Dell office in Sydney now has it working but I have suspicions they have etither set it up differently (eg 1 PS4000 SAN). We are meeting with them later this week so will advise on the outcome.

The whole experience with Dell has been very disheartening, I am in Auckland and support here is next to none !! but as I say will let you know what they have tested

Reply
0 Kudos
vmn00by
Contributor
Contributor

Was there ever any resolution to this? I'm having a similar issue were I can't reproduce the results the documentation states should be shown for connections between my 2 dell blades and my 1 PS4000.

My post: http://communities.vmware.com/message/1594684

Reply
0 Kudos
neilynz
Contributor
Contributor

Hi there,

Unfortunately not no, and it has gone very quiet at Dell HQ. I passed on findings that other people out there have experienced similar issues so I guess they are still battling with it. I will be chasing them up in the next few days though and will advise on outcome

Thanks

Neil

Reply
0 Kudos
ChrisSlack
Contributor
Contributor

I worked with several Dell PSE's who opened up a Bug ticket with VMWare. After a few months of having Dell PSE's onsite and working through the issue and even bringing their own lab onsite VM has stated that the resolution which is not a work around but the proper way to setup failover between two ISCSI arrays is to set an additional NIC on the vSwitch to standby as a user above mentioned. For example I have a 1:1 ratio for my ISCSI vSwitch network configuration but moved up a second NIC from unused to Standby. That simple and resolved the failover issues I was having.

There is not going to be a fix, this is supposedly the Best practice from VMware.

Let me know if you have anymore questions.

Thanks

neilynz
Contributor
Contributor

Thanks Chris,

Yup been running with that config for 8 months with no issues. I asked VMware support and they didn't have any major issues with it. If that is the case both VMware & Dell need to update their docs. However my last update from Dell was they have setup a lab in Sydney and fixed the issue. I've asked for the build docs to confirm they have repplicated the environment .... still waiting.......

I don't have any issues with the active \ standby configuration as it stands but await to hear about the lab config but if I need to setup a similar hardware environmnet again I'll be using that config

Thanks for your post

Neil

Reply
0 Kudos
s1xth
VMware Employee
VMware Employee

This is a very interesting post. I would be interested in hearing what Dell would have to say about this issue now that the Equallogic MEM has been released. I have my NICS configured in a 1:1 configuration and with an adapter "unused". I have tested failover over the weekend by pulling power to a switch and everything continued to work and the paths stayed up (one path failed of course). I was previously using the VMware Round Robin and also never had problems. I will be meeting with EQL next week at VMworld, I will add this to my list of questions.

If anyone has any questions and would like some official answers please let me know and I will try to get some clarity on this subject.

Jonathan

Blog: www.virtualizationbuster.com

Twitter: s1xth

http://www.virtualizationimpact.com http://www.handsonvirtualization.com Twitter: @jfranconi
Reply
0 Kudos
neilynz
Contributor
Contributor

Hi Jonathan,

Thought you would like to know the issue is finally resolved after exactly 1 year !!

We upgraded the firmmware on the Equallogic to latest version released in Decemeber and this fixed all our issues. A switch failover worked just fine. Case now back in Dells courrt to pay for all my time trying to figure it out.....

Cheers

Neil

Reply
0 Kudos
J1mbo
Virtuoso
Virtuoso

Just a quick note, VMware and Dell recognised this issue and it's documented in VMware KB 2007829, "Software iSCSI based EqualLogic array cannot failover".  The official proposed solution is a seperate VMK on the iSCSI LAN, but not bound to the software iSCSI adatper, which can then be bound to multiple NICs (hence 'high availability vm-kernel port').

The EqualLogic best practices guide for vSphere 5 also reflects this (TR1075), however their vSphere 4.x guide TR1049 has not been updated and should therefore be considered to be incorrect (I just lodged a request for it to be updated though).

Reply
0 Kudos