VMware Cloud Community
freejak04
Contributor
Contributor
Jump to solution

Multihost with DroboElite disconnecting during high i/o

We purchased a DroboElite for our QA environment which currently has three ESXi4 hosts. Under heavy loads, the iSCSI performance will begin to degrade rapidly up to the point where the LUN usually gets disconnected from the host. Sometimes, the host will automatically reconnect and other times, a reboot of the DroboElite is required. I've been back and forth with Data Robotics for weeks troubleshooting the issue without any success. I've made the changes to the HB timeout settings in the 'hidden' console as suggested by DR and also tried connecting to two different gigabit switches (dell powerconnect). Nothing has helped thus far.

Does anyone have experience with these units? Any suggested configuration changes I can make?

Thanks!

0 Kudos
73 Replies
aseniuk
Contributor
Contributor
Jump to solution

Well we have given up on DroboElite and have now in the process of returning them to Drobo for analysis.

My expereinces are just like all of you, but what I would like to know is why does it lockup? I started 2 installs on a DroboElite it froze and I left it for 2 days but it was still locked up. So can anyone explain why it is doing that?

In any case we have moved to much larger solutions... Equallogic / NetApp. Just so you all know these boxes might cost a ton but they work very well.

0 Kudos
jwcMyEdu
Contributor
Contributor
Jump to solution

let me know how the VessRAID works for you. The price is good and I will soon be in the market I think.

Has anyone seen an issue using a single switch and breaking it into two vlans instead of using 2 physical switches? I am looking at a DELL PowerConnect PC 5424 but it's both a lot more money than I have spent so far and may be more ports than I nee

0 Kudos
golddiggie
Champion
Champion
Jump to solution

Has anyone seen an issue using a single switch and breaking it into two vlans instead of using 2 physical switches? I am looking at a DELL PowerConnect PC 5424 but it's both a lot more money than I have spent so far and may be more ports than I nee

One of the production environments I built had just a single HP ProCurve 2900 series switch for the iSCSI/vMotion/HA/DRS traffic on it. It was segmented into two VLANs (one for the iSCSI, the other for everything else). The VM traffic was routed to the core switch (different subnet than the 2900 was on) to keep things clear. Besides, with just 24 ports to use, we didn't have much room for anything else. We did have six NIC's on each host (three total), plus the ports on the EqualLogic PS5000 and PS6000 SANs (dual controllers on each). We added an additional quad port NIC into each host, when we purchased the additional ProCurve switch, for redundancy.

In a production environment, you're better off building redundancy into all aspects of the VMware environment. Don't allow a single point of failure to bring down everything. I would even go so far as to use different power sources for alternate items (such as the redundant PSU's in the hosts feeding off of both). It might seem like overkill until you experience what happens when that single point fails, and everything stops.

I would also recommend getting HP ProCurve switches before looking at Dell switches. The warranty period alone is reason enough (lifetime on the HP switches). Plus, the HP switch CLI is well known, and documented. Finding companies that use either HP or Cisco (which have always cost more) switches is easy. Locating (decent sized) companies using Dell switches is not so easy. I know of many companies that are exclusively HP or Cisco, have yet to come across any that use the Dell switches (either in part, or entirely).

VMware VCP4

Consider awarding points for "helpful" and/or "correct" answers.

0 Kudos
jwcMyEdu
Contributor
Contributor
Jump to solution

I think my problem is solved....

It was the switches! Last night I took the switches out of the equation and used two crossover cable to connect one ESX directly to the Drobo - it FLEW. Previously migrating one VM to local storage and booting another would kill the system - last night I copied 8 VMs simultaneously while booting then working on a 9th. PowerConnect 2716's are NOT capable of handling iSCSI traffic.

I am going to order a PowerConnect 5424 - it is "the world's first iSCSI optimized switch" and see if that makes things happy.

Cheers,

-John

0 Kudos
golddiggie
Champion
Champion
Jump to solution

"The PowerConnect 5400-series is the world’s first switch portfolio to automatically optimize itself for iSCSI storage"

At least you're finally getting a switch WITH CLI, not just the web management. That being said, IF you had full CLI on the old switch, you probably could have done the optimization settings on it to make things better... Don't assume that the new Dell switch will make things much better... There's a reason why Dell switches are cheaper than HP and Cisco network products...

VMware VCP4

Consider awarding points for "helpful" and/or "correct" answers.

0 Kudos
jwcMyEdu
Contributor
Contributor
Jump to solution

I love Cisco but it may be out of my range for this project. I heard from a few people to stay away from Dell switches...so I looked in to HP. I found an older forum post discussing the ProCurve 1800 vs 2824 - and they mentioned that the 2824 can't support both JumboFrames and Flow Control at the same time. I was considering between the 2824 ($400 on eBay) and the 2900 (fewer available and $1400-$1600 on eBay) as the 2900 does support both at once. Any thoughts?

0 Kudos
golddiggie
Champion
Champion
Jump to solution

I'm using a ProCurve 2510G-24 on my LAN (home lab setup)... I went into it and made the QoS change today and speeds have never been better... I had already enabled jumbo frames (on all ports, since pretty much everything on the switch is capable). I wouldn't get the ProCurve 1800 series since it doesn't offer CLI. I didn't want to spend the funds on the 2900 series, which is why I went to the 2510G...

Ebay has some 2510G-24's listed (new, sealed) for under $650... The 48 port version runs almost twice that (makes sense) starting around $1150... When I need to add more ports to my LAN, I know I'll do my best to locate another 2510G so that I can stack them together.

VMware VCP4

Consider awarding points for "helpful" and/or "correct" answers.

0 Kudos
jwcMyEdu
Contributor
Contributor
Jump to solution

And with the 2510 you get jumbo frames and flow control? Are they both needed?

0 Kudos
golddiggie
Champion
Champion
Jump to solution

It has the ability of using both... Need depends on what you're doing...

HP ProCurve 2510G-24 Switch (J9279A) specifications

You gain a LOT of options with this level of switch. With all the documentation, such as the 244 page 'Advanced Traffic Management Guide" and 422 page "Management and Configuration Guide" found here you should be able to locate the settings to make things function properly within your environment... Having the CLI option is the only way to set a good number of those settings too. The CLI for the HP switches is a good thing to know, or be familiar with too. Right up there with knowing the CLI for Cisco switches...

VMware VCP4

Consider awarding points for "helpful" and/or "correct" answers.

0 Kudos
jwcMyEdu
Contributor
Contributor
Jump to solution

Wow - thanks. We'll check it out. This definitely hits the right price point. Is it hours of configuration to get it just right or just a matter of setting a few points?

0 Kudos
golddiggie
Champion
Champion
Jump to solution

Time it takes depends on which method you use... You should be able to get it all set (as long as you plan things a little ahead of time) in under an hour. Send me your email address and I'll send off a full list of the CLI commands (for HP switches) that I was given... You'll just need to become a little familiar with how to make changes there. Some of the items you can do via the web interface, but you'll want to learn the CLI (and telnet menu driven) commands too.

The HP ProCurve switches are enterprise grade solutions. The lifetime warranty is also a great feature (especially when it's pretty easy to find 10+ year old ProCurve switches still in production environments).

The web interface also has a helpful 'help' function in it. It will explain what you're seeing on that page and what the different configurations mean... Very helpful for those that have not done it yet... You also have help functions inside the telnet menu and CLI modes...

VMware VCP4

Consider awarding points for "helpful" and/or "correct" answers.

0 Kudos
jwcMyEdu
Contributor
Contributor
Jump to solution

I PM'd you with my email - thanks again.

0 Kudos
StuartCUK
Contributor
Contributor
Jump to solution

Hi everyone

I've been monitoring this thread for a few weeks. I have exactly the same issue as the original post. My environment is significantly more simple, however.

I am attempting to build a small development environment.

I have a single ESXi 4 Server attached with a cross-over cable to iSCSI port 2 on the Drobo (169.168.0.X), iSCSI port 1 (10.10.0.100/24) on the drobo is connected to my LAN for management purposes only. The 2TB volume I created on my Drobo Elite is not set to multi host, and it is not required to be.

My ESXi server is a white box based on an AMD Phenom II x4 CPU & 8GB RAM, I am using 2 Intel CT Gigabit network cards (in the HCL), again 1 card on the LAN, 1 card for Drobo/iSCSI only.

I have an existing Linux VM that I brought into the environment via vCenter converter - this machine is working fine - no matter how hard it hits the disks I never seen any problems.

However, when ever I try to build a new Windows VM from scratch in the enviroment (Win 2008 R2 x64), with a few minutes of installation starting I see disconnects in the host event log. Generally the LUN will reconnect within a few seconds and installation may continue, sometimes I'll get as far as having a complete guest. As soon as any "heavy" I/O starts in the guest, I will see further disconnects. Windows gets quite upset when you remove access to its system volume and within a few minutes of constant connects/disconnects the server is trashed. On a few occaisons Drobo has locked up all together, it remains pingable but no iSCSI or Drobo Dashboard management works and only the power switch will bring it back to life.

So, in this scenerario (which is "configuration option 2" from the Drobo Elite best practise guide), there is no switch involved, yet I see to still be seeing the same problems.

Note that the other elements of this guide have also been followed, including the MTU size, HBATokenTimeout and partition alignment.

I have suggested to my Drobo reseller that the drobo is removed and swapped for a Promise VessRAID solution (3u, 16 bay SATA and SAS, iSCSI, and it seems to be cheaper too?) but Drobo and my reseller are suggesting I am likely to see the same problems on that hardware.

So right now I have a quad core 8GB ESXi box + a Drobo Elite running a single 512Mb Linux virtual machine. Probably the most expensive Linux test server ever!!

Any thoughts, suggestions? What have a missed?

0 Kudos
jwcMyEdu
Contributor
Contributor
Jump to solution

Try connecting the second iSCSI from the Drobo up to the machine (use a dedicatedgigabit card for each iSCSI connection). With only one connection I'm not surprised that you get the problem with two machines running like that.

0 Kudos
Rumple
Virtuoso
Virtuoso
Jump to solution

Having a single iscsi connection has plenty of bandwidth under normal circumstances.

Especially true if that drobo elite does not have a bbwc on its internal controller.

Everyone of those low end devices I've looked at have no controller batter and are just not designed to handle large amounts of writes...since they are in writethrough mode on the controller.

Might as well build a system with onboard raid controller, install windows and create a software raid...its the same performance.

0 Kudos
jwcMyEdu
Contributor
Contributor
Jump to solution

What you're saying makes sense, but I can tell you that my current setup is almost exactly what I was suggesting (I'm running a 2950 but currently crossover-cables to the Droboelite). The only reason I'm doing this now is that my old switches were pieces of junk and I'm waiting for the new one to get in - so for now I'm running two crossover cables - and it's fast, and I can move 8 machines from onboard to iscsi and back.

0 Kudos
StuartCUK
Contributor
Contributor
Jump to solution

I agree that 2 iSCSI bonded network cards would improve performance

But Drobo doesn't let you bond it's iSCSI network cards, so you can't achieve 2GB/s iSCSI anyway, you can have multipath redundancy only.

And in any case, this is a documented configuration from Drobo's own whitepaper. I could live with lower performance from the device in this configuration (if indeed that's the case) but disconnects? timeouts? Surely this shouldn't be occuring on this combination of hardware?

Do I have a faulty Drobo? Or is this just how they behave and I should ditch?

0 Kudos
StuartCUK
Contributor
Contributor
Jump to solution

Just one other point

Do you think it's acceptable either from a VMWare or Drobo point of view that you should see disconnects because of your choice of gigabit switch? Surely the Drobo/VMWare combination should be able to "negotiate" in these situations and not just die?

0 Kudos
jwcMyEdu
Contributor
Contributor
Jump to solution

It does make you curious as to what it means to be "VMWare Certified". That said...

I don't think my PowerConnect 2716's were "Certified" and that's as important as anything. There is a difference between a $140 Gigabit switch and a $1500 gigabit switch - namely how much traffic can go across the backend. It's one thing to achieve a gigabit connection, another entirely to sustain it across multiple sources. I'm not offended that I have to change switches.

To a certain extent you do get what you pay for - and the Droboelite is the cheapest way to get certified SAN. It'll work, but don't expect it to work as well as EQL.

-John

0 Kudos
Formatter
Enthusiast
Enthusiast
Jump to solution

I have a slightly different setup but it works great with no disconnects at all.

3 ESXI servers (Dell T610 One 5520 and 12 Gig ram allowing for expansion)

1 Netgear GS724T Gig Switch with vlans and jumbo frames (About $400.00)

1 Iomega storecenter ix4-200d 4 TB Jumbo Frames and NTFS also ISCSI (About $800.00)

I have had no problems but I would like thedrives in the Iomega to be faster however I have had great success running production and test VM's on this setup.

0 Kudos