Solved: Multihost with DroboElite disconnecting during hig...

freejak04 · ‎03-17-2010

We purchased a DroboElite for our QA environment which currently has three ESXi4 hosts. Under heavy loads, the iSCSI performance will begin to degrade rapidly up to the point where the LUN usually gets disconnected from the host. Sometimes, the host will automatically reconnect and other times, a reboot of the DroboElite is required. I've been back and forth with Data Robotics for weeks troubleshooting the issue without any success. I've made the changes to the HB timeout settings in the 'hidden' console as suggested by DR and also tried connecting to two different gigabit switches (dell powerconnect). Nothing has helped thus far.

Does anyone have experience with these units? Any suggested configuration changes I can make?

Thanks!

BradMDRI · ‎06-25-2010

Hi All,

I wanted to let everyone know that we have been testing a fix to the disconnect problem on the DroboElite and have posted the new DroboElite firmware, version 1.0.3, on our website at www.Drobo.com. If you have Drobo Dashboard running and monitoring the DroboElite you should get prompted to update the firmware automatically. If you are not running Drobo Dashboard on a regular basis then you should either run Drobo Dashboard to get the automatic update for your DroboElite or manually download the new 1.0.3 firmware and follow the manual update procedure that you will find on our website.

I want to thank everyone on this thread for helping us work through this issue and beta testing the new firmware. We think we have solved this issue and we're looking forward to your feedback on this new release. Please feel free to post your results here or send me a direct email at bmeyer@datarobotics.com with your comments.

Brad Meyer

DroboElite Product Marketing Manager

View solution in original post

aseniuk · ‎03-22-2010

I am also having the same issue, I starting off thinking it was the drobo elite then I had drobo send me a new droboelite, it was having the same issues. I then looked at the servers I had attached; both servers are of different makes so that could not be a coincidence. I also have powerconnect 5448 switches, I removed those from the equation and I started getting better performance. I thought it might be esxi or vSphere but I had 2 different versions

installed on 2 different servers so I was starting to rule that out until today. The drobo now doesn't respond unless I reboot it, so I am back to thinking it is a vSphere issue.

freejak04 · ‎03-22-2010

Next time the drobo locks up, see if you can still access it with the drobo dashboard. When mine does, I cannot get it to show up on the dashboard either via usb or network connection. Doesn't look like a vSphere issue to me.

BradMDRI · ‎03-25-2010

Hi All, We have been seeing this problem pop up with a limited number of customers and one of the major issues is OS alignment within VMware. This affects WinXP, W2k3, and most Linux implementations and is common throughout many VMware environments and manifests itself on ESX servers and storage systems as an overloaded process, slow performance, and, in our case, some iSCSI disconnects. There are several VMware documents on the topic. Here are some links. Please be sure that your OS installation adheres to VMware best practices to limit your exposure to these issues.

http://www.vmware.com/pdf/esx3_partition_align.pdf

http://www.vmware.com/pdf/Perf_Best_Practices_vSphere4.0.pdf

http://www.vmware.com/files/pdf/vmfs_rdm_perf.pdf

Rumple · ‎03-25-2010

Do the Drobo elites have a BBWC on their raid controller. From what I've read they are mainly a software raid system running on standard disks. If so, you are probably overwhelming the controller since everything would be in write through mode causing pretty bad performance issues under load.

DataAnywhere · ‎04-01-2010

We have the lockup issue too. The DroboElite is reachable via ping, but that's it. When this happens, our only option is to hit the power button on the DroboElite, wait for it to power down, then power it back up again. This happens most often during reboots or our main virtual machine that connects via iSCSI to the DroboElite volumes. The VMFS filesystem of this VM is actually stored on a local disk of the ESXi server.

The document refers to re-partitioning and formatting volumes on the device.

The DroboElite actually performs the volume creation and NTFS formatting from within the Dashboard. These volumes are then connected via iSCSI from the Windows VM. We remember either from a support call or from reading (unfortunately I can't find references) that you should never perform any volume management or formatting of these NTFS volumes using Microsoft's disk management tools. Basically, all volume management and formatting is done in the Drobo Dashboard. The only exception is the VMFS partitions. I believe we came to this conclusion when we attempted (during testing) to create a software mirror within the Windows Disk management between 2 Drobo units.

- Geoff

freejak04 · ‎04-01-2010

Hi Geoff. Are you still experiencing lockups after aligning the VM partitions? We're seeing lockups just from transferring files to the drobo using VMWare's datastore browser.

DataAnywhere · ‎04-01-2010

We haven't performed the alignment because we are under the impression that the DroboElite Dashboard is the only way to manage volumes and partitions. Also, that alignment document states to not align boot partitions. All "data" partitions are actually either a) stored as VMDKs locally on the ESXi server (ie. not on the DroboElite) or b) on the DroboElite as iSCSI connected volumes via the Windows Server VM.

Most of our lockups happen when we reboot the Windows VM that contains the NTFS Volumes connected via iSCSI on the DroboElite. This VM is on local storage of the ESXi server. The workaround is to pause all VMs (except the main Windows VM) first, and only then can we reboot the main Windows VM - quite a pain when installing updates, but it's a work-around that we've found that consistantly works. The DroboElite is a production unit, offsite from us, so when it locks up, it's an out-of-office trip for us.

The lockup occurred this morning when I attempted to take a snapshot of a virtual machine that is stored on the DroboElite. This is only the second time we've run into this (lockup when taking a snapshot). Usually, it only happens in the above scenario.

We have not made any attempts to align the VMFS datastores, and perhaps we should. I'll need to read that document again. What we've been planning is to run all the virtual machines on the local ESXi datastore, rather than the DroboElite, and use the Elite only for the windows iSCSI volumes and backups of the VMs.

Our curent setup is :

2 ESXi 4.0 servers with idential hardware.

ESX01 contains:

- the main Windows Server VM (local datastore)

- a small linux VM (on the DroboElite) - 2 vmdks: 1 x 10GB and 1x200GB

ESX02 contains :

- a 32-bit Windows App server (on the DroboElite) - 1 x 60GB VMDK (system drive)

- a 32-bit XP client (on the local datastore)

- a couple of test environments that we only run when needed - usually off.

We quite using vCenter with HA, since we started moving the VMs to the local datastore to improve performance, and found no benefit to HA that doesn't include storage vmotion. (just our optionion anyway).

In anycase, we'll look into aligning the VMFS partitions (if it's accepted practice from Drobo).

EDIT - added :

According to fdisk -lu output:

The DroboElite starts at position 128, and the alignment documentation states that this is aligned. Oddly, the local storage is not aligned. I don't see how aligning the local storage would effect the DroboElite.

- Geoff

Message was edited by: DataAnywhere (added the output information for fdisk -lu)

venom78 · ‎04-19-2010

I see that I am not the only one is having problems with Drobo Elite. I am too experiencing "timout". Drobo started to disconnect whenever, I perform cloning, Storage VMotion, or even simply not performing anything. The VMFS doesn't need to be aligned because it is aligned during creation. I have perofrm alignment on one OS, and it is a pain to align an vm that has OS installed in it. At this point, I think Drobo is simply not fit for VMware environment.

Here is my setup:

2x ESX4.0

8x CentOS vms

2x Windows 2003 Server vms

1x Windows XP vm

2x HP Procurve 3500 switch with VLAN dedicated for Drobo Elite.

CCata · ‎04-19-2010

Does everyone here have jumbo frames enabled? I've run into similar issues with high I/O on network storage devices in general due to this. Not saying it's defintly the problem, but it's worth a shot - disabling jumbo frames if it's enabled (typically 9000 bytes). That or make sure it's set the same across the board and is supported by your switches.

freejak04 · ‎04-20-2010

No jumbo frames here. Tried everything in the book. The Elite will not stay connected to the VMWare host. Truly disappointed at Data Robotics for marketing this device as VMFS compatible when it is clearly not based on all the reports around the web.

I grew tired of dealing with them. For now, the Elite is connected to a linux box (iscsi) which is being shared via NFS. In this configuration, the Drobo does not drop connections although it is still slow and just barely useable. We are using it as a backup for the vm hosts.

mmoran · ‎04-20-2010

Hi Everyone,

Brad from DRI above mentioned VM alignment and how important it is for the best experience with the DroboElite.

Following Brad's post there was some mention that since VMware states to not align boot disks that it is not important.

VMware recommends VM alignment for any high performance activities. DRI has been able to reproduce these disconnections only with misaligned VMs. When the entire VM is aligned (Boot and Data), we are not able to reproduce these disconnects.

Some of the issues above have been seen during OS updates which are massive writes to the boot disk. This could be caused by boot disk misalignment.

For those of you willing the test this theory, please deploy a windows 2008 VM and see if you have the same issues with disconnects during windows update. Windows 2008 is aligned by default unlike Linux and windows 2003.

If the above experiment fails please contact DRI support and we will be happy to help.

Please follow the best practice guide found at : http://www.drobo.com/pdf/DroboElite_VMware_Best_Practices.pdf

A very important setting is the following, it should be set on all ESX/ESXi hosts.

esxcfg-advcfg -s 14000 /VMFS3/HBTokenTimeout

If you want to check the setting run:

esxcfg-advcfg -g /VMFS3/HBTokenTimeout

Thanks for being patient.

--Mike

venom78 · ‎04-20-2010

Well...it freezes during the OS installation on a vm that has aligned.

golddiggie · ‎04-20-2010

Simply amazing how many issues there are with products like the drobo line when trying to use with them ESX/ESXi. Meanwhile, the really good iSCSI products have none of these issues. I see it as a true case of getting what you pay for. Spend just a little (such as the drobo line) and you're not getting something that's going to do more than a mediocre job, at best. Spend the money to get a quality SAN (not a NAS) and you'll not be hounded by performance issues, and the manufacturer pushing blame off onto other products, technologies, or settings you made (such as using programmed defaults).

My original impression of the drobo product lineup has not changed from when it originally came onto the market... Sold cheap because it's made cheap. OK for a low value NAS, but don't put anything you care about on it. Now, it also includes a below bargain basement iSCSI implementation/option.

VMware VCP4

DataAnywhere · ‎04-20-2010

I've looked for that HBTokenTimeout stuff before and couldn't find the setting it on ESXi 4; who knows what I was doing now, since the command you provided shows that it's currently set to 5000.

Can this setting be executed on the fly on a production system, or should I be waiting until after regular staff hours?

Thanks,

- Geoff

venom78 · ‎04-20-2010

You have to ssh into the ESX host to perform execute the command. You don't need to reboot the host. Therefore, no down time.

mmoran · ‎04-20-2010

Hi,

Setting that value on a live host shouldn't matter at all. No reboot required.

However, if you have procedures to do such tasks during off-peak hours please do so.

Take care,

--Mike

freejak04 · ‎04-20-2010

FYI, changing these settings didn't help at all for me.

jwcMyEdu · ‎05-03-2010

I'm having similar issues - has anyone solved this yet? We are running two 2950's with a DroboElite and have gone through a few different configs. Here are some of the things we tried (Switches are Dell PowerConnect - I forgot the model number but they are last year's model and 16 port 1-Gigabit) :

Throughout -

- Jumboframes is on (in the switch). Drobo set to MTU 9000 and the NICs are set to use Jumboframes

- Switches are in Managed mode and set to Auto discovery

1) Initial setup was 2 2950's with 1 switch for the iSCSI and iSCSI ports picked at random from the onboard or 4-port Intel (one each). We quickly moved this to have two independent networks for the iSCSI (separate subnets).

2) Added 2 more 1-port Intel NICs into the mix and we are running all 4 NICs to the iSCSI network - 2 to each subnet

3) Today we will remove the non-dedicated NICs from the iSCSI and set them as backups

We are in the process of aligning the VMs partitions (massive pain) and once we get this we will be as "best practice" as you get.

I saw a couple of folks had trouble with the Dell Switches - is there anything to look out for?

When there's a large I/O event - formatting a volume, moving a volume, etc - everything else has to be still. I booted a VM last night while moving another and crashed the system. To do maintenance (such as realigning the VMs) I basically have to shut down everything in the cluster.

I hate to think it's a matter of getting what you pay for - the DroboElite is VMWare Certified - they must have seen it work.

venom78 · ‎05-03-2010

jwcMyEdu,

I think you are out of luck. My problem is very similar to yours. Whenever I perform any tasks that involve large I/O, Drobo always disconnect itself and eventually stop responding to VMware. I have two HP Procuve 3500yl switches with a VLAN configured for Drobo only. I was dealing with Drobo's level 3 support, and we tried everything we can. I even have the VM aligned, but it still have the same problem. I am in process of returning the Drobo and purchasing the Promise VessRAID 1840i. I think you get what you paid for. If you manage to get it work, please let me know.

All

Multihost with DroboElite disconnecting during high i/o