I am also having the same issue, I starting off thinking it was the drobo elite then I had drobo send me a new droboelite, it was having the same issues. I then looked at the servers I had attached; both servers are of different makes so that could not be a coincidence. I also have powerconnect 5448 switches, I removed those from the equation and I started getting better performance. I thought it might be esxi or vSphere but I had 2 different versions
installed on 2 different servers so I was starting to rule that out until today. The drobo now doesn't respond unless I reboot it, so I am back to thinking it is a vSphere issue.
Next time the drobo locks up, see if you can still access it with the drobo dashboard. When mine does, I cannot get it to show up on the dashboard either via usb or network connection. Doesn't look like a vSphere issue to me.
Hi All, We have been seeing this problem pop up with a limited number of customers and one of the major issues is OS alignment within VMware. This affects WinXP, W2k3, and most Linux implementations and is common throughout many VMware environments and manifests itself on ESX servers and storage systems as an overloaded process, slow performance, and, in our case, some iSCSI disconnects. There are several VMware documents on the topic. Here are some links. Please be sure that your OS installation adheres to VMware best practices to limit your exposure to these issues.
Do the Drobo elites have a BBWC on their raid controller. From what I've read they are mainly a software raid system running on standard disks. If so, you are probably overwhelming the controller since everything would be in write through mode causing pretty bad performance issues under load.
We have the lockup issue too. The DroboElite is reachable via ping, but that's it. When this happens, our only option is to hit the power button on the DroboElite, wait for it to power down, then power it back up again. This happens most often during reboots or our main virtual machine that connects via iSCSI to the DroboElite volumes. The VMFS filesystem of this VM is actually stored on a local disk of the ESXi server.
The document http://www.vmware.com/pdf/esx3_partition_align.pdf refers to re-partitioning and formatting volumes on the device.
The DroboElite actually performs the volume creation and NTFS formatting from within the Dashboard. These volumes are then connected via iSCSI from the Windows VM. We remember either from a support call or from reading (unfortunately I can't find references) that you should never perform any volume management or formatting of these NTFS volumes using Microsoft's disk management tools. Basically, all volume management and formatting is done in the Drobo Dashboard. The only exception is the VMFS partitions. I believe we came to this conclusion when we attempted (during testing) to create a software mirror within the Windows Disk management between 2 Drobo units.
Hi Geoff. Are you still experiencing lockups after aligning the VM partitions? We're seeing lockups just from transferring files to the drobo using VMWare's datastore browser.
We haven't performed the alignment because we are under the impression that the DroboElite Dashboard is the only way to manage volumes and partitions. Also, that alignment document states to not align boot partitions. All "data" partitions are actually either a) stored as VMDKs locally on the ESXi server (ie. not on the DroboElite) or b) on the DroboElite as iSCSI connected volumes via the Windows Server VM.
Most of our lockups happen when we reboot the Windows VM that contains the NTFS Volumes connected via iSCSI on the DroboElite. This VM is on local storage of the ESXi server. The workaround is to pause all VMs (except the main Windows VM) first, and only then can we reboot the main Windows VM - quite a pain when installing updates, but it's a work-around that we've found that consistantly works. The DroboElite is a production unit, offsite from us, so when it locks up, it's an out-of-office trip for us.
The lockup occurred this morning when I attempted to take a snapshot of a virtual machine that is stored on the DroboElite. This is only the second time we've run into this (lockup when taking a snapshot). Usually, it only happens in the above scenario.
We have not made any attempts to align the VMFS datastores, and perhaps we should. I'll need to read that document again. What we've been planning is to run all the virtual machines on the local ESXi datastore, rather than the DroboElite, and use the Elite only for the windows iSCSI volumes and backups of the VMs.
Our curent setup is :
2 ESXi 4.0 servers with idential hardware.
- the main Windows Server VM (local datastore)
- a small linux VM (on the DroboElite) - 2 vmdks: 1 x 10GB and 1x200GB
ESX02 contains :
- a 32-bit Windows App server (on the DroboElite) - 1 x 60GB VMDK (system drive)
- a 32-bit XP client (on the local datastore)
- a couple of test environments that we only run when needed - usually off.
We quite using vCenter with HA, since we started moving the VMs to the local datastore to improve performance, and found no benefit to HA that doesn't include storage vmotion. (just our optionion anyway).
In anycase, we'll look into aligning the VMFS partitions (if it's accepted practice from Drobo).
EDIT - added :
According to fdisk -lu output:
The DroboElite starts at position 128, and the alignment documentation states that this is aligned. Oddly, the local storage is not aligned. I don't see how aligning the local storage would effect the DroboElite.
Message was edited by: DataAnywhere (added the output information for fdisk -lu)
I see that I am not the only one is having problems with Drobo Elite. I am too experiencing "timout". Drobo started to disconnect whenever, I perform cloning, Storage VMotion, or even simply not performing anything. The VMFS doesn't need to be aligned because it is aligned during creation. I have perofrm alignment on one OS, and it is a pain to align an vm that has OS installed in it. At this point, I think Drobo is simply not fit for VMware environment.
Here is my setup:
8x CentOS vms
2x Windows 2003 Server vms
1x Windows XP vm
2x HP Procurve 3500 switch with VLAN dedicated for Drobo Elite.
Does everyone here have jumbo frames enabled? I've run into similar issues with high I/O on network storage devices in general due to this. Not saying it's defintly the problem, but it's worth a shot - disabling jumbo frames if it's enabled (typically 9000 bytes). That or make sure it's set the same across the board and is supported by your switches.
No jumbo frames here. Tried everything in the book. The Elite will not stay connected to the VMWare host. Truly disappointed at Data Robotics for marketing this device as VMFS compatible when it is clearly not based on all the reports around the web.
I grew tired of dealing with them. For now, the Elite is connected to a linux box (iscsi) which is being shared via NFS. In this configuration, the Drobo does not drop connections although it is still slow and just barely useable. We are using it as a backup for the vm hosts.
Brad from DRI above mentioned VM alignment and how important it is for the best experience with the DroboElite.
Following Brad's post there was some mention that since VMware states to not align boot disks that it is not important.
VMware recommends VM alignment for any high performance activities. DRI has been able to reproduce these disconnections only with misaligned VMs. When the entire VM is aligned (Boot and Data), we are not able to reproduce these disconnects.
Some of the issues above have been seen during OS updates which are massive writes to the boot disk. This could be caused by boot disk misalignment.
For those of you willing the test this theory, please deploy a windows 2008 VM and see if you have the same issues with disconnects during windows update. Windows 2008 is aligned by default unlike Linux and windows 2003.
If the above experiment fails please contact DRI support and we will be happy to help.
Please follow the best practice guide found at : http://www.drobo.com/pdf/DroboElite_VMware_Best_Practices.pdf
A very important setting is the following, it should be set on all ESX/ESXi hosts.
esxcfg-advcfg -s 14000 /VMFS3/HBTokenTimeout
If you want to check the setting run:
esxcfg-advcfg -g /VMFS3/HBTokenTimeout
Thanks for being patient.
Well...it freezes during the OS installation on a vm that has aligned.
Simply amazing how many issues there are with products like the drobo line when trying to use with them ESX/ESXi. Meanwhile, the really good iSCSI products have none of these issues. I see it as a true case of getting what you pay for. Spend just a little (such as the drobo line) and you're not getting something that's going to do more than a mediocre job, at best. Spend the money to get a quality SAN (not a NAS) and you'll not be hounded by performance issues, and the manufacturer pushing blame off onto other products, technologies, or settings you made (such as using programmed defaults).
My original impression of the drobo product lineup has not changed from when it originally came onto the market... Sold cheap because it's made cheap. OK for a low value NAS, but don't put anything you care about on it. Now, it also includes a below bargain basement iSCSI implementation/option.
I've looked for that HBTokenTimeout stuff before and couldn't find the setting it on ESXi 4; who knows what I was doing now, since the command you provided shows that it's currently set to 5000.
Can this setting be executed on the fly on a production system, or should I be waiting until after regular staff hours?