I work as an IT tech at a university research lab and I have a bit of a problem. We have a number of servers up and running today in a mixed environment. I could easily virtualize from the aging hardware instead of buying a lot of new hardware. We have access to several sites in the area where we can co-locate one or more servers.
The concept I’m trying to achieve is a decent failover/replicated file system solution where we have one ESXi host in each server room with a replicated file system. One server is to act as a primary server with all normal operations running from it and the other is a secondary server to use if the first fails Some of my friends have asked why I just don’t get a SAN instead but I think the cost of have TWO 48TB SAN’s would be staggering compared to my suggestion. Having just one negates the benefit of separate locations, so I’d have to get two.
I have two fairly powerful servers (see specs below) and they are running ESXi 4.1 SP1 on internal SSD disks. They have a total of 24 2TB discs in 4 RAID’s. One is a 2TB RAID1 where I have the Vcenter server and one Openfiler server on each of the hosts. The other three is 10TB RAID5 split into 2TB volumes. One of these is set up in Openfiler to act as an iSCSI-target to be used by the ESXi host with a DRBD block synch to the other server. This traffic goes via a dual channel 10 GB fiber and should present no problem. (and indeed does not, DRBD is really fast and stable) The other two volumes are going to be used by a fileserver and should present no problem. However…
I am having trouble with the iSCSI in Openfiler. It just can’t keep up with the pressure I put on it and it cannot handle a primary:primary setup in DRBD very well. The vmfs file system should manage but the iSCSI in Openfiler I’m not so sure of. I get the following error message on the Openfiler server whenever I put some hard IO on it.
kern.info<6>: Aug 17 12:59:01 mmvmof01 kernel:last message repeated 54 times
kern.info<6>: Aug 17 12:59:01 mmvmof01 kernel: iscsi_trgt: Abort Task (01) issued on tid:2 lun:0
Once this has occurred I have to reset the iSCSI-target service to be able to access the LUN again. It is needless to say both annoying and totally useless. In DRBD dual primary mode i had some really unhealthy read/write issues that caused the delicate balance to fail and it is now back in primary:seconday mode on DRBD until i have sorted the whole iSCSI problem.
From what I can tell version five of ESXi seems to solve a lot of my problems. First off I don’t have to butcher my large discs into little 2TB chunks which removes some management. Second I have an idea for a replacement of the Openfiler/iSCSI mess. I do not know if it is doable though…
I was considering setting up data volumes as NFS shares and try the IO mirror function. One of the shares on my passive server would be mounted from the active one. That mount is then used in the IO mirror to replicate data over to the secondary server. Is this doable at all? Will it be mightily slow and cumbersome? I know it might be the less than optimal solution but I have gotten a bit wary of iSCSI since my trials and tribulations with Openfiler…
My other option is setting up a Debian DRBD/iSCSI VM on each host but it adds complexity to the setup I’d rather do without.
CPU cores: 8 x 2,4Ghz
CPU type: Intel Xeon E5620
Nic’s: 8 total (2 dual channel fiber)
RAID: Areca ARC-1280
Discs: 24 2TB SATA plus 1 internal 40GB SSD
vSphere 5 does not include a NFS server, so I'm not sure how you are going to export from one ESXi host to another.
To be honest, though, vSphere 5 has a better solution than what you are hoping to do. There is a new Virtualized Storage Array available for vSphere 5 that could do what you are hoping for:
I have a friend who ran with OpenFiler (I'm a bit more of a NexentaStor Community/OpenSolaris type of guy if you want to get something for free), he ran into the same iSCSI issues and switched to NexentaStor over it, as far as we understood there was simply no fix for it without patching it ourselves or keeping the iSCSI from spiking (only happened under hard load).
On a custom ESXi box I set up rsyncd, so you may be able to do some iffy things with ESXi, but be prepared for complete lack of support, funny looks from the community, and a widened attack vector on the box (being as you're using 48GB of memory, I'm assuming you're not running free so I wouldn't do this).
The VSA option is nice, but the whole VSA setup at VMWare is extremely young and it shows in terms of what you can do with it (fixed cluster size at creation being the biggest thing here). Maybe with a 3rd party VSA you can get further, but being as you're running OpenFiler again, I'm pretty sure you're not looking to drop a ton of money on it.
While this KB article warns of issues with Openfiler and IET http://kb.vmware.com/kb/1026596 an SCST target is available as an add on for openfiler.
Thanks for all the good replies.
I think we will go with the VSA actually, if it seems safe enough.
Money is not really an issue when reliability is concerned, we just try to keep the costs down.
Openfiler was a bit of a longshot and would have been nice if it worked, but im currently writing its obituary and closing that chapter.
I just wish they would drop vsphere 5 so i can get on with sorting this!