Xtravirt Virtual SAN Appliance - Page 3

admin · ‎05-21-2008

Hi Guys,

Over at Xtrairt we've just released a new free virtual appliance that I thought people might find interesting. It lets you use the local storage on two ESX hosts to create a virtual SAN. I've attached an overview diagram to explain how it works.

http://www.xtravirt.com/index.php?option=com_remository&Itemid=75&func=fileinfo&id=29

Summary: The Xtravirt Virtual SAN (XVS) appliance for VMware ESX3 Server is a free solution to provide the benefits of shared VMFS storage without the cost of a SAN – this allows the utilisation of otherwise unused local storage in the ESX server to facilitate enterprise level features such as vMotion, DRS and HA normally only available through the use of a shared storage device. All volume data is synchronously replicated between hosts, providing full fail-over capability with data integrity in the event of host, disk or appliance failure. The appliance is menu driven and has been designed to be as easy to configure as possible, and full documentation on the implemenation process is provided.

Questions/feedback welcome in this thread.

Cheers,

Alex

TechFan · ‎11-09-2008

Ok. So that explains why it wasn't taking the IP address I was trying to use. . .it has a 0. We have been testing with the default IP configuration and it was working okay until I updated one of the test ESX server and restarted it. . .and didn't stop the services first. Now they both seem to "sync" forever. . .never finish.

Is there any way to get to the data again or is it all lost? That is a bit scary for real use if we lose complete access because one node goes down uncleanly. . .

I don't seen any updated releases. . .

Ok, I am not sure if this is the only way, but I deleted the VMDK on the 2ndary and created a new one, then set them to do an initial sync. That is going to be a very long restore process if we have a full 400GB in production. . .I hope there is another way to get them back in sync.

TechFan · ‎11-09-2008

Another question I just thought of. . .does this Virtual Appliance have jumbo frame support on the guest OS level? Apparently, ESXi doesn't support it at the kernel level, only for guest OS's directly. Could make a big difference in sync speeds. . .

I should also ask here if anyone has seen a similiar appliance providing syncronized NFS accessed data stores. . .

rvsharpe · ‎11-10-2008

Techfan,

I have run into the issue of not synching even after a "graceful"

shutdown...i.e. shutting down the Xtravirt services on one of the nodes

for instance. According to the Xtravirt folks, the application should

failover to the other san appliance but I have yet to experience a

successful failover. I ususlly have to run the initial synchronization

on one or sometimes both of the nodes to get synchronization to start

after shutting down one of the nodes either gracefully or by pulling the

plug to simulate a disaster. I have to agree though, synchronization is

slow using this recovery method.

As far as recovering your data, I have successfully been able to access

data once the sync completes.

One good sign though is that I did receive a response from Xtravirt when

I left an email using the support link on their website. The contact

there instructed me to send logs and even offered to look at my servers

remotely. Unfortunately I have not had the chance to continue to trouble

shoot..I eventually moved my VM's back to a Network SAN device for now.

I suggest that you send Xtravirt support an email regarding this issue.

In closing, I removed the application with the intention of reinstalling

shortly. Let me know how you make out.

Ramon

TechFan · ‎11-11-2008

Is there any way to get these to be thin provisioned disks. . .like replacing the empty disk after initial syncronization? I am finding that one disk is staying thin, but the other is fully expanded.

I actually have only had that resync issue once so far. All my other power downs and shutdowns have come back online. That first one, I just added a new disk and did resyncronization.

What I really want to know is how to force mount a single part, so I can provide access to the data until a planned support window to resyncronize. . .I'll have to write them.

mfoley · ‎11-13-2008

I get the split brain situtaion as well and am using a cross-over cable. I don't see how this coul dbe the problem but has anyone else got a split brain when they didn't use the crossover?

I contacted xtravirt for support and they helped me fix th eproblem but I'd like it to be more reliable...and I wouldn't mind paying for commercial support.

glazgb · ‎11-24-2008

Hi!

in documentation "XVS Installation Guide.doc" say:

2.6 Recovering from an XVS appliance node shutdown or failure

If one of the nodes is lost for any reason, including ESX host failure, the other node will take over all storage operations and the volume will continue to function. When the node is restored to operation it is necessary to re-sync the volume to resume full failover capability of the SAN.

Power on the XVS node that was disconnected and allow it to boot. Select Option 2 at the Main Menu to enter the Services Menu.

At the Services Menu select Option 1 to start the XVS services.

Assuming network connectivity between the nodes is functional, you should be presented with a resynchronisation progress screen. This will automatically exit when resynchronisation is complete.

How do "automatical resynchronisation" after power up ?

rvsharpe · ‎12-02-2008

I get the split brain situation with or without a crossover cable. By the way, can you explain the resolution to resolve the split brain situation?

Ramon

admin · ‎12-03-2008

Hi Guys,

Afraid I've currently ceased development of the XVS due to time constraints, so I'd suggest that it is only used for dev/testing unless you have a sound knowledge of Linux/DRBD/Heartbeat.

The resolution for split-brain is to go to invalidate one of the nodes, if you google for drbd split-brain recovery you'll find a lot of information.

Basically you just need to select one node and force it to overwrite the other, the command for this would be something like:

On the master:

drbdadm primary all

On the node to be overwritten

drbdadm secondary all

drbdadm invalidate all

drbdadm connect all

On the master:

drbdadm connect all

This should then cause a full re-sync, be careful about doing it the correct way around!

When the sync is done, do the following on the overwritten node.

drbdadm primary all

service heartbeat start

service iscsi-target start

If anyone needs any help/advice I'm happy to advise on a best-effort basis, either PM me on here or email me at support (at) xtravirt.com.

Cheers,

Alex

TechFan · ‎12-03-2008

Alex,

Thanks for the response. I figured there wasn't anything more happening

on this when I heard xtravirt was bought out. That is unfortunate but

understandable.

So, basically what you are saying is that the only way to recover is to

completely resync. . .so, resync 250GB if that is the size of the space?

Not much different than just deleting a virtual disk on one side and

resyncing. So, if both hosts go down, then it is likely to be a while

before you can get them back up. . .good reason not to use it in

production. We were just about to start using it as one type of

storage. . .NFS, Local, and VSAN mixed as needed.

Thanks again for the great tool.

admin · ‎12-03-2008

deleted duplicate post

Message was edited by: mittell

admin · ‎12-03-2008

Ugh I remember why I gave up using these forums now, the software is so bad.

I typed a long reply to this twice now only to get it lost with an HTTP 500 Internal Server Error, and replies don't seem to be showing up to the threads properly, half the time rvsharpes post is the last in the thread when there are two more replies!

Anyway, for the thirds time...

You only need to do a full re-sync on a split-brain, which should only happen if you lose the network connectivity between the two nodes when they are both live. If one of the nodes, or the host it is on, crrashes for any reason - then only an incremental sync is necessary. DRBD maintains meta-data so this process is automatic, all you need to do is when the crashed node comes back up type "drbdadm connect all" and it should begin the resync, when it's complete type "drbdadm primary all" "service heartbeat start" wait 2 mins "service iscsi-target start" and everything should be up and running again. If the initial drbdadm command doesn't work, try a "drbdadm outdate all" (instead of invalidate, which causes full re-sync) on the node then try again. I've tried this out many times with hard powering off a node and DRBD does an incremental resync successfully every time. The "start services" menu option should automate these commands - but it seems there may be a bug in there that is preventing it from happening so I suggest using the command line.

(select all copy - damn forum)

dilidolo · ‎12-03-2008

So this is basically IET + Heartbeat + DRBD?

Good for someone without dedicated ISCSI device for testing.

glazgb · ‎12-03-2008

mittel

crrashes for any reason - then only an incremental sync is necessary. DRBD maintains meta-data so this process is automatic, all you need to do is when the crashed node comes back up type "drbdadm connect all" and it should begin the resync, when it's complete type "drbdadm primary all" "service heartbeat start" wait 2 mins "service iscsi-target start" and everything should be up and running again. If the initial drbdadm command doesn't work, try a "drbdadm outdate all" (instead of invalidate, which causes full re-sync) on the node then try again. I've tried this out many times with hard powering off a node and DRBD does an incremental resync successfully every time. The "start services" menu option should automate these commands - but it seems there may be a bug in there that is preventing it from happening so I suggest using the command line.

How made this automatical in start XVS ?

in /etc/RC3.D ?

TechFan · ‎12-03-2008

Thanks again for the additional info. I will play with it more, but the recovery time in case of failure is what has kept me from deciding to deploy anything live on my XVS setup so far and caused me to decrease the volume to 250GB instead of 500GB.

O/T: I also noticed the email reply's are very delayed. . .and that your reply I got via email notification wasn't on the forum thread for a while. . .neither was my reply. . .lol

Again, I really appreciate all you have done with this, it is a great tool.

TechFan · ‎12-09-2008

Ok. I had to shutdown both ESX servers tonight. . .and I even shutdown both sides of the XVS setup. After restarting the servers. . .they again will not sync. . .I tried the outdate all. . .didn't work either. I thought I would write you before I go invalidating all. . .

When I try to run any of the following commands, I get an error (No response from the DRBD driver! Is the module loaded?...terminated with exit code 20):

drbdadm primary all

drbdadm secondary all

drbdadm connect all

this one doesn't complain:

drbdadm invalidate all

Any advice. . .ideas?

admin · ‎12-09-2008

Sounds like the DRBD service hasn't auto-started. If you cat /proc/drbd do

you get a file not found error?

Try a "service drbd start" then those other drbdadm commands.

Alex

TechFan · ‎12-09-2008

I might have to that earlier, but I started the service first. . .it started and now the file is found. I did the service start command and that made progress, but the one I want to be primary won't accept the drbdadm primary all. The secondary takes its commands properly, but the primary complains about dbbd0: Not outdating peer, since I am diskless.<4> State change failed: (-2) Refusing to be primary without at least one UpToDate disk. Command 'drbdsetup /dev/drbd0 primary' terminated with exit code 11.

admin · ‎12-11-2008

"drbdadm up all" on the primary first.

Tip, if you do "cat /proc/drbd" it'll give you current status, if you see diskless do an drbdadm attach all then drbdadm connect all (NOTE: drbdadm up all does both actions in one command.)

TechFan · ‎12-11-2008

Lol. It is pretty obvious I haven't done this drbd stuff before. Unfortunately, the setup doesn't want to co-operate still.

I started the service on both sides. I then tried to do drbdadm up all on the primary. It complains that it is attached to a disk (Failure: 124) Device is attached to disk (use detach first)....it echos the command it is trying to run. . .then says terminated with exit code 10. So, I then try drbdadm primary all. It fails again with that message in my previous post. So, I tried detach all, then disconnect all. . .then I tried up all again. . .says disk is still attached, so then detach all, then up all again. . .no errors this time. . .now primary all. . .it gives me the exact same error it did in my previous post again. . .

So, stuck. It thinks it is diskless. . .and has a disk at the same time. Not sure how that works. . .thanks for the tips.

TechFan · ‎12-11-2008

FYI. I got it. I was trying to see if I could manually set the disk state with dstate (if that was what it even meant) and I found this link:

The key commands are:

*drbdadm secondary resource*

*drbdadm -- --discard-my-data connect resource*

In our case resource=all. There is a bit more info as well, so read it if any of the rest of you are stuck.

Following its one extra step fixed it and allowed it to only resync a tiny portion of the 250GB volume, so it only took a short amount of time in comparison. Thanks for your help.

Message was edited by: TechFan

fixed formatting