Solved: Virtualizing MSCS Server

WadeG · ‎08-12-2008

Hi there, I have an ESX 3.5 HA environment and want to do away with my MSCS cluster. Does anyone have any suggestions, tips, links, etc on what needs to be done to remove MSCS, keeping shares etc so it can be virtualized?

Thanks in advance!

Wade

alecprior · ‎08-13-2008

We did exactly what you're aiming to do, when we moved to ESX last year. A 2 node MSCS cluster running file and print services (also exchange, but that's outside thh scope of your question).

We created a single fileserver VM. We had the luxury of 2 hours downtime available, so we restored a backup onto the new server overnight then used a scheduled robocopy task to keep it updated during uptime. Then did one last copy during the downtime, stopped the cluster services (kepts servers on though) and created a DNS record in the old cluster name pointing at the new fileserver VM. Created the shares manually, and the users carried on as normal.

Our choice to ditch MSCS was based on ESX providing hardware resilience of a much higher standard than MSCS. We have had hardware failures on the MSCS cluster in the past which the cluster hasn't reacted quick enough to, which results in between 30 seconds and 2 minutes or so of dowmtime. We can reset the new VM fileserver and it'll be serving files again in a similar timeframe so it has no disadvantage as far as we're concerned. Plus we still have the cluster services (file/print/mail) on different servers using affinity rules on DRS.

View solution in original post

fejf · ‎08-12-2008

Perhaps you can start with the VMware paper about MSCS: "Setup for Microsoft Cluster Service" at the end of the page: http://www.vmware.com/support/pubs/vi_pages/vi_pubs_35u2.html

Direct link: http://www.vmware.com/pdf/vi3_35/esx_3/r35u2/vi3_35_25_u2_mscs.pdf

--

There are 10 types of people. Those who understand binary and the rest. And those who understand gray-code.

-- There are 10 types of people. Those who understand binary and the rest. And those who understand gray-code.

WadeG · ‎08-12-2008

Giving it a read right now, thank you for the swift response!

Edit:

After reading through the pdf briefly it only talks about runing MSCS with VMware, I want to do away with it entirely. The problem is that the VM infastructure is using iSCSI and the 2 node MSCS is using a DAS array. I guess what I want to do is remove MSCS (but preverse my shares etc) and then virtualize the single remaining host.

Thoughts?

Message was edited by: WadeG

yorkie · ‎08-12-2008

Hey Wade - so your question is - how can I run my applications on VMware HA instead of MSCS? You want to get rid of MSCS, right?

Is there a particular reason for this (I'm incredibly nosey!)...

What application are you running on MSCS - because I think the answer on whether to move to VMware HA depends on the app + the RTO/RPO for that app?

I'm no MSCS expert, but I know a few who are :smileyblush: and I'm sure they'll be along in a minute to throw their hat into the ring...

I would be interested in capturing the answer for this, and the solution into a VIOPS proven practice (e.g title "Migrating <application> from MSCS to VMware HA").

Cheers

Steve

TomHowarth · ‎08-13-2008

Giving it a read right now, thank you for the swift response!
Edit:
After reading through the pdf briefly it only talks about runing MSCS with VMware, I want to do away with it entirely. The problem is that the VM infastructure is using iSCSI and the 2 node MSCS is using a DAS array. I guess what I want to do is remove MSCS (but preverse my shares etc) and then virtualize the single remaining host.

Thoughts?
Message was edited by: WadeG

Now I am a little bit confused.

do you A: - want to completely remove MSCS and run a single virtualised machine and rely on HA for availability?

or b:- do you want to move your currently Phyiscal MSCS cluster into ESX and use HA as a further backup to resiliance?

If is is A: rememberr HA is not true redundancy, it is not Guest Level monitoring. it is will only kick in when a Host (read ESX server) fails. this will lead to a phyiscal outage of the Guest systems as they are started up on another node in the cluster (read bunch or ESX Hosts) if you need Guest level resiliance you should keep your MSCS.

if it is B then read the document posted earlier.

Tom Howarth

VMware Communities User Moderator

Tom Howarth VCP / VCAP / vExpert
VMware Communities User Moderator
Blog: http://www.planetvm.net
Contributing author on VMware vSphere and Virtual Infrastructure Security: Securing ESX and the Virtual Environment
Contributing author on VCP VMware Certified Professional on VSphere 4 Study Guide: Exam VCP-410

ChrisDearden · ‎08-13-2008

Tom , doesn't update 2 support guest level HA with a heartbeat to the virtual tools install ?

"VMware High Availability (HA)

VirtualCenter 2.5 update 2 adds full support for monitoring individual virtual machine failures based on VMware tools heartbeats. This release also extends support for clusters containing mixed combinations of ESX and ESX Server 3i hosts, and minimizes previous configuration dependencies on DNS."

http://www.vmware.com/support/vi3/doc/vi3_esx35u2_vc25u2_rel_notes.html

If this post has been useful , please consider awarding points. @chrisdearden http://jfvi.co.uk http://vsoup.net

fejf · ‎08-13-2008

Misunderstood what you wanted

But anyway you should be aware that VMware HA is something completely different than MSCS. MSCS tries to provide 100% uptime by switching services from one cluster node to another if there's a problem with one node. With VMware HA there's only ONE node (aka Virtual Machine) and if that VM fails (doesn't matter if the complete host fails or as ChrisDearden mentioned with 3.5.0update2 if a single machine fails) it is restarted on another ESX host. This means that it's like someone pressed the reset-button of a physical host.

So with VMware HA you have the downtime of at least the time it takes to boot the VM - with big databases, tomcats etc this can take some time before the service is available again.

--

There are 10 types of people. Those who understand binary and the rest. And those who understand gray-code.

-- There are 10 types of people. Those who understand binary and the rest. And those who understand gray-code.

ChrisDearden · ‎08-13-2008

Then again with MSCS you still have the time to bounce a service , which is still around 30 seconds or so - its far form continuous availability ( which I understand is something to look for in a future release ) In someways MSCS allows for a higher service uptime when you have to patch the underlying OS ( almost like vmotion ). From a hardware perspective , I can't think of many situations recently when we've had anything other than a PSU or disk fail on a modern cluster - hardware failure is less common than it used to be. If your service can withstand a little more down time , then there is no reason to go for a single guest on a HA cluster. This approach was considered by some of our application teams wanting to add some more resilience to a number of standalone SQL boxes. The Service still has to be taken down for guest reboots , but hardware failure would produce less of an impact ( given that a VM guests generally restarts faster then a physical server )

If this post has been useful , please consider awarding points. @chrisdearden http://jfvi.co.uk http://vsoup.net

yorkie · ‎08-13-2008

You lose "in flight" transactions with MSCS, right? If MSCS has to "failover" an instance, effectively restarting the service on another server which you say is faster than a VM start up - but is it a similar operation?

In both cases, say you have a DB that is clustered, then the in-flight transaction @ time of failure is "lost"... right? In a DB case, when the instance restarts the first thing it has to do is recover back using its logs or whatever, but minus the "in flight" transaction?

So the difference between MSCS and HA is the speed of recovery? The other differences would be cost, complexity of design / build / operation, etc.

I'm no MSCS expert as you can see... always willing to learn the nuances though... a little knowledge is dangerous as they say... :smileylaugh:

Steve

ChrisDearden · ‎08-13-2008

My experience is primarily with SQL clusters , so I can only draw on that , If the event of a non gracefull failover event , SQL would recover to the last commited transaction.

Both systems would possilby require a consistency check be made as part of the service startup process , which could well take a while in the case of larger db's.

The other disadvantage with using HA as your only reslience is that if the entire host has to fail over , then restarting a number of guests simultaniously on a host can take quite some time. ( planning your autostart settings may well be required to ensure that essential boxes start up first )

If this post has been useful , please consider awarding points. @chrisdearden http://jfvi.co.uk http://vsoup.net

yorkie · ‎08-13-2008

Hey Chris,

I think I can sense a showdown test in the offing :smileydevil:

Apologies for being incredibly nosey but are all of your instances (a) live/live - as in, any instance can serve any query - or (b) is each instance serving its own clients?

The reason I ask is, if it's (a) then you only lose capacity whilst the recovery is underway, right?... whereas if it's (b) then you lose service...

And the reason I want to know that is, what are the OLAs you have to recover, what metric is at risk... because if you have an hour to recover then HA might be "good enough"?

Cheers

Steve

WadeG · ‎08-13-2008

Hi everyone,

Basically I inherited a MSCS 2 node cluster that serves up file services only.

So some down time would be fine in an emergency. So really I just want to turf the MSCS and put a single guest into esx. With DRS and HA I think that would be a good solution and allow me to rip out another 2 older servers.

Does this help?

Sent using a BlackBerry wireless device.

yorkie · ‎08-13-2008

Yes, that's cool - so you want to create a single VM in a HA VMware cluster that has Guest level clustering (ie. ESX 3.5 U2), that can access the files on your DAS and provide the same services... Here's my random thoughts on the topic...

If you want to reuse the DAS, I'm don't think you can use this directly with HA, but you could front it with an iSCSI or NFS device (e.g. a linux server running this kind of software which is plugged into your DAS).

P2V doesn't seem right because you don't want to migrate the data from the DAS...

So it sounds like one option might be...

build a server with a Linux OS and iSCSI or NFS software on it
Turn off the MSCS host, disconnect it from the DAS, connect your new Linux image to it and see if you can see the files... this has to be non-disruptive, ie. when Linux sees the disks it doesn't try to signature them or anything crazy.
To be safe, turn off your linux host and plug the MSCS one back in - check everything is cool and wait for the next change window...
Build your new VM with either a new ID and IP (do users connect using a naming service, which makes it easy to switch over when you are ready). This will only have a system disk for now on the normal cluster VMFS.
In the next change window, bring down your MSCS box, plug in the Linux box, check it works, then configure your VM to access the DAS via the Linux box over iSCSI or NFS...

I just made all of that up and there are bound to be complications (i'm thinking twice about how the VM accesses the DAS luns)... no doubt an expert will be along any minute to fix my mistakes or provide a better answer :smileygrin:

TomHowarth · ‎08-13-2008

Good point, however there was no update level posted on the Orignial Post, so I was hedging my bets, honest guv'

The question is now what would offer the quicker recovery time.

Tom Howarth

VMware Communities User Moderator

Tom Howarth VCP / VCAP / vExpert
VMware Communities User Moderator
Blog: http://www.planetvm.net
Contributing author on VMware vSphere and Virtual Infrastructure Security: Securing ESX and the Virtual Environment
Contributing author on VCP VMware Certified Professional on VSphere 4 Study Guide: Exam VCP-410

yorkie · ‎08-13-2008

instead of the quickest, surely the most appropriate :smileycool:

If your recovery window is 1 hour, then both are suitable? No need to shell out for a 30s recovery time when anything less than 30 mins will do...

I would measure the recovery time as the whole process from incident alert all the way through to service recovered... of which the technology bit is just a small, but crucial, piece of that process...

So the question is - is MSCS overkill and HA is good enough?

I might do a quick poll of 20 customers and see what services they run on MSCS and what their recovery times are.... should be interesting data (anyone got any data to share now?)

Ciao

Steve

WadeG · ‎08-13-2008

Hi Tom and others,

Sorry for the confusion, I haven't posted often enough yet to get the gist of providing lots of detail...

The situation is that I have a 2 node MSCS cluster serving files right now, 2 servers using a DAS array. We have a couple EQl-5000e SANs using iSCSI for the ESX servers. Now I don't want to reuse the DAS, in fact I want to pull everything into a VMware environment or perhaps move it onto an iSCSi volume so I can replicate it, etc.

So my question is (not knowing much about MSCS) how can I remove MSCS from the current physical servers, and still keep my shares, etc. or would it be better to simply replicate them off to a volume on the san and create a new guest to server them out? Talking it through that maybe the better idea.

Anyways I'm just looking for insight with anyone experience with MSCS, I don't think that it's use is warranted here...

Thanks

alecprior · ‎08-13-2008

We did exactly what you're aiming to do, when we moved to ESX last year. A 2 node MSCS cluster running file and print services (also exchange, but that's outside thh scope of your question).

We created a single fileserver VM. We had the luxury of 2 hours downtime available, so we restored a backup onto the new server overnight then used a scheduled robocopy task to keep it updated during uptime. Then did one last copy during the downtime, stopped the cluster services (kepts servers on though) and created a DNS record in the old cluster name pointing at the new fileserver VM. Created the shares manually, and the users carried on as normal.

Our choice to ditch MSCS was based on ESX providing hardware resilience of a much higher standard than MSCS. We have had hardware failures on the MSCS cluster in the past which the cluster hasn't reacted quick enough to, which results in between 30 seconds and 2 minutes or so of dowmtime. We can reset the new VM fileserver and it'll be serving files again in a similar timeframe so it has no disadvantage as far as we're concerned. Plus we still have the cluster services (file/print/mail) on different servers using affinity rules on DRS.

WadeG · ‎08-13-2008

Hi Alec,

Thanks for the info. I was leaning that way, but wanted to see if there were another approach I should consider. I'll head this wat though as it really would be the cleanest.

Cheers and thanks so much!

yorkie · ‎08-13-2008

Hey Alec, when you restored a backup did it include all the MSCS stuff - did you have to disable this in some way?

Cheers

Steve

alecprior · ‎08-14-2008

No, just a file level backup of the data from the shares. No system state at all.

All

Virtualizing MSCS Server