VMware
1 2 Previous Next 17 Replies Last post: Nov 6, 2007 4:46 PM by jbaxter714  

Some musings on MSCS limitations posted: Sep 22, 2007 3:43 PM

Click to view oschistad's profile Hot Shot 126 posts since
Jun 28, 2004
As many of us already know, there are Issues concerning running microsoft clusters in VMware that warrant some planning.

First of all there's the requirement for storing the boot disk of the VM on a _local_ (as in Direct Attached) file system, which is a rather severe limitation.


This in turn gives rise to the inability to run MSCS on a boot-from-SAN ESX server (there being no local drives in such a scenario)

Little information has been provided by VMware as to why you aren't allowed to store the boot disk of a cluster node on SAN, but in this thread a VMware Technician explains:


if we have swapped out a page for your VM to SAN and there is a failover or some other event we have no option but to stall the VM (e.g. it will not get any CPU time) until the page has been brought in


If I understand that statement correctly, we are talking about overcommitment of memory and VMware swap here. If an ESX server is overcommitting memory and the MSCS node has active memory in swap, the VM must be paused while memory is being fetched from disk. This is not normally a problem, BUT: In a SAN with multiple paths, you sometimes experience path failovers which ESX typically use 30-45 seconds to handle. If the swapped-out pages required to populate active memory of a cluster node happened to reside on this temporarily unavailable disk, the VM in question would be stalled throughout the entire 45 seconds - and that is long enough for the sister cluster node to detect a cluster failure and initiate failover actions.


The scenario above is my personal best guess as to what the actual reason for VMwares rather strange requirement might be.


However, if the assumptions I describe above are correct, then the root cause of the problem is memory overcommitment and the likelyhood of path switching in a SAN as opposed to local storage environment.


If so, then one way to work around the problem would simply be to reserve *all* the memory of the VM. By doing so you guarantee that VMware will never swap out any of its pages to VM swap, and hence a path failover will not cause the VM to pause.


Furthermore, the restrictions imposed by VMware to support MSCS in a virtualized environment are in and of themselves an increased risk to your platform!. If you think about it, the reason why failovers occur in a SAN is that some component just failed. If the same happened in a local storage system your disk would most likely be gone - period. And any failure which causes the ESX server to go offline will take your cluster node down too - and with no way of starting the VM from another ESX host, since it resides on a local disk. Lastly, there's the fact that ESX servers by design contain no vital data locally and thus do not require any specific data protection measures beyond the obvious - this all changes when you start hosting production VMs on local storage.


Personally, even for a production platform, I am far from convinced that going with local storage for the boot disk is the Correct solution to this problem.

Re: Some musings on MSCS limitations

1. May 7, 2007 1:24 PM in response to: oschistad
Click to view Mike_Laverick's profile Virtuoso 4,063 posts since
Jan 5, 2004
This is an interesting post - and suprised no-one has responded to it... I recently blogged about my MSCS experiences with everything stored in VMDK on the SAN, coupled with VMware HA...

http://www.rtfm-ed.co.uk/?p=373

It's non-supported configuration for all the reasons you have already outlined.

From the VMware people I have spoken to, they have said that the reason they don't support this configuration is:

Firstly, If your boot disks, quorum and shared disks are all on the SAN then SAN connectivity fails, the “Active” node can become confused about what has occurred. From a troubleshooting perspective it is easier if local storage is used.

Secondly, time-out values in the Windows registry and latency on the SAN fabric could cause unnecessary triggering of node fail-over.

In away I agree with you - there does seem to be "gap" between what the technology can do, and what is actually supported by VMware. But I believe there are good reasons why they don't support the configuration...

Regards
Mike

Re: Some musings on MSCS limitations

2. May 7, 2007 3:49 PM in response to: Mike_Laverick
Click to view Anders's profile Expert 1,301 posts since
Oct 3, 2003
Hi Mike, it's not simply a trouble shooting issue.

And it has nothing to do with paging either.

Simply put MSCS is a bit "itchy" when it comes to timeouts.
If there is a SAN fabric event like a fail-over,
we have to put the VM to sleep until we can handle the IO.
(remember, no caching in the vmkernel)
This can be longer that the failover time, causing the standby node to try take control of the cluster resources.
When the originating cluster owner is rescheduled to run again,
it has no concept of time passing nor that it's resources are now owned by the other server, if sucsessful in grabbing them.

We then have a split brain cluster.

Now this is a unforseen side effect of virtualization,
but MS puts similar restrictions in regular HW.
Not 100% sure about boot from SAN,
but you do need separate HBA for OS and data/quorum.

MS is going away from SCSI reset for control mechanics for this exact reason in Lonhorn.
"No longer uses SCSI Bus Resets which can be disruptive on a SAN"

- Anders

Re: Some musings on MSCS limitations

3. May 7, 2007 7:33 PM in response to: oschistad
Click to view PatrickMSlattery's profile Enthusiast 26 posts since
Mar 14, 2006
The simplest way around this that I have found is to use software initiator based iSCSI LUNs for the cluster resources. It's 100% supported by MS under Windows 2003.
Admittedly most of my MSCS clusters are using RDM LUNs. I use WAFL snapshots from my NetApp storage array to bring up clusters at different points for testing.
I certainly would not advocate the use of MSCS clusters under ESX for anything other than testing and development purposes, but for test and dev it's an ideal clustering environment for my uses.

Re: Some musings on MSCS limitations

5. Jun 8, 2007 1:51 PM in response to: oschistad
Click to view jasonboche's profile Champion 5,895 posts since
Jan 7, 2004
Also, as to split brain - that is *exactly* why there
is a Quorum disk in most clusters, to ensure that
only one node can "win" the fight for ownership of
the cluster. Sounds like MSCS has some holes in its
logic in handling split brain - if you used to be the
active node and you suddenly lose your quorum disk,
you are probably not master any more....

The answer is the quorum logic wasn't designed with virtualization in mind, and the scenarios that can be introduced in a virtualized environment versus a physical environment. MSCS quorum works (usually...), but not entirely in a VM environment for the reasons already explained above.

Re: Some musings on MSCS limitations

6. Aug 20, 2007 7:26 PM in response to: oschistad
Click to view MattG's profile Expert 527 posts since
Jun 21, 2004
So does it make sense to consider VI3 HA sufficient for Exchange 2003 or SQL 2005 failover instead of going with a VM MSCS cluster?

What are the advantages/disadvantages of this scenario?

Thanks,

-MattG

Re: Some musings on MSCS limitations

7. Aug 20, 2007 7:37 PM in response to: oschistad
Click to view RParker's profile Champion 5,270 posts since
Dec 6, 2006
"First of all there's the requirement for storing the boot disk of the VM on a _local_ (as in Direct Attached) file system, which is a rather severe limitation. "

This isn't a requirement, this is *IF* you want VM Ware to support your setup.

Re: Some musings on MSCS limitations

8. Aug 20, 2007 7:41 PM in response to: MattG
Click to view RParker's profile Champion 5,270 posts since
Dec 6, 2006
Basic common sense says that VM Ware isn't the REPLACEMENT for everything computer, it attempts to consolidate solutions, and it can't do EVERYTHING.

For what it does, it does it very well. For everything else there is always physical servers. No one said we HAVE to use VM Ware.

This is why solutions for clustering, file servers, and Email should be handled by physical machines, and not in a VM.

Since we know there are limitations, why force it? Just accept what is, and move on.

Re: Some musings on MSCS limitations

9. Aug 20, 2007 8:47 PM in response to: MattG
Click to view wcrahen's profile Expert 353 posts since
Sep 24, 2004
HA is "clustering" for the ESX host itself, not the VMs or the apps running in those VMs. So, it is possible to have a Exchange or SQL VM fail and HA will do nothing about it. HA is great for physical failures while MSCS is good for that plus server/application failures. My opion, HA will often satify most needs and MSCS might not be needed.

Re: Some musings on MSCS limitations

10. Aug 21, 2007 2:18 AM in response to: MattG
Click to view Anders's profile Expert 1,301 posts since
Oct 3, 2003
So does it make sense to consider VI3 HA sufficient
for Exchange 2003 or SQL 2005 failover instead of
going with a VM MSCS cluster?

What are the advantages/disadvantages of this
scenario?


I'd say the most important thing you loose using VMware HA is rolling upgrades/patching.
With MSCS you can patch second node, fail over, then patch first node.
Same way with upgrades.

You also have faster failover using MSCS.

That has to be weighed up against the simplicity of HA vs the complexity of MSCS.
There is no universal answer to this, each costumer requirements are different.
I have several costumers who've ditched MSCS as well as several who found VMware HA not meeting their needs.

- Anders

Re: Some musings on MSCS limitations

11. Aug 21, 2007 3:03 AM in response to: oschistad
Click to view bertdb's profile Master 1,332 posts since
Sep 13, 2005
NPIV might bring some interesting changes to the MSCS support in the future. If every node has its own WWN identity on the fabric, SCSI reservations can be maintained even after VMotion. This is one of the reasons why VMotion of MSCS nodes isn't possible, the master node would lose it's status after VMotion.

of course, the SAN timeout being greater than the cluster timeout remains an issue that also blocks VMotion of cluster nodes.

Re: Some musings on MSCS limitations

12. Aug 21, 2007 6:46 AM in response to: Anders
Click to view MattG's profile Expert 527 posts since
Jun 21, 2004
So how does HA for a Database (SQL) compare to MSCS with regard to data consistency when a host fails?

While MSCS is a cluster, I am assuming that since a host failed, there will be transactions that aren't committed that would be committed on failover and transactions that aren't complete that would be tossed?

If this is the case, is this no different than what HA provides in terms of data consistency?

Thanks,
-MattG

Re: Some musings on MSCS limitations

13. Aug 21, 2007 10:13 AM in response to: MattG
Click to view BUGCHK's profile Master 953 posts since
Nov 7, 2005
In both cases, the database must roll back uncommitted transactions.

Re: Some musings on MSCS limitations

14. Nov 5, 2007 2:38 PM in response to: Mike_Laverick
Click to view sean1017's profile Enthusiast 36 posts since
Sep 6, 2007
Here's my situation...We are implementing VI3 for DR purposes. Once all our production machines are converted to VMDKs on the SAN, they'll sync to another SAN in the DR site. So in order for this to work, all our machines must be on the SAN.

This is a problem because we also have 3 MSCS clusters that we need to virtualize. So if i go the supported route and put the system vmdks on local storage, then there's no way to replicate those vmdks to the DR SAN.

So, is anybody running MSCS with the system vmdks on SANs? If so, how has it behaved?

I currently have a file serving cluster virtualized with both nodes' system vmdk's stored on the SAN and i don't appear to be having problems but i'm worried about putting this into production.

Am I missing another option that would satisfy our DR plan? Any advice?

Thanks,
Sean

p.s. how do you just add a post to the topic instead of having to reply to a specific post?

VMware Developer

SDKs, APIs, Videos, Learn and much more in the Developer community.

Learn More

Developer Sample Code

Increase your developer productivity with VMware API sample code.

Learn More

VMworld Sessions & Labs

Online access to the latest VMworld Sessions & Labs and online services.

Learn more

Purchase PSO Credits Online

Purchase credits to redeem training and consulting services online.

Buy Now

Community Hardware Software

View reported configurations or report your own.

Learn More

VMware vSphere

Come witness the next giant leap in virtualization.

Register Today

Communities