Re: Some musings on MSCS limitations

oschistad · ‎05-07-2007

As many of us already know, there are Issues concerning running microsoft clusters in VMware that warrant some planning.

First of all there's the requirement for storing the boot disk of the VM on a \_local_ (as in Direct Attached) file system, which is a rather severe limitation.

This in turn gives rise to the inability to run MSCS on a boot-from-SAN ESX server (there being no local drives in such a scenario)

Little information has been provided by VMware as to why you aren't allowed to store the boot disk of a cluster node on SAN, but in this thread a VMware Technician explains:

if we have swapped out a page for your VM to SAN and there is a failover or some other event we have no option but to stall the VM (e.g. it will not get any CPU time) until the page has been brought in

If I understand that statement correctly, we are talking about overcommitment of memory and VMware swap here. If an ESX server is overcommitting memory and the MSCS node has active memory in swap, the VM must be paused while memory is being fetched from disk. This is not normally a problem, BUT: In a SAN with multiple paths, you sometimes experience path failovers which ESX typically use 30-45 seconds to handle. If the swapped-out pages required to populate active memory of a cluster node happened to reside on this temporarily unavailable disk, the VM in question would be stalled throughout the entire 45 seconds - and that is long enough for the sister cluster node to detect a cluster failure and initiate failover actions.

The scenario above is my personal best guess as to what the actual reason for VMwares rather strange requirement might be.

However, if the assumptions I describe above are correct, then the root cause of the problem is memory overcommitment and the likelyhood of path switching in a SAN as opposed to local storage environment.

If so, then one way to work around the problem would simply be to reserve \*all* the memory of the VM. By doing so you guarantee that VMware will never swap out any of its pages to VM swap, and hence a path failover will not cause the VM to pause.

Furthermore, the restrictions imposed by VMware to support MSCS in a virtualized environment are in and of themselves an increased risk to your platform!. If you think about it, the reason why failovers occur in a SAN is that some component just failed. If the same happened in a local storage system your disk would most likely be gone - period. And any failure which causes the ESX server to go offline will take your cluster node down too - and with no way of starting the VM from another ESX host, since it resides on a local disk. Lastly, there's the fact that ESX servers by design contain no vital data locally and thus do not require any specific data protection measures beyond the obvious - this all changes when you start hosting production VMs on local storage.

Personally, even for a production platform, I am far from convinced that going with local storage for the boot disk is the Correct solution to this problem.

Michelle_Laveri · ‎05-07-2007

This is an interesting post - and suprised no-one has responded to it... I recently blogged about my MSCS experiences with everything stored in VMDK on the SAN, coupled with VMware HA...

http://www.rtfm-ed.co.uk/?p=373

It's non-supported configuration for all the reasons you have already outlined.

From the VMware people I have spoken to, they have said that the reason they don't support this configuration is:

Firstly, If your boot disks, quorum and shared disks are all on the SAN then SAN connectivity fails, the “Active” node can become confused about what has occurred. From a troubleshooting perspective it is easier if local storage is used.

Secondly, time-out values in the Windows registry and latency on the SAN fabric could cause unnecessary triggering of node fail-over.

In away I agree with you - there does seem to be "gap" between what the technology can do, and what is actually supported by VMware. But I believe there are good reasons why they don't support the configuration...

Regards

Mike

Regards
Michelle Laverick
@m_laverick
http://www.michellelaverick.com

Anders · ‎05-07-2007

Hi Mike, it's not simply a trouble shooting issue.

And it has nothing to do with paging either.

Simply put MSCS is a bit "itchy" when it comes to timeouts.

If there is a SAN fabric event like a fail-over,

we have to put the VM to sleep until we can handle the IO.

(remember, no caching in the vmkernel)

This can be longer that the failover time, causing the standby node to try take control of the cluster resources.

When the originating cluster owner is rescheduled to run again,

it has no concept of time passing nor that it's resources are now owned by the other server, if sucsessful in grabbing them.

We then have a split brain cluster.

Now this is a unforseen side effect of virtualization,

but MS puts similar restrictions in regular HW.

Not 100% sure about boot from SAN,

but you do need separate HBA for OS and data/quorum.

MS is going away from SCSI reset for control mechanics for this exact reason in Lonhorn.

"No longer uses SCSI Bus Resets which can be disruptive on a SAN"

\- Anders

PatrickMSlatter · ‎05-07-2007

The simplest way around this that I have found is to use software initiator based iSCSI LUNs for the cluster resources. It's 100% supported by MS under Windows 2003.

Admittedly most of my MSCS clusters are using RDM LUNs. I use WAFL snapshots from my NetApp storage array to bring up clusters at different points for testing.

I certainly would not advocate the use of MSCS clusters under ESX for anything other than testing and development purposes, but for test and dev it's an ideal clustering environment for my uses.

oschistad · ‎05-07-2007

Thanks for the answer, Anders - very illuminating.

The use of the word "page" led me to believe that this was specifically related to VM swap and not general I/O.

However, this has now got me wondering what the difference between the boot disk and the cluster disks are, in relation to the issue at hand. If a path failure in the SAN leads to a clustered disk being unavailable, won't the VMkernel handle things in exactly the same fashion as when the boot VMDK is unavailable and suspend processing for the VM? Or is this one of the differences between a VMDK and an RDM?

Also, as to split brain - that is \*exactly* why there is a Quorum disk in most clusters, to ensure that only one node can "win" the fight for ownership of the cluster. Sounds like MSCS has some holes in its logic in handling split brain - if you used to be the active node and you suddenly lose your quorum disk, you are probably not master any more....

Anyway, this is a very interesting and important topic as it directly affects how I need to plan our clusters...

jasonboche · ‎06-08-2007

Also, as to split brain - that is \*exactly* why there
is a Quorum disk in most clusters, to ensure that
only one node can "win" the fight for ownership of
the cluster. Sounds like MSCS has some holes in its
logic in handling split brain - if you used to be the
active node and you suddenly lose your quorum disk,
you are probably not master any more....

The answer is the quorum logic wasn't designed with virtualization in mind, and the scenarios that can be introduced in a virtualized environment versus a physical environment. MSCS quorum works (usually...), but not entirely in a VM environment for the reasons already explained above.

VCDX3 #34, VCDX4, VCDX5, VCAP4-DCA #14, VCAP4-DCD #35, VCAP5-DCD, VCPx4, vEXPERTx4, MCSEx3, MCSAx2, MCP, CCAx2, A+

MattG · ‎08-20-2007

So does it make sense to consider VI3 HA sufficient for Exchange 2003 or SQL 2005 failover instead of going with a VM MSCS cluster?

What are the advantages/disadvantages of this scenario?

Thanks,

-MattG

-MattG If you find this information useful, please award points for "correct" or "helpful".

RParker · ‎08-20-2007

"First of all there's the requirement for storing the boot disk of the VM on a \_local_ (as in Direct Attached) file system, which is a rather severe limitation. "

This isn't a requirement, this is \*IF* you want VM Ware to support your setup.

RParker · ‎08-20-2007

Basic common sense says that VM Ware isn't the REPLACEMENT for everything computer, it attempts to consolidate solutions, and it can't do EVERYTHING.

For what it does, it does it very well. For everything else there is always physical servers. No one said we HAVE to use VM Ware.

This is why solutions for clustering, file servers, and Email should be handled by physical machines, and not in a VM.

Since we know there are limitations, why force it? Just accept what is, and move on.

wcrahen · ‎08-20-2007

HA is "clustering" for the ESX host itself, not the VMs or the apps running in those VMs. So, it is possible to have a Exchange or SQL VM fail and HA will do nothing about it. HA is great for physical failures while MSCS is good for that plus server/application failures. My opion, HA will often satify most needs and MSCS might not be needed.

Anders · ‎08-21-2007

So does it make sense to consider VI3 HA sufficient
for Exchange 2003 or SQL 2005 failover instead of
going with a VM MSCS cluster?
What are the advantages/disadvantages of this
scenario?

I'd say the most important thing you loose using VMware HA is rolling upgrades/patching.

With MSCS you can patch second node, fail over, then patch first node.

Same way with upgrades.

You also have faster failover using MSCS.

That has to be weighed up against the simplicity of HA vs the complexity of MSCS.

There is no universal answer to this, each costumer requirements are different.

I have several costumers who've ditched MSCS as well as several who found VMware HA not meeting their needs.

\- Anders

bertdb · ‎08-21-2007

NPIV might bring some interesting changes to the MSCS support in the future. If every node has its own WWN identity on the fabric, SCSI reservations can be maintained even after VMotion. This is one of the reasons why VMotion of MSCS nodes isn't possible, the master node would lose it's status after VMotion.

of course, the SAN timeout being greater than the cluster timeout remains an issue that also blocks VMotion of cluster nodes.

MattG · ‎08-21-2007

So how does HA for a Database (SQL) compare to MSCS with regard to data consistency when a host fails?

While MSCS is a cluster, I am assuming that since a host failed, there will be transactions that aren't committed that would be committed on failover and transactions that aren't complete that would be tossed?

If this is the case, is this no different than what HA provides in terms of data consistency?

Thanks,

-MattG

-MattG If you find this information useful, please award points for "correct" or "helpful".

BUGCHK · ‎08-21-2007

In both cases, the database must roll back uncommitted transactions.

sean1017 · ‎11-05-2007

Here's my situation...We are implementing VI3 for DR purposes. Once all our production machines are converted to VMDKs on the SAN, they'll sync to another SAN in the DR site. So in order for this to work, all our machines must be on the SAN.

This is a problem because we also have 3 MSCS clusters that we need to virtualize. So if i go the supported route and put the system vmdks on local storage, then there's no way to replicate those vmdks to the DR SAN.

So, is anybody running MSCS with the system vmdks on SANs? If so, how has it behaved?

I currently have a file serving cluster virtualized with both nodes' system vmdk's stored on the SAN and i don't appear to be having problems but i'm worried about putting this into production.

Am I missing another option that would satisfy our DR plan? Any advice?

Thanks,

Sean

p.s. how do you just add a post to the topic instead of having to reply to a specific post?

JBraes · ‎11-06-2007

Sean,

There really is no-way around this problem.

If you store everything on your san, it will just work fine, only it is not supported.

Now for your DR-replication issues.

I know this is an additional cost but you could have a look at double-take to satisfy your replication needs.

I also have another idea in mind and maybe this could help.

You said that all your VM's are vmdk-based, so why don't you copy your vmdk's from local disk to san (using a scripted snapshot) and let the replication do the rest.

I never tried it but this is how I should try and work around these cluster issues. Also how many changes happen to your c-partitions , not a lot I suppose, so regular snapshot replications should be enough for a DR site.

sean1017 · ‎11-06-2007

JBraues,

Thanks for your reply. That's a good idea (copying vmdks to SAN), especially since the actual system drive rarely changes (only when i apply updates/SPs etc). Worst case scenario, i'll give that a try.

But this line, "If you store everything on your san, it will just work fine, only it is not supported" has me tempted to just keep it as is. I'm not overly concerned about support in this case because I could just move it to local storage before I called if i need support. This is of course assuming that the problem isn't caused by not running the C drives on local storage.

That's why i was wondering how many others are running MSCS boxes on SANs.

jbaxter714 · ‎11-06-2007

Do the new cluster modes (LCR and CCR) in Exchange 2007 change things at all? It seems that with a separate witness server that you'd be able to avoid split-brain and such...

kghammond2009 · ‎03-05-2010

I am not sure if anyone is following this thread anymore. But I have two additional thoughts.

This thread emphasizes that there are potential issues of a cluster not behaving properly if the quorum drive is not located on properly supported shared storage.

As of vSphere Update 1, the quorum drive is supported on a FC SAN, correct?

A quorum drive is not supported on VMware iSCSI/NFS storage.

But can't you use a File Share Witness to get around this issue? If I understand this thread correctly this concern is not in the actual shared storage mis-behaving but the issue is with the quorum drive in shared storage getting confused due to the virtualization layer.

Alternatively as also stated, you could place the quorum drive on a iSCSI LUN using the MS iSCSI initiator, correct?

Once the quorum drive is dealt with, you can place the shared storage on the MS iSCSI initiator to have the shared storage supported, or you can use FC SAN. If you use iSCSI/NFS storage, you will be unsupported but this should work just fine, correct?

Thank you,

Kevin

All

Some musings on MSCS limitations