One of the most common questions I get from customers goes something like this - How can we guarantee consistency of our data and ensure we are capturing all transactions during a vMotion event, especially for a high-throughput app.?
The answer, of course, lies in how ESX and Virtual Center execute the transaction, and in the understanding that when a vMotion event occurs, CPU instructions are not being mirrored or moved or shunted to the new system in any way. We are simply performing a copy of the data held in memory and the
# How can we guarantee consistency of the db and ensure we are capturing all transactions during a vMotion event, especially for a high-throughput app.?</div>
When an administrator issues a vMotion command, ESX and Virtual Center do the following:
Provision new VM on target host.
Provision new VM on target host
Precopy memory from source to target, with ongoing memory changes logged in a memory bitmap
Quiesce VM on the source host and copy memory bitmap to target host
Start VM on target host
"Demand page" the source VM when applications attempt to read/write modified memory
Note - The new VM comes up before ALL the memory is copied over. We can do this, because once the initial copy has taken place, subsequent changes to the app touch only a fraction of all the memory pages.
"Background page" the source VM until all memory has been successfully copied
Delete VM from source host
Are there in fact situations in which we can't vMotion because there are too many transactions?
Theoretically, we agree there may be applications that just move too fast for vMotion to keep up with, and therefore ESX may never actually finish the memory copy from one ESX server to the other. Matt and I both agree, however, that this is extremely rare, and we ought to be able to architect around it. If you are seeing this to be the case, Mike, we should examine the situation in detail to make sure the implemented solution matches the business goals.
Can I use vMotion with instances of RAC that are clustered?
Absolutely you can vMotion an instance of a clustered Oracle database - there is nothing inherent in the design of the application that would prevent this.
Does all of this apply to SQL as well (as far as you know)?
There are of course, many differences in the manner in which Microsoft and Oracle have implemented clustering, and there are therefore some considerations regarding design, implementation and support, but in general the same discussion applies, yes.
So there are a couple of take-aways from this -
The original vm is used for a brief period of time to grab memory pages (if necessary) that haven't yet been copied over.
The copy progress continues in the background while the new vm is up and running.
We are not touching the data - we only copy the active memory pages from one VM to another. This is significant, for as Matt will tell you, Oracle consistency doesn't depend on what is in memory, but the logs on the disk, which don't move during a vMotion event. Therefore:
if the transaction completes, it will be on disk. Case closed.
If, in the very rare event there is actually a transaction sitting in memory that hasn't been flushed to disk, the memory bitmap will refer us to the source vm for that page/set of pages.
Long story short, it works. If any of you are in fact seeing you have some databases that you haven't been able to vMotion, I would be very interested to know more about that vm and the ESX host systems it is running on. Where are we seeing the traffic? Disk I/O? Network I/O? What caused the timeout? You get my drift.
Thanks, and happy computing!
Sr. Systems Engineer
VMware, Inc. - Southern VA