Solved: Re: How safe is it to snapshot live Linux servers

subzero1697 · ‎06-27-2020

I have been working with VMware vSphere since 6.5.

I am certain that snapshots work very well when the VM in question is powered off.

I sometimes even power off VMs, take snapshots, power them back on again and do changes.

I have also seldom snapshoted VMs when they are powered on and I haven't had any problems with that to this day.

However, I am still insecure when it comes to this.

My question is, how safe is it to snapshot live (powered on) servers with active connections (transactions) to them ?

Can something happen to these connections while the snapshot is being taken ? Can data corruption occur somehow ?
Does the type of server matter ?
- OS: Linux/Windows
- Software: Database (postgresql, mysql, Oracle), Application with Frontend and Backend, applications that read/write from/to files, applications that read/write from/to databases, ...
If issues arise during the changes that I am making, will I be able to safely restore the server to the snapshot without any issues, regardless of the type of server I am running ?

Thanks

daphnissov · ‎06-27-2020

Lots of different questions here so let's go through them one by one. Before that, however, the general response is, yes, it's safe to snapshot VMs when they're running and this is done millions of times a day for even the most critical workloads. The interesting parts arise when you need to then perform an action with that snapshot.

Can something happen to these connections while the snapshot is being taken ? Can data corruption occur somehow ?

Generally speaking, no, especially to the second question. Connections should also be maintained because the snapshot process completes extremely quickly on most VMs. The instance in which connections may be dropped is if you have a large number of virtual disks, you have poorly-performing backend storage, and you quiesce a system with much transactional I/O.

Does the type of server matter ?

Not really. They both snapshot similarly, however there are different mechanism used to quiesce the system based on the OS type. More below.

Software: Database (postgresql, mysql, Oracle), Application with Frontend and Backend, applications that read/write from/to files, applications that read/write from/to databases, ...

This is one of the rubs, not with taking the snapshot or even deleting it, but reverting to it. When taking a snapshot of a powered-on system, by default the "quiesce" option is not enabled. With this option disabled, the snapshotting of the disks takes place without any coordination inside the guest. When reverted, the guest comes back up exactly like there was a power outage or a cut. In most cases, this is fine, even with some databases that write to a T-log first like Postgres. Other databases like MySQL and Oracle are less tolerant of this and require quiescence when the snapshot is taken. This process communicates with a piece of software inside the guest and coordinates with the applications who respond to these quiesce requests to flush any in-memory data buffers to backend disk. Once this flush is done, the databases are said to be in a "consistent" state at which point the snapshot is taken. This ensures if the snapshot must be reverted the system returns to a known good state. Of course, that's not to say the only way the system will return to operation is with the quiesce option enabled, but it is the safest way especially for systems that have a transactional database installed within them. This safety is especially important when VM backups occur as they leverage snapshots to prepare the system. Some vendors have even gone so far as to write their own quiescence drivers and not rely on those that come from VMware.

If issues arise during the changes that I am making, will I be able to safely restore the server to the snapshot without any issues, regardless of the type of server I am running ?

Answered above in regard to snapshot reversion. The other operation is a delete in which case the previous process is unrelated. A delete involves incorporating back into the base disk the changed blocks after the snapshot was taken and therefore accepting those changes that occurred after point of snapshot.

Hope this answers your questions.

------------------
How to Ask for Help on Tech Forums
https://neonmirrors.net

View solution in original post

daphnissov · ‎06-27-2020

Lots of different questions here so let's go through them one by one. Before that, however, the general response is, yes, it's safe to snapshot VMs when they're running and this is done millions of times a day for even the most critical workloads. The interesting parts arise when you need to then perform an action with that snapshot.

Can something happen to these connections while the snapshot is being taken ? Can data corruption occur somehow ?

Generally speaking, no, especially to the second question. Connections should also be maintained because the snapshot process completes extremely quickly on most VMs. The instance in which connections may be dropped is if you have a large number of virtual disks, you have poorly-performing backend storage, and you quiesce a system with much transactional I/O.

Does the type of server matter ?

Not really. They both snapshot similarly, however there are different mechanism used to quiesce the system based on the OS type. More below.

Software: Database (postgresql, mysql, Oracle), Application with Frontend and Backend, applications that read/write from/to files, applications that read/write from/to databases, ...

This is one of the rubs, not with taking the snapshot or even deleting it, but reverting to it. When taking a snapshot of a powered-on system, by default the "quiesce" option is not enabled. With this option disabled, the snapshotting of the disks takes place without any coordination inside the guest. When reverted, the guest comes back up exactly like there was a power outage or a cut. In most cases, this is fine, even with some databases that write to a T-log first like Postgres. Other databases like MySQL and Oracle are less tolerant of this and require quiescence when the snapshot is taken. This process communicates with a piece of software inside the guest and coordinates with the applications who respond to these quiesce requests to flush any in-memory data buffers to backend disk. Once this flush is done, the databases are said to be in a "consistent" state at which point the snapshot is taken. This ensures if the snapshot must be reverted the system returns to a known good state. Of course, that's not to say the only way the system will return to operation is with the quiesce option enabled, but it is the safest way especially for systems that have a transactional database installed within them. This safety is especially important when VM backups occur as they leverage snapshots to prepare the system. Some vendors have even gone so far as to write their own quiescence drivers and not rely on those that come from VMware.

If issues arise during the changes that I am making, will I be able to safely restore the server to the snapshot without any issues, regardless of the type of server I am running ?

Answered above in regard to snapshot reversion. The other operation is a delete in which case the previous process is unrelated. A delete involves incorporating back into the base disk the changed blocks after the snapshot was taken and therefore accepting those changes that occurred after point of snapshot.

Hope this answers your questions.

------------------
How to Ask for Help on Tech Forums
https://neonmirrors.net

subzero1697 · ‎06-27-2020

Thanks for sharing this information.

It helps clear some uncertainty.

All

How safe is it to snapshot live Linux servers