Re: SAN snapshots of LUN

gumaheru · ‎10-03-2006

Hey guys,

I have vm's that are using raw device mappings pointing to luns on a Netapp Filer. I want to be able to take snap shots of the luns using the Netapp snap shot feature. What I was hoping to do was use Vmware ESX 3.0 to take a snap shot of the system through the command line, just like in the GUI, then take a snap shot of the LUN. I am trying to use vcbSnapshot, but can't get the syntax right. What would be the best way to go about doing this? The reason that I want to do it this way is that with the snap ESX snapshot I can guarantee that there aren't any writes to the disk. The filer doesn't quiesce the disks before doing so since it takes a snap of the lun. This has caused some problems and I have had to revert to an older snapshot then I wanted to. What would be the best way to go about doing this?

vmmeup · ‎10-03-2006

You can use "vcbSnapAll -a poweronstate:on -r /backup_location" at the command line to snapshot all your running vm's to a location which could be another lun where you store your snaps and then snapshot that lun. I also found a pretty handy script that someone wrote that works pretty well. I took this script and modified it to fit my needs and it does a good job.

http://www.tooms.dk/?page=http%3A//www.tooms.dk/forum/topic.asp%3FTOPIC_ID%3D128

Sid Smith ----- VCP, VTSP, CCNA, CCA(Xen Server), MCTS Hyper-V & SCVMM08 [http://www.dailyhypervisor.com] - Don't forget to award points for correct and helpful answers. 😉

gumaheru · ‎10-04-2006

vmmeup,

I was using the vcbSnapAll before. It is a nice tool. I still wish to go with my desired setup above. This way I don't have to allocate more space for another LUN. I wish to just take a snap shot of the LUN changes. That is why I wish to create a snap shot like the GUI does and then take a snap shot using the filer. This way I can copy the snap shot on the filer and just restart the vm host. This is faster then doing the vcbRestore.

Eddy · ‎10-04-2006

Dont you have the Snap Agents installed on your VMs?

We use IPstor (Virtual Storage), which works similar to NetApp... We Snap and replicate our LUNs across the WAN to our DR Sites...

The Snap Agents quiesce of LUNS for us...

Go Virtual!

admin · ‎10-04-2006

(like Eddy said...)

vcbSnap* will not come into play for your particular RDM case. RDMs were designed to give the VM a dedicated control and data channed to the LUN. As a consequence, you need to provide your own SAN mgmt, IO quiescing and LUN snapshotting software inside the VM when using physical compatibility RDMs. In other words, you can't mix and match hardware level functions with VMFS snapshotting functions. You get one or the other

Note that you can use vcbSnap* to make a VMFS-based snapshot of your virtual compatibility RDM. But clearly that would defeat your purpose, which is to use NetApp snapshots.

Javaman310 · ‎10-18-2006

I have the same scenario. NetApp has a good paper on how to use their LUN snapshots for RDM disks. Unfortunately, it was written for ESX 2.5. I am trying to achieve the same result in 3.0 The vmware-cmd utility no longer supports "addredo", which was NetApp's recommended method of quiescing the vmdk. Does anyone have ideas on this? I am experimenting with vcbsnapshot but previous posts in this thread lead me to believe that the snapshot data is not stored in the RDM LUN, but rather in the VMFS volume.

gumaheru · ‎10-19-2006

I placed a ticket with Netapp and IBM. Netapp and IBM both responded with the following:

The document "VMware Scripting API" - http://www.vmware.com/pdf/Scripting_API_23.pdf. This release of the scripting APIs introduces compatibility with ESX Server 3.New operations are introduced that create and manage snapshots for the virtual machine as a unit. The VmPerl API names for the new operations are:

• VmPerl::VM::create_snapshot(name, description, quiesce,memory)

• VmPerl::VM::revert_to_snapshot()

• VmPerl::VM::remove_snapshot()

• VmPerl::VM::has_snapshot()

charlesgage · ‎11-30-2006

Did you have any luck with this? We're in a very similar position, using RDM luns on NetApp. I want to use NetApp snapshots, but need a way to quiesce the VM first.

As far as I can tell, the API operations create a VMware snapshot, which will use additional space.

All I want to do is:

quiesce VM IO

take NetApp snapshot

resume normal operation

Thanks.

wolfwolf · ‎11-30-2006

Hi,

We're using a few VMs on a ESX Server 3.0.1 host with raw devices an EqualLogic iSCSI array and we're also looking into a way to to take advantage of the EqualLogic array snapshot capabilities.

Any suggestions?

Thanks.

Javaman310 · ‎11-30-2006

Did you have any luck with this? We're in a very
similar position, using RDM luns on NetApp. I want to
use NetApp snapshots, but need a way to quiesce the
VM first.
All I want to do is:
quiesce VM IO
take NetApp snapshot
resume normal operation

After speaking to the VCB design team lead at VMWorld, I am now confidant that I have this problem licked.

Using NetApp's article #3393 which was written for ESX 2.5 as a guide, I have updated the process to work with ESX 3.0. Of course, there is a rumor that NetApp is developing a "Snapmanager for VMware" which may replace this entire process.

Below is my complete script that quiesces the RDM LUN, takes the NetApp snapshot, then unquiesces the RDM LUN. But first, some important background information:

Although VMware changed some names and commands between ESX 2.5 and 3.0, the concepts and functionality are the same. Taking a VMware snapshot, whether from the GUI or the command line, does indeed quiesce the vmdk.

In the case of RDM LUNs, the metadata and change log for the snapshot are stored in the VMFS volume where the vmx file resides. But that’s ok- if you had to do a restore, you would be recovering to the point in time when the VMware snapshot was taken. Incidentally, even without quiescing the vmdk, your SAN snapshot copy would be crash consistent. For a Windows VM, Checkdisk would run if you booted from that state- but it would work.

In ESX 2.5, the change log was called a REDO log. ESX 3.0 calls them COW files. COW files store a copy of each changed block in 16MB increments. So for multiple VMware snapshots or VM’s with heavy disk I/O, the COW files can grow large quite quickly. But for our purpose of taking a SAN snapshot, this shouldn’t be a concern.

There IS a concern with the number of simultaneous VMware snapshots on a single VMFS volume on the SAN, with multiple ESX servers connecting to it. According to the VCB design team, too many simultaneous disk writes to VMFS from multiple ESX servers is BAD. I couldn’t get them to go on record with a specific number of VM’s, but it was suggested that above 50 VM’s on the same VMFS volume, all with VMware snapshots, you might see performance degradation, or worse: disk corruption or even ESX crashing. You could use DRS affinity rules to group VM’s, or use some other grouping criteria in SQL when you run your quiescing script.

Another problem to consider when automating VMware snapshots, is that in the world of VMotion, DRS, and HA; the host ESX server for a particular VM can change. The script needs to send the quiescing command to the correct host, but how will it know which host has a particular VM?

My script leverages the features of several utilities.

First, VCB comes with several nifty command-line utilities. In fact, the VCB product is just a set of scripts designed to plug-in to your backup software. But those scripts all use these core utilities. You can write your own scripts to use the VCB utilities for other purposes. The VCB utilities access Virtual Center’s SQL database to get information about the target VM. In particular we will use vcbsnapshot[/i]. This command locates the proper host for the VM and creates the Vmware snapshot:

“vcbsnapshot -h localhost -u username -p password -c moref:vm-611 quiesce”

In my setup, I have VCB installed on the same Windows server as my Virtual Center (thus the localhost). But the target hostname should be your Virtual Center server. The username and password must be an account with appropriate permissions in Virtual Center. You can store the username and password in a config file so that you don’t have to write them on the command line every time. The target VM is formatted “moref:vm-611” which means “machine object reference” and 611 is the record ID in SQL. “quiesce” is the name of the snapshot. Because you are sending the command to Virtual Center, and Virtual Center knows on which host the VM resides, it automatically forwards the command to the appropriate ESX host.

Then to remove the snapshot created above, the command is:

“vcbsnapshot -h localhost -u username -p password -d moref:vm-611 ssid:snapshot-619”

Yes, you DO need to know the snapshot ID.

So how do we get these ID numbers? The answer is OSQL. OSQL is a command-line utility that comes with SQL Enterprise Manager. You only need the osql.exe file. Just copy it from your installation of SQL. You can use osql to query the Virtual Center database and get the necessary ID numbers for both the VM’s and the snapshots:

osql -U username -P password -D VirtualCenter -Q "SELECT ID FROM vc_user.VPX_VM WHERE IS_TEMPLATE = 0"

The above command connects to the database referenced by the System DSN (in your ODBC connectors) called “VirtualCenter”. It logs into the database with the listed username and password, which must have rights on the database in SQL. Then the actual SQL query pulls the record ID number for each VM that is not a template. (You can customize the filter criteria here). In my example, “vc_user” is the database owner in SQL. VPX_VM is the SQL table that stores the list of all VM's.

The output from the above command looks like this:

ID

\----

108

110

…

1226

(19 rows affected)

Similarly, to get the snapshot ID, we use:

osql -U username -P password -D VirtualCenter -Q "select VM_ID,ID from vc_user.VPX_SNAPSHOT where SNAPSHOT_NAME = 'Quiesce'"

This query returns a list of the VM ID and Snapshot ID for each VM that has a snapshot named “quiesce”. The output looks like this:

VM_ID ID

\----

-

108 1228

110 1229

… …

1226 1246

(19 rows affected)

Now all we need to do is clean this up so that it can be used in a script. Add the –o switch to direct output to a text file. Add the switch “-h-1” to remove the column headers. Then, when you run the vcbsnapshot command in a FOR loop, you set the “(“ character as the EOL character. That will cause the FOR loop to skip the last line of your output that says “(## Rows Affected)”. More on the FOR loop later:

osql -U vc_user -P vmware -D VirtualCenter –h-1 -Q "SELECT ID FROM vc_user.VPX_VM WHERE IS_TEMPLATE = 0" –o output.txt

The last component is the set of commands to manage the NetApp snapshots. I use RSH to connect to NetApp. You will need a user account with RSH permissions in NetApp. It can just be a local user account on the Windows box where you run the script. Modify etc/hosts.equiv on NetApp to grant RSH permission.

RSH nameofNetApp snap create volname snapshotname

Similarly, there is also “snap rename” and “snap delete” to manage old snapshots.

Now, let’s put it all together. Here is the script:

\--

BEGIN SCRIPT \--

@echo off

echo Script Execution Commenced at %date% %time% >> snapall.log

c:

cd "\Program Files\VMware\VMware Consolidated Backup Framework"

REM Manage old snapshots on NetApp

RSH nameofNetApp snap delete volname vmware_snap4 >> snapall.log

RSH nameofNetApp snap rename volname vmware_snap3 vmware_snap4 >> snapall.log

RSH nameofNetApp snap rename volname vmware_previous vmware_snap3 >> snapall.log

RSH nameofNetApp snap rename volname vmware_recent vmware_previous >> snapall.log

REM Get list of VM's from SQL database

osql -U vc_user -P password -D VirtualCenter -h-1 -Q "SELECT ID FROM vc_user.VPX_VM WHERE IS_TEMPLATE = 0" -o vmlist.txt

REM quiesce all VM's by creating vmware snapshots.

for /F "eol=(" %%i in (vmlist.txt) do vcbsnapshot -h localhost -u username -p password -c moref:vm-%%i Quiesce >> snapall.log

if not errorlevel 0 goto :Error

REM create new snapshot on NetApp

RSH nameofNetApp snap create volname vmware_recent >> snapall.log

if not errorlevel 0 goto Error

REM Get list of vmware snapshot ID's from SQL database

osql -U username -P password -D VirtualCenter -h-1 -Q "select VM_ID,ID from vc_user.VPX_SNAPSHOT where SNAPSHOT_NAME = 'Quiesce'" -o sslist.txt

REM unquiesce all VM's by removing all vmware snapshots.

for /F "eol=( tokens=1,2" %%i in (sslist.txt) do vcbsnapshot -h localhost -u username -p password -d moref:vm-%%i ssid:snapshot-%%j >> snapall.log

if not errorlevel 0 goto :Error

goto End

:Error

echo An error occurred during script execution. See snapall.log for details.

:End

echo Script Execution complete at %date% %time% >> snapall.log

\--

END SCRIPT \--

Works like a champ!

You can now use a LUN clone to do a machine-level restore, or mount the LUN clone with SnapDrive to do file-level restores.

cmauro · ‎12-01-2006

Very nice Javaman. Iwill try.

charlesgage · ‎12-08-2006

This is fantastic! Exactly what I was looking for! Many thanks for taking the time to help!

I've got the script working just fine now, I'm not seeing any COW files, but I am getting the delta.vmdk files which I assume is the redo log.

I haven't yet decided whether to use RDM or VMFS's in our production environment, but going by NetApp best practice, we'd need to create one flexvol per VM (in either case), which means the script would somehow have to work out which flexvol to snapshot. I've been thinking about using the notes field in VC to store the NetApp filer & flexvol names, rather than trying to work it out from SAN paths which I think would be too complex for a scripting novice like myself.

Our NetApp TAM is suggesting that we'd be better using VMFS's for VM boot LUNs and RDM's for data disks - not sure why though. Either way, management is going to be tricky if we need a 1:1 flexvol, VM ratio as we're looking at 100's of LUNs, VMFS's and vols to manage. I've yet to see a large scale VI3 on NetApp FCP implementation.

Thanks again for sharing your script!!!!

Javaman310 · ‎12-08-2006

I'm glad it was helpful.

Check out NetApp's article #3393. Based on that, the way I have my storage set up is that all my RDM LUN's are in one flexvol. Then each VM has its own RDM LUN as its boot disk. Additional disks for a VM are also RDM LUN's. The RDM's must be run in virtual mode so that VMware snapshot capability is available. (Although if you needed to cluster a VM with a physical server, you would need to run the RDM in physical mode or use the iSCSI initiator inside the VM).

When you take a NetApp Snapshot, you are getting the whole volume. That's why my script quiesces all the VM's at the same time.

If your VM boot disk is in VMFS, you will still be able to do machine-level restore, but you lose the ability to do file-level restore. It also makes machine-level restores take longer because you would have to copy the vmdk from the snapshot back into a production vmfs volume.

With RDM LUN's as the primary disk for a VM, to restore you can literally connect the VM to a LUN clone and voila! You're up! The time it takes to restore is the time it takes to remember the LUN clone command syntax! In my testing... it was about four minutes.

The really amazing part of this process is the ability to do file-level restores. Because with RDM, the contents of the LUN are formatted by the guest OS, that means you can connect that LUN to another machine, phycial or virtual, and access the contents. You could either attach the LUN clone to another VM as an RDM, or you could mount it on a physical server with SnapDrive. I have tested both and it works flawlessly.

Again, this backup/restore scenario requires no network traffic, and no CPU load on any server.

MikeAvery · ‎02-21-2007

Very helpful Javaman, thanks!

I have some questions on this and similar topics..

My test setup setup uses a FAS3020 with one volume to store VM configruation over NFS to EX hosts. One volume per VM, with an iSCSI LUN for each drive within the volume. These LUNs are RDM, virtual access.

I am looking to quiesce these volumes using VCB and VC facilities and use snapshots. Is there wisdom anyone would would impart?

duane · ‎07-09-2007

This is how I have mine setup. I can reboot my vm's, so this scripts does reboot the vms.

It also reboots all all vm's. I used vi to create the scripts

I have a script that run in the crontab called snapvm.sh in /usr/sbin

snapvm.sh

/usr/sbin/vmshutdown>>/var/log/reboot.log 2>&1

ssh root@vmware2.domain.com /usr/sbin/vmshutdown>>/var/log/reboot.log 2>&1

ssh root@vmware3.domain.com /usr/sbin/vmshutdown>>/var/log/reboot.log 2>&1

ssh root@vmware4.domain.com /usr/sbin/vmshutdown>>/var/log/reboot.log 2>&1

sleep 600

sync

ssh root@vmware2.domain.com /bin/sync

ssh root@vmware3.domain.com /bin/sync

ssh root@vmware4.domain.com /bin/sync

echo ' Starting ---> '`date`

ssh root@netapp.domain.com snap delete vmware old3 >>/var/log/netapp.log 2>&1

ssh root@netapp01.domain.com snap rename vmware old2 old3 >>/var/log/netapp.log 2>&1

ssh root@netapp01.domain.com snap rename vmware old1 old2 >>/var/log/netapp.log 2>&1

ssh root@netapp01.domain.com snap rename vmware new old1 >>/var/log/netapp.log 2>&1

ssh root@netapp01.domain.com snap create vmware new >>/var/log/netapp.log 2>&1

echo ' Finished ---> '`date`

/usr/sbin/vmstartup>>/var/log/reboot.log 2>&1

ssh root@vmware2.domain.com /usr/sbin/vmstartup>>/var/log/reboot.log 2>&1

ssh root@vmware3.domain.com /usr/sbin/vmstartup>>/var/log/reboot.log 2>&1

ssh root@vmware4.domain.com /usr/sbin/vmstartup>>/var/log/reboot.log 2>&1

It calls vmshutdown on each local ESX box

vmshutdown

#!/bin/bash

#####################################################################

\# set the paths that the vmware tools need

PATH="/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin"

#####################################################################

\# try do a nice shutdown of VM there is power

count_vm_on=0

for vm in `vmware-cmd -l` ; do

#echo "VM: " $vm

for VMstate in `vmware-cmd "$vm" getstate` ; do

#echo $VMstate

If the VM is power ON

if \[ $VMstate = "on" ] ; then

echo " "

echo "VM: " $vm

echo "State: is on and will now tell it to shut down"

echo "Shutting down: " $vm

vmware-cmd "$vm" stop trysoft

vmwarecmd_exitcode=$(expr $?)

if \[ $vmwarecmd_exitcode -ne 0 ] ; then

echo "exitcode: $vmwarecmd_exitcode so will now turn it off hard"

vmware-cmd "$vm" stop hard

fi

count_vm_on=$count_vm_on+1

sleep 2

if the VM is power OFF

elif \[ $VMstate = "off" ] ; then

echo " "

echo "VM: " $vm

echo "State: is off, so i skip it"

if the VM is power suspended

elif \[ $VMstate = "suspended" ] ; then

echo " "

echo "VM: " $vm

echo "State: is suspended, so i skip it"

if state is getstate or =

else

printf ""

#echo "unknown state: " $VMstate

fi

done

echo "$(date):----

Shutdown Done";

It then snaps with netapp, which is very quick

Then it calls vmstartup on each local ESX box

#!/bin/sh

\#

#####################################################################

\# set the paths that the vmware tools need

PATH="/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin"

#####################################################################

\# try do a nice power on of VMs

count_vm_on=0

for vm in `vmware-cmd -l` ; do

#echo "VM: " $vm

for VMstate in `vmware-cmd "$vm" getstate` ; do

#echo $VMstate

If the VM is power OFF

if \[ $VMstate = "off" ] ; then

echo " "

echo "VM: " $vm

echo "State: is off and will now tell it to turn on"

echo "Turning on: " $vm

vmware-cmd "$vm" start trysoft

sleep 30

fi

done

echo "$(date):----

Startup Done";

I then backup my netapp snapshot to tape.

I'm sure this script can be greatly improved on, I just placing this info

to give some insight on how it could be done.