Hey guys,
I have vm's that are using raw device mappings pointing to luns on a Netapp Filer. I want to be able to take snap shots of the luns using the Netapp snap shot feature. What I was hoping to do was use Vmware ESX 3.0 to take a snap shot of the system through the command line, just like in the GUI, then take a snap shot of the LUN. I am trying to use vcbSnapshot, but can't get the syntax right. What would be the best way to go about doing this? The reason that I want to do it this way is that with the snap ESX snapshot I can guarantee that there aren't any writes to the disk. The filer doesn't quiesce the disks before doing so since it takes a snap of the lun. This has caused some problems and I have had to revert to an older snapshot then I wanted to. What would be the best way to go about doing this?
You can use "vcbSnapAll -a poweronstate:on -r /backup_location" at the command line to snapshot all your running vm's to a location which could be another lun where you store your snaps and then snapshot that lun. I also found a pretty handy script that someone wrote that works pretty well. I took this script and modified it to fit my needs and it does a good job.
http://www.tooms.dk/?page=http%3A//www.tooms.dk/forum/topic.asp%3FTOPIC_ID%3D128
vmmeup,
I was using the vcbSnapAll before. It is a nice tool. I still wish to go with my desired setup above. This way I don't have to allocate more space for another LUN. I wish to just take a snap shot of the LUN changes. That is why I wish to create a snap shot like the GUI does and then take a snap shot using the filer. This way I can copy the snap shot on the filer and just restart the vm host. This is faster then doing the vcbRestore.
Dont you have the Snap Agents installed on your VMs?
We use IPstor (Virtual Storage), which works similar to NetApp... We Snap and replicate our LUNs across the WAN to our DR Sites...
The Snap Agents quiesce of LUNS for us...
(like Eddy said...)
vcbSnap* will not come into play for your particular RDM case. RDMs were designed to give the VM a dedicated control and data channed to the LUN. As a consequence, you need to provide your own SAN mgmt, IO quiescing and LUN snapshotting software inside the VM when using physical compatibility RDMs. In other words, you can't mix and match hardware level functions with VMFS snapshotting functions. You get one or the other
Note that you can use vcbSnap* to make a VMFS-based snapshot of your virtual compatibility RDM. But clearly that would defeat your purpose, which is to use NetApp snapshots.
I have the same scenario. NetApp has a good paper on how to use their LUN snapshots for RDM disks. Unfortunately, it was written for ESX 2.5. I am trying to achieve the same result in 3.0 The vmware-cmd utility no longer supports "addredo", which was NetApp's recommended method of quiescing the vmdk. Does anyone have ideas on this? I am experimenting with vcbsnapshot but previous posts in this thread lead me to believe that the snapshot data is not stored in the RDM LUN, but rather in the VMFS volume.
I placed a ticket with Netapp and IBM. Netapp and IBM both responded with the following:
The document "VMware Scripting API" - http://www.vmware.com/pdf/Scripting_API_23.pdf. This release of the scripting APIs introduces compatibility with ESX Server 3.New operations are introduced that create and manage snapshots for the virtual machine as a unit. The VmPerl API names for the new operations are:
• VmPerl::VM::create_snapshot(name, description, quiesce,memory)
• VmPerl::VM::revert_to_snapshot()
• VmPerl::VM::remove_snapshot()
• VmPerl::VM::has_snapshot()
Did you have any luck with this? We're in a very similar position, using RDM luns on NetApp. I want to use NetApp snapshots, but need a way to quiesce the VM first.
As far as I can tell, the API operations create a VMware snapshot, which will use additional space.
All I want to do is:
quiesce VM IO
take NetApp snapshot
resume normal operation
Thanks.
Hi,
We're using a few VMs on a ESX Server 3.0.1 host with raw devices an EqualLogic iSCSI array and we're also looking into a way to to take advantage of the EqualLogic array snapshot capabilities.
Any suggestions?
Thanks.
Did you have any luck with this? We're in a very
similar position, using RDM luns on NetApp. I want to
use NetApp snapshots, but need a way to quiesce the
VM first.
All I want to do is:
quiesce VM IO
take NetApp snapshot
resume normal operation
After speaking to the VCB design team lead at VMWorld, I am now confidant that I have this problem licked.
Using NetApp's article #3393 which was written for ESX 2.5 as a guide, I have updated the process to work with ESX 3.0. Of course, there is a rumor that NetApp is developing a "Snapmanager for VMware" which may replace this entire process.
Below is my complete script that quiesces the RDM LUN, takes the NetApp snapshot, then unquiesces the RDM LUN. But first, some important background information:
Although VMware changed some names and commands between ESX 2.5 and 3.0, the concepts and functionality are the same. Taking a VMware snapshot, whether from the GUI or the command line, does indeed quiesce the vmdk.
In the case of RDM LUNs, the metadata and change log for the snapshot are stored in the VMFS volume where the vmx file resides. But thats ok- if you had to do a restore, you would be recovering to the point in time when the VMware snapshot was taken. Incidentally, even without quiescing the vmdk, your SAN snapshot copy would be crash consistent. For a Windows VM, Checkdisk would run if you booted from that state- but it would work.
In ESX 2.5, the change log was called a REDO log. ESX 3.0 calls them COW files. COW files store a copy of each changed block in 16MB increments. So for multiple VMware snapshots or VMs with heavy disk I/O, the COW files can grow large quite quickly. But for our purpose of taking a SAN snapshot, this shouldnt be a concern.
There IS a concern with the number of simultaneous VMware snapshots on a single VMFS volume on the SAN, with multiple ESX servers connecting to it. According to the VCB design team, too many simultaneous disk writes to VMFS from multiple ESX servers is BAD. I couldnt get them to go on record with a specific number of VMs, but it was suggested that above 50 VMs on the same VMFS volume, all with VMware snapshots, you might see performance degradation, or worse: disk corruption or even ESX crashing. You could use DRS affinity rules to group VMs, or use some other grouping criteria in SQL when you run your quiescing script.
Another problem to consider when automating VMware snapshots, is that in the world of VMotion, DRS, and HA; the host ESX server for a particular VM can change. The script needs to send the quiescing command to the correct host, but how will it know which host has a particular VM?
My script leverages the features of several utilities.
First, VCB comes with several nifty command-line utilities. In fact, the VCB product is just a set of scripts designed to plug-in to your backup software. But those scripts all use these core utilities. You can write your own scripts to use the VCB utilities for other purposes. The VCB utilities access Virtual Centers SQL database to get information about the target VM. In particular we will use vcbsnapshot[/i]. This command locates the proper host for the VM and creates the Vmware snapshot:
vcbsnapshot -h localhost -u username -p password -c moref:vm-611 quiesce
In my setup, I have VCB installed on the same Windows server as my Virtual Center (thus the localhost). But the target hostname should be your Virtual Center server. The username and password must be an account with appropriate permissions in Virtual Center. You can store the username and password in a config file so that you dont have to write them on the command line every time. The target VM is formatted moref:vm-611 which means machine object reference and 611 is the record ID in SQL. quiesce is the name of the snapshot. Because you are sending the command to Virtual Center, and Virtual Center knows on which host the VM resides, it automatically forwards the command to the appropriate ESX host.
Then to remove the snapshot created above, the command is:
vcbsnapshot -h localhost -u username -p password -d moref:vm-611 ssid:snapshot-619
Yes, you DO need to know the snapshot ID.
So how do we get these ID numbers? The answer is OSQL. OSQL is a command-line utility that comes with SQL Enterprise Manager. You only need the osql.exe file. Just copy it from your installation of SQL. You can use osql to query the Virtual Center database and get the necessary ID numbers for both the VMs and the snapshots:
osql -U username -P password -D VirtualCenter -Q "SELECT ID FROM vc_user.VPX_VM WHERE IS_TEMPLATE = 0"
The above command connects to the database referenced by the System DSN (in your ODBC connectors) called VirtualCenter. It logs into the database with the listed username and password, which must have rights on the database in SQL. Then the actual SQL query pulls the record ID number for each VM that is not a template. (You can customize the filter criteria here). In my example, vc_user is the database owner in SQL. VPX_VM is the SQL table that stores the list of all VM's.
The output from the above command looks like this:
ID
\----
108
110
1226
(19 rows affected)
Similarly, to get the snapshot ID, we use:
osql -U username -P password -D VirtualCenter -Q "select VM_ID,ID from vc_user.VPX_SNAPSHOT where SNAPSHOT_NAME = 'Quiesce'"
This query returns a list of the VM ID and Snapshot ID for each VM that has a snapshot named quiesce. The output looks like this:
VM_ID ID
\----
-
108 1228
110 1229
1226 1246
(19 rows affected)
Now all we need to do is clean this up so that it can be used in a script. Add the o switch to direct output to a text file. Add the switch -h-1 to remove the column headers. Then, when you run the vcbsnapshot command in a FOR loop, you set the ( character as the EOL character. That will cause the FOR loop to skip the last line of your output that says (## Rows Affected). More on the FOR loop later:
osql -U vc_user -P vmware -D VirtualCenter h-1 -Q "SELECT ID FROM vc_user.VPX_VM WHERE IS_TEMPLATE = 0" o output.txt
The last component is the set of commands to manage the NetApp snapshots. I use RSH to connect to NetApp. You will need a user account with RSH permissions in NetApp. It can just be a local user account on the Windows box where you run the script. Modify etc/hosts.equiv on NetApp to grant RSH permission.
RSH nameofNetApp snap create volname snapshotname
Similarly, there is also snap rename and snap delete to manage old snapshots.
Now, lets put it all together. Here is the script:
\--
@echo off
echo Script Execution Commenced at %date% %time% >> snapall.log
c:
cd "\Program Files\VMware\VMware Consolidated Backup Framework"
REM Manage old snapshots on NetApp
RSH nameofNetApp snap delete volname vmware_snap4 >> snapall.log
RSH nameofNetApp snap rename volname vmware_snap3 vmware_snap4 >> snapall.log
RSH nameofNetApp snap rename volname vmware_previous vmware_snap3 >> snapall.log
RSH nameofNetApp snap rename volname vmware_recent vmware_previous >> snapall.log
REM Get list of VM's from SQL database
osql -U vc_user -P password -D VirtualCenter -h-1 -Q "SELECT ID FROM vc_user.VPX_VM WHERE IS_TEMPLATE = 0" -o vmlist.txt
REM quiesce all VM's by creating vmware snapshots.
for /F "eol=(" %%i in (vmlist.txt) do vcbsnapshot -h localhost -u username -p password -c moref:vm-%%i Quiesce >> snapall.log
if not errorlevel 0 goto :Error
REM create new snapshot on NetApp
RSH nameofNetApp snap create volname vmware_recent >> snapall.log
if not errorlevel 0 goto Error
REM Get list of vmware snapshot ID's from SQL database
osql -U username -P password -D VirtualCenter -h-1 -Q "select VM_ID,ID from vc_user.VPX_SNAPSHOT where SNAPSHOT_NAME = 'Quiesce'" -o sslist.txt
REM unquiesce all VM's by removing all vmware snapshots.
for /F "eol=( tokens=1,2" %%i in (sslist.txt) do vcbsnapshot -h localhost -u username -p password -d moref:vm-%%i ssid:snapshot-%%j >> snapall.log
if not errorlevel 0 goto :Error
goto End
:Error
echo An error occurred during script execution. See snapall.log for details.
:End
echo Script Execution complete at %date% %time% >> snapall.log
\--
Works like a champ!
You can now use a LUN clone to do a machine-level restore, or mount the LUN clone with SnapDrive to do file-level restores.
Very nice Javaman. Iwill try.
This is fantastic! Exactly what I was looking for! Many thanks for taking the time to help!
I've got the script working just fine now, I'm not seeing any COW files, but I am getting the delta.vmdk files which I assume is the redo log.
I haven't yet decided whether to use RDM or VMFS's in our production environment, but going by NetApp best practice, we'd need to create one flexvol per VM (in either case), which means the script would somehow have to work out which flexvol to snapshot. I've been thinking about using the notes field in VC to store the NetApp filer & flexvol names, rather than trying to work it out from SAN paths which I think would be too complex for a scripting novice like myself.
Our NetApp TAM is suggesting that we'd be better using VMFS's for VM boot LUNs and RDM's for data disks - not sure why though. Either way, management is going to be tricky if we need a 1:1 flexvol, VM ratio as we're looking at 100's of LUNs, VMFS's and vols to manage. I've yet to see a large scale VI3 on NetApp FCP implementation.
Thanks again for sharing your script!!!!
I'm glad it was helpful.
Check out NetApp's article #3393. Based on that, the way I have my storage set up is that all my RDM LUN's are in one flexvol. Then each VM has its own RDM LUN as its boot disk. Additional disks for a VM are also RDM LUN's. The RDM's must be run in virtual mode so that VMware snapshot capability is available. (Although if you needed to cluster a VM with a physical server, you would need to run the RDM in physical mode or use the iSCSI initiator inside the VM).
When you take a NetApp Snapshot, you are getting the whole volume. That's why my script quiesces all the VM's at the same time.
If your VM boot disk is in VMFS, you will still be able to do machine-level restore, but you lose the ability to do file-level restore. It also makes machine-level restores take longer because you would have to copy the vmdk from the snapshot back into a production vmfs volume.
With RDM LUN's as the primary disk for a VM, to restore you can literally connect the VM to a LUN clone and voila! You're up! The time it takes to restore is the time it takes to remember the LUN clone command syntax! In my testing... it was about four minutes.
The really amazing part of this process is the ability to do file-level restores. Because with RDM, the contents of the LUN are formatted by the guest OS, that means you can connect that LUN to another machine, phycial or virtual, and access the contents. You could either attach the LUN clone to another VM as an RDM, or you could mount it on a physical server with SnapDrive. I have tested both and it works flawlessly.
Again, this backup/restore scenario requires no network traffic, and no CPU load on any server.
Very helpful Javaman, thanks!
I have some questions on this and similar topics..
My test setup setup uses a FAS3020 with one volume to store VM configruation over NFS to EX hosts. One volume per VM, with an iSCSI LUN for each drive within the volume. These LUNs are RDM, virtual access.
I am looking to quiesce these volumes using VCB and VC facilities and use snapshots. Is there wisdom anyone would would impart?
This is how I have mine setup. I can reboot my vm's, so this scripts does reboot the vms.
It also reboots all all vm's. I used vi to create the scripts
I have a script that run in the crontab called snapvm.sh in /usr/sbin
snapvm.sh
/usr/sbin/vmshutdown>>/var/log/reboot.log 2>&1
ssh root@vmware2.domain.com /usr/sbin/vmshutdown>>/var/log/reboot.log 2>&1
ssh root@vmware3.domain.com /usr/sbin/vmshutdown>>/var/log/reboot.log 2>&1
ssh root@vmware4.domain.com /usr/sbin/vmshutdown>>/var/log/reboot.log 2>&1
sleep 600
sync
ssh root@vmware2.domain.com /bin/sync
ssh root@vmware3.domain.com /bin/sync
ssh root@vmware4.domain.com /bin/sync
echo ' Starting ---> '`date`
ssh root@netapp.domain.com snap delete vmware old3 >>/var/log/netapp.log 2>&1
ssh root@netapp01.domain.com snap rename vmware old2 old3 >>/var/log/netapp.log 2>&1
ssh root@netapp01.domain.com snap rename vmware old1 old2 >>/var/log/netapp.log 2>&1
ssh root@netapp01.domain.com snap rename vmware new old1 >>/var/log/netapp.log 2>&1
ssh root@netapp01.domain.com snap create vmware new >>/var/log/netapp.log 2>&1
echo ' Finished ---> '`date`
/usr/sbin/vmstartup>>/var/log/reboot.log 2>&1
ssh root@vmware2.domain.com /usr/sbin/vmstartup>>/var/log/reboot.log 2>&1
ssh root@vmware3.domain.com /usr/sbin/vmstartup>>/var/log/reboot.log 2>&1
ssh root@vmware4.domain.com /usr/sbin/vmstartup>>/var/log/reboot.log 2>&1
It calls vmshutdown on each local ESX box
vmshutdown
#!/bin/bash
#####################################################################
\# set the paths that the vmware tools need
PATH="/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin"
#####################################################################
\# try do a nice shutdown of VM there is power
count_vm_on=0
for vm in `vmware-cmd -l` ; do
#echo "VM: " $vm
for VMstate in `vmware-cmd "$vm" getstate` ; do
#echo $VMstate
If the VM is power ON
if \[ $VMstate = "on" ] ; then
echo " "
echo "VM: " $vm
echo "State: is on and will now tell it to shut down"
echo "Shutting down: " $vm
vmware-cmd "$vm" stop trysoft
vmwarecmd_exitcode=$(expr $?)
if \[ $vmwarecmd_exitcode -ne 0 ] ; then
echo "exitcode: $vmwarecmd_exitcode so will now turn it off hard"
vmware-cmd "$vm" stop hard
fi
count_vm_on=$count_vm_on+1
sleep 2
if the VM is power OFF
elif \[ $VMstate = "off" ] ; then
echo " "
echo "VM: " $vm
echo "State: is off, so i skip it"
if the VM is power suspended
elif \[ $VMstate = "suspended" ] ; then
echo " "
echo "VM: " $vm
echo "State: is suspended, so i skip it"
if state is getstate or =
else
printf ""
#echo "unknown state: " $VMstate
fi
done
done
echo "$(date):----
Shutdown Done";
It then snaps with netapp, which is very quick
Then it calls vmstartup on each local ESX box
#!/bin/sh
\#
#####################################################################
\# set the paths that the vmware tools need
PATH="/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin"
#####################################################################
\# try do a nice power on of VMs
count_vm_on=0
for vm in `vmware-cmd -l` ; do
#echo "VM: " $vm
for VMstate in `vmware-cmd "$vm" getstate` ; do
#echo $VMstate
If the VM is power OFF
if \[ $VMstate = "off" ] ; then
echo " "
echo "VM: " $vm
echo "State: is off and will now tell it to turn on"
echo "Turning on: " $vm
vmware-cmd "$vm" start trysoft
sleep 30
fi
done
done
echo "$(date):----
Startup Done";
I then backup my netapp snapshot to tape.
I'm sure this script can be greatly improved on, I just placing this info
to give some insight on how it could be done.