NetAPP snapshots of VMGuests on NFS datastores in ...

bjaming · ‎08-22-2008

Hello VMheads,

I'm currently deploying a ESX infrastructure over here in sunny San Diego and I find myself with a question revolving around the netapp snapshot technology and scripted method of creating snapshots of VMGuests. I'm of course referring to the scripts located here-> http://media.netapp.com/documents/tr-3428.pdf

Within this script it's quite obvious that I need to define the filers, the volumes and the location of the VMGuest configuration files, now here comes the question part...

Since my guests could potentially exist on any one of the 3 cluster ESX servers I am setting up and could be dynamically reassigned between the hosts due to HA, DRS, VMotion (etc) is there a way to dynamically look for the guests that exist, figure out what filer and volume they are associated with, compile that information and then put those variables into the snapshot script (referenced above on the netapp site) during every snapshot creation?

Some thoughts are using vmware-cmd -l (to list locations of vmx files for guests) piping that into awk to get the output we want, send that output as a flat file and reference that flat file as the variable within the netapp snapshot script, We'll also need to know what volume the guest resides and all other guests that reside in the same volume since the snapshots will contain all guests within that volume (aka-they all need to be suspended and quiesced prior to the snapshot taking place and then all need to be restarted afterwards).

Then during a discussion someone mentioned, well we don't want to just turn a "dumb script" loose on the infrastructure we need to have error checking, and what should we do is a guest doesn't get properly suspended? abort the operation? snapshot anyway? How about if a guest doesn't come back from suspend mode after being suspended? Then what do we do? reset it?

So obviously there's some topics that I think really aren't sufficiently addressed with the netapp scripts and instead of beating my brains into pulp trying to figure out how to modify the script to make it a little more "safe" to use I would come here, to the place the vmware evangelists hang out and ask them, who better to ask right?

Since this is a long post and some people have very little patience, here's some cliffs notes! :smileygrin:

-netapp nfs datastores

-want to use netapp snapshots

-have the scripts from netapp but have concerns

-no "sanity checking"

-guest location is "dynamic" and script is "static"

-HALP HALP!!!

thanks pals

kjb007 · ‎08-22-2008

First, since you're talking about storage side, remember that your storage is centralized on the NetApp, so it doesn't matter what host you're actually running on, since the storage is on the NetApp. Second, the script run takes a snapshot of all vm's on that host, so you would need to run this on every host, to have consistent snapshots of all your vm's. Once all your vm's are snapshot'd, you snapshot the storage, and you have a quiesced snapshot of all your vm's that can be recovered. Then, the script goes in and removes the snapshot, so you're good to go.

You may want to talk to your rep about Snap Manager for VI. It integrates netapp snapshots with vm snapshots, for easy recovery, and you don't have to do the manual scripting bit. It will integrate directly with VC.

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

bjaming · ‎08-22-2008

Actually it is very important which ESX host has the registered guest, because that's the only ESX host that can place that guest in a suspended mode. I am familiar with the snapmanager product line from netapp and use it for SQL and Exchange deployments currently and unless I'm mistaken it's primarily focused towards SAN deployed environments (leverages snapdrive etc--iSCSI or FCP) and since my deployment is neither I doubt there is much value there.

As far as snapping all the hosts within a given volume you're partially correct, the guests will exist within a shared volume however, I may have multiple volumes for "like" hosts, for example, one for windows hosts and ones for *nix hosts. Consider the following;

An ESX host has several guests registered, some windows, some linux, however to leverage ASIS (de-dupe) to the fullest potential, and for performance and other reasons (update, security policy etc) I have grouped them in different volumes. I decide to implement one snapshot schedule for one of those volumes and an entirely different schedule for the other. During the snapshot process I could potentially end up with hosts being suspended but not being snapshotted, for example. It's time to snapshot the windows hosts, however the ESX server also has linux guests registered and as a result of a scripting error the linux guests are suspended while the windows volume is backed up via a snapshot.

Once those hosts have migrated to a new ESX host the scripting is now no longer functional because I am unable to suspend those guests from the original host. I can't suspend a guest registered on ESX02 from ESX01, I hope I'm explaining this correcly and it makes sense.

I want very much to NOT limit the dynamic nature of ESX and ensure that when a guest vmotion's from one server to another the backup process continues un-inhibited and doesn't require manual administrative intervention to "work"

hance the requirement for a modicum of intelligence in the scripts that are used to backup the guests.

Centralized storage doesn't necessarily mean everything in one volume.

bjaming · ‎08-22-2008

I should really use spell check or something lol

kjb007 · ‎08-22-2008

SnapManager supports FC/iSCSI/ and NFS. It is looking at storage from the storage side, and does not care how it presents it. It ties into vc, and snaps the datastore, whether it's VMFS or NFS does not matter. As far as NetApp is concerned, a datastore is a datastore, and again, because it is on the NetApp, the array can snap it.

As far as running snapshots on Linux vs Windows, this would be where naming conventions come into play. I assume you can differentiate between windows and linux with your naming standards. The script again is running a generic listing of all vm's running on a host, and generating a list of vm's to snapshot. You don't have to suspend a vm to take a snapshot, if you do, then that is your choice. Running a general snapshot, especially the way the script is doing it does not suspend the vm, it just takes a snapshot. Do not confuse the the suspend state with the snapshot action.

Whether you take a snapshot of your datastore when a vm has been snapshot'd or not, it will be the datastore volume(s) itself on the NetApp that will get snapshot'd on the NetApp side, not an individual vm.

Good luck,

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

bjaming · ‎08-23-2008

Snap Manager is great, but when you have to deal with a management team that trips over dollars to save pennies....

I'm obviously looking for something else and the most elegant solution I can find is the scripting. As far as "it doesn't matter if the guest is running", those are crash consistent backups at the best. I would be remiss in my duties if I told the "superiors" that we're golden and wasn't sufficiently confident in the solution.

In addition, the notion that the scripting doesn't pause or "suspend" a vm guest is and that the only side of the equation that needs to be considered is the storage side is dangerous and wrong. I'm sorry dude but you are totally off base in saying so. The only way to get a "verifiable" snap of a vm guest is to perform a cold snap which is a shut down and quiesce of the guest and then performing the snapshot.

I don't want to sound like I'm not grateful for your input but seriously man, I'm looking for real solutions not an opinion. Your answers are not helpful at all.

All

NetAPP snapshots of VMGuests on NFS datastores in clustered environment