VMware Cloud Community
vmproteau
Enthusiast
Enthusiast
Jump to solution

SAN replication only for VM Disaster Recovery

I am part of talks for a DR site that will only have SAN replicated datastores. So we'll have an essentially empty ESXi environment with datastore LUNs replicating. In the event of a DR, the plan would be to start importing the VMs into the DR ESX environment. I have never seen strict SAN replication offered up as a DR solution. Ignore the logistics around IP, naming, etc. I want to focus on the consistency of VMs brought up that were replicated like this.

  1. My assumption is the VMs would likely come up in a crash consistent state. Is that accurate or could I not confidently say the VM would be healthy at all.
  2. If the VM does come up, I assume integrity of any application or DB data on those VMs would be suspect.

Like I said, I've never contemplated this scenerio. Most solutions I've seen utilize some application layer to maintain consistency at different intervals. Just curious if anyone has ever had experience with this kind of recovery.

Reply
0 Kudos
1 Solution

Accepted Solutions
grasshopper
Virtuoso
Virtuoso
Jump to solution

vmproteau wrote:

Just curious if anyone has ever had experience with this kind of recovery.

Hi ,

This works perfectly.  I've done thousands of these VM moves in a crash consistent fashion with a 100% success rate.  The portability we all talk about is real.  This has been working for me just fine since 2005 when I first started using EMC's SRDF.  I've also had tremendous success with NetApp's SnapMirror.  If you have enterprise storage of any kind, you should be fine.  Also, modern guest operating systems are extremely resilient to crash-consistent power-on (i.e. 2008 R2 is a dream come true).  Additionally, I have never had a problem with NT4.0, 2000 or 2003 in my datacenter migrations or DR exercises.

Nowadays, most DBAs are using something like SQL LiteSpeed or Idera to create local backups on additional vmdk's attached to the VM.  If a recovery of a table is needed for example, they can get that granular coverage using such products.  Again, I have never had a problem with SQL coming online following a crash-consistent power on (i.e. DR test, DC move, or HA event).  Of course YMMV and you should test for your specific environment.

As for RPO this is up to you to determine how often the array is replicating.  More often = more money of course.  You can also get more granular on the SQL backups points.  For many, one backup per day is fine.  Others are taking Idera SQL backups every 15 minutes for example (a bit overkill!).

On to RTO... The greatest risk to your RTO is lack of organization and preparation (update run books and document the scripts!).  Also, having good PMs to manage the application mapping and interdependencies is critical.  You'd be surprised how many application tests fail due to host files, DNS or startup order of the VMs (i.e. ideally you should bring up AD/DNS, then SQL, then App, then Web).  Practice makes perfect.  PowerCLI is your friend here.

Unless the plan is to re-ip everything, this DR location should be built as totally isolated with dedicated physical firewalls (or lots of sniffer sessions and excellent ACLs).  If you choose SRM (as you should!), much of that risk is averted since you can test in the bubble.  Many companies slack on proper isolation and end up hurting prod while testing.  Don't do that.  Physical Citrix is especially vulnerable to this and is often required for DR since many apps are published exclusively via Citrix.  Often the Citrix server will end up being multi-homed (so the company doesn't have to pay for additional SQL, and Citrix licensing servers, etc.) and will be straddling the Prod and isolated environments.  The default behavior is that the Citrix user will get the same routing as the underlying Citrix server so the risk is that they can inadvertently connect to prod.  You must study this well.

Besides Citrix, the other considerations are ensuring proper VPN and other connection mechanisms to the DR site (will be used for isolated app testing and actual DR).  May need multiple new concentrators depending on business requirements and network requirements.  This, along with timesync often get saved for the end but are important to your success and should be reviewed asap.  Also, keep the volumes you will replicate clean (i.e. no straggler vmdk's that are orphaned, etc.).  Use RVTools or vHealth check scripts to stay vigilant on this.  What get's replicated to the other side should be clean.  Only what you need.  Every vmx that's there should be the one that gets registered.

Anyway, with array replication this project should be a slam dunk.  Don't worry so much about the crash-consistent power-on of your guests.  It just works.

View solution in original post

Reply
0 Kudos
9 Replies
depping
Leadership
Leadership
Jump to solution

vmproteau wrote:

  1. My assumption is the VMs would likely come up in a crash consistent state. Is that accurate or could I not confidently say the VM would be healthy at all.
    1. Correct, crash consistent state
  2. If the VM does come up, I assume integrity of any application or DB data on those VMs would be suspect.
    1. Correct, but most environments will be able to recover from this

depping
Leadership
Leadership
Jump to solution

PS: there are 1000s of environments using the mechanisms you described above. Perfectly viable option for environment where resilience and recover-ability is not built in to the application layer.

vmproteau
Enthusiast
Enthusiast
Jump to solution

Appreciate the response Duncan. So the only question will be what RPO/RTO are realistic with this type of DR methodology.

RTO: With respect to the resilency of most guest Operating Systems, I think I could be comfortable with more aggressive RTO requirements. At least getting the VM powered up and on the network anything beyond that to application recovery is a more detailed, case by case calculus.

RPO: To me this is more of a concern. I suppose this depends on the configuration and resiliency of the application. I don't think I'd be comfortable communicating strong confidence in DB consistency once a VM was back up. I'd be relatively confident they could recover but at what point. With the DR methodologies descibed and say a SQL VM, I couldn't rule out a DB restore from a previous backup would be required could I? I'll be speaking with my DBA but, instictively it seems I could only set an RTO for the DB for the most recent backup.

Reply
0 Kudos
grasshopper
Virtuoso
Virtuoso
Jump to solution

vmproteau wrote:

Just curious if anyone has ever had experience with this kind of recovery.

Hi ,

This works perfectly.  I've done thousands of these VM moves in a crash consistent fashion with a 100% success rate.  The portability we all talk about is real.  This has been working for me just fine since 2005 when I first started using EMC's SRDF.  I've also had tremendous success with NetApp's SnapMirror.  If you have enterprise storage of any kind, you should be fine.  Also, modern guest operating systems are extremely resilient to crash-consistent power-on (i.e. 2008 R2 is a dream come true).  Additionally, I have never had a problem with NT4.0, 2000 or 2003 in my datacenter migrations or DR exercises.

Nowadays, most DBAs are using something like SQL LiteSpeed or Idera to create local backups on additional vmdk's attached to the VM.  If a recovery of a table is needed for example, they can get that granular coverage using such products.  Again, I have never had a problem with SQL coming online following a crash-consistent power on (i.e. DR test, DC move, or HA event).  Of course YMMV and you should test for your specific environment.

As for RPO this is up to you to determine how often the array is replicating.  More often = more money of course.  You can also get more granular on the SQL backups points.  For many, one backup per day is fine.  Others are taking Idera SQL backups every 15 minutes for example (a bit overkill!).

On to RTO... The greatest risk to your RTO is lack of organization and preparation (update run books and document the scripts!).  Also, having good PMs to manage the application mapping and interdependencies is critical.  You'd be surprised how many application tests fail due to host files, DNS or startup order of the VMs (i.e. ideally you should bring up AD/DNS, then SQL, then App, then Web).  Practice makes perfect.  PowerCLI is your friend here.

Unless the plan is to re-ip everything, this DR location should be built as totally isolated with dedicated physical firewalls (or lots of sniffer sessions and excellent ACLs).  If you choose SRM (as you should!), much of that risk is averted since you can test in the bubble.  Many companies slack on proper isolation and end up hurting prod while testing.  Don't do that.  Physical Citrix is especially vulnerable to this and is often required for DR since many apps are published exclusively via Citrix.  Often the Citrix server will end up being multi-homed (so the company doesn't have to pay for additional SQL, and Citrix licensing servers, etc.) and will be straddling the Prod and isolated environments.  The default behavior is that the Citrix user will get the same routing as the underlying Citrix server so the risk is that they can inadvertently connect to prod.  You must study this well.

Besides Citrix, the other considerations are ensuring proper VPN and other connection mechanisms to the DR site (will be used for isolated app testing and actual DR).  May need multiple new concentrators depending on business requirements and network requirements.  This, along with timesync often get saved for the end but are important to your success and should be reviewed asap.  Also, keep the volumes you will replicate clean (i.e. no straggler vmdk's that are orphaned, etc.).  Use RVTools or vHealth check scripts to stay vigilant on this.  What get's replicated to the other side should be clean.  Only what you need.  Every vmx that's there should be the one that gets registered.

Anyway, with array replication this project should be a slam dunk.  Don't worry so much about the crash-consistent power-on of your guests.  It just works.

Reply
0 Kudos
vmproteau
Enthusiast
Enthusiast
Jump to solution

Apologize for the delayed response grasshopper. Appreciate the detailed reply. Exactly what I was looking for and in line with my expectations of this type of solution. Just wanted to make sure I was setting expectations correctly.

Reply
0 Kudos
vmproteau
Enthusiast
Enthusiast
Jump to solution

Had another question related to this process. We have a team that uses MAC address for naming puposes. After importing the VMs into the destination vCenter and bringing them online, will the source MAC be maintained or would it be a new MAC be generated.

Reply
0 Kudos
grasshopper
Virtuoso
Virtuoso
Jump to solution

Hi vmproteau,

The virtual MAC is only guaranteed to be unique within a single vCenter.  If you are bringing up VMs on a new vCenter / VCDB they will likely get new MACs.  Also, power on order could have an impact as well since vCenter only guarantees unique dynamically assigned addresses for powered on VMs.  If a VM is offline, then powers up later in the site bring-up, it could very possibly get a new MAC generated.  This could happen even in your source datacenter without moving VMs at all.

One option is to use static MAC addresses.  Typically most folks try to avoid this due to the complexity of managing these.  This is almost never done unless an archaic application is licensed by MAC address.  If you choose to go with static MACs, I recommend making a standard that includes a custom Annotation (now known as 'tags') where you provide a key value pair (i.e. StaticMAC, then the MAC address.).  This would ensure that the desired MAC would be known and could be tracked for accuracy.  This information is then kept in the vCenter Database.  Human intervention (or automation) would be required to set a VM MAC back to the documented MAC held in the tags, but it could be done. [some nice PowerCLI tag write-ups here and here]

On a side note, keep in mind that the common best practice is to always keep the vSphere Display name, the Guest Operating System hostname, and the underlying VM files (vmdk's, vmx, etc.) all using the same name.  If your machines are dynamically changing their GOS names, ensure that the DisplayName reflects this as well.  Storage vMotion the VMs as a standard if the DisplayName changes.  This will save valuable time during troubleshooting efforts when accuracy counts (i.e. vmkfstools activities against disks, or other recovery activities).  Requires proper vCenter version and an advanced config setting in vCenter to ensure that svMotion actually does the renaming.

I'm not totally clear on that business unit's requirements for the renames, but hopefully some of the above is helpful in considering the elements involved in creating a solid standard for this case.  Please consult the "vSphere Networking > MAC Address Management" section of the vSphere Documentation Center as the canonical source [don't refer to KBs] for any MAC address customizations.

Best of luck and let us know if anything else is needed.

Reply
0 Kudos
vmproteau
Enthusiast
Enthusiast
Jump to solution

Thanks grasshopper. This is what I had expected but wanted some independant validation since I hadn't reviewed in a while. As you said I generally avoid relying on guest MAC address for anything in the virtual environment and I'm not 100% clear on how this particular group is using them here. Also, yes, we monitor and maintain consistency for display name, guest OS name, and VM file names. As always appreciate the detailed response.

Reply
0 Kudos
vmproteau
Enthusiast
Enthusiast
Jump to solution

I had opened an SR as well as posting here and just received VMware's take that differs from yours. My instictive thoughts matched your own however, the VMware engineer is stating the MACs would not chnage.This was an email reply so I haven't spoken with him yet.

Under no circumstances does MAC address change, during the lifespan of a VM. The MAC address is stored in the .vmx file and the vmx file is the identifier/config file for this VM. Even if you removed the VM from inventory, move the files somewhere else and then re-import the VM from the .vmx file, you will not change the MAC. The only time it will change is if you have either cloned the VM (creating a new .vmx file) or when importing a VM been prompted as to whether you had copied or moved the VM and chose the 'copied' option (of course if you say you had copied a VM, it assumes that this is a new .vmx file and so sees it as a new machine).