VMware

ronzo

ronzo's Profile

  • Name: ron mckelvey 
  • Email: rmckelvey@p2v.net
  • Member Since: Nov 11, 2003
  • Last Logged In: Sep 9, 2008 11:04 PM
  • Status Level: Hot Shot Hot Shot (546 points)
  • Location: NJ
  • Occupation: Linux/VMware Consultant
  • Homepage: http://www.esxpress.com
  • Biography: I've been using VMware since late 2000, we started using GSX in Production 2001. Been doing hot backups of ESX on our Prod hosts since Dec 2003. We've done a number of DRP's with clients using ESX and hot backups, always 100% successful. Gotta luv VMware.

ronzo's Latest Content

(Yippie, first BLOG post of the year 2008)

We've been testing the GA release of ESX 3.5 almost continuously since it came out and except for our GUI problem, the rest is looking good! Actually it seems a little more peppy. The client over VPN is faster and more responsive, at least for me.

Here is the link explaining the GUI problem:
http://communities.vmware.com/thread/117115

Besides the problem with serial ports as a pipe, the rest was good. We made a change to put the serial pipes in another path, that is allowed by the /etc/vmware/configrules, and disabled the GUI for now, and it all works.

Because most of what we do is actually in a VM (the VBA) we had very little to change. Even on the actions we do in the console of the host (creating the VBAs and such) nothing needed to be changed (except for the serial pipe). This makes esXpress pretty flexible, as it will run on any version of ESX patch levels. So far, VMware API changes have not affected us.

In our version 2, we ran in the console space, and when we moved to ESX 3.x we moved into a VBA for the backup itself. The whole VBA idea was a good one, it was the next evolution of esXpress.

Our next version under development right now will take this to a new level.

People always ask us about the VBAs, why are we the only one doing backups like this?

How did we come up with this idea?

Even though the VBA was my idea (most of the other good ideas came from Caleb Shay) for a change, I cannot take total credit for it. If it was not for a conversation here on the VMTN (http://communities.vmware.com/message/422323) we would not of been pushed to come up with a better solution then running in the console like we currently were doing at the time.

I remember the day, it was rainy out, and we (Ken and Caleb too) were at a customer site, and after reading that post we had to go out for a smoke and a meeting. We were sitting in my truck (Ford F250 Harley Edition) and discussing... I was like "Let's take Jason at face value and assume he is correct. What can we do?". After about 10 minutes of talking (quizzing Caleb, our Linux expert) I came up with running it all in a tiny Linux VM, which we coined as a VBA.

Thank you jason for pushing us into this solution using VBAs!
And we just passed the 1,000 customer mark, and still going. If anything, our development is moving faster.

Since testing has been good (for the most part) with ESX 3.5, we finally released our esXpress version 3.1. GA was released today (Jan 1, 2008).

Thank you to everyone who has helped with the testing of esXpress, we really do appreciate all the feedback we get.

thanks
ron

0 Comments Permalink

I've been testing on a Host (HP585, 4 CPU, Dual Core) thats got about 100 VMs at any given time.
Right now there are 83 on the host (not counting the esXpress VBAs)

The customer uses this Host for mostly development, and at any given time, 30 - 50 of the VMs are running while the rest are powered off.

On Sat night we ran FULLs for the first time here, and backed up 63 of the 83 VMs. Of the VMs we did not backup are ones with existing snapshots, which they are not ready to start backing up yet.

2007-12-08 10:49:29o OK 'xxxxxx' - 2/2/2 disks, (100%) 37g/37g/37g (8.3% Data), Act: 1h:02m:35s 10mb/s (35gb/hr) vs Vrt: 30m:52s 20mb/s (70gb/hr), sent 1.60g VM 65/83 OFF -
============================================================================================================================================
2007-12-08 10:49:35o ALL TOTAL: 63 vms 103/117/117 disks, (100%) 2282g/2282g/2534g (45% Data), Act: 14h:48m:24s 43mb/s (151gb/hr) vs Vrt: 37h:39m:08s 17mb/s (59gb/hr), sent 400.8g
2007-12-08 10:52:51o ====Completed ONCE A DAY BACKUP - 07/12/07

In the ALL TOTAL line, that gives the complete backup stats for the backup run:

2007-12-08 10:49:35o ALL TOTAL: 63 vms 103/117/117 disks - This means we attempted to backup 63 VMs. We did successfully backup 103 VMDKs, out of 117 attempted VMDKs, out of a total of 117 VMDKs. When we started this backup run we had some initial failures in the beginning due to some network problems, but at the end the RETRY got them all.

(100%) 2282g/2282g/2534g (45% Data) - We successfully backed up 2282 GB out of a total of 2534 GB, and of those VMDKs, 45% of the 2282 GB was data, which means that there is 55% whitespace in those VMDKs.

Act: 14h:48m:24s 43mb/s (151gb/hr) - The actual backup time was 14 Hours. This is from when the initial backup all started, until the run was complete. This includes the time to add and remove snapshots, along with all other processing. Averaging 43 mb/sec is pretty good.

vs Vrt: 37h:39m:08s 17mb/s (59gb/hr) - The virtual time is what is sounds like, the amount of time taken in the virtual VBA. This does not include snapshot time, just the pure backup time. If you added up the time from all the VBAs, this is the total. So if you were to run single threaded (one VBA or one backup in the Console) this is how long it will probably take.

sent 400.8g - And finally, how much was sent to the backup targets in total. For this FULL run, we backuped up 2282 GB, but only sent 400 GB to the Backup Targets. Pretty good compression.

After the initial pass was done, esXpress went back and tried to backup the failed VMs again:

2007-12-08 12:34:11 ALL Machine RETRY Backup Starting-----
2007-12-08 12:40:50a ERR 'yyyyyyyy' - Create Snapshot FAILED, Backup Failed VM 6/12 ON -a
2007-12-08 16:18:54a OK 'xxxxxxxxx' - 2/2/2 disks, (100%) 154g/154g/154g (86% Data), Act: 2h:28m:22s 17mb/s (59gb/hr) vs Vrt: 2h:10m:29s 20mb/s (70gb/hr), sent 46.1g VM 8/12 OFF -
2007-12-08 16:19:02a ==================================================================================================
2007-12-08 16:19:02a ALL TOTAL: 11 vms 22/22/22 disks, (100%) 536g/536g/536g (67% Data), Act: 3h:44m:51s 40mb/s (140gb/hr) vs Vrt: 11h:04m:40s 13mb/s (45gb/hr), sent 110.2g

So in the RETRY we successfully backed up 11 VMs, 22 VMDKs and 536 GB of VMDK.

In the end here, it took 14h:48m:24 and 3h:44m:51s to backup 2818 GB on one host. Not so bad.

Now for the next night, we ran Delta backups. Of those 83 VMs which only 65 of them can be backed up, we backed up 30 of them.

Those other 35 VMs we SKIPPED on the backup becasue they are powered off, and they have not changed since the last backup. No sense backing them up again.

2007-12-09 03:43:11o skipped 'zzzzzzzz' - '00-zzzzzzz.vmdk' - Not Changed since 2007-08-07 16:45:57, Delta Skipped -
2007-12-09 03:43:15o skipped 'zzzzzzzz' - '01-zzzzzzz_1.vmdk' - Not Changed since 2007-08-07 16:45:57, Delta Skipped -
2007-12-09 03:43:15o SKIP 'xxxxxxx' - No Disk to backup, Skipped 2 disks VM 61/83 OFF -
==================================================================================================
2007-12-09 04:04:01o ALL TOTAL: 30 vms 51/51/51 disks, (0.5%) 6.2g/1335g/1335g (43% Data), Act: 8h:02m:51s 47mb/s (165gb/hr) vs Vrt: 18h:17m:34s 20mb/s (70gb/hr), sent 2.18g

(0.5%) 6.2g/1335g/1335g - This is a little different from a FULL backup run. Here our Percent is only 0.5%. This means out of the 1335 GB of VMDKs, we only took 6.2 GB worth of delta blocks and sent only 2.18 GB to the backup server.

On this host, the overall bottleneck is Disk I/O. The Fulls averages 40 mb/sec and the Deltas, just a little faster, 47 mb/sec.

I can't wait for bigger hardware!

0 Comments Permalink

It's been a long 2 weeks, a lot of testing. We've been doing the QA testing on esXpress v3.1 RC-8 and we are getting close with a final release of version 3.1. Version RC-8 just went up.

A little history on our releases:
2006-03, First Beta of esXpress v2
2006-05, GA of esXpress v2
2006-08, Beta of esXpress v3.0
2006-12, GA of esXpress v3.0
2007-06, Beta of esXpress v3.1
2007-12, GA of esXpress v3.1 (predication)

Let’s talk about software development cycles.

Our version 2 of esXpress was originally released in March 2006 for public beta and GA release was May 6, 2006.

We released a few maintenance versions after that, but we have not updated the code since Dec 30. 2006 when we released esXpress 2.3-5 for ESX 2. We only released this version because of some things we learned from creating version 3 of esXpress and back-ported some of the features.

In terms of software reliability, version 2 of esXpress is pretty stable. It has been almost a year since any changes has been made to the code, and we know a lot of people are still using it. You can install version 2 of esXpress onto any version of VMware ESX 2.x and it works. We want that same level of comfort with our version 3 of esXpress. We believe in solid releases that work, not the version of the day. It's pretty impossible to call a 2 week beta cycle a real beta test.

Now back to software testing. I’ve been doing the QA for the 3.1 versions, and I’ve let a few bugs slip by. But in the past few weeks we have found and squashed a few old bugs including a Divide by Zero error in the stats, when you are running a VM backup in parallel mode, and all disks fail.

The best part about VMware, we created a monstrous QA test using literally a 100 VMs that cover just about every scenario that esXpress might encounter. VMware even makes testing VMware third party products easier!

On my dev box I run about 8-12 VMs at any time for my testing before moving to the big box. Just to give you an idea of how much we test, I personally have run 687 backups in the past 9 days on this box. This does not even count the other testing others have done before it was turned over to me. This is just to certify the 3.1RC-8 version is good.

This RC-8 version has some minor enhancements over the RC-7 version. We have streamlined the backup process to make the 'Backup All' better. If you have 8 VBAs enabled, we now try to achieve that and get 8 of them running at a time, and keep 8 of them running for the entire backup run.

Previously, the delay between starting up new backups was not as quick as it needs to be. It worked well here on my dev box with 12 VMs, but at one our test sites it did not. When you have Hosts that are really running 100 VMs, things work a little different then a Host with a couple of VMs.

We also added some fuzzy logic to the backup ordering. When you have to backup 100 VMs (remember, this is on one host) you need to be able to automatically figure out the best order without having to manually manipulate each of the VMs. We give each VM a weight, and for each different status (such as vm_order, power, snapshot, version) we add extra weight. The more weight a VM gets, it sinks to the bottom of the backup order list. (Note: this is another good feature to steal.) This way VMs that will probably be skipped will be at the bottom of the list and attempted last.

Sometimes it seems like most of the work we do is at a level that most people we never even know it exists. But that is a good thing, it means that esXpress will probably mostly work unless you have some type of environmental failure in VMware, Storage, Network or Backup space. We've spent the last week really working the obscure bugs out of the engine that runs the VBAs. Even adding another layer of having the backup try again on failure, if the failure was something immediate.

Besides running hundreds of tests on my dev server, I’ve also been running them at a customer’s location for a reality check. I’m backing up 8 hosts that represent some 400 VMs, and 12 TB of data. All this gets backed up nightly. Two of those hosts have 100 VMs each.

This customer used to be the biggest customer of those other guys (200 CPU), but not anymore. And what makes it more amusing, he gave us a list of feature requests earlier this year, and that is what drove most of the development for version3.1. Gotta luv users who know what they want. This is just one customer among many who let us use their environments for our QA testing (which they get personal support in return for). It is nice when customers trust you enough to give you root access into their systems.

Hopefully we did enough testing on version 3.1 to make it as reliable as possible, but only time will tell.

0 Comments Permalink

Write your own drafts, invite selected collaborators, or leave it open for all to pitch in.

Communities