VI 3.5 Update 3 and VC 2.5 Update 4 (But we have had this issue for some months, even when we were Update 2 on both ESX and VC)
We have virtualized both vCenter and the vCenter database, both onto separate VMs but they have an affinity DRS rule to stay on the same ESX host and DRS is set to manual just to be safe. Also, both VMs are on the same subnet. The vCenter database VM is running Oracle and has a 120G vmdk and 80G vmdk. We use PHDs esXpress to create weekly backups of delta changes and monthly full backups. For those not familiar with esXpress it uses snapshots to create a compressed tarball archive that can be transferred with ftp or scp. We use that for DR purposes and off site backup.
As the number of VMs managed by vCenter has grown, so has the database obviously. Over the last several months we have had "mysterious" disconnects between vCenter and vCenter database on a weekly basis that we have finally tracked down to the time when the snapshot of the database VM is being committed at the end of the esXpress backup. Recent full backups have run as long as 6 hours and deltas 2 hours. These backups are running concurrent with other backups. We do not know the size of the snapshots that are being committed but it is taking anywhere from 5 - 10 minutes for the snapshots to commit, once esXpress issues the snapshot delete.
We have seen snapshot commits cause network drops before so there may be nothing to do but work around this. Both vCenter and the vCenter database are on their own LUN, a FC attached Hitachi USP-V. If we cannot think of any other tuning to help then we will manually manage the backups so that they occur during a planned maintenance window. Has anyone else seen this or have other ideas to keep the network responding during a commit? We are going to migrate the vCenter database from RH3 32bit to RH5 64bit and shrink the size of the OS disk, only backing that up so that will reduce the backup times and perhaps stop the network drops. We had been running well with the snapshots for sometime before so I assume this change in behavior is due to the size or volume of activity within the vCenter database.
It's interesting to read about the issue you're having, as we have had a similar issue over the past few months.
We have a couple of larger, busy VMs (1 with 12GB RAM and the other with 4vCPUs) that, when committing snapshots, lose network connectivity for a fair amount of time. I came across the following KB article that seemed to mention the symptoms I'm seeing:
and that patch is included in 3.5 Update 3. As we're currently running 3.5 Update 1, I did some testing with 3.5 U3 to see if it would make a difference. Initial tests didn't seem to show any difference, but I am going to revisit the testing with 3.5 U4 just in case. As you're on 3.5 U3, you should be OK but obviously not.
I'll let you know if Update 4 proves any better, be good to hear how you get on.
I do not consider that VMware ESX snapshots are designed for the purpose of dealing with large data volumes and significant changes to those volumes. SANs have enough control over the underlying disk hardware to implement complex logic (ie. copy-on-write) to rapidly create and commit snapshots, but at the level ESX needs to work independant of the underlying storage system there is not an efficient method to deal with maintaining and committing large delta snapshot files. Since it sounds like you have a large VirtualCenter DB it sounds like you have a large environment that may be using SAN storage. In that case look at using whatever SAN Storage snapshot functionality your SANs may have for taking snapshots of data drives, if you have a real need to take snapshots of the DB volumes rather than use other incremental DB backup methods.
A product like esXpress should mostly only be used for taking backups of small OS root volumes as you indicate that you are considering doing. The other data VM disks created for the VM should be set as Independant and Persistent in the VM Disk configuration. With these settings a snapshot and associated delta file won't be created for those disks. Assuming that the OS root volume would be small with little changes during the backup process, the backup job should be relatively quick and the commit process quick.
Thank you for sharing your experience. I can tell you that Update 4 on vCenter does not fix the problem and will keep you informed as to whether getting all the ESX hosts to Update 4 helps. I have a case open with VMware on this and will post any findings.
Thank you and I agree with your technical point that SAN snapshots are superior to what ESX can do. That is why we typically only use esXpress for vmdk files that are 35G or so in size. It has worked well for us over the years and we are probably pushing it to have snapshots on two vmdk files that total 200G. Unless VMware has suggestions or planned fixes we will schedule this backup with snapshots during a planned maintenance window.
I definitely agree that SAN based backups are best. We experienced issues in our development VMware environment. The VMs were being stored on an older NetApp FAS 3020 over 2gig FC. The LUNS were also being de-duplicated using ASIS. I know, not best scenario but it was well worth it when it was stable. The main problem would be that VCB software would try and delete the snapshot and the process would take so long that it would timeout in VC, but in actualality it was still working. When the NetApp would spike ~70% CPU, and multiple snapshot operations were going, it would loose network connectivity. This would not only take down Ethernet (lan, and ISCSI) but also FC connected servers. A few things to check.
- The IO that you are pushing to the storage system
- VMFS and VMDK alignment.
We moved the same vms over to our EMC storage systems and what would take 3 hrs over at the NetApp to delete the snapshot took 4 minutes. Granted the EMCs are not doing deduplication and are designed for speed, as well as being quad-4gig fiber connected.
Linux / SAN / Virtualization
Thank you for your thoughts. I think the IO is not being pushed at the SAN layer since it is a high end Hitachi USP-V with large cache but I will check with the SAN admins to see what the logs look like. Our ESX hosts are isolated on their own SAN ports and watched closely for performance issues. I do not think alignment is an issue since allocation are done through vCenter and I have spot checked a few and they are all good. This particular ESX host has only a 2gig HBA so that is something I can look into.
Part of what you're describing sounds like SOAP timeouts. Fom what I recall, we saw it from Versions 2-4. Updating to Virtual Center 2.5 Update 5 kinda helped, but it wasn't fully resolved until we went to ESX 4/vCenter 4. You can update to vCenter 4 without upgrading the ESX Servers and VMs/Tools. VM7 VMs (after the ESX 4.0 upgrade) don't seem to have as many of these types of problems as VM4 VMs.
It may not be a good idea to snapshot such a large VM in production though. Maybe just snapshot it before/after every round of updates and then backup the database daily using a backup client?