We have a Vsphere 4 environment with 8 esxi hosts per cluster. In this cluster we have one 512GB datastore that holds all the linked-clones and master vms. This is of course a test environment....
We have a need to boot 30 vms at a time to perform automated testing. In the past these 30 vms would boot up fast and easy without issues. We recently started telling the vms (through vmware API) to revert the vms to the last snapshot prior to booting. This has made the vm boot process EXTREMELY longer and is causing us issues.
So, I am guessing that reverting to the last snapshot is whats is causing us the issue? My questions are:
1) Is that correct it is the reverting snapshot causing it?
2) What indeed happens when you boot a vm and it loses any changes it has when last on and just reverts?
3) Is this just killing our I/O? What else is this doing?
If this is an I/O SAN issue then I can look at balancing out our LUN load, etc but first I need to know whats really happening under the hood to properly address this. Is there another method of meeting our ultimate goal without causing such performance problems?
Yes, that's what I would say. Spread out your snapshot VM's across different datastores as much as possible, that may help with your IO distribution.
Failing that, I don't see much else you can do. Given your criteria for the simpler approach and scripts, I see your point about keeping things simple and unified. I wish I had a better solution for you.
A simpler solution would be to simply set the VM disks to NON-peristent. As long as the VM's are running data is current. As soon as the VM is powered OFF, the changes are gone, and now you can boot again at a point where you want to start. So previous to the change of non-peristant make SURE the VM is at the state you want.
So rather than revert, just make the changes to all the VM's, now you end up with the same thing. That will give you the same result without the IO concenrs or the time it takes to revert changes.
Trying to do 30 VM's simultaneous will take a HUGE hit on any file system..
The issue with this is we have tests (and testers in non-automated scenarios) that need to install software then reboot, etc so we cannot make the disks non-persistent.
So what exactly happens on the storage-side and with I/O and files shrinking/growing when a revert to snapshot happens? Is it like im reading really doubling I/O?
mujmuj wrote:
The issue with this is we have tests (and testers in non-automated scenarios) that need to install software then reboot, etc so we cannot make the disks non-persistent.
So what exactly happens on the storage-side and with I/O and files shrinking/growing when a revert to snapshot happens? Is it like im reading really doubling I/O?
I understand your testing.. a reboot will NOT destroy data. Specifically, the VM would need to be powered off, data then goes away. so it's still safe for testing and scripts.
Test, it see if you can even use it.. you may find it's easier than snapshots.
It's not the I/O per se, it's you are doing 30 concurrent transactions with a lot of IO at the SAME time, that's the problem. 30 disks ALL doing major (> 10 MB/s I/O) that is a TON of IO and commits.. it's not the underlying SAN.. it's the datastore, you will get a LOT of lag and latency doing this.
Well, there is about 128 esxi hosts, with thousands of test vms. We unfortunately need a "one size fits all" approach to our infrastructure as much as we can. In this case non-persistent disks mean if the vm is shutdown (and/or I thought rebooted too?) it will lose any new data or changes. In our case custom code is telling the vm when to revert to a snapshot so it suites our needs much better than trying to make some disks non-persistent vs others. We could have testers make another snapshot when software is installed and the vm customized to make a future shutdown ok, but management doesn't want to go that route...
Ok, well thank you for the input. It makes sense what you are saying. Really if we want to resolve this it isn't our SAN causing the issue (and thus too much I/O) it is the datastore having too much going on at once. So really our only solution to improve things is to spread the load among additional datastores, correct?
Yes, that's what I would say. Spread out your snapshot VM's across different datastores as much as possible, that may help with your IO distribution.
Failing that, I don't see much else you can do. Given your criteria for the simpler approach and scripts, I see your point about keeping things simple and unified. I wish I had a better solution for you.