VMware Cloud Community
vmsysadmin20111
Enthusiast
Enthusiast

cloning operation is slow on VSAN

Hi all,

would setting the "IOPS limit for an object" rule in VM storage policy affect the cloning operations on VSAN datastore?

The reason I'm asking is that I'm observing the following issue: the customer on all-flash VSAN datastore starts a cloning operation for a very large VM (>1.5 TB). The VM spills over the capacity SSD boundary, so it is automatically striped (I'm assuming, have not had a chance to verify this yet). This cloning operation severely impacts other operations in this environment, such as VM memory snapshots (perhaps the clone operation is hitting the same disks). The cloning is also very slow  - the customer is particularly upset that cloning on his all-flash VSAN is slower than on his old spinning-disk array. I'm hesitant to suggest to increase the stripe width to improve the performance since that would use even more capacity disks and impact even more VMs.

Current setup - 4 nodes (all flash), RAID 1, FTT=1, stripe width=1

Any way to throttle the cloning operation so other objects are not impacted?

Thanks!

Tags (2)
Reply
0 Kudos
6 Replies
TheBobkin
Champion
Champion

Hello,

There are many unknowns at play here but here's a stab at it:

- Any object over the max component size of 255GB gets striped automatically, however this does not always place components as sensibly as striping as a result of a Rule in a Storage Policy applied to this Object.

So if you have a FTT=1 1.5TB vmdk (I assume there is boot/OS+ at least one data disk but let's assume just one as you did not specify), this Object will be striped into a minimum of 12 ~250GB components (6+6 on 2 hosts). However if this is 'force-striped' (due to Object size) it *may* clump multiple components on single capacity drives, but when you apply striping in a Storage Policy, this will aim to stripe the components across as many different disks as it can (which will likely improve performance). So basically if you know something is going to get striped (by its size), you might as well apply a higher SW policy accordingly.

- Are you trying to clone this large VM while doing synchronous snapshot-based backups of a ton of VMs?

If so, then avoid this time-frame to get a better metric of whether this is 'slow' or not, depending on the VMs loads and numbers, this back-up activity can generate a ton of IO.

- But back to your question regarding would throttling IO of these Objects reduce their impact:

If this is what is actually causing sluggishness in the environment then sure, this may reduce the impact as the read IOPS for the clone job would be throttled (provided you limited it enough).

(This will of course make this job even slower again so it may be a 'speed or stability, pick one' situation)

A few other relevant questions:

Is this VM operational or just used to clone off?

What kind of Application and does it require high IO?

How often is it cloned?

Are you *POSITIVE* the resulting clone isn't being created as Thick? (check via RVC using vsan.vm_object_info <vm>   and look for ProportionalCapacity = 100, if it is then maybe you are cloning 3TB instead of 200GB)

Are there any other possible problems in the environment? (E.g.: using SATA and/or small SSDs for cache and/or have a terrible cache:cap ratio)

Bob

-o- If you found this comment useful or answer please select as 'Answer' and/or click the 'Helpful' button, please ask follow-up questions if you have any -o-

vmsysadmin20111
Enthusiast
Enthusiast

Hi Bob,

thanks for the suggestions, much appreciated! Good point about the max component size! This environment is not in production yet, but backup performance is definitely a concern. I'm not sure how frequently they are planning to run cloning operations in this environment - I believe they are just running it as a test (I did suggest that they use a more meaningful testing method such as HCIBench). No high IO in the cloned VM afaik, it's just very large. I will check the disk format.

Reply
0 Kudos
GreatWhiteTec
VMware Employee
VMware Employee

If you are just testing things out, you could test cloning operations with checksums disabled. In my experience this significantly speeds up cloning; however, we do recommend having checksums enabled on prod environments unless the applications do their own checks. The cool thing about SPBM is that you can have different policies assigned at the object level, so I think you know where I am going with this...

Definitely recommend HCIBench for testing. 

vmsysadmin20111
Enthusiast
Enthusiast

Well, the mystery is solved... The customer was comparing VSAN performance to Simplivity using VM cloning process. On Simplivity, the clone operations appear to be "instant" from the vCenter point of view, since no data is being copied. On a VSAN, the cloning is actually copying the data, so it takes 30 mins to clone a 400 GB. Thanks, all!.

Reply
0 Kudos
TheBobkin
Champion
Champion

Thanks for the update. 400GB (800GB as is FTT=1 or were you including that?) in 30 minutes, not bad, assuming it is thick or fairly full VM though.

Hehe...funny that it was a case of "Those are some nice apples, but why do they not look like those oranges?"

Bob

Reply
0 Kudos
vmsysadmin20111
Enthusiast
Enthusiast

400GB total (thick provisioned), FTT=1 is taken into account, but still is a good number. HCIBench default easyrun 70%/30% random read/write test showed an amazing 65,000 IOPs.

Reply
0 Kudos