Solved: VM Disk Speed Degraded after vMotion to New-Same S...

XwebNetwork · ‎11-07-2020

So this is a weird one. I have two servers that are hardware identical. I vMotion'd a VM from server A to B. Process seemingly went smooth but for whatever reason, this particular VM's disk reads and writes are way off. They should be in line with the following which is from another VM on the same server.

But the tests render the following:

Now, I've ensured everything and everything is up to date. The VM did used to report proper numbers on the prior server. So something must have happened during the move? I really am not sure but maybe someone has run into a similar issue? Reverting back is not an option unfortunately due to the way the backup cycles landed.

Anyways, thank you in advance!

Physical Specs for both servers.

Proliant DL360 G10

XEON Platinum 8160 x 2

192GB DDR4 RAM

6 x 2TB SSD in RAID10ADM with Caching via P408i Controller

XwebNetwork · ‎11-07-2020

Huh, so I got it....but not what I would have thought. I noticed that there was a temporary snapshot that had not been removed from the transfer. Deleted it and bam, back in business. So looking into why that would be the case, it seems to make perfect sense.

When you create a snapshot, the original disk image is "frozen" in a consistent state, and all write accesses from then on will go to a new differential image. Even worse, as explained here and here, the differential image has the form of a change log, that records every change made to a file since the snapshot was taken. This means, that read accesses would have to read not only one file, but also all difference data (the original data plus every change made to the original data). The number increases even more when you cascade snapshots.

View solution in original post

conyards · ‎11-07-2020

You may need to dig a little deeper into host performance and configuration. There are myriad things between the VM and storage that could impact performance.

I'd suggest starting with 'esxtop' and reviewing data from KAVG, DAVG and GAVG, to try and ascertain if there is latency being introduced at the Kernel or Device layer.

Heres a KB guiding how to interpret esxtop:

https://kb.vmware.com/s/article/1008205

Here is a link to an excellent blog series for storage troubleshooting in general.

https://blogs.vmware.com/vsphere/2012/05/troubleshooting-storage-performance-in-vsphere-part-1-the-b...

Once you've got that information, you can direct troubleshooting in the right direction. e.g. if it points toward DAVG troubleshoot the path between the host and the array and the array itself. If it's KAVG, then the issue is likely resident in the host.

Hopefully that makes sense.

Simon

https://virtual-simon.co.uk/

XwebNetwork · ‎11-07-2020

Thanks for the hasty reply man! I'll definitely look into this.

XwebNetwork · ‎11-07-2020

So finally had a quick moment to sit down and try this. Loaded ESXTOP, hit V to see the virtual disks and ran CrystalDiskMark again on two of the VMs. Right away I noticed about 10x more latency/rd on the affected VM compared to the other. Will continue to investigate.

conyards · ‎11-07-2020

Out of interest, was the latency being reported at the Kernel or Device layer?

https://virtual-simon.co.uk/

XwebNetwork · ‎11-07-2020

Unless I am reading it wrong...it doesn't seem to show up on either. Only on the V screen. So I'm guessing that must means its at the vm level?

XwebNetwork · ‎11-07-2020

Huh, so I got it....but not what I would have thought. I noticed that there was a temporary snapshot that had not been removed from the transfer. Deleted it and bam, back in business. So looking into why that would be the case, it seems to make perfect sense.

When you create a snapshot, the original disk image is "frozen" in a consistent state, and all write accesses from then on will go to a new differential image. Even worse, as explained here and here, the differential image has the form of a change log, that records every change made to a file since the snapshot was taken. This means, that read accesses would have to read not only one file, but also all difference data (the original data plus every change made to the original data). The number increases even more when you cascade snapshots.

All

VM Disk Speed Degraded after vMotion to New-Same Spec'd Host