How to get more info on what is happening during a...

Munners66 · ‎10-06-2020

I have two identical VMs each with three disks.

When they are backed up, a snapshot is created, they are backed up and then the snapshot is deleted.

One of the VMs the snapshot deletion takes 2 seconds, the other takes over 2 minutes.

I have looked in the VMware.log file for both of these VMs, and I can see that the one that takes over 2 minutes has three 40 second pauses in the log file where nothing is logged as follows:

2020-10-05T02:12:10.996Z| vcpu-0| I125: DISKLIB-CBT   : Initializing ESX kernel change tracking for fid 1093275682.
2020-10-05T02:12:10.996Z| vcpu-0| I125: DISKLIB-CBT   : Successfuly created cbt node 412a1022-cbt.
2020-10-05T02:12:10.996Z| vcpu-0| I125: DISKLIB-CBT   : Opening cbt node /vmfs/devices/cbt/412a1022-cbt
2020-10-05T02:12:10.996Z| vcpu-0| I125: DISKLIB-LIB   : Opened "/vmfs/volumes/03f4c58a-7cd9305c/SVR02/SVR02-000001.vmdk" (flags 0x8, type vmfsSparse).

2020-10-05T02:12:51.040Z| vcpu-0| I125: DISKLIB-VMFS : "/vmfs/volumes/03f4c58a-7cd9305c/SVR02/SVR02-flat.vmdk" : open successful (522) size = 107374182400, hd = 1318850598. Type 3
2020-10-05T02:12:51.040Z| vcpu-0| I125: DISKLIB-DSCPTR: Opened [0]: "SVR02-flat.vmdk" (0x20a)
2020-10-05T02:12:51.040Z| vcpu-0| I125: DISKLIB-LINK : Opened '/vmfs/volumes/03f4c58a-7cd9305c/SVR02/SVR02.vmdk' (0x20a): vmfs, 209715200 sectors / 100 GB.

.

2020-10-05T02:12:51.729Z| vcpu-0| I125: DISKLIB-CBT   : Initializing ESX kernel change tracking for fid 1384386613.
2020-10-05T02:12:51.729Z| vcpu-0| I125: DISKLIB-CBT   : Successfuly created cbt node 52841035-cbt.
2020-10-05T02:12:51.729Z| vcpu-0| I125: DISKLIB-CBT   : Opening cbt node /vmfs/devices/cbt/52841035-cbt
2020-10-05T02:12:51.729Z| vcpu-0| I125: DISKLIB-LIB   : Opened "/vmfs/volumes/3d6ecc01-4726715a/SVR02/SVR02-000002.vmdk" (flags 0x8, type vmfsSparse).

2020-10-05T02:13:31.774Z| vcpu-0| I125: DISKLIB-VMFS : "/vmfs/volumes/3d6ecc01-4726715a/SVR02/SVR02-flat.vmdk" : open successful (522) size = 64424509440, hd = 1251151928. Type 3
2020-10-05T02:13:31.774Z| vcpu-0| I125: DISKLIB-DSCPTR: Opened [0]: "SVR02-flat.vmdk" (0x20a)
2020-10-05T02:13:31.774Z| vcpu-0| I125: DISKLIB-LINK : Opened '/vmfs/volumes/3d6ecc01-4726715a/SVR02/SVR02.vmdk' (0x20a): vmfs, 125829120 sectors / 60 GB.

.

2020-10-05T02:13:32.219Z| vcpu-0| I125: DISKLIB-CBT   : Initializing ESX kernel change tracking for fid 1384648757.
2020-10-05T02:13:32.220Z| vcpu-0| I125: DISKLIB-CBT   : Successfuly created cbt node 52881035-cbt.
2020-10-05T02:13:32.220Z| vcpu-0| I125: DISKLIB-CBT   : Opening cbt node /vmfs/devices/cbt/52881035-cbt
2020-10-05T02:13:32.220Z| vcpu-0| I125: DISKLIB-LIB   : Opened "/vmfs/volumes/694241c7-23b3bcf4/SVR02/SVR02-000002.vmdk" (flags 0x8, type vmfsSparse).

2020-10-05T02:14:12.269Z| vcpu-0| I125: DISKLIB-VMFS : "/vmfs/volumes/694241c7-23b3bcf4/SVR02/SVR02-flat.vmdk" : open successful (522) size = 107374182400, hd = 1215434809. Type 3
2020-10-05T02:14:12.269Z| vcpu-0| I125: DISKLIB-DSCPTR: Opened [0]: "SVR02-flat.vmdk" (0x20a)
2020-10-05T02:14:12.269Z| vcpu-0| I125: DISKLIB-LINK : Opened '/vmfs/volumes/694241c7-23b3bcf4/SVR02/SVR02.vmdk' (0x20a): vmfs, 209715200 sectors / 100 GB.

How can I find out what is happening during these 40 second gaps?

Thanks.

Lalegre · ‎10-06-2020

Hey Munners66,

This depends of the size of the snapshots and how much changes did you have since the backup operation started, as the snapshot is used for the VM to be able to keep Writing on the disk without be interrupted.

You can follow the next KB Article to watch how the sizes of the delta files change during the Deletion Operation which means is consolidating all the changes to the VMDKs: VMware Knowledge Base

Munners66 · ‎10-06-2020

Hi Lalegre,

Thanks for you reply.

I understand that snapshot deletion can take some time depending on the activity during the life of the snapshot, but these two servers are of identical build, the only difference is one is active while the other is doing nothing. The active one deletes its snapshot in 2 seconds while the inactive one takes over two minutes, and has three 40 second pauses in its log file where it looks like it is doing nothing.

I was trying to see if I could find out what is happening during these long pauses as nothing is in the VMware.log file.

Thanks,

nachogonzalez · ‎10-06-2020

Hey Hope you are doing fine:
Let me ask you some questions?

- Those servers that are the same, are in a cluster?
In that case is possible that server B takes more time because is waiting for server A to finish.

- Are backup jobs the same for both servers?

- Are both servers on the same host? What happens if you swap VM's hosts?
- Have you checked Hosts resources? (CPU, Memory, network/HBA consumption)
- If you take a backup with both servers powered off, what is the downtime?

Let me know how can I assist

Blog Nachogonzalez.com.ar Twitter @nachogon_

Munners66 · ‎10-06-2020

Hi,

Yes, the two servers are in a cluster, and it's the passive node that takes over two minutes to delete the snapshot.

The backup jobs are the same for both servers and do not run at the same time.

The two nodes are on different hosts, and as a test, I migrated the passive node to the least resource used host to see if that would make a difference, but it didn't.

These are production servers, so cannot really power them down to test if the snapshot deletion is quicker.

Thanks.

scott28tt · ‎10-06-2020

What sort of cluster?

I wonder if something about the passive node is generating excessive IO requests or if it's some sort of storage locking issue?

-------------------------------------------------------------------------------------------------------------------------------------------------------------

Although I am a VMware employee I contribute to VMware Communities voluntarily (ie. not in any official capacity)
VMware Training & Certification blog

Munners66 · ‎10-06-2020

It's a Windows Server Failover Cluster.

Looking at the vmware.log file, it is taking snapshots of the three local disks and not touching the Cluster Shared Volumes.

The backup only takes about 4 mins for each node. I just don't understand the three 40 second pauses in the log.

Is there any way to see what is going in in this time?

nachogonzalez · ‎10-06-2020

Quick question, in the same timerange what is happening in the OS?
Is the Operating System Aware of that latency?

Also this KB shows best practices for WSFC clusters, are you sure everything is according to what is described here:
VMware Knowledge Base

Blog Nachogonzalez.com.ar Twitter @nachogon_

Munners66 · ‎10-20-2020

Well, it would appear that the cause of these delays is explained in this article:

VMware Knowledge Base

This issue occurs when the target virtual machine and the backup appliance reside on two different hosts, and the NFSv3 protocol is used to mount NFS datastores. A limitation in the NFSv3 locking method causes a lock timeout, which pauses the virtual machine being backed up.

The backup is being modified to use NBD and will see how it runs tonight.

All

How to get more info on what is happening during a snapshot deletion.