8 Replies Latest reply on Oct 10, 2019 7:42 AM by AndreasAPS

    6.7.0 snapshot removal freezes vm

    AndreasAPS Lurker

      We are currently facing the following issue:

       

      - We have vm acting as Samba FileServer (for about 50 clients)

      - It is backuped up using veeam B&R. Veeam uses the vmware API to create a snapshot and removes it after the backup is done.

      - During snapshot removal the vm freezes and is completely unavailable for 40 seconds but sometimes more than 60 seconds up to 20 minutes. (60 seconds is critical because it impacts our users)

      - The snapshot removal progress is not smooth, there are long periods of no progress at all.

      - WHen I try ssh to the ESX at that time (e.g. to cd or ls the directory) I get an unresponsive ssh session, or "device our resource busy" timeouts (not knowing if this is normal, as I have seen such waits even when no backup was running).

       

      - The VM has 5 harddisks (in total about 9 TB), 4 Harddisks are on Datastore2, 1 Harddisk is on Datastore1  (1 TB).

      - Our cluster consists of 2 ESX hosts with a shared storage (FC direct attached HPE MSA5020) that provides Datastore1 and 2

      - Datastore2 is dedicated to this VM.

       

      When I grep in the vmware.log, I see a lot of times where the vm was stopped for snapshot removal.

       

       

      2019-10-08T10:00:52.645Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 1177715 us

      2019-10-08T10:04:23.472Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 12895 us

      2019-10-08T10:04:38.091Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 155503 us

      2019-10-08T10:05:17.391Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 30895458 us

      2019-10-08T10:05:22.945Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 166767 us

      2019-10-08T10:05:25.206Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 144651 us

      2019-10-08T11:00:55.800Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 1268898 us

      2019-10-08T11:04:31.733Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 13710 us

      2019-10-08T11:04:51.409Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 18886631 us

      2019-10-08T11:04:53.100Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 197801 us

      2019-10-08T11:05:09.481Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 8160928 us

      2019-10-08T11:05:15.162Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 187385 us

      2019-10-08T11:05:17.492Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 164755 us

      2019-10-08T12:00:52.716Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 1238173 us

      2019-10-08T12:04:08.472Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 10503 us

      2019-10-08T12:04:09.289Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 108357 us

      2019-10-08T12:04:26.429Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 16180417 us

      2019-10-08T12:05:00.877Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 25686061 us

      2019-10-08T12:05:46.096Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 39624419 us

      2019-10-08T12:05:48.275Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 126040 us

      2019-10-08T13:00:57.400Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 1269665 us

      2019-10-08T13:04:43.269Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 11101 us

      2019-10-08T13:05:17.012Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 32848352 us

      2019-10-08T13:05:18.540Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 159029 us

      2019-10-08T13:05:26.983Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 147776 us

      2019-10-08T13:05:58.774Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 26297009 us

      2019-10-08T13:06:01.072Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 158871 us

      2019-10-08T14:00:53.754Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 1084127 us

      2019-10-08T14:04:09.445Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 11057 us

      2019-10-08T14:07:20.351Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 189794746 us

      2019-10-08T14:07:22.001Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 176603 us

      2019-10-08T14:12:58.620Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 326074922 us

      2019-10-08T14:14:10.905Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 66778768 us

      2019-10-08T14:14:13.307Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 152553 us

      2019-10-09T12:17:22.774Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 1902985 us

      2019-10-09T12:23:22.198Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 13032 us

      2019-10-09T12:32:32.823Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 549832173 us

      2019-10-09T12:32:34.558Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 390240 us

      2019-10-09T12:33:56.769Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 73481855 us

      2019-10-09T12:42:53.251Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 530946216 us

      2019-10-09T12:42:55.595Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 167243 us

       

      I can hardly believe that this is "normal" behaviour. Does anybody have ideas how to narrow down the issue?

       

      regards

       

      Andreas

       

      Nachricht geändert durch Andreas Baier

        • 1. Re: 6.7.0 snapshot removal freezes vm
          Tayfun DEGER Hot Shot
          vExpert

          There are multiple reasons why a virtual machine is a frezee during snapshot. This is sometimes caused by a bug in ESXi, sometimes by virtual machine configuration.

           

          First of all, can you answer my questions below?

           

          ESXi version? Build number?

          What is virtual machine hardware?

          What is the physical equipment brand model?

          --
          Blog: https://www.tayfundeger.com
          Twitter: https://www.twitter.com/tayfundeger

          vBlogger, vExpert, Cisco Champions

          Please, if this solution helped your problem, "Helpful" if it solves your problem "Correct Answer" to mark.
          • 2. Re: 6.7.0 snapshot removal freezes vm
            AndreasAPS Lurker

            We are using:

             

            ESX Version:

            • Esxi 6.7.0 Build 13006603

            VM Hardware:

            • 2 CPU, 4 GB Ram, 5 HardDisks (16GB, 500GB, 4.88TB, 2.93TB, 1TB). (what else do you need to know?)

            Phys Equipment:

            • 2x ProLiant DL360 Gen10 each with 128 GB Ram and 2 x 12 Core Intel(R) Xeon(R) Gold 5118
            • 3. Re: 6.7.0 snapshot removal freezes vm
              Tayfun DEGER Hot Shot
              vExpert

              During snapshot, the virtual machine is instantly frozen. If the CPU and memory resources on the virtual machine are insufficient during this time, performance problems may occur during snapshot. Can you increase CPU and Memory resources? Can you check the latency of the datastores from the ESXi performance monitor?

               

              What is the Guest OS you have also used? What is the version of vmware tools installed on the virtual machine?

              --
              Blog: https://www.tayfundeger.com
              Twitter: https://www.twitter.com/tayfundeger

              vBlogger, vExpert, Cisco Champions

              Please, if this solution helped your problem, "Helpful" if it solves your problem "Correct Answer" to mark.
              • 4. Re: 6.7.0 snapshot removal freezes vm
                AndreasAPS Lurker

                Hi,

                 

                • The VM is running on Cent OS 7. The version of the vmware tools is "open-vm-tools.x86_64 10.2.5-3.el7"
                • Increasing memory and CPU of the VM? We could increase the ressources, yes. But does a snapshot removal really consume VM's memory/CPU ressources?
                • regarding the latency of the datastore: I did not find where to check that in vCenter - it tells me "No performance data is available for the currently selected metrics". But I had a look at Veeam One and I could see that at the time it happens the latency is not increased. Write latency was 0, Read Latency was around 0.5 to 1 ms for both Datastores.

                 

                 

                We installed some updates in the meantime, now the build is VMware ESXi, 6.7.0, 14320388

                • 5. Re: 6.7.0 snapshot removal freezes vm
                  depping Champion
                  VMware EmployeesUser Moderators

                  Nah. increasing the resources of the VM won't make a big difference, it usually has to do with the change rate on disk (IO intensity) and the storage system itself (how fast is it?). it could also be that the host is low on resources and the resources needed to merge the snapshot with the base is the limitation. Try moving it to another host to see if that makes a difference.

                  • 6. Re: 6.7.0 snapshot removal freezes vm
                    AndreasAPS Lurker

                    The esx hosts were doing almost nothing. CPU was about 8%. Memory was at 30% (max). I can't even see any increase in neither Memory nor CPU nor I/O during the backup / snapshot removal.

                    • 7. Re: 6.7.0 snapshot removal freezes vm
                      continuum Guru
                      User ModeratorsvExpertCommunity Warriors

                      We cant find the serial killer when all you give us is the time of the last kills.

                      The events in between the events are the useful ones.

                      • 8. Re: 6.7.0 snapshot removal freezes vm
                        AndreasAPS Lurker

                        I attached vmware logs. 

                         

                        • Until 8-Oct we were running backups hourly. Then we stopped the backups to keep the VM running.
                        • 9-Oct-19 12:16 - 12:43 (UTC) there was a backup started manually which took 19:50 minutes to remove a snapshot
                        • 8-Oct-19 14:00 - 14:14 (UTC) a snapshot removal took 10:36 mins
                        • 8-Oct-19 13:00 - 13:06 (UTC) a snapshot removal took 1:35 mins
                        • 8-Oct-19 12:00 - 12:06 (UTC) a snapshot removal took 1:58 mins