Solved: Nakivo backup, failed to remove snapshot

Effelito · ‎10-21-2017

Hi,

I've used Nakivo Backup & Replication with ESXi 5.5 for a long time without any issues. Iv'e recently upgraded to a new host with ESXI 6.5u1 and now I'm having problems backing up my guests.

The problem is that a full initial backup works 100% but following incremental backups almost always result in guest shutting down. The issue is that Nakivo will try and remove the sanpshot after a completed backup job but the guest is still using that snapshot. This gives me the error that the *-00001.vmdk is corrupt and the guest cannot boot.

I'v tried everyting, even a support call with two techs from Nakivo and they belive that this is a problem with vmware and not nakivo.

Please help...

hussainbte · ‎10-23-2017

The only change is that you have upgraded to vmfs6 and the snapshots are all sesparse.

If you have a vmfs datastore which is still vmfs5 you can try and migrate the vm to there and take snapshots and check..

check if you have a local datastore which can be used for testing..

If you found my answers useful please consider marking them as Correct OR Helpful Regards, Hussain https://virtualcubes.wordpress.com/

View solution in original post

hussainbte · ‎10-21-2017

1st thing is to make sire the backup solution is supported with the version of VC and ESXi...

there has to be proper explanation from the Nakivo Support..

If you take manual snapshots and delete im sure you will have no issues.

seems like the backup team dont want to address the situation or take it up with VMware SDK support and hence pointing to a vmware issue

If you found my answers useful please consider marking them as Correct OR Helpful Regards, Hussain https://virtualcubes.wordpress.com/

Effelito · ‎10-22-2017

Well a manual snapshot and deletion also " crashes" the guest if the guest is running. Seems like its consolidation that fails.

hussainbte · ‎10-22-2017

Well then if it is OK, could you share the latest vmware.log file from the VM.

Do you have the issue with 1 VM or with all of them.. any particular OS..?

If you found my answers useful please consider marking them as Correct OR Helpful Regards, Hussain https://virtualcubes.wordpress.com/

Effelito · ‎10-22-2017

Hi,

Well it happens to all of my guest regarding of OS. It happens most to one Ubuntu 16.04 LTS and to a Windows server 2008.

The attached log is from the Server 2008, you can see it happen:

2017-10-21T12:08:00.559Z| vcpu-0| I125: SnapshotVMXTakeSnapshotComplete: Done with snapshot 'Temporary snapshot 8cc7518c-923d-4b38-8cda-77246cf9404f': 3

2017-10-21T12:08:00.559Z| vcpu-0| I125: VigorTransport_ServerSendResponse opID=vim-cmd-30-15e3 seq=1909: Completed Snapshot request.

2017-10-21T12:08:04.167Z| vcpu-0| I125: HBACommon: First write on scsi0:0.fileName='/vmfs/volumes/59dcf422-93efabc2-e7f4-a0369f330054/vSphere-dc01.int.effelito.net/2f0c4509-b6a4-4c5e-bdc1-9060b0831b6b-000001.vmdk'

2017-10-21T12:08:04.168Z| vcpu-0| I125: DDB: "longContentID" = "75f5323b76629679834556e6594e20c4" (was "93a7291a1e22a2a586ef99d00404b7a7")

2017-10-21T12:08:04.175Z| vcpu-0| I125: DISKLIB-CHAIN : DiskChainUpdateContentID: old=0x404b7a7, new=0x594e20c4 (75f5323b76629679834556e6594e20c4)

2017-10-21T12:09:28.547Z| vcpu-0| I125: Msg_Question:

2017-10-21T12:09:28.547Z| vcpu-0| I125: [msg.hbacommon.corruptredo] The redo log of '2f0c4509-b6a4-4c5e-bdc1-9060b0831b6b-000001.vmdk' is corrupted. If the problem persists, discard the redo log.

2017-10-21T12:09:28.547Z| vcpu-0| I125: ----------------------------------------

2017-10-21T12:10:44.648Z| vcpu-0| I125: VigorTransportProcessClientPayload: opID=243b17cd seq=2163: Receiving Bootstrap.MessageReply request.

2017-10-21T12:10:44.649Z| vcpu-0| I125: VigorTransport_ServerSendResponse opID=243b17cd seq=2163: Completed Bootstrap request.

2017-10-21T12:10:44.649Z| vcpu-0| I125: MsgQuestion: msg.hbacommon.corruptredo reply=0

2017-10-21T12:10:44.649Z| vcpu-0| E105: PANIC: Exiting because of failed disk operation.

2017-10-21T12:10:45.448Z| vcpu-0| W115: A core file is available in "/vmfs/volumes/59dcf422-93efabc2-e7f4-a0369f330054/vSphere-dc01.int.effelito.net/vmx-zdump.001"

2017-10-21T12:10:45.448Z| mks| W115: Panic in progress... ungrabbing

This is basically the same error on all my machines, manual snapshot or Nakivo initiated.

hussainbte · ‎10-23-2017

The only change is that you have upgraded to vmfs6 and the snapshots are all sesparse.

If you have a vmfs datastore which is still vmfs5 you can try and migrate the vm to there and take snapshots and check..

check if you have a local datastore which can be used for testing..

If you found my answers useful please consider marking them as Correct OR Helpful Regards, Hussain https://virtualcubes.wordpress.com/

Effelito · ‎10-23-2017

Hi,

Don't really know what sesparse means but do you think that vmfs6 could be the problem? The guest were moved from a vmfs5 datastore to vmfs6.

I have a second datastore that i can format to vmfs5.

hussainbte · ‎10-23-2017

please do so and check..

we will be able to anrrow down the issue

If you found my answers useful please consider marking them as Correct OR Helpful Regards, Hussain https://virtualcubes.wordpress.com/

Effelito · ‎10-23-2017

Hi,

I think that did the trick. I've done a full backup and a incremental backup on a Windoes server 2008 and a small Linux guest. All went great and snapshots were removed. The question is why is this an issue on vmfs6? Is it because the guests where built on vmfs5 from the beginning or is there something in vmfs6 that I'm not aware of?

hussainbte · ‎10-23-2017

there is difference in the way snapshots are handled in vmfs6 from vsphere 6.5 onwards.

Nakivo backup does not seem to be aware/compatible with that..

FYI.. some explanation given at the end of this article regarding the difference.

Deep Dive - The Ultimate Guide to Master VMware Snapshot

If you found my answers useful please consider marking them as Correct OR Helpful Regards, Hussain https://virtualcubes.wordpress.com/

Effelito · ‎10-23-2017

Thank you so much for your guidence. I will inform Nakivo of this since they haven't encountered this problem before.

Cheers!

hussainbte · ‎10-23-2017

If you could just update this article once you get the response from Nakivo.. I would like to know their stand

If you found my answers useful please consider marking them as Correct OR Helpful Regards, Hussain https://virtualcubes.wordpress.com/

Effelito · ‎10-23-2017

Will do, they sent our findings to their developers.

sviatx · ‎01-18-2018

In the settings of a Backup job, you need to uncheck "Use existing backup as a target" in order to create a new backup chain for VM.

It works, as before, without problems. I have checked on several hosts.

Effelito · ‎04-21-2018

Aha, so its mandatory to start a new backup chain when you change filesystem? Have ro try that. Thanks!

No solution was provided by Nakivo and I was just happy I got it working.

JPCJon · ‎06-12-2018

I have the same issue, BUT, recreating the backup job entirely on Nakivo still does not resolve the issue.

All was fine until i refreshed my environment, migrated all hosts' compute and storage from vSphere 5.5 w/ a vmfs5 datastore to brand new hosts running 6.5 w/ a vmfs6 datastore. All my other hosts don't seem to have problems (at least 15 hosts are backing up using the same settings in Nakivo). The issue only persists on the VM that acts as my file server with one virtual disk that is GUID (NTFS) and 4 TB. That virtual disk WAS 2 TB when it was on the old vmfs5 datastore.

Nakivo (despite multiple recreations of the backup job from scratch {i.e. not using existing backup chain}) will successfully back up the initial job, but then at some point in the following couple weeks (running incrementals using CBT change tracking every 2 hours) will randomly take 6+ hours to run a backup (despite the usual 5 min. incremental process it has been running since the initial backup) and then freeze on snapshot removal around 45%.

Unfortunately, in my case, this actually LOCKS UP THE ENTIRE HOST that the VM's compute resources are assigned to (including the other VMs on that host. entirely unreachable) until i HARD REBOOT that host.

I'm including my input here in the hopes we all may be able to come to some kind of conclusion for what is going on. I'm currently in the process of bringing up an entirely new file server to replace this VM, creating it from scratch using the new vmfs6 datastore in the hopes that the issue is simply due to the migration from vmfs5 to 6. I also don't believe this is a problem with Nakivo seeing that Nakivo simply sends an API call to vSphere. At that point, everything is in VMware's hands. So, i believe the issue must lie within VMware itself.

please reply or post if you have any insight on this issue.

thanks so much.

Effelito · ‎09-02-2018

Hi JPCJon,

Did you ever find a solution?

I just reinstalled my 6.5 due to a corrupt USB bootbank and I thought I give VMFS6 a go again. Sadly most of my guests had chrashed during last night backup in the same way as before. I did do a new backup chain but as in your case it didn't work.

Effelito · ‎09-02-2018

Updated to 6.7 and did a test on a simple linux guest. Full initial backup works great, second incremental fails because the snapshots are not consolidated after the first backup.