VMware Cloud Community
gbentz
Contributor
Contributor

Server replication snapshot deletion shutting down my servers

So we have 3 hosts

6.7 U2

6.7 U2

6.5 U2

One of the 6.7 servers is brand new and much newer than the other two.  It also has about 4 tb of free storage, the others have around 1tb left depending on the datastore.  We have been using veeam for about 4 years now without any major issues.  Well since we setup this new server at first we had our SQL server shut itself down during a veeam replication due to it failing to consolidate the snapshot in a given amount of time.  I thought it was a fluke so I set the D drive which holds all the db's to independent persistent and excluded it for now via veeam.  Then the next day the linux centos server had the same redo log error.  Now my exchange server just did the exact same thing which has been using veeam since day one.  It's now on this new server so I know it's narrowed down to this host but every setting is the same across the board.  They both use the same repos during the replication jobs, on the same switch and so forth. This is the error that happened a bit ago

At this point I am stuck on what in the world is causing this to happen to this host only

An application (/bin/vmx) running on ESXi host has crashed (8 time(s) so far). A core file might have been created at

The redo log of 'ourserver-000001.vmdk' is corrupted. If the problem persists, discard the redo log.

serrver is powered off etc

Reply
0 Kudos
7 Replies
continuum
Immortal
Immortal

Sounds ugly.

First of all - what is your current state ?
Do you need to fix any damaged VMs first ?
For the moment stop all Veeam-jobs for misbehaving VMs.

Make sure that the Veeam-VM unmounts all disks it may still have in use. CHECK THIS !!! - do not skip this step.

Provide vmware.logs of the affected VMs , find Veeam logs for those VMs.
Tell us what you need first - so that we can tell you what else we need to know.

Ulli


________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

gbentz
Contributor
Contributor

At this moment all the servers are up and stable.  Veeam jobs have all been disabled for now and any snapshots have been manually consolidated.  I will pull logs now and get them posted

Reply
0 Kudos
gbentz
Contributor
Contributor

Here are the log files from the host and the screen of the error veeam throws.  The KB says its a space issue but all my hosts have plenty of space for these snapshots and replicas

Reply
0 Kudos
gbentz
Contributor
Contributor

Here are the veeam logs

Reply
0 Kudos
continuum
Immortal
Immortal

I found hints that automatic backuptools may have issues with vmdks in mixed modes.
This may have the unexpected result that the independant flag may be ignored and the time required to remove snapshots after the backup will be much longer than expected.
That would explain some of your issues - however your VMware logs do not cover any snapshot automation.

Can you please check wether you have vmware.logs that were active while Veeam was running a backup job against the VM ?
While searching for the message "

I found 1022411's Blog: VMware Snapshots and Oracle non... | Oracle Community

Please read that and compare your symptoms.
A vmkernel.log would be helpful as well.


________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

Reply
0 Kudos
gbentz
Contributor
Contributor

Let me look for some more logs and one more note.  The servers in question were on my other hosts for years and never had this issue.  I migrated them over to this new server because it's latest gen and much faster than the other ones.  Since that migration to this brand new server we are seeing these replication shutdowns.  All disks were always dependent.  I only switched them as a test thinking maybe the SQL data drive and veeam didn't play well but then it happened to the mail server so my theory went out the door then.

Reply
0 Kudos
gbentz
Contributor
Contributor

The mailsvr log in the zip should be the one according to the modified date that would reflect this error on the replication job in veeam

5/14/2019 12:20:51 PM :: Removing VM snapshot... (0% done) Details: The operation cannot be allowed at the current time because the virtual machine has a question pending:

'msg.hbacommon.corruptredo:The redo log of 'jmfxexch2016-000001.vmdk' is corrupted. If the problem persists, discard the redo log.

'. 

This is the same error I've been seeing on each of my servers on this new host with the exception of my small eset virtual appliance

Reply
0 Kudos