VMware Cloud Community
moracius
Enthusiast
Enthusiast
Jump to solution

'Optimistic locking failure' after failed clean-up

Hello,

I have configured five VM's configured successfully with vSphere Replication (VR). One of them that was replicating Ok, after a recovery test that I was trying to do the clean-up, failed with "time-out error more than 900 sec" in the storage phase. The Windows VM has 17 vDisks, and about 4.5TB of data.

After this, I tried to execute a new test, and the recovery plan failed with the message "Error creating test bubble image for group..."; with this message I found in http://www.vmware.com/support/srm/srm_releasenotes_5_0_0.html:

To resolve this issue, you must reconfigure replication using the following procedure:

  1. Clean up the recovery plan that has just failed.
  2. Start the Reconfigure wizard for affected the protection group.
  3. Change the location of files that are on the datastore that has been disconnected. Select the same datastore and folder locations.
  4. Agree to reuse the existing disks, as suggested by the wizard. Reconfigure the virtual machine.
  5. The protection group enters a full sync state, during which data consistency is checked. Wait for the process to complete.

Despite the fact that I wasn't aware of any disconnect of the storage, I tried the procedure. Funny was that it asked me twice for each one of the 17 disks the message to reuse the disk (duplicate file found). After this, according to the release notes, it should enter a sync state, however I noticed in the "recent task" that it actuall deleted all vDisks in the replicated datastore and failed the "initial sync".
Having no more other option I was aware of, I stopped replication and try to reconfigure again. When I did that, I got the error "VRM Server generic error. Please check the documentation for any  troubleshooting information. The detailed exception is: 'Optimistic  locking failure'."

So I de-install everything from the Protected and Recovered site (by everything I mean SRM server, the VRM Server and the VR Server from both sites). Reinstall all, reconfigure, try to enable VR on the same 17 disks VM and still having the 'Optimistic  locking failure'.

I tested with other VMs and it is working.

I saw many hint on this community to try to remove, re-add the VM in the inventory, or even remove the VM, create a new VMX file and re-attached all disks, but I can't do it now, so hopefully someone would have any idea on how to fix this without having to power down the VM in question.

My environment is using vCenter 5.0 U1 and SRM 5.0.1.

Thanks for any hints,

Regards,

Reply
0 Kudos
1 Solution

Accepted Solutions
moracius
Enthusiast
Enthusiast
Jump to solution

Hi Martin,

Thanks for the help. I'll look into opening a SR with VMware. Right now, it appears I have successfully found an work around.

After the "optimistic" failure, I couldn't get that particular VM to be VR configured again, because I was receiving another kind of error: [Call "HmsGroup.CurrentSpec" for object "GID-aef7837-8dd7-8dd7-4aa2-bc7d-8d67cba47363" on Server "vrms-server.local.com" failed. An unknown error has occurred.]

Looking around the Internet, I found several people that has passed this GID error. Many advices on the foruns tell to remove the VM from inventory and then re-add it. However, in order to do this, we need to shutdown the VM, but I wasn’t keen to do that.


I think the remove/re-add from inventory action works because when we do that, vCenter generates a different Managed Object ID (MoID) for the VM, so if the VM MoID was “vm-3882”, when it is re-added, it will be something different, so later the GID will also be something different. More information on Managed Object Reference (MoRef) in vCenter Server, can be found here http://kb.vmware.com/kb/1017126.

The GID number I believe is a VR ID for the VM, so when the first error hit (in my case, the “optimistic” failure), somehow that GID get “jinxed” in the VMRS DB and SRM will never even try to configure VR on it again. The trick I found was to, instead of creating a new MoID by readding the VM to inventory, I just deleted the DB rows where the problematic GID were. I found such rows in two tables of the VRMS database that match the GUID in the error message. The tables are:

  • [GrpSpecVM]
  • [PrimaryGroupEntity]

So, I used T-SQL scripts to delete the row where the relationshinp between the VM MoID and the GID are recorded, [GrpSpecVM]; after that, I deleted the row where the relationship between the GID and the ConfigurationState are recorded, [PrimaryGroupEntity]. With these actions I manage to be able to right click the VM and start the VR again without receiveing the GID error and with the VM powered up all the time. I'm aware that this row deletion is not supported by VMware, and I haven’t received this information from them, and I have no idea if this is going to cause any trouble in the future – but it sure fixed the GID error.

Now back to my original problem ("optimistic" failure).

I also found a number of tips and hints about that, I tried all of them, and nothing was working. There was however, one of the tips, available at http://vblitz.joelprophoto.com/blitz-setup-vsphere-replication-srm-5-0/ that I haven't tried yet. It says:

  1. If initial configuration fails with a “optimistic locking failure” immediately try to reconfigure again.

I didn't pay attention to that tip, because I didn't see much sense in IMMEDIATELY doing the same thing without changing anything and expect a different result, but in this case, it actually worked! I had to repeat 3 times the same VR configuration I was doing in the same VM, and I noticed each time something was not 100% configured like the previous attempt, so I guess after the 3rd attempt it actually "sticks" 100% of the configs in the VRMS database and the initial sync finally started.

Anyway, just wanted to post here some information about the new workaround I found for the GID error, and also thanks the http://vblitz.joelprophoto.com post for the tip I initially was reluctant to follow and that in the end was what actually worked...

Regards,

Moracy


View solution in original post

Reply
0 Kudos
2 Replies
mvalkanov
VMware Employee
VMware Employee
Jump to solution

Hi,

Please open an SR (if you haven't already) and upload SRM + vSphere Replication support bundles for both protection and recovery sites.

The VRMS logs are needed to troubleshoot the "Optimistic locking failure".

About the issues experienced:

- there should be details for the real cause somewhere at the end of the "Error creating test bubble image for group ...", both in the UI and in VRMS logs (/opt/vmware/hms/logs at the recovery site VRMS)

- failing to create the test bubble replica image might not be necessarily due to disconnected and re-connected datastore

- reconfiguring the replication group detected either changed target path for one or more of the disks or changed path for the replica of the config files and internally triggered replication unconfigure + configure (this is known issue with vSphere Replication and is being addressed in a future release). The unconfigure part removed the placeholder disks at the recovery site and configure probably failed, as it expected to find the initial copies. If the disk paths weren't changed, unconfigure and configure wouldn't have happened and VRMS would have just updated the datastore managed object id value.

- the appliances have been reinstalled, but perhaps the VRMS database has been preserved and the VRMS configured from the existing database, simply re-installing VRMS and keeping the database won't change anything for a VM's replication configuration

To be able to troubleshoot the "Optimistic locking failure" we need to take a look at the VRMS logs and see what is wrong with the replication settings/state for that particular VM.

Regards,

Martin

moracius
Enthusiast
Enthusiast
Jump to solution

Hi Martin,

Thanks for the help. I'll look into opening a SR with VMware. Right now, it appears I have successfully found an work around.

After the "optimistic" failure, I couldn't get that particular VM to be VR configured again, because I was receiving another kind of error: [Call "HmsGroup.CurrentSpec" for object "GID-aef7837-8dd7-8dd7-4aa2-bc7d-8d67cba47363" on Server "vrms-server.local.com" failed. An unknown error has occurred.]

Looking around the Internet, I found several people that has passed this GID error. Many advices on the foruns tell to remove the VM from inventory and then re-add it. However, in order to do this, we need to shutdown the VM, but I wasn’t keen to do that.


I think the remove/re-add from inventory action works because when we do that, vCenter generates a different Managed Object ID (MoID) for the VM, so if the VM MoID was “vm-3882”, when it is re-added, it will be something different, so later the GID will also be something different. More information on Managed Object Reference (MoRef) in vCenter Server, can be found here http://kb.vmware.com/kb/1017126.

The GID number I believe is a VR ID for the VM, so when the first error hit (in my case, the “optimistic” failure), somehow that GID get “jinxed” in the VMRS DB and SRM will never even try to configure VR on it again. The trick I found was to, instead of creating a new MoID by readding the VM to inventory, I just deleted the DB rows where the problematic GID were. I found such rows in two tables of the VRMS database that match the GUID in the error message. The tables are:

  • [GrpSpecVM]
  • [PrimaryGroupEntity]

So, I used T-SQL scripts to delete the row where the relationshinp between the VM MoID and the GID are recorded, [GrpSpecVM]; after that, I deleted the row where the relationship between the GID and the ConfigurationState are recorded, [PrimaryGroupEntity]. With these actions I manage to be able to right click the VM and start the VR again without receiveing the GID error and with the VM powered up all the time. I'm aware that this row deletion is not supported by VMware, and I haven’t received this information from them, and I have no idea if this is going to cause any trouble in the future – but it sure fixed the GID error.

Now back to my original problem ("optimistic" failure).

I also found a number of tips and hints about that, I tried all of them, and nothing was working. There was however, one of the tips, available at http://vblitz.joelprophoto.com/blitz-setup-vsphere-replication-srm-5-0/ that I haven't tried yet. It says:

  1. If initial configuration fails with a “optimistic locking failure” immediately try to reconfigure again.

I didn't pay attention to that tip, because I didn't see much sense in IMMEDIATELY doing the same thing without changing anything and expect a different result, but in this case, it actually worked! I had to repeat 3 times the same VR configuration I was doing in the same VM, and I noticed each time something was not 100% configured like the previous attempt, so I guess after the 3rd attempt it actually "sticks" 100% of the configs in the VRMS database and the initial sync finally started.

Anyway, just wanted to post here some information about the new workaround I found for the GID error, and also thanks the http://vblitz.joelprophoto.com post for the tip I initially was reluctant to follow and that in the end was what actually worked...

Regards,

Moracy


Reply
0 Kudos