Solved: VMware SRM Reprotect Fails with Peer array ID prov...

cjscol · ‎07-18-2013

I have the following setup

Site-A

VMware vCenter 5.1 U1

VMware SRM 5.1.1

NetApp SRA 2.0.1P2

FAS3140C - Data ONTAP 8.1.2P4

Site-B

VMware vCenter 5.1 U1

VMware SRM 5.1.1

NetApp SRA 2.0.1P2

FAS3140C - Data ONTAP 8.1.2P4

I have created a basic Protection group at Site-A containing a single VM with a single vmdk hard disk on a NFS volume.

The NFS volume is snapmirrored to Site-B

I can perform a planned migration to Site-B, reprotect and then another planned migration back to Site-A but then when I attempt to reprotect again so that I am ready for another recovery to Site-B the reprotect fails on the first step Configure Storage to Reverse Direction with "Error - Failed to reverse replication for failed over devices. SRA command 'prepareReverseReplication' failed. Peer array ID provided in the SRM input is incorrect Check the SRM logs to verify correct peer array ID."

I cannot see anything on the Filer logs or in the SRM logs to indicate what the issue is.

This happens for every Protection Group I create so it is not isolated to just this one volume. I have also tried with iSCSI VMFS volumes and get exactly the same results.

If I create a Protection Group at Site-B I can recover to Site-A and cannot reprotect it to fail it back to Site-B.

Initially I though the issue I was seeing is that I could do a failback but couldn't perform a second reprotect because the SnapMirrors were left in the wrong state but now I can see that the issue is that I cannot perform a reprotect from Site-A to Site-B.

I have completely un-installed SRM at both locations, removed the SRM database at both locations and started again but still get the same issue.

I've actually got IBM N series N6040 controllers and am using the IBM branded Data ONTAP and SRA. I have a call open with VMware and IBM but not getting very far.

Has anyone seen this issue before and got a solution?

Calvin Scoltock VCP 2.5, 3.5, 4, 5 & 6 VCAP5-DCD VCAP5-DCA http://pelicanohintsandtips.wordpress.com/blog LinkedIn: https://www.linkedin.com/in/cscoltock

cjscol · ‎07-25-2013

I have now identified what is causing this issue and can work around it until there is a fix from NetApp available.

The issue was that the filer name at the Recovery Site had the same name as the filer at the Protected Site with a prefix on it, i.e. in my case the Recovery Site filer was named NSERIES01 and the Protected Site was DRNSERIES01. Remember I had already performed a fail-over and fail-back so the Protected Site was my original Recovery Site, so yes the filer on the Protected Site for this Protection Group is DRNSERIES01 and the Recovery Site has NSERIES01 on it.

When the Reprotect task is run the first step is to call the SRA with the command prepareReverseReplication, this calls reverseReplication.pl which attempts to check that the SnapMirror is broken off. It gets the status of all of the SnapMirrors from the filer at the Recovery Site, i.e. in this case NSERIES01. It then goes through each of these looking for a match of the local-filer-name:volume-name in the source of the snapmirror, e.g. for my test group it was attempting to match NSERIES01:NFS_VMware_Test, at this point the source of the SnapMirror is DRNSERIES01:NFS_VMware_Test which is correct but because the script is using a pattern matching test it matches NSERIES01:NFS_VMware_Test to DRNSERIES01:NFS_VMware_Test as NSERIES01:NFS_VMware_Test is contained within DRNSERIES01:NFS_VMware_Test. It then checks if the destination of the snapmirror matches the peerArrayID (i.e. in this case DRNSERIES01) which it does not as the destination, correctly, is NSERIES01 and then reports that the peerArrayID is incorrect. If there is no match on the local-filer-name:volume-name in the source of the snapmirror then it goes on to check the destination of the snapmirror and when it finds a match it check if the peerArrayID matches the source of the SnapMirror and if it does it then checks that the status of the SnapMirror is broken-off.

I never hit the issue with the first reprotect because DRNSERIES01:NFS_VMware_Test is not contained within the source of the SnapMirror (NSERIES01:NFS_VMware_Test) and therefore it goes on to the next test of checking for DRNSERIES01:NFS_VMware_Test in the destination of the SnapMirror, which it finds and then checks DRNSERIES01 against the destination that also matches and finally confirms that the SnapMirror relationship is broken-off.

I had changed the volume on DRNSERIES01 a while ago because I thought the issue may have been due to the volume names being the same but I had changed it by putting a suffix of _Repl on the end and therefore the script was still matching NSERIES01:NFS_VMware_Test to DRNSERIES01:NFS_VMware_Test_Repl.

I have now configured my test group as follows:

Protected Site

Filer Name = NSERIES01

Volume Name = NFS_VMware_Test_HMR

Recovery Site

File Name = DRNSERIES01

Volume Name = NFS_VMware_Test_EWC

I can now recover to the Recovery Site, Reprotect, Fail-Back, Reprotect and Fail-over again and continue performing recoveries and reprotect over and over again as often as you can re-record on a Scotch VHS tape!

If the filer on Site-B had a suffix on its name instead of a prefix, e.g. it was named NSERIES01DR, or had a completely different name then I would never have hit this bug in the SRA. I will be waiting for NetApp to fix the SRA. In the meantime I will be renaming all of my volumes at the recovery site so that to avoid this issue.

Calvin Scoltock VCP 2.5, 3.5, 4, 5 & 6 VCAP5-DCD VCAP5-DCA http://pelicanohintsandtips.wordpress.com/blog LinkedIn: https://www.linkedin.com/in/cscoltock

View solution in original post

Smoggy · ‎07-18-2013

I've seen this before but I'm not 100% there is ever a single cause for this. Before you do anything else you need to be completely sure that your snapmirror setup across both sites is valid and i mean valid according to the install and admin guide that ships with the SRA and also with this guide http://media.netapp.com/documents/tr-4064.pdf

There are many ways to configure snapmirror, many options that SRA has for enumerating snapmirror, (ex ip_hostname_mapping), there are also different ways of defining connection names in snapmirror.conf and of course you then throw into the mix there are two sites so two sets of all of these configurations. You need to check your setup is consistent with the docs and also consistent at each site. Most NetApp "odd" issues I see relate to a mistake somewhere in the config or simply a setting missed (such as ip_hostname_mappings) that needs to be in place in order for things to work with the customers chosen naming conventions for example (long / short / fqdn or ip).

In the original NetApp 2.0 SRA there were some "Reprotect" bugs but those were fixed in later release. There was also an issue some customers hit where they turned on "autodelete" for the snapmirror volumes which caused reprotect to fail. This is mentioned in the SRA admin guide search for keyword "autodelete".

If you have time please go through your config. Are you using qtrees here as separate exports or simply the whole volume? In all of the above I'm assuming you've deployed a fairly simply snapmirror configuration. I have seen some odd multipath snapmirror setups with multiple destinations that have caused odd errors as well. Are you creating the snapmirror entries manually or via ops mgr?

I think you said you have logged an SR with VMware and IBM? did you log an SR with each company or is the initial call to IBM? Be useful to know the VMware ticket number.

cjscol · ‎07-18-2013

Thanks for this Lee. I have checked and double checked the configuration as I had before and had read NetApp TR-4064. I still cannot see what the issue is.

All of the SnapMirror relationships are using the host name for the source and destination Filers, no IP addresses used in snapmirror relationships, no FQDN and no connection names used.

No qtrees.

SnapMirror relationships have been created with Ops Manager 2.0R1 but they look good as does snapmirror.conf

Original Volume created with Virtual Storage Console with autodelete NOT enabled.

I raised a SR with VMware (13345880607) who identified the issue to be with the SRA and recommended raising a SR with IBM. I can now see that the error is coming from the SRA so I guess it is down to IBM to help me fix it.

This is what I have discovered, when the SRA runs the reverseReplication.pl script with the command prepareReverseReplication it is not detecting the broken-off snapmirror as I am seeing the following error in the SRM logs

validating if path NFS_VMware_Test is valid mirrored device for given peerArrayId

curr state = invalid, prev state = ,source name:DRNSERIES02:NFS_VMware_Test, destination:NSERIES02:NFS_VMware_Test, path=NFS_VMware_Test, arrayID:NSERIES02

Skipping /vol/NFS_VMware_Test as peerArrayId DRNSERIES02 is not valid

If I run a snapmirror status on the controller at the site I have just failed over to, i.e. Site-A, I see the snapmirror is broken off and the Source and Destination Filers are listed as the Filer host names and not an IP address, FQDN or Connection Name.

When I do a reprotect after a failover to Site-B I see the following equivalent messages in the SRM logs

validating if path NFS_VMware_Test is valid mirrored device for given peerArrayId

curr state = , prev state = broken-off,source name:NSERIES02:NFS_VMware_Test, destination:DRNSERIES02:NFS_VMware_Test, path=NFS_VMware_Test, arrayID:DRNSERIES02

I'm wondering if it has got something to do with the length of the peerArrayId name, surely at 11 characters this is not the issue.

Calvin Scoltock VCP 2.5, 3.5, 4, 5 & 6 VCAP5-DCD VCAP5-DCA http://pelicanohintsandtips.wordpress.com/blog LinkedIn: https://www.linkedin.com/in/cscoltock

cjscol · ‎07-25-2013

I have now identified what is causing this issue and can work around it until there is a fix from NetApp available.

The issue was that the filer name at the Recovery Site had the same name as the filer at the Protected Site with a prefix on it, i.e. in my case the Recovery Site filer was named NSERIES01 and the Protected Site was DRNSERIES01. Remember I had already performed a fail-over and fail-back so the Protected Site was my original Recovery Site, so yes the filer on the Protected Site for this Protection Group is DRNSERIES01 and the Recovery Site has NSERIES01 on it.

When the Reprotect task is run the first step is to call the SRA with the command prepareReverseReplication, this calls reverseReplication.pl which attempts to check that the SnapMirror is broken off. It gets the status of all of the SnapMirrors from the filer at the Recovery Site, i.e. in this case NSERIES01. It then goes through each of these looking for a match of the local-filer-name:volume-name in the source of the snapmirror, e.g. for my test group it was attempting to match NSERIES01:NFS_VMware_Test, at this point the source of the SnapMirror is DRNSERIES01:NFS_VMware_Test which is correct but because the script is using a pattern matching test it matches NSERIES01:NFS_VMware_Test to DRNSERIES01:NFS_VMware_Test as NSERIES01:NFS_VMware_Test is contained within DRNSERIES01:NFS_VMware_Test. It then checks if the destination of the snapmirror matches the peerArrayID (i.e. in this case DRNSERIES01) which it does not as the destination, correctly, is NSERIES01 and then reports that the peerArrayID is incorrect. If there is no match on the local-filer-name:volume-name in the source of the snapmirror then it goes on to check the destination of the snapmirror and when it finds a match it check if the peerArrayID matches the source of the SnapMirror and if it does it then checks that the status of the SnapMirror is broken-off.

I never hit the issue with the first reprotect because DRNSERIES01:NFS_VMware_Test is not contained within the source of the SnapMirror (NSERIES01:NFS_VMware_Test) and therefore it goes on to the next test of checking for DRNSERIES01:NFS_VMware_Test in the destination of the SnapMirror, which it finds and then checks DRNSERIES01 against the destination that also matches and finally confirms that the SnapMirror relationship is broken-off.

I had changed the volume on DRNSERIES01 a while ago because I thought the issue may have been due to the volume names being the same but I had changed it by putting a suffix of _Repl on the end and therefore the script was still matching NSERIES01:NFS_VMware_Test to DRNSERIES01:NFS_VMware_Test_Repl.

I have now configured my test group as follows:

Protected Site

Filer Name = NSERIES01

Volume Name = NFS_VMware_Test_HMR

Recovery Site

File Name = DRNSERIES01

Volume Name = NFS_VMware_Test_EWC

I can now recover to the Recovery Site, Reprotect, Fail-Back, Reprotect and Fail-over again and continue performing recoveries and reprotect over and over again as often as you can re-record on a Scotch VHS tape!

If the filer on Site-B had a suffix on its name instead of a prefix, e.g. it was named NSERIES01DR, or had a completely different name then I would never have hit this bug in the SRA. I will be waiting for NetApp to fix the SRA. In the meantime I will be renaming all of my volumes at the recovery site so that to avoid this issue.

Calvin Scoltock VCP 2.5, 3.5, 4, 5 & 6 VCAP5-DCD VCAP5-DCA http://pelicanohintsandtips.wordpress.com/blog LinkedIn: https://www.linkedin.com/in/cscoltock

All

VMware SRM Reprotect Fails with Peer array ID provided in the SRM input is incorrect