Last week I did a disaster recovery test in which I did a failover to a secondary group and a fail back to the primary. Our environment exist
of ESX servers connected to two EqualLogic groups, both are replication partner of each other.
During the test, it became clear that ESX did not resignature the volumes although I have set lvm.EnableResignature to 1.
Is there a way to force ESX to resignature a volume?
Fisrt, the ESX server has to see the LUNs. If it does, after a rescan, you will see verbage in the vmkernel log stating of that fact, snapshot lun seen, disabling access, see resignature section. After you enable resignature, you have to force a rescan of your HBA's. Otherwise, for DR purposes, I use LVM.DisallowSnapshotLun set to 0, since I am attempting to use the snapshot LUNs and don't want to resignature and rename my datastore.
My issue is that I want to force my LUNS to be resignatured so I set LVM.EnableResignature to 1. The reason is that all my ESX servers see both the primary and the secondary SAN to enable storage vmotion from the first to the second SAN and vice versa. What I saw with my DR test is that at the moment the primary SAN became back online, the ESX servers immediate started to write to the original volumes in stead of the promoted replica volumes (Sorry for the EqualLogic terminology). This leaded to corruption of various VMs.
As for some reason event with LVM.EnableResignature set to 1, my volumes were not resignatured, I want to force it.
BTW: Is there any description available how ESX decides to perform resignature?
Here's a good doc that may help: http://communities.vmware.com/servlet/JiveServlet/download/781793-3097/PS_TA48_288131_166-1_FIN_v2.p...
Does ESX see those LUNs as snapshots and reports them as such in the vmkernel log?
The link to the PDF was very helpful, thanks.
I checked the vmkernel log and the LUN did not show up as snapshot when I did my DR test. I decided to do another test where I let the SAN on but brought a volume offline and on the other SAN brough the replicated volume online. I did a rescan but in this case the volume was resignatured.
As I was thinking that bringing a volume offline is something different than cutting the power from the SAN, I did another test where I cut of the network access from my test ESX server to the SAN where the volume is resided (we are using iSCSI) and brought the replica online. After that I did a rescan and again, ESX decided to do a resignature.
I am puzzeled why in the original test, the volumes were not resignatured and the last two tests they were.can there be some timing involved as during the last two tests the time between cutting of and the rescan was < 15 min, were in the orginal test that was around 2 hours.
I decided to rerun the orginal test next week (cutting of the power from the primary SAN and briging the replicated volumes online on the secondary SAN). What information should I capture during the test to get a good understanding from what is happening.
There should be no issue regarding timing. As long as the LUN has the same VMFS header, but is being presented through another target and/or another via another lun ID, then it should be seen as a snapshot, and resignaturing should occur.
Today I did some tests:
1. Switched the primary SAN off
2. Promoted the replicated volumes (on SAN)
3. Did a rescan on an ESX server. The volume was not resignatured
4. Put a snapshot of the promoted volume online (on the SAN)
5. Did a rescan. This volume was resignatured
6. Demoted the volume back to replica (on the SAN)
7. Promoted it again ( on the SAN)
8. Did a rescan. This time, the promoted volume was resignatured!!!
When a skip step 5, the final result is that the volume is not resignatured so it seems an ESX issue.
I am opening a case with VMware for this.
When the primary SAN was turned off, did the LUN Id and/or target change for the promoted volume?
The iSCSI target ID for the promoted volume is different from the original volume. The LUN ID remains the same (I do not have any control over this).