VMware Cloud Community
rreynol
Enthusiast
Enthusiast
Jump to solution

LUN resignature question

We have 21 ESX servers all sharing the same set of LUNs. When the 21st ESX server was added some weeks ago there was an error in one of the LUN id settings on the SAN side so that this one LUN was presented to the 21st ESX server with a different signature from what the other 20 ESX servers were already using. We did not discover the error right away.

The 21st ESX server generated an error since it sees the different signature and disabled access to the LUN. Over this past weekend we had some major SAN maintenance that caused all the ESX servers to go through failover on the HBAs. The net result of all this is that we now only have 3 ESX servers that can still see the LUN (they are the ESX servers that are hosting the VMs on this one LUN), all the other ESX servers show a broken link. We have fixed the 21st ESX server to change to the correct LUN id on the SAN side but reboots and rescans of that server still do not clear up the problem.

VMware support suggests that we power off all the VMs on the LUN in question, turn on resignaturing on the 21st server, rescan the 21st server, turn off resignaturing on the 21st server, rescan all the other 20 ESX servers, the VMs on the LUN will now be orphaned so we will have to add them back to inventory before we can power them on again.

I have two questions. Is there any solution that would not require downtime for the VMs? How is it that this one LUN id problem would impact all the other ESX servers, and not just itself?

-Robert

Tags (2)
Reply
0 Kudos
1 Solution

Accepted Solutions
mike008
Enthusiast
Enthusiast
Jump to solution

Unfortunately, that doesn't tell us whether or not the LUN was resignatured. It is being viewed by the 21st ESX server as a snapshot LUN (hence the prefixed name assigned to it) so LVM.DisAllowSnapshotLUN must have been changed from it's default setting of enabled. Hopefully LVM.EnableResignature was not changed and enabled also. The rest of the ls -al output is just a GUID not the header info we are looking for. Unless you have a backup of the VMFS header, we have nothing to compare it to so the only way to tell is by rescanning the LUN from a server that until then saw the old sig (which will bring down your working servers I'm pretty sure) and then look at the vmkernel log. Can you sacrafice 1 of 3? Probably not.Otherwise, I am not sure how we would determine if the signature was rewritten. I think I determined it before by looking through the vmkernel logs of the server I suspected as the culprit. If you know when the 21st server was added, then it should say in the log that it saw it as a snapshot and resignaturing is enabled so it is resignaturing it (or someting like that).

View solution in original post

Reply
0 Kudos
8 Replies
mike008
Enthusiast
Enthusiast
Jump to solution

If the disk signature does not match the characteristics of the LUN as it is presented to the host (flags, LUN ID, etc.) then the ESX host will generally view it as a snapshot LUN. If you set disallowsnapshotLUN to 0 in the Advanced Settings --> LVM on the host, you should then be able to see the LUNs. May have other issues though. I have had to deal a lot with signaturing and snapshot LUNs - it can be a bit of a headache.

Although I feel that I don't quite have all the info for your situation to accurately give my $.02, I'll try. After re-reading your post, unless you can present the LUN on the same ID the same way as before, I don't think there is any other way than what VMware support suggests for a long-term solution. You will have to resig it at some point to match it's presentment. With regards to your second question, the key is that the signature must match the presentment. Either the presentment was changed to all the other servers at the same time, or the disk was resignatured to match perhaps the presentment on that one server breaking it for everyone else. One thing I can advise though is that if you have VMs running on that LUN right now (on those three hosts), don't rescan those hosts until you have a plan of attack for this situation unless you can afford the downtime.

Mike

P.S. Check out VMWorld 2007 Breakout Session Slides for "Top Support Issues & How to Troubleshoot The Part I) - Issue #2 VMFS Volumes and Snapshots. May be helpful. Not sure if I can post it here. If I can't, moderators please remove the attachment.

rreynol
Enthusiast
Enthusiast
Jump to solution

Thank you Mike. I did try to change the Advanced settings but none of the combinations made any difference. I expect it is because the LUN is in use with the 3 VMs running on it. The attachment is helpful to understand the details of what is going on. It is striking that one mistake on the LUN id can disrupt so many ESX servers; although this may have been complicated by the SAN maintenance issue and that we did not catch the LUN id problem until after the other ESX servers had effectively done a rescan due to the maintenance.

If no one else chimes in I will give you the rest of the points. I appreciate the reply.

-Robert

Reply
0 Kudos
rreynol
Enthusiast
Enthusiast
Jump to solution

Just another thought. I have the original LUN id and GUID of the volume that is still valid for the three ESX servers. Is there no file on ESX that I can modify on the other hosts, particularly the 21st host that went bad, to set this all back to the way it was originally? What if we removed this LUN from the ESX hosts that do not see it properly, do a rescan and then add it back and do a rescan? I do not want to do anything that may cause more harm, just trying to think of ways to avoid downtime for the VMs.

Reply
0 Kudos
mike008
Enthusiast
Enthusiast
Jump to solution

Assuming the LUN has NOT be resignatured, then yes, in theory I believe it can be presented back to the servers the same way it was. Rescan and if the data in the LVM header matches what is returned to the server, then voila you should be back to the original configuration. The big question is - what changed about the LUN presentation? Is it just the id? Something else? This is the area where I didn't feel like I quite had all the info from the original post.

Mike

Reply
0 Kudos
rreynol
Enthusiast
Enthusiast
Jump to solution

The original LVM looks like this, on an ESX server that still sees the LUN since it is hosting a VM on the LUN:

# ls -la 06\:30

lrwxr-xr-x 1 root root 35 Apr 14 16:37 06:30 -> 47506876-0826c591-23f1-0018fe76af06

This is what the LVM looks like on the 21st ESX server, where the mistake was made in the LUN id when the SAN admin assigned it to the 21st ESX host. Since then the mistake has been corrected and the 21st ESX server rebooted.

lrwxr-xr-x 1 root root 35 Apr 14 16:40 snap-00000003-06:30 -> 48037382-a05b450c-6b1c-0017a44c7a51

Does this mean the LUn 06:30 has a new signature and there is not much else to be done than what VMware support suggests?

Reply
0 Kudos
mike008
Enthusiast
Enthusiast
Jump to solution

Unfortunately, that doesn't tell us whether or not the LUN was resignatured. It is being viewed by the 21st ESX server as a snapshot LUN (hence the prefixed name assigned to it) so LVM.DisAllowSnapshotLUN must have been changed from it's default setting of enabled. Hopefully LVM.EnableResignature was not changed and enabled also. The rest of the ls -al output is just a GUID not the header info we are looking for. Unless you have a backup of the VMFS header, we have nothing to compare it to so the only way to tell is by rescanning the LUN from a server that until then saw the old sig (which will bring down your working servers I'm pretty sure) and then look at the vmkernel log. Can you sacrafice 1 of 3? Probably not.Otherwise, I am not sure how we would determine if the signature was rewritten. I think I determined it before by looking through the vmkernel logs of the server I suspected as the culprit. If you know when the 21st server was added, then it should say in the log that it saw it as a snapshot and resignaturing is enabled so it is resignaturing it (or someting like that).

Reply
0 Kudos
rreynol
Enthusiast
Enthusiast
Jump to solution

Thank you Mike for all of your help. I will take the better safe than sorry approach and follow VMware's advice.

Reply
0 Kudos
rreynol
Enthusiast
Enthusiast
Jump to solution

We completed the resignature task successfully. Words of advice to anyone searching the forums about how to handle resignaturing: DO NOT change Advance settings to resignature or presentation of snapshot without first calling VMware support. You could run the risk of losing all the other LUNs in your cluster. Support told me of some horror stories of those who do it on their own and pay the price. I suppose if you really know what you are doing then it is possible to make the attempt, but the risk is great if you get it wrong.

Reply
0 Kudos