VMware Cloud Community
AllBlack
Expert
Expert

Path issues, LUN gone

Hi there,

I was in the process of creating a new VM when I noticed one of my LUNs had disappeared from host 1 in my cluster.

it was still available on host 2

I checked the even log and I saw the following error:

SCSI 4473 Cannot find path to device vmhba32:3:2 in a good state: Trying path vmhba32:3:2.

I don't have a clue what has caused this as it was working fine before the weekend.

As there was nothing on the LUN I decided to remove them and add them again.

The storage was connected and I created my VM. Afterwards I simulated a fail over on my cluster.

when host 1 came up my storage was gone again.

What can cause this?

My storage adapter still list the path but it will not appear under actual storage.

I also see an error that says LVM: 4469: vml.<somelongstringhere>:1 may be snapshot: disabling access. see resignature section

Where do I start troubleshooting?

cheers

Please consider marking my answer as "helpful" or "correct"
Tags (3)
Reply
0 Kudos
11 Replies
snapper
Enthusiast
Enthusiast

The key here would be is 'simulated failover on my cluster'.

Does this have anything to do with snap shotting / the LUN and/or presenting it via another path?

Each LUN has a signature, which incorporates a number of factors into the naming, such as serial/path. If it sees an identical signature but via a different path it detects it as a 'snapshot', and prevents access to the second one. It's something like a split brain condition.

There are numerous forum posts about this if you search for 'resignaturing' or snapshot LUNs.

The reference is in the SAN Configuration Guide, in the 'resignaturing section'

www.vmware.com/pdf/esx25_san_cfg.pdf

If you find the need to 'resignature', keep in mind you will have to re-register your VMs, which is a major pain if you have resource pools / folders etc. It's not stated in the doco, but re-registering a VM gives it a new vm_id in the database, which means you will loose things like affinity rules, task and event history etc.

Cheers,

SP

Don't forget to award points where appropriate :slightly_smiling_face:
Reply
0 Kudos
kjb007
Immortal
Immortal

Check your array as well. You may have a storage processor LUN thrashing which may cause the LUN to be seen as different.

-KJB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
Reply
0 Kudos
AllBlack
Expert
Expert

As far as I know nothing changed. I noticed the error as explained above so I added the storage again.

I thought evrything was ok again but when I later turned one of my hosts off as part of the simulation I got the same error.

So the initial error should not have anything to do with the failover. I think I really need to figure out what caused it.

WIll look into the resignaturing of the LUN

cheers

Please consider marking my answer as "helpful" or "correct"
Reply
0 Kudos
AllBlack
Expert
Expert

Hi Kjb,

I am not up to speed with it all just yet. We have a clariion cx3-20c. That makes it an active/passive array and it is still using MRU policy.

Any ideas how I check into it on the SAN side? I will do some investigation in the mean time.

Also, as I said the Lun is detected under my storage adapter but not under storage. When I try to add my device shows up in the storage wizard but its availibility says "None"

Maybe that helps in troubleshooting. Thanks guys

Please consider marking my answer as "helpful" or "correct"
Reply
0 Kudos
snapper
Enthusiast
Enthusiast

that definately sounds symptomatic of a snapshot LUN. It will display under storage configuration but it won't allow you to view or create a VMFS volume on it, which you really don't want to do if it's a presentation of the same LUN.

Is the LUN being presented as the same LUN number on each host? If not, re-present the LUN as the same LUN number on the host that isn't visible.

I've used the VMware doco hds_svd_technote.pdf (www.vmware.com/pdf/hds_svd_technote.pdf) to fix this issue in the past, presuming that it is a snapshot LUN and not one of the other issues suggested, such as issues at the array/fabric level.

The ESX logfiles should display some information about what it thinks the problem is.

1. Do a rescan on each adapter for new storage

2. Check for snapshot messages from the service console on each esx host:

grep -i snapshot /var/log/vmkwarning*

grep -i snapshot /var/log/vmkernel*

If it is being viewed as a snapshot LUN, it should be displayed in these logfiles and on the VC console, and possibly in /var/log/messages as well

Don't forget to award points where appropriate :slightly_smiling_face:
kjb007
Immortal
Immortal

Looking at your hba, it looks like you're using iSCSI. Now from what I've seen, in an active/passive array, one SP owns the LUN and will present it to hosts. If something happens and the SP loses that LUN, it will fail over, and ownership will pass to the 2nd SP. This is where things usually go awry. Since the same LUN is now presented to the server via a different path, and possibly with a different ID, this would cause ESX to see that the LUN has a VMFS already on it, and to avoid the risk of writing to a LUN that is not his, the ESX host will disable access to it instead. This is where the snapshot LUN issue comes up, which is what your log is stating.

Usually, in an active/passive array, once the original owning SP is back up, the disk is failed back to the original, and ownership is returned to the original. If this ownership change is happening constantly, then this is what is called SP LUN thrashing, and causes, at best slowness, and at worst, LUNs going away. I would make sure there isn't a deeper issue with your array, and that it looks healthy. If it is healthy, then I would follow instructions already posted to Resignature the LUN with a new signature, so you can once again mount it and re-register your VM's, if any.

Hope that helps.

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
Reply
0 Kudos
AllBlack
Expert
Expert

Talking about a steep learning curve! I don't think anything is gone wrong on the SAN. I have no alerts whatsoever

regarding a SP failover.

Yes I am using ISCSI software adapter.

This is the info I get under storage adapters on both hosts. It is identical

Path: vhmba32:3:2

Canonical Path: vhmba32:3:2

Type: disk

Capacity: 512 Gb

LUN ID: 2

Will try and re-signature.

Please consider marking my answer as "helpful" or "correct"
Reply
0 Kudos
AllBlack
Expert
Expert

Plenty of those entries in the log files on both hosts.

/var/log/vmkernel.3:Apr 22 08:17:27 tur-esx1 vmkernel: 0:15:13:08.012 cpu0:1040) ALERT: LVM: 4469: vml.020002000060060160a2a01a00268cd81bf00cdd11524149442035:1 may be snapshot: disabling access. See resignaturing section in SAN config guide.

It seemed to have started a few days ago but I cannot really remember what I did back then.

I wonder now whether I deleted the original LUN and created a new one with the same name but smaller in size.

I followed the resignature steps. I did a rescan and now I have a storage called snap-000000002-mystorage. I renamed it.

Looks like I am back in business. I am a bit confused about the entire "snapshot" thing though

Please consider marking my answer as "helpful" or "correct"
Reply
0 Kudos
kjb007
Immortal
Immortal

As well you should be. The snapshot name can be a bit of a misnomer, but the intent, from what I remember, was when LUNs used to be mirrors of other LUNs. In that sense, they were a snapshot of another LUN. In order to keep the host from mounting an original LUN with its snapshot, and possibly destroying both, the server would hide what it thought was the 'snapshot'. Well, that's all well and good, but what happens is that when you get a LUN tha was previously mounted and then it shows up again under a different path and possibly a different LUN id, then the old rule of, this LUN must be a snapshot of my other LUN that I have, so I'll make sure I don't corrupt it, kind of bites us in the bu!@.

Better safe than sorry though, since there are so many threads in these forums on how to fix it.

Glad you're back in business though.

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
soudertech
Enthusiast
Enthusiast

I am having a very similiar problem except that i am not using iSCSI. The biggest differencs is that i have not lost a lun or any other visibility.... I just noticed this issue in the last few days following a flarecode upgrade, which requires a fiber switch to go down at a time.

The error is as follows:

7:23:59.659 cpu15:1049)SCSI: 4473: Cannot find a path to device vmhba1 :0:7 in a good state. Trying path vmhba2 :0:7

I am recieving this error on 3 of 5 hosts.... All seems well no latency or anything, just a blinking amber light on the front of the host (DL585 Operton)

Any ideas..?

Clariion cx300 2GB

x 5 DL585 2ea Qlogic hba 4gb

Live long and virtualize...!
Reply
0 Kudos
rmcclinnis
Contributor
Contributor

I am on ESX-3.0.5 and we had the same issue, it appears that ESX decided that is didn't want to recognize the partition anymore. We had about 8 production servers on the LUN and they have vanished. Does anyone know of a command to try to get the partition back?

Reply
0 Kudos