Re: Not able to recognize Storage luns from ESX 3....

pvenkat14 · ‎07-27-2009

Hi,

We have 3 ESX 3.5 servers which are identical in terms of HW configuration. 4 luns are presented to all the 3 ESX servers through qlogic QLA2432 interfaces. The issue here is we are able to see SAN luns from 2 servers but not from the 3rd server. After little bit of investigation we are able to see the following difference on the two where we are able to see Storage Luns we noticed the two comments in dmesg/var/log/messages

scan_scsis starting finish

scan_scsis done with finish

Where as on the host where we are not able to see Stoarge Luns we didnt noticed these logs.

Can any body help us in trouble shooting this issue? Is there a command to see the status of Fiber cards on ESX whether it is offline or online?

Thanks in Advance

dickybird · ‎07-27-2009

We had similar issue and it was resolved by rebooting the ESX server after migrting all VM's to other ESX hosts.

If you cannot reboot you can try rescanning LUN's from Command prompt

To scan LUN's

#esxcfg -rescan vmhba1

pvenkat14 · ‎07-27-2009

We tried rebooting with out any luck. The main difference i noticed is when ever you perform rescanc command on ESX hosts i see /var/log/messages getting updated with lot of messages

Please see the below logs of succesfull server(where i can see the Storage luns) and failed server (where i am not able recognize the Storage luns) one

_Succesfull server_

Jul 28 08:58:52 usnovpesx1 kernel: VMWARE: Unique Device attached as scsi disk sdb at scsi1, channel 0, id 0, lun 0

Jul 28 08:58:52 usnovpesx1 kernel: Attached scsi disk sdb at scsi1, channel 0, id 0, lun 0

Jul 28 08:58:52 usnovpesx1 kernel: scan_scsis starting finish

Jul 28 08:58:52 usnovpesx1 kernel: SCSI device sdb: 1070287500 512-byte hdwr sectors (547987 MB)

Jul 28 08:58:52 usnovpesx1 kernel: sdb:

Jul 28 08:58:52 usnovpesx1 kernel: scan_scsis done with finish

Jul 28 08:58:52 usnovpesx1 kernel: scsi singledevice 1 0 0 1

Jul 28 08:58:52 usnovpesx1 kernel: Vendor: HITACHI Model: OPEN-V*15 Rev: 5009

Jul 28 08:58:52 usnovpesx1 kernel: Type: Direct-Access ANSI SCSI revision: 02

Jul 28 08:58:52 usnovpesx1 kernel: VMWARE SCSI Id: Supported VPD pages for sdc : 0x0 0x80 0x83 0xc0 0xe0

Jul 28 08:58:52 usnovpesx1 kernel: VMWARE SCSI Id: Device id info for sdc: 0x2 0x1 0x0 0x14 0x48 0x49 0x54 0x41 0x43 0x48 0x49 0x20 0x52 0x35 0x30 0x30 0x32

0x39 0x35 0x35 0x30 0x33 0x30 0x35 0x1 0x10 0x0 0x2 0x2b 0x2 0x1 0x3 0x0 0x10 0x60 0x6 0xe 0x80 0x4 0x29 0x55 0x0 0x0 0x0 0x29 0x55 0x0 0x0 0x3 0x5

Jul 28 08:58:52 usnovpesx1 kernel: VMWARE SCSI Id: Id for sdc 0x60 0x06 0x0e 0x80 0x04 0x29 0x55 0x00 0x00 0x00 0x29 0x55 0x00 0x00 0x03 0x05 0x4f 0x50 0x45

0x4e 0x2d 0x56

Jul 28 08:58:52 usnovpesx1 kernel: VMWARE: Unique Device attached as scsi disk sdc at scsi1, channel 0, id 0, lun 1

Jul 28 08:58:52 usnovpesx1 kernel: Attached scsi disk sdc at scsi1, channel 0, id 0, lun 1

Jul 28 08:58:52 usnovpesx1 kernel: scan_scsis starting finish

Jul 28 08:58:52 usnovpesx1 kernel: SCSI device sdc: 1070287500 512-byte hdwr sectors (547987 MB)

Jul 28 08:58:52 usnovpesx1 kernel: sdc: unknown partition table

Jul 28 08:58:52 usnovpesx1 kernel: scan_scsis done with finish

Jul 28 08:58:52 usnovpesx1 kernel: scsi singledevice 1 0 0 2

Jul 28 08:58:52 usnovpesx1 kernel: Vendor: HITACHI Model: OPEN-V*15 Rev: 5009

Jul 28 08:58:52 usnovpesx1 kernel: Type: Direct-Access ANSI SCSI revision: 02

Jul 28 08:58:53 usnovpesx1 kernel: VMWARE SCSI Id: Supported VPD pages for sdd : 0x0 0x80 0x83 0xc0 0xe0

Jul 28 08:58:53 usnovpesx1 kernel: VMWARE SCSI Id: Device id info for sdd: 0x2 0x1 0x0 0x14 0x48 0x49 0x54 0x41 0x43 0x48 0x49 0x20 0x52 0x35 0x30 0x30 0x32

0x39 0x35 0x35 0x30 0x43 0x45 0x35 0x1 0x10 0x0 0x2 0x2b 0x2 0x1 0x3 0x0 0x10 0x60 0x6 0xe 0x80 0x4 0x29 0x55 0x0 0x0 0x0 0x29 0x55 0x0 0x0 0xc 0xe5

Jul 28 08:58:53 usnovpesx1 kernel: VMWARE SCSI Id: Id for sdd 0x60 0x06 0x0e 0x80 0x04 0x29 0x55 0x00 0x00 0x00 0x29 0x55 0x00 0x00 0x0c 0xe5 0x4f 0x50 0x45

0x4e 0x2d 0x56

Jul 28 08:58:53 usnovpesx1 kernel: VMWARE: Unique Device attached as scsi disk sdd at scsi1, channel 0, id 0, lun 2

Jul 28 08:58:53 usnovpesx1 kernel: Attached scsi disk sdd at scsi1, channel 0, id 0, lun 2

Jul 28 08:58:53 usnovpesx1 kernel: scan_scsis starting finish

Jul 28 08:58:53 usnovpesx1 kernel: SCSI device sdd: 1070287500 512-byte hdwr sectors (547987 MB)

Jul 28 08:58:53 usnovpesx1 kernel: sdd: sdd1

Jul 28 08:58:53 usnovpesx1 kernel: scan_scsis done with finish

Jul 28 08:58:53 usnovpesx1 kernel: scsi singledevice 1 0 0 3

Jul 28 08:58:53 usnovpesx1 kernel: Vendor: HITACHI Model: OPEN-V*15 Rev: 5009

Jul 28 08:58:53 usnovpesx1 kernel: Type: Direct-Access ANSI SCSI revision: 02

Jul 28 08:58:53 usnovpesx1 kernel: VMWARE SCSI Id: Supported VPD pages for sde : 0x0 0x80 0x83 0xc0 0xe0

Jul 28 08:58:53 usnovpesx1 kernel: VMWARE SCSI Id: Device id info for sde: 0x2 0x1 0x0 0x14 0x48 0x49 0x54 0x41 0x43 0x48 0x49 0x20 0x52 0x35 0x30 0x30 0x32

0x39 0x35 0x35 0x30 0x43 0x45 0x38 0x1 0x10 0x0 0x2 0x2b 0x2 0x1 0x3 0x0 0x10 0x60 0x6 0xe 0x80 0x4 0x29 0x55 0x0 0x0 0x0 0x29 0x55 0x0 0x0 0xc 0xe8

Jul 28 08:58:53 usnovpesx1 kernel: VMWARE SCSI Id: Id for sde 0x60 0x06 0x0e 0x80 0x04 0x29 0x55 0x00 0x00 0x00 0x29 0x55 0x00 0x00 0x0c 0xe8 0x4f 0x50 0x45

0x4e 0x2d 0x56

Jul 28 08:58:53 usnovpesx1 kernel: VMWARE: Unique Device attached as scsi disk sde at scsi1, channel 0, id 0, lun 3

Jul 28 08:58:53 usnovpesx1 kernel: Attached scsi disk sde at scsi1, channel 0, id 0, lun 3

Jul 28 08:58:53 usnovpesx1 kernel: scan_scsis starting finish

Jul 28 08:58:53 usnovpesx1 kernel: SCSI device sde: 1070287500 512-byte hdwr sectors (547987 MB)

Jul 28 08:58:53 usnovpesx1 kernel: sde: sde1

Jul 28 08:58:53 usnovpesx1 kernel: scan_scsis done with finish

Jul 28 08:58:54 usnovpesx1 kernel: scsi5: remove-single-device 0 0 0 failed, device busy(6).

Jul 28 08:58:54 usnovpesx1 kernel: scsi singledevice 5 0 0 0

Jul 28 09:13:58 usnovpesx1 sshd[17712]: Connection from 161.89.145.221 port 34806

Jul 28 09:14:00 usnovpesx1 sshd(pam_unix)[17712]: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=161.89.145.221 user=aoadmin

Jul 28 09:14:02 usnovpesx1 sshd[17712]: Failed password for aoadmin from 161.89.145.221 port 34806 ssh2

Jul 28 09:14:04 usnovpesx1 sshd[17712]: Accepted password for aoadmin from 161.89.145.221 port 34806 ssh2

Jul 28 09:14:04 usnovpesx1 sshd(pam_unix)[17714]: session opened for user aoadmin by (uid=0)

Jul 28 09:14:07 usnovpesx1 su(pam_unix)[17749]: session opened for user root by aoadmin(uid=500)

Failed server

Jul 28 09:04:58 usnovpesx2 kernel: scsi5: remove-single-device 0 0 0 failed, device busy(6).

Jul 28 09:04:58 usnovpesx2 kernel: scsi singledevice 5 0 0 0

Jul 28 09:28:06 usnovpesx2 kernel: scsi5: remove-single-device 0 0 0 failed, device busy(6).

Jul 28 09:28:06 usnovpesx2 kernel: scsi singledevice 5 0 0 0

epoh · ‎07-27-2009

I am working on the server mentioned in the previous post. I just wanted to add, I noticed modprobe is failing on boot:

~~root@usnovpesx2 sbin~~# dmesg| grep scsi

kmod: failed to exec /sbin/modprobe -s -k scsi_hostadapter, errno = 2

scsi0 : vsd

scsi1 : qla2300_707_vmw

scsi2 : qla2300_707_vmw

scsi3 : qla2300_707_vmw

scsi4 : qla2300_707_vmw

scsi5 : megaraid_sas

When I manually run modprobe I do not see any errors returned. ETA: After looking at /var/log/messages after running modprobe I see this error "modprobe: modprobe: Can't locate module qla2300_conf"

Phaedrus1 · ‎07-27-2009

Are you sure that the host is zoned properly to the SAN? Verify in your SAN fabric switching.

You may present storage to a server but without the proper zoning the host will not be able to see it.

epoh · ‎07-27-2009

Well, I'm not the SAN administrator, so I can't be 100% certain it's correct, but our SAN admins have looked it over several times, and at one point even completely removed the hosts and rebuilt everything on the SAN side. So I'd hope that's correct. And I can see all 4 luns from both boxes, so I would think that indicates the zoning is correct.

/var/log/messages does not show any errors after doing a rescan. /var/log/vmkernel shows:

Jul 28 11:05:11 usnovpesx1 vmkernel: 32:01:23:35.520 cpu6:1030)StorageMonitor: 196: vmhba1:0:0:0 status = 2/0 0x5 0x24 0x0

Jul 28 11:05:11 usnovpesx1 vmkernel: 32:01:23:35.521 cpu6:1030)StorageMonitor: 196: vmhba1:0:2:0 status = 2/0 0x5 0x24 0x0

Jul 28 11:05:11 usnovpesx1 vmkernel: 32:01:23:35.522 cpu6:1030)StorageMonitor: 196: vmhba1:0:3:0 status = 2/0 0x5 0x24 0x0

Jul 28 11:05:11 usnovpesx1 vmkernel: 32:01:23:35.523 cpu2:1026)StorageMonitor: 196: vmhba5:0:0:0 status = 2/0 0x5 0x24 0x0

Jul 28 11:05:11 usnovpesx1 vmkernel: 32:01:23:35.524 cpu6:1030)StorageMonitor: 196: vmhba1:0:1:0 status = 2/0 0x5 0x24 0x0

Jul 28 11:05:16 usnovpesx1 vmkernel: 32:01:23:41.212 cpu6:1030)StorageMonitor: 196: vmhba1:0:0:0 status = 2/0 0x5 0x24 0x0

Jul 28 11:05:16 usnovpesx1 vmkernel: 32:01:23:41.213 cpu6:1030)StorageMonitor: 196: vmhba1:0:2:0 status = 2/0 0x5 0x24 0x0

Jul 28 11:05:16 usnovpesx1 vmkernel: 32:01:23:41.214 cpu6:1030)StorageMonitor: 196: vmhba1:0:3:0 status = 2/0 0x5 0x24 0x0

Jul 28 11:05:16 usnovpesx1 vmkernel: 32:01:23:41.215 cpu2:1026)StorageMonitor: 196: vmhba5:0:0:0 status = 2/0 0x5 0x24 0x0

Jul 28 11:05:16 usnovpesx1 vmkernel: 32:01:23:41.216 cpu6:1030)StorageMonitor: 196: vmhba1:0:1:0 status = 2/0 0x5 0x24 0x0

Phaedrus1 · ‎07-27-2009

Since you say you can see the LUNs on the backend, have you tried going to the Configuration Tab and selecting Storage and then on the upper right clicking Add Storage, then choosing Disk/LUN and clicking next to see if the storage shows up?

If you do see the storage there, don't go any further, Cancel out of it. This may mean that you have a disk resignature issue. The LUNs are visible but they are not registered with VC as shared volumes and look new to the server.

If you have this issue we can talk further.

admin · ‎07-27-2009

.

epoh · ‎07-27-2009

Yes, that is what's happening. What next?

(Thanks a bunch for your help, btw.)

Phaedrus1 · ‎07-28-2009

Answer this first please. What is the SAN you are using and are these 4 more new LUNs?

If you have a SAN that will allow you to assign the same LUN numbers twice on a target then you could have the following problem.

Here is what happened to me.

I presented a LUN to an ESX host, it labels the SCSI target vmhba0:0:1:0 (behind the scenes there is a GUID assigned to this). I check it over and all is well. Now I go to another existing host that has a different & NEW LUN but because the SAN allowed me to also numer the LUN as 1, it also has the SCSI target label vmhba0:0:1:0 (behind the scenes there is a DIFFerent GUID assigned to this) with the 1 representing the LUN number 1. Perfectly legal when the esxhosts never have to share those LUNS, once you do though, the problem crops up. Now when you go to present that LUN to the other ESX host it looks different on the back end.

Because the ATLP (Adapter Target LUN Partition) or vmhbaA:T:L:P

e.g.

esxhost1 - LUN Presented - vmhba:0:0:1:0 - GUID assigned

esxhost2 - NEW LUN Presented - vmhba:0:0:1:0 - DIFFerent GUID assigned

Now present esxhost1 LUN to esxhost2... you will get a response similar to what you are experiencing. Even though the vmhbaA:T:L:P look the same in Storage adapters view you really have two unique GUIDs on the LUNs. Therefore when shared to either host they think it is new or unknown storage.

One thing you must do in an ESX environment on the SAN side is make sure that all VMware LUNs have unique LUN numbers. This avoids any confusion on the backend as to which LUN is which.

e.g.

Production = LUN 1 - 20

Stage = LUN 21-40

etc...

Fortunately I have been working on an HP SAN lately and a LUN number can only be used once per array. We do have the problem crop up when another array has the same LUN number and that is why we went to a LUN tracking system.

Good luck.

epoh · ‎07-28-2009

I was afraid that was the problem as well, but I verified with the SAN admin that the luns are being presented in the same order to both servers, and that the WWID of each lun matches on both servers.

epoh · ‎07-28-2009

So in the interest of closure we actually had 4 seperate problems going on:

One esx host did not install properly and could not load the kernel module for the hba - will have to be reloaded with esx.

one of the datastores had it's partition table accidentally deleted. The vmware tech helped restore it.

another issue was with an lun with an aix table on it, the vmware tech reminded me that dd works on esx, DUH. So that lun is good to go now.

The last issue we had was with the signature problem. We're going to remove the lun from both servers, do a rescan, have the SAN admins destroy the lun and re-present it and then we'll rescan again.

Now I'm going to go lay down.

Phaedrus1 · ‎07-28-2009

A nap sounds real good about now... lol

Good luck with the rest of it.

All

Not able to recognize Storage luns from ESX 3.5 server