Scary.
This is a continuation of https://www.vmware.com/community/message.jspa?messageID=644419
In other words, at the moment I have 5 ESXs in a cluster and only one can see my VMs. Yesterday two could. They are all served via two biggish LUNs from a netapp.
All other iscsi connections are fine. The problem is only on this one cluster and these two LUNs. Furthermore my other cluster (_exactly_ the same versions of everything and setup) is working fine.
The problem, exactly, is that the iscsi connection works (LUNs appear under Storage Adapters), but ESX doesnt seem to see a filesystem on them (Disks dont appear under storage, unless I do "Add storage" at which point it thinks the disks are empty).
For example (/proc/partitions):
8 0 1048576000 sda 2 6 16 10 0 0 0 0 0 10 10
8 16 1048576000 sdb 1 3 8 0 0 0 0 0 0 0 0
8 32 209715200 sdc 1 3 8 0 0 0 0 0 0 0 0
8 33 209712446 sdc1 0 0 0 0 0 0 0 0 0 0 0
sda and sdb are "blank", or so ESX thinks. sdc is a working LUN that I created as a test.
fdisk -l /dev/sda
Disk /dev/sda: 1073.7 GB, 1073741824000 bytes
255 heads, 63 sectors/track, 130541 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
Nothing. However the same two devices, sda/sdb, work perfectly on another ESX.
Strangely, though, unlike sdc1 here which has a partition table, sda and sdb dont have a partition table, not even on the ESX where they are working correctly!
We have lately had a lot of iscsi problems were it was getting stuck and required resets. So Im not surprised that something is wrong here, but Id love to know what it is and how to fix it.
Using my "non-vmware-world" logic my instinct would be to run some kind of partition/filesystem recovery, but I have noticed that vmware doesnt always play by traditional rules. Also I'm afraid to, as Ive got about 40 VMs on those LUNs.
Any ideas?
I can give a quick and "stupid" hint.
When you have 1 host working and 1 (or more) that doesn't make a deep compare of configuration.
I'd start from ESX firewall and iSCSI configuration.
What does vmkernel log say when you force a rescan?
Message was edited by:
mcwill
I have compared the configurations, even completely deleted them and started over. I find it hard to believe it would be a config error, because these worked before and lost the disks without me doing anything.
Log is very interesting:
\[root@server log]# esxcfg-rescan vmhba40
Doing iSCSI discovery. This can take a few seconds ...
Rescanning vmhba40...done.
On scsi0, removing: 0:0 0:1 1:0.
On scsi0, adding: 0:0 0:1 1:0.
vmkernel:
May 19 11:23:11 server vmkernel: 2:19:22:02.048 cpu3:1033)iSCSI: bus 0 target 0 updating configuration of session 0x3cd43a78 to iqn.1992-08.com.netapp:sn.101170975:vf.26ebc256-c19a-11db-b0a8-00a09802e56e 6e a09802e56e a8-00a09802e56e
May 19 11:23:11 server vmkernel: 2:19:22:02.048 cpu3:1033)iSCSI: bus 0 target 0 = iqn.1992-08.com.netapp:sn.101170975:vf.26ebc256-c19a-11db-b0a8-00a09802e56e
May 19 11:23:11 server vmkernel: 2:19:22:02.048 cpu3:1033)iSCSI: bus 0 target 0 portal 0 = address 10.0.0.125 port 3260 group 3
May 19 11:23:11 server vmkernel: 2:19:22:02.048 cpu3:1033)iSCSI: bus 0 target 0 configuration updated at 24252161, session 0x3cd43a78 to iqn.1992-08.com.netapp:sn.101170975:vf.26ebc256-c19a-11db-b0a8-00a09802e56e does not need to logout ut to logout need to logout
May 19 11:23:11 server vmkernel: 2:19:22:02.049 cpu3:1034)iSCSI: bus 0 target 1 updating configuration of session 0x3cd4fb78 to iqn.1992-08.com.netapp:sn.101172444:vf.1f7d0d94-bcfa-11db-8d03-00a09802e51e 1e a09802e51e 03-00a09802e51e
May 19 11:23:11 server vmkernel: 2:19:22:02.049 cpu3:1034)iSCSI: bus 0 target 1 = iqn.1992-08.com.netapp:sn.101172444:vf.1f7d0d94-bcfa-11db-8d03-00a09802e51e
May 19 11:23:11 server vmkernel: 2:19:22:02.049 cpu3:1034)iSCSI: bus 0 target 1 portal 0 = address 10.0.0.126 port 3260 group 3
May 19 11:23:11 server vmkernel: 2:19:22:02.049 cpu3:1034)iSCSI: bus 0 target 1 configuration updated at 24252162, session 0x3cd4fb78 to iqn.1992-08.com.netapp:sn.101172444:vf.1f7d0d94-bcfa-11db-8d03-00a09802e51e does not need to logout ut to logout need to logout
May 19 11:23:16 server vmkernel: 2:19:22:06.758 cpu3:1034)SCSI: 8266: Starting rescan of adapter vmhba40
May 19 11:23:16 server vmkernel: VMWARE SCSI Id: Supported VPD pages for vmhba40:0:0 : 0x0 0x80 0x83 0xc0 0xc1
May 19 11:23:16 server vmkernel: VMWARE SCSI Id: Device id info for vmhba40:0:0: 0x2 0x1 0x0 0x20 0x4e 0x45 0x54 0x41 0x50 0x50 0x20 0x20 0x20 0x4c 0x55 0x4e 0x20 0x43 0x34 0x63 0x77 0x64 0x4a 0x2d 0x76 0x52 0x69 0x44 0x65 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x1 0x3 0x0 0x10 0x60 0xa9 0x80 0x0 0x40x44x80 0x0 0x4xa9 0x80 0x0 0x44
May 19 11:23:16 server vmkernel: 3 0x34 0x63 0x77 0x64 0x4a 0x2d 0x76 0x52 0x69 0x44 0x65 0x1 0x2 0x0 0x10 0x4a 0x2d 0x76 0x52 0x69 0x44 0x65 0x0 0xa 0x98 0x0 0x43 0x34 0x63 0x77 0x64 0x1 0x13 0x0 0x10 0x60 0xa9 0x80 0x0 0x0 0x0 0x0 0x2 0x52 0xc7 0xf9 0x7d 0x0 0x0 0xc 0xbc c 0xc 0xbc 0 0x0 0xc 0xbc
May 19 11:23:16 server vmkernel: VMWARE SCSI Id: Id for vmhba40:0:0 0x60 0xa9 0x80 0x00 0x43 0x34 0x63 0x77 0x64 0x4a 0x2d 0x76 0x52 0x69 0x44 0x65 0x4c 0x55 0x4e 0x20 0x20 0x20
May 19 11:23:16 server vmkernel: VMWARE SCSI Id: Supported VPD pages for vmhba40:0:1 : 0x0 0x80 0x83 0xc0 0xc1
May 19 11:23:16 server vmkernel: VMWARE SCSI Id: Device id info for vmhba40:0:1: 0x2 0x1 0x0 0x20 0x4e 0x45 0x54 0x41 0x50 0x50 0x20 0x20 0x20 0x4c 0x55 0x4e 0x20 0x43 0x34 0x63 0x77 0x64 0x4a 0x41 0x63 0x53 0x7a 0x64 0x45 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x1 0x3 0x0 0x10 0x60 0xa9 0x80 0x0 0x40x4445 0x54 0x41 0x50 0x50 0x20 0x20 0x20 0x4c 0x55 0x4e 0x20 0x43 0x34 0x
May 19 11:23:16 server vmkernel: 3 0x34 0x63 0x77 0x64 0x4a 0x41 0x63 0x53 0x7a 0x64 0x45 0x1 0x2 0x0 0x10 0x4a 0x41 0x63 0x53 0x7a 0x64 0x45 0x0 0xa 0x98 0x0 0x43 0x34 0x63 0x77 0x64 0x1 0x13 0x0 0x10 0x60 0xa9 0x80 0x0 0x0 0x0 0x0 0x2 0x52 0xc7 0xf9 0x7d 0x0 0x0 0xc 0xbc c 0x1 0x2 0x0 0x10 0x4a 0x41 0x63 0x53 0x7a 0x64 0x45 0x0 0xa 0x98 0x0 0x43 0x34 0x63 0
May 19 11:23:16 server vmkernel: VMWARE SCSI Id: Id for vmhba40:0:1 0x60 0xa9 0x80 0x00 0x43 0x34 0x63 0x77 0x64 0x4a 0x41 0x63 0x53 0x7a 0x64 0x45 0x4c 0x55 0x4e 0x20 0x20 0x20
May 19 11:23:16 server vmkernel: VMWARE SCSI Id: Supported VPD pages for vmhba40:1:0 : 0x0 0x80 0x83 0xc0 0xc1
34 0x42 0x42 0x45 0x4c 0x79 0x32 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x1 0x3 0x0 0x10 0x60 0xa9 0x80 0x0 0x4
13 0x0 0x10 0x60 0xa9 0x80 0x0 0x0 0x0 0x0 0x2 0x52 0xc7 0xf9 0x7e 0x0 0x0 0xc 0xbc
May 19 11:23:16 server vmkernel: VMWARE SCSI Id: Id for vmhba40:1:0 0x60 0xa9 0x80 0x00 0x43 0x34 0x64 0x4d 0x53 0x34 0x42 0x42 0x45 0x4c 0x79 0x32 0x4c 0x55 0x4e 0x20 0x20 0x20
May 19 11:23:16 server vmkernel: 2:19:22:06.869 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 2 0)
May 19 11:23:16 server vmkernel: 2:19:22:06.869 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 3 0)
May 19 11:23:16 server vmkernel: 2:19:22:06.869 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 4 0)
May 19 11:23:16 server vmkernel: 2:19:22:06.870 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 5 0)
May 19 11:23:16 server vmkernel: 2:19:22:06.870 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 6 0)
May 19 11:23:16 server vmkernel: 2:19:22:06.870 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 7 0)
May 19 11:23:16 server vmkernel: 2:19:22:06.870 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 8 0)
May 19 11:23:16 server vmkernel: 2:19:22:06.870 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 9 0)
May 19 11:23:16 server vmkernel: 2:19:22:06.870 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 10 0)
May 19 11:23:16 server vmkernel: 2:19:22:06.870 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 11 0)
May 19 11:23:16 server vmkernel: 2:19:22:06.870 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 12 0)
May 19 11:23:16 server vmkernel: 2:19:22:06.870 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 13 0)
May 19 11:23:16 server vmkernel: 2:19:22:06.870 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 14 0)
May 19 11:23:16 server vmkernel: 2:19:22:06.870 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 15 0)
May 19 11:23:16 server vmkernel: 2:19:22:06.870 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 16 0)
\May 19 11:23:16 server vmkernel: 2:19:22:06.878 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 252 0)
May 19 11:23:16 server vmkernel: 2:19:22:06.878 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 253 0)
May 19 11:23:16 server vmkernel: 2:19:22:06.878 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 254 0)
May 19 11:23:16 server vmkernel: 2:19:22:06.878 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 255 0)
May 19 11:23:16 server vmkernel: 2:19:22:06.878 cpu3:1034)SCSI: 8323: Finished rescan of adapter vmhba40
This thread looks similar
http://www.vmware.com/community/thread.jspa?messageID=628463
On there, it was a networking fault.
Thanks, unfortunately in my case I dont think its a networking problem. The perfectly working LUNs are on the same network (and interfaces on the netapp) as the non-working ones. Also the working ESXs are connected to the same switch as the non-working ones, so as far as I can figure the only spot on the network it could possibly be is the switch port or the ESX port, and Ive gone through both (and everything else) a dozen times.
Should VMFS partitions have a partition type of "fd"? Always? Im curious because my working ones do, while the non-working ones dont (even though they work on that one ESX).
As far as I know, all vmfs partitions are of type "fb", see below for output from fdisk on a system with a local scsi drive (sda) and 3 LUNs sdc, sdd, sdf.
One query, if sda & sdb are LUNs on your system then which device is the ESX installation/boot from?
\[root@esxborga root]# fdisk -l
Disk /dev/sda: 146.8 GB, 146815733760 bytes
255 heads, 63 sectors/track, 17849 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/sda1 * 1 13 104391 83 Linux
/dev/sda2 14 650 5116702+ 83 Linux
/dev/sda3 651 17513 135452047+ fb Unknown
/dev/sda4 17514 17849 2698920 f Win95 Ext'd (LBA)
/dev/sda5 17514 17582 554211 82 Linux swap
/dev/sda6 17583 17836 2040223+ 83 Linux
/dev/sda7 17837 17849 104391 fc Unknown
Disk /dev/sdc: 536.8 GB, 536870912000 bytes
255 heads, 63 sectors/track, 65270 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/sdc1 1 65270 524281211 fb Unknown
Disk /dev/sdd: 73.9 GB, 73969696768 bytes
255 heads, 63 sectors/track, 8992 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/sdd1 1 8992 72228176 fb Unknown
Disk /dev/sdf: 148.7 GB, 148703805440 bytes
255 heads, 63 sectors/track, 18078 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/sdf1 1 18078 145211471 fb Unknown
ESX itself is on local drives. All my VMs are via iSCSI.
Yeah, my sda and sdb are definitely "empty" according to fdisk. No partitions at all.
Question is, what do I do about it?
But what local drives?
A standard install of ESX would place ESX on sda. What does the following command produce?
vdf -h
Do you have dynamic or static discovery?
Discovery Methods
To determine which storage resource on the network is available for access, the iSCSI initiators ESX Server system uses these discovery methods:
Dynamic Discovery The initiator discovers iSCSI targets by sending a
SendTargets request to a specified target address. To use this method, enter the address of the target device so that the initiator can establish a discovery session with this target. The target device responds by forwarding a list of additional targets that the initiator is allowed to access.
Static Discovery After the target device used in the SendTargets session sends you the list of available targets, they appear on the Static Discovery list. To this list, you can manually add any additional targets, or remove targets you dont need. The static discovery method is available only with the hardware-initiated storage.
But what local drives?
A standard install of ESX would place ESX on sda.
What does the following command produce?
vdf -h
Local drives are two mirrored 73gb SAS drives that are only used for ESX.
Filesystem Size Used Avail Use% Mounted on
/dev/cciss/c0d0p2 4.9G 1.4G 3.2G 31% /
/dev/cciss/c0d0p1 99M 32M 63M 34% /boot
none 131M 0 131M 0% /dev/shm
/dev/cciss/c0d0p6 2.0G 51M 1.8G 3% /var/log
/vmfs/devices 129G 0 129G 0% /vmfs/devices
/vmfs/volumes/4574faed-d8dd623e-ff52-001a4ba65d2c
60G 627M 60G 1% /vmfs/volumes/4574faed-d8dd623e-ff52-001a4ba65d2c
And these are dynamic discovery.
It does find the LUNs, netapp sees the session get established and the ESX as logged in. That's why I'm kind of assuming its not necessarily a problem with iSCSI (although iSCSI may have caused it).
But what local drives?
A standard install of ESX would place ESX on sda.
What does the following command produce?
vdf -h
Local drives are two mirrored 73gb SAS drives that
are only used for ESX.
Stupid me, I forgot about raid devices.
It's possible that the vmfs info on the LUN has been corrupted by your earlier iscsi problems.
http://www.vmware.com/community/thread.jspa?messageID=640538
Check the answer marked correct in the above thread, it implies that once a host is up and running with a LUN it can happily continue to use it even if subsequently the vmfs/partition info is damaged. It only appears to cause a problem upon reboot (or rescan?) of the previously working host at which point the LUN becomes unusable.
Certainly check this with vmware support, but if it were me I would certainly consider creating a new LUN and migrate all VMs off the problem LUNs.
Good luck.
May 19 11:23:16 server vmkernel: 2:19:22:06.870 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 6 0)
What do you see under VI client -> Configuration -> Storage (SCSI,...)?
check vmhba properties. Which state? (Active, Standby, Disabled?)