Re: All my VMs are belong to /dev/null

urgrue · ‎05-18-2007

Scary.

This is a continuation of https://www.vmware.com/community/message.jspa?messageID=644419

In other words, at the moment I have 5 ESXs in a cluster and only one can see my VMs. Yesterday two could. They are all served via two biggish LUNs from a netapp.

All other iscsi connections are fine. The problem is only on this one cluster and these two LUNs. Furthermore my other cluster (_exactly_ the same versions of everything and setup) is working fine.

The problem, exactly, is that the iscsi connection works (LUNs appear under Storage Adapters), but ESX doesnt seem to see a filesystem on them (Disks dont appear under storage, unless I do "Add storage" at which point it thinks the disks are empty).

For example (/proc/partitions):

8 0 1048576000 sda 2 6 16 10 0 0 0 0 0 10 10

8 16 1048576000 sdb 1 3 8 0 0 0 0 0 0 0 0

8 32 209715200 sdc 1 3 8 0 0 0 0 0 0 0 0

8 33 209712446 sdc1 0 0 0 0 0 0 0 0 0 0 0

sda and sdb are "blank", or so ESX thinks. sdc is a working LUN that I created as a test.

fdisk -l /dev/sda

Disk /dev/sda: 1073.7 GB, 1073741824000 bytes

255 heads, 63 sectors/track, 130541 cylinders

Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System

Nothing. However the same two devices, sda/sdb, work perfectly on another ESX.

Strangely, though, unlike sdc1 here which has a partition table, sda and sdb dont have a partition table, not even on the ESX where they are working correctly!

We have lately had a lot of iscsi problems were it was getting stuck and required resets. So Im not surprised that something is wrong here, but Id love to know what it is and how to fix it.

Using my "non-vmware-world" logic my instinct would be to run some kind of partition/filesystem recovery, but I have noticed that vmware doesnt always play by traditional rules. Also I'm afraid to, as Ive got about 40 VMs on those LUNs.

Any ideas?

masaki · ‎05-18-2007

I can give a quick and "stupid" hint.

When you have 1 host working and 1 (or more) that doesn't make a deep compare of configuration.

I'd start from ESX firewall and iSCSI configuration.

mcwill · ‎05-19-2007

What does vmkernel log say when you force a rescan?

Message was edited by:

mcwill

urgrue · ‎05-19-2007

I have compared the configurations, even completely deleted them and started over. I find it hard to believe it would be a config error, because these worked before and lost the disks without me doing anything.

Log is very interesting:

\[root@server log]# esxcfg-rescan vmhba40

Doing iSCSI discovery. This can take a few seconds ...

Rescanning vmhba40...done.

On scsi0, removing: 0:0 0:1 1:0.

On scsi0, adding: 0:0 0:1 1:0.

vmkernel:

May 19 11:23:11 server vmkernel: 2:19:22:02.048 cpu3:1033)iSCSI: bus 0 target 0 updating configuration of session 0x3cd43a78 to iqn.1992-08.com.netapp:sn.101170975:vf.26ebc256-c19a-11db-b0a8-00a09802e56e 6e a09802e56e a8-00a09802e56e

May 19 11:23:11 server vmkernel: 2:19:22:02.048 cpu3:1033)iSCSI: bus 0 target 0 = iqn.1992-08.com.netapp:sn.101170975:vf.26ebc256-c19a-11db-b0a8-00a09802e56e

May 19 11:23:11 server vmkernel: 2:19:22:02.048 cpu3:1033)iSCSI: bus 0 target 0 portal 0 = address 10.0.0.125 port 3260 group 3

May 19 11:23:11 server vmkernel: 2:19:22:02.048 cpu3:1033)iSCSI: bus 0 target 0 configuration updated at 24252161, session 0x3cd43a78 to iqn.1992-08.com.netapp:sn.101170975:vf.26ebc256-c19a-11db-b0a8-00a09802e56e does not need to logout ut to logout need to logout

May 19 11:23:11 server vmkernel: 2:19:22:02.049 cpu3:1034)iSCSI: bus 0 target 1 updating configuration of session 0x3cd4fb78 to iqn.1992-08.com.netapp:sn.101172444:vf.1f7d0d94-bcfa-11db-8d03-00a09802e51e 1e a09802e51e 03-00a09802e51e

May 19 11:23:11 server vmkernel: 2:19:22:02.049 cpu3:1034)iSCSI: bus 0 target 1 = iqn.1992-08.com.netapp:sn.101172444:vf.1f7d0d94-bcfa-11db-8d03-00a09802e51e

May 19 11:23:11 server vmkernel: 2:19:22:02.049 cpu3:1034)iSCSI: bus 0 target 1 portal 0 = address 10.0.0.126 port 3260 group 3

May 19 11:23:11 server vmkernel: 2:19:22:02.049 cpu3:1034)iSCSI: bus 0 target 1 configuration updated at 24252162, session 0x3cd4fb78 to iqn.1992-08.com.netapp:sn.101172444:vf.1f7d0d94-bcfa-11db-8d03-00a09802e51e does not need to logout ut to logout need to logout

May 19 11:23:16 server vmkernel: 2:19:22:06.758 cpu3:1034)SCSI: 8266: Starting rescan of adapter vmhba40

May 19 11:23:16 server vmkernel: VMWARE SCSI Id: Supported VPD pages for vmhba40:0:0 : 0x0 0x80 0x83 0xc0 0xc1

May 19 11:23:16 server vmkernel: VMWARE SCSI Id: Device id info for vmhba40:0:0: 0x2 0x1 0x0 0x20 0x4e 0x45 0x54 0x41 0x50 0x50 0x20 0x20 0x20 0x4c 0x55 0x4e 0x20 0x43 0x34 0x63 0x77 0x64 0x4a 0x2d 0x76 0x52 0x69 0x44 0x65 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x1 0x3 0x0 0x10 0x60 0xa9 0x80 0x0 0x40x44x80 0x0 0x4xa9 0x80 0x0 0x44

May 19 11:23:16 server vmkernel: 3 0x34 0x63 0x77 0x64 0x4a 0x2d 0x76 0x52 0x69 0x44 0x65 0x1 0x2 0x0 0x10 0x4a 0x2d 0x76 0x52 0x69 0x44 0x65 0x0 0xa 0x98 0x0 0x43 0x34 0x63 0x77 0x64 0x1 0x13 0x0 0x10 0x60 0xa9 0x80 0x0 0x0 0x0 0x0 0x2 0x52 0xc7 0xf9 0x7d 0x0 0x0 0xc 0xbc c 0xc 0xbc 0 0x0 0xc 0xbc

May 19 11:23:16 server vmkernel: VMWARE SCSI Id: Id for vmhba40:0:0 0x60 0xa9 0x80 0x00 0x43 0x34 0x63 0x77 0x64 0x4a 0x2d 0x76 0x52 0x69 0x44 0x65 0x4c 0x55 0x4e 0x20 0x20 0x20

May 19 11:23:16 server vmkernel: VMWARE SCSI Id: Supported VPD pages for vmhba40:0:1 : 0x0 0x80 0x83 0xc0 0xc1

May 19 11:23:16 server vmkernel: VMWARE SCSI Id: Device id info for vmhba40:0:1: 0x2 0x1 0x0 0x20 0x4e 0x45 0x54 0x41 0x50 0x50 0x20 0x20 0x20 0x4c 0x55 0x4e 0x20 0x43 0x34 0x63 0x77 0x64 0x4a 0x41 0x63 0x53 0x7a 0x64 0x45 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x1 0x3 0x0 0x10 0x60 0xa9 0x80 0x0 0x40x4445 0x54 0x41 0x50 0x50 0x20 0x20 0x20 0x4c 0x55 0x4e 0x20 0x43 0x34 0x

May 19 11:23:16 server vmkernel: 3 0x34 0x63 0x77 0x64 0x4a 0x41 0x63 0x53 0x7a 0x64 0x45 0x1 0x2 0x0 0x10 0x4a 0x41 0x63 0x53 0x7a 0x64 0x45 0x0 0xa 0x98 0x0 0x43 0x34 0x63 0x77 0x64 0x1 0x13 0x0 0x10 0x60 0xa9 0x80 0x0 0x0 0x0 0x0 0x2 0x52 0xc7 0xf9 0x7d 0x0 0x0 0xc 0xbc c 0x1 0x2 0x0 0x10 0x4a 0x41 0x63 0x53 0x7a 0x64 0x45 0x0 0xa 0x98 0x0 0x43 0x34 0x63 0

May 19 11:23:16 server vmkernel: VMWARE SCSI Id: Id for vmhba40:0:1 0x60 0xa9 0x80 0x00 0x43 0x34 0x63 0x77 0x64 0x4a 0x41 0x63 0x53 0x7a 0x64 0x45 0x4c 0x55 0x4e 0x20 0x20 0x20

May 19 11:23:16 server vmkernel: VMWARE SCSI Id: Supported VPD pages for vmhba40:1:0 : 0x0 0x80 0x83 0xc0 0xc1

34 0x42 0x42 0x45 0x4c 0x79 0x32 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x1 0x3 0x0 0x10 0x60 0xa9 0x80 0x0 0x4

13 0x0 0x10 0x60 0xa9 0x80 0x0 0x0 0x0 0x0 0x2 0x52 0xc7 0xf9 0x7e 0x0 0x0 0xc 0xbc

May 19 11:23:16 server vmkernel: VMWARE SCSI Id: Id for vmhba40:1:0 0x60 0xa9 0x80 0x00 0x43 0x34 0x64 0x4d 0x53 0x34 0x42 0x42 0x45 0x4c 0x79 0x32 0x4c 0x55 0x4e 0x20 0x20 0x20

May 19 11:23:16 server vmkernel: 2:19:22:06.869 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 2 0)

May 19 11:23:16 server vmkernel: 2:19:22:06.869 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 3 0)

May 19 11:23:16 server vmkernel: 2:19:22:06.869 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 4 0)

May 19 11:23:16 server vmkernel: 2:19:22:06.870 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 5 0)

May 19 11:23:16 server vmkernel: 2:19:22:06.870 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 6 0)

May 19 11:23:16 server vmkernel: 2:19:22:06.870 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 7 0)

May 19 11:23:16 server vmkernel: 2:19:22:06.870 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 8 0)

May 19 11:23:16 server vmkernel: 2:19:22:06.870 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 9 0)

May 19 11:23:16 server vmkernel: 2:19:22:06.870 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 10 0)

May 19 11:23:16 server vmkernel: 2:19:22:06.870 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 11 0)

May 19 11:23:16 server vmkernel: 2:19:22:06.870 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 12 0)

May 19 11:23:16 server vmkernel: 2:19:22:06.870 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 13 0)

May 19 11:23:16 server vmkernel: 2:19:22:06.870 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 14 0)

May 19 11:23:16 server vmkernel: 2:19:22:06.870 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 15 0)

May 19 11:23:16 server vmkernel: 2:19:22:06.870 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 16 0)

\May 19 11:23:16 server vmkernel: 2:19:22:06.878 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 252 0)

May 19 11:23:16 server vmkernel: 2:19:22:06.878 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 253 0)

May 19 11:23:16 server vmkernel: 2:19:22:06.878 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 254 0)

May 19 11:23:16 server vmkernel: 2:19:22:06.878 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 255 0)

May 19 11:23:16 server vmkernel: 2:19:22:06.878 cpu3:1034)SCSI: 8323: Finished rescan of adapter vmhba40

mcwill · ‎05-19-2007

This thread looks similar

http://www.vmware.com/community/thread.jspa?messageID=628463

On there, it was a networking fault.

urgrue · ‎05-20-2007

Thanks, unfortunately in my case I dont think its a networking problem. The perfectly working LUNs are on the same network (and interfaces on the netapp) as the non-working ones. Also the working ESXs are connected to the same switch as the non-working ones, so as far as I can figure the only spot on the network it could possibly be is the switch port or the ESX port, and Ive gone through both (and everything else) a dozen times.

Should VMFS partitions have a partition type of "fd"? Always? Im curious because my working ones do, while the non-working ones dont (even though they work on that one ESX).

mcwill · ‎05-20-2007

As far as I know, all vmfs partitions are of type "fb", see below for output from fdisk on a system with a local scsi drive (sda) and 3 LUNs sdc, sdd, sdf.

One query, if sda & sdb are LUNs on your system then which device is the ESX installation/boot from?

\[root@esxborga root]# fdisk -l

Disk /dev/sda: 146.8 GB, 146815733760 bytes

255 heads, 63 sectors/track, 17849 cylinders

Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System

/dev/sda1 * 1 13 104391 83 Linux

/dev/sda2 14 650 5116702+ 83 Linux

/dev/sda3 651 17513 135452047+ fb Unknown

/dev/sda4 17514 17849 2698920 f Win95 Ext'd (LBA)

/dev/sda5 17514 17582 554211 82 Linux swap

/dev/sda6 17583 17836 2040223+ 83 Linux

/dev/sda7 17837 17849 104391 fc Unknown

Disk /dev/sdc: 536.8 GB, 536870912000 bytes

255 heads, 63 sectors/track, 65270 cylinders

Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System

/dev/sdc1 1 65270 524281211 fb Unknown

Disk /dev/sdd: 73.9 GB, 73969696768 bytes

255 heads, 63 sectors/track, 8992 cylinders

Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System

/dev/sdd1 1 8992 72228176 fb Unknown

Disk /dev/sdf: 148.7 GB, 148703805440 bytes

255 heads, 63 sectors/track, 18078 cylinders

Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System

/dev/sdf1 1 18078 145211471 fb Unknown

urgrue · ‎05-20-2007

ESX itself is on local drives. All my VMs are via iSCSI.

Yeah, my sda and sdb are definitely "empty" according to fdisk. No partitions at all.

Question is, what do I do about it?

mcwill · ‎05-20-2007

But what local drives?

A standard install of ESX would place ESX on sda. What does the following command produce?

vdf -h

masaki · ‎05-21-2007

Do you have dynamic or static discovery?

Discovery Methods

To determine which storage resource on the network is available for access, the iSCSI initiators ESX Server system uses these discovery methods:

Dynamic Discovery – The initiator discovers iSCSI targets by sending a

SendTargets request to a specified target address. To use this method, enter the address of the target device so that the initiator can establish a discovery session with this target. The target device responds by forwarding a list of additional targets that the initiator is allowed to access.

Static Discovery – After the target device used in the SendTargets session sends you the list of available targets, they appear on the Static Discovery list. To this list, you can manually add any additional targets, or remove targets you don’t need. The static discovery method is available only with the hardware-initiated storage.

urgrue · ‎05-21-2007

But what local drives?
A standard install of ESX would place ESX on sda.
What does the following command produce?
vdf -h

Local drives are two mirrored 73gb SAS drives that are only used for ESX.

Filesystem Size Used Avail Use% Mounted on

/dev/cciss/c0d0p2 4.9G 1.4G 3.2G 31% /

/dev/cciss/c0d0p1 99M 32M 63M 34% /boot

none 131M 0 131M 0% /dev/shm

/dev/cciss/c0d0p6 2.0G 51M 1.8G 3% /var/log

/vmfs/devices 129G 0 129G 0% /vmfs/devices

/vmfs/volumes/4574faed-d8dd623e-ff52-001a4ba65d2c

60G 627M 60G 1% /vmfs/volumes/4574faed-d8dd623e-ff52-001a4ba65d2c

urgrue · ‎05-21-2007

And these are dynamic discovery.

It does find the LUNs, netapp sees the session get established and the ESX as logged in. That's why I'm kind of assuming its not necessarily a problem with iSCSI (although iSCSI may have caused it).

mcwill · ‎05-21-2007

But what local drives?
A standard install of ESX would place ESX on sda.
What does the following command produce?
vdf -h

Local drives are two mirrored 73gb SAS drives that
are only used for ESX.

Stupid me, I forgot about raid devices.

It's possible that the vmfs info on the LUN has been corrupted by your earlier iscsi problems.

http://www.vmware.com/community/thread.jspa?messageID=640538

Check the answer marked correct in the above thread, it implies that once a host is up and running with a LUN it can happily continue to use it even if subsequently the vmfs/partition info is damaged. It only appears to cause a problem upon reboot (or rescan?) of the previously working host at which point the LUN becomes unusable.

Certainly check this with vmware support, but if it were me I would certainly consider creating a new LUN and migrate all VMs off the problem LUNs.

Good luck.

masaki · ‎05-21-2007

May 19 11:23:16 server vmkernel: 2:19:22:06.870 cpu3:1034)iSCSI: queuecommand 0x3cc06cb0 failed to find a session for HBA 0x1f63a98, (0 0 6 0)

What do you see under VI client -> Configuration -> Storage (SCSI,...)?

check vmhba properties. Which state? (Active, Standby, Disabled?)