VMware Cloud Community
conyards
Expert
Expert

Browsing /vmfs/volumes/ timesout.

Hi all,

Yesterday I added the 50th 51st and 52 LUN to an ESX Cluster, the first server I ran the rescan HBA/VMFS from crashed the vpx agent and lost connectivity to virtual centre, after timing out the rescan process.

I ssh'd to the host and restarted the mangement agents (service mgmt-vmware restart). and after a period of just under an hour the host reconnected to VC. however, when I attempted to browse for the LUNs I'd added the VC client crashed. I carried on trouble shooting from the host and discovered I was unable to ls/ll under /vmfs/volumes, however I was able to browse directly into a LUN and view the contents.

Suspecting that the LUN additions had screwed the ESX host, I removed these LUN via the SAN manager, and restarted the ESX host. ON reboot the host took some 20 minutes to load the VMFS3 module, and 10 minutes to load the VMFS2, and I was still unable to ll/ls under /vmfs/volumes.

All the VMs in the cluster are still online, some running live services.

The hosts are IBM LS41s attached to a DS8300 and DS4700 SAN with Cisco fabric switches.

I can't see any fault on the storage side, neither can the storage team, however a PMR has been opened with IBM to dig a little deeper. I also have a PMR opened with IBMs VMware support, with a support dump uploading to them now.

Has anybody seen this behavior before?

Extract from messages;

"Oct 18 12:06:05 wm-cab1-esx21 kernel: Vendor: IBM Model: 2107900 Rev: .215

Oct 18 12:06:05 wm-cab1-esx21 kernel: Type: Direct-Access ANSI SCSI revision: 05

Oct 18 12:06:05 wm-cab1-esx21 kernel: VMWARE SCSI Id: Supported VPD pages for sdbn : 0x0 0x80 0x83 0xc0

Oct 18 12:06:05 wm-cab1-esx21 kernel: VMWARE SCSI Id: Device id info for sdbn: 0x1 0x3 0x0 0x10 0x60 0x5 0x7 0x63 0x5 0xff 0xc1 0x2a 0x0 0x0 0x0 0x0 0x0 0x0 0x22 0x2 0x1 0x14 0x0 0x4 0x0 0x0 0x1 0x2 0x1 0x15 0x0 0x4 0x0 0x0 0x0 0x0

Oct 18 12:06:05 wm-cab1-esx21 kernel: VMWARE SCSI Id: Id for sdbn 0x60 0x05 0x07 0x63 0x05 0xff 0xc1 0x2a 0x00 0x00 0x00 0x00 0x00 0x00 0x22 0x02 0x32 0x31 0x30 0x37 0x39 0x30

Oct 18 12:06:05 wm-cab1-esx21 kernel: VMWARE: Unique Device attached as scsi disk sdbn at scsi3, channel 0, id 1, lun 52

Oct 18 12:06:05 wm-cab1-esx21 kernel: Attached scsi disk sdbn at scsi3, channel 0, id 1, lun 52

Oct 18 12:06:05 wm-cab1-esx21 kernel: scan_scsis starting finish

Oct 18 12:06:05 wm-cab1-esx21 kernel: SCSI device sdbn: 75497472 512-byte hdwr sectors (36864 MB)

Oct 18 12:06:05 wm-cab1-esx21 kernel: sdbn: unknown partition table

Oct 18 12:06:05 wm-cab1-esx21 kernel: scan_scsis done with finish

Oct 18 12:06:05 wm-cab1-esx21 sshd[31224]: Found matching DSA key: ********

Oct 18 12:06:05 wm-cab1-esx21 sshd[31224]: Found matching DSA key: ********

Oct 18 12:06:05 wm-cab1-esx21 sshd[31224]: Accepted publickey for root from 10...* port 4953 ssh2

Oct 18 12:06:05 wm-cab1-esx21 sshd(pam_unix)[31224]: session opened for user root by (uid=0)

Oct 18 12:13:05 wm-cab1-esx21 su(pam_unix)[31269]: session opened for user root by ********(uid=500)

Oct 18 12:17:02 wm-cab1-esx21 /usr/lib/vmware/hostd/vmware-hostd[19502]: Accepted password for user root from 127.0.0.1

Oct 18 12:17:02 wm-cab1-esx21 snmpd[19799]: SNMPVmodl: refreshing cache completed.

Oct 18 12:17:02 wm-cab1-esx21 snmpd[19799]: SNMPVmodl: start refreshing cache...

Oct 18 12:17:04 wm-cab1-esx21 watchdog-hostd: Terminating watchdog with PID 19492

Oct 18 12:17:04 wm-cab1-esx21 snmpd[19799]: SNMPVmodlRefreshCache: Examining 15 VMs found.

Oct 18 12:17:05 wm-cab1-esx21 watchdog-hostd: Signal received: exiting the watchdog

Oct 18 12:17:06 wm-cab1-esx21 /usr/lib/vmware/hostd/vmware-hostd[19502]: Accepted password for user root from 127.0.0.1

Oct 18 12:17:35 wm-cab1-esx21 watchdog-vpxa: '/usr/sbin/vpxa' exited after 1120187 seconds

Oct 18 12:17:35 wm-cab1-esx21 watchdog-vpxa: Executing '/usr/sbin/vpxa'

Oct 18 12:17:36 wm-cab1-esx21 watchdog-hostd: PID file /var/run/vmware/watchdog-hostd.PID not found

Oct 18 12:17:36 wm-cab1-esx21 snmpd[19799]: VmomiClient::CollectVMStats: :vmodl.fault.HostNotReachable

Oct 18 12:17:36 wm-cab1-esx21 snmpd[19799]: eip 0x10b734e

Oct 18 12:17:36 wm-cab1-esx21 snmpd[19799]: eip 0x100df79

Oct 18 12:17:36 wm-cab1-esx21 snmpd[19799]: eip 0xfc8215

Oct 18 12:17:36 wm-cab1-esx21 snmpd[19799]: eip 0x4a9add

Oct 18 12:17:36 wm-cab1-esx21 snmpd[19799]: eip 0x4aa3f2

Oct 18 12:17:36 wm-cab1-esx21 snmpd[19799]: eip 0x4a9e0f

Oct 18 12:17:36 wm-cab1-esx21 snmpd[19799]: eip 0x13b7b6b

Oct 18 12:17:36 wm-cab1-esx21 snmpd[19799]: eip 0x13b6e5c

Oct 18 12:17:36 wm-cab1-esx21 snmpd[19799]: eip 0x13ae2e2

Oct 18 12:17:36 wm-cab1-esx21 snmpd[19799]: eip 0x13ae804

Oct 18 12:17:36 wm-cab1-esx21 snmpd[19799]: eip 0x13b259c

Oct 18 12:17:36 wm-cab1-esx21 snmpd[19799]: eip 0x14004f6

Oct 18 12:17:36 wm-cab1-esx21 snmpd[19799]: eip 0x55c1977

Oct 18 12:17:36 wm-cab1-esx21 snmpd[19799]: eip 0x4b2894

Oct 18 12:17:36 wm-cab1-esx21 VMware[init]: Begin '/usr/sbin/vmware-hostd -u -a', min-uptime = 60, max-quick-failures = 5, max-total-failures = 1000000

Oct 18 12:17:36 wm-cab1-esx21 snmpd[19799]: eip 0x49b212

Oct 18 12:17:37 wm-cab1-esx21 VMware[init]: connect: No such file or directory.

Oct 18 12:17:37 wm-cab1-esx21 snmpd[19799]: eip 0x49a966

Oct 18 12:17:37 wm-cab1-esx21 VMware[init]: connect: No such file or directory.

Oct 18 12:17:37 wm-cab1-esx21 snmpd[19799]: eip 0x496e9f

Oct 18 12:17:37 wm-cab1-esx21 snmpd[19799]: eip 0x4ecfa5

Oct 18 12:17:37 wm-cab1-esx21 snmpd[19799]: eip 0x9479ac

Oct 18 12:17:37 wm-cab1-esx21 snmpd[19799]: eip 0x947bba

Oct 18 12:17:37 wm-cab1-esx21 snmpd[19799]: eip 0x4e9112

Oct 18 12:17:37 wm-cab1-esx21 snmpd[19799]: eip 0x94780c

Oct 18 12:17:37 wm-cab1-esx21 snmpd[19799]: eip 0x93eea3

Oct 18 12:17:37 wm-cab1-esx21 snmpd[19799]: eip 0x9401b5

Oct 18 12:17:38 wm-cab1-esx21 snmpd[19799]: eip 0x93fd6d

Oct 18 12:17:38 wm-cab1-esx21 snmpd[19799]: eip 0x93dd8d

Oct 18 12:17:38 wm-cab1-esx21 snmpd[19799]: eip 0x555e38

Oct 18 12:17:38 wm-cab1-esx21 snmpd[19799]: eip 0x556ddb

Oct 18 12:17:38 wm-cab1-esx21 snmpd[19799]: eip 0x557489

Oct 18 12:17:38 wm-cab1-esx21 snmpd[19799]: eip 0x55626c

Oct 18 12:17:38 wm-cab1-esx21 snmpd[19799]: eip 0x804b9f9

Oct 18 12:17:38 wm-cab1-esx21 snmpd[19799]: eip 0x804aca0

Oct 18 12:17:38 wm-cab1-esx21 snmpd[19799]: eip 0xacc79a

Oct 18 12:17:38 wm-cab1-esx21 snmpd[19799]: eip 0x8049da1

Oct 18 12:17:39 wm-cab1-esx21 snmpd[19799]: SNMPVmodlRefreshCache exception:vmodl.fault.HostNotReachable

Oct 18 12:17:39 wm-cab1-esx21 snmpd[19799]: eip 0x10b734e

Oct 18 12:17:39 wm-cab1-esx21 snmpd[19799]: eip 0x100df79

Oct 18 12:17:39 wm-cab1-esx21 snmpd[19799]: eip 0xfc8215

Oct 18 12:17:39 wm-cab1-esx21 snmpd[19799]: eip 0x4a9add

Oct 18 12:17:39 wm-cab1-esx21 snmpd[19799]: eip 0x4aa3f2

Oct 18 12:17:39 wm-cab1-esx21 snmpd[19799]: eip 0x4a9e0f

Oct 18 12:17:39 wm-cab1-esx21 snmpd[19799]: eip 0x13b7b6b

Oct 18 12:17:39 wm-cab1-esx21 snmpd[19799]: eip 0x13b6e5c

Oct 18 12:17:39 wm-cab1-esx21 snmpd[19799]: eip 0x13ae2e2

Oct 18 12:17:39 wm-cab1-esx21 snmpd[19799]: eip 0x13ae804

Oct 18 12:17:40 wm-cab1-esx21 snmpd[19799]: eip 0x13b259c

Oct 18 12:17:40 wm-cab1-esx21 snmpd[19799]: eip 0x14004f6

Oct 18 12:17:40 wm-cab1-esx21 snmpd[19799]: eip 0x55dfd6d

Oct 18 12:17:40 wm-cab1-esx21 snmpd[19799]: eip 0x552ce41

Oct 18 12:17:40 wm-cab1-esx21 snmpd[19799]: eip 0x4af697

Oct 18 12:17:40 wm-cab1-esx21 snmpd[19799]: eip 0x4aee53

Oct 18 12:17:41 wm-cab1-esx21 snmpd[19799]: eip 0x4ae3e0

Oct 18 12:17:41 wm-cab1-esx21 snmpd[19799]: eip 0xfe65dc

Oct 18 12:17:41 wm-cab1-esx21 snmpd[19799]: eip 0x4aa6fa

Oct 18 12:17:41 wm-cab1-esx21 snmpd[19799]: eip 0x4aa0c1

Oct 18 12:17:41 wm-cab1-esx21 snmpd[19799]: eip 0x4a71df

Oct 18 12:17:41 wm-cab1-esx21 snmpd[19799]: eip 0x49bede

Oct 18 12:17:41 wm-cab1-esx21 snmpd[19799]: eip 0x49a966

Oct 18 12:17:41 wm-cab1-esx21 snmpd[19799]: eip 0x496e9f

Oct 18 12:17:41 wm-cab1-esx21 snmpd[19799]: eip 0x4ecfa5

Oct 18 12:17:41 wm-cab1-esx21 snmpd[19799]: eip 0x9479ac

Oct 18 12:17:41 wm-cab1-esx21 snmpd[19799]: eip 0x947bba

Oct 18 12:17:42 wm-cab1-esx21 snmpd[19799]: eip 0x4e9112

Oct 18 12:17:42 wm-cab1-esx21 snmpd[19799]: eip 0x94780c

Oct 18 12:17:42 wm-cab1-esx21 snmpd[19799]: eip 0x93eea3

Oct 18 12:17:42 wm-cab1-esx21 snmpd[19799]: eip 0x9401b5

Oct 18 12:17:42 wm-cab1-esx21 snmpd[19799]: eip 0x93fd6d

Oct 18 12:17:42 wm-cab1-esx21 snmpd[19799]: eip 0x93dd8d

Oct 18 12:17:42 wm-cab1-esx21 snmpd[19799]: eip 0x555e38

Oct 18 12:17:42 wm-cab1-esx21 snmpd[19799]: eip 0x556ddb

Oct 18 12:17:42 wm-cab1-esx21 snmpd[19799]: eip 0x557489

Oct 18 12:17:42 wm-cab1-esx21 snmpd[19799]: eip 0x55626c

Oct 18 12:17:42 wm-cab1-esx21 snmpd[19799]: eip 0x804b9f9

Oct 18 12:17:42 wm-cab1-esx21 snmpd[19799]: eip 0x804aca0

Oct 18 12:17:42 wm-cab1-esx21 snmpd[19799]: eip 0xacc79a

Oct 18 12:17:42 wm-cab1-esx21 snmpd[19799]: eip 0x8049da1

Oct 18 12:17:44 wm-cab1-esx21 snmpd[19799]: [2007-10-18 12:17:44.493 '

Oct 18 12:17:44 wm-cab1-esx21 snmpd[19799]: SNMP

Oct 18 12:17:44 wm-cab1-esx21 snmpd[19799]: ' 3076449696 info]

Oct 18 12:17:44 wm-cab1-esx21 snmpd[19799]: Disconnect: not connected.

Oct 18 12:17:44 wm-cab1-esx21 snmpd[19799]:

Oct 18 12:17:44 wm-cab1-esx21 snmpd[19799]: VmomiClient::Init: :514 Error connecting to hostd-vmdb service instance.

Oct 18 12:17:44 wm-cab1-esx21 snmpd[19799]: Could not init vmomi client to vmware-hostd.

Oct 18 12:17:44 wm-cab1-esx21 snmpd[19799]: SNMPVModlRefreshCache: Could not acquire vmomi client to vmware-hostd.

Oct 18 12:17:45 wm-cab1-esx21 snmpd[19799]: [2007-10-18 12:17:45.097 '

Oct 18 12:17:45 wm-cab1-esx21 snmpd[19799]: SNMP

Oct 18 12:17:45 wm-cab1-esx21 snmpd[19799]: ' 3076449696 info]

Oct 18 12:17:45 wm-cab1-esx21 snmpd[19799]: Disconnect: not connected.

Oct 18 12:17:45 wm-cab1-esx21 snmpd[19799]: "

https://virtual-simon.co.uk/
Reply
0 Kudos
3 Replies
conyards
Expert
Expert

Update:

I have an open PMR with IBM now at a SAN and ESX level for this fault.

Thoughts are still very welcome. I've got points to award!

Simon

https://virtual-simon.co.uk/
Reply
0 Kudos
conyards
Expert
Expert

Anybody got any thoughts, calls been sitting with IBM since last Friday seemingly without a resolution forthcoming.

Simon

https://virtual-simon.co.uk/
Reply
0 Kudos
conyards
Expert
Expert

My main concern revolves around the following;

For LUN;

/vmfs/volumes/470f81bf-7cffe640-2d98-00145e6d25c2

50G 50G 1.0M 99% /vmfs/volumes/wm-dmz-servername01.ds

That shows 1M free the ll -lah is below

total 562M

drwxrwxrwt 1 root root 1.1K Oct 12 23:56 .

drwxrwxrwx 1 root root 512 Oct 29 10:52 ..

-r----


1 root root 896K Oct 12 15:16 .fbb.sf

-r----


1 root root 62M Oct 12 15:16 .fdc.sf

-r----


1 root root 244M Oct 12 15:16 .pbc.sf

drwxr-xr-x 1 root root 560 Oct 12 23:56 SERVERNAME

-r----


1 root root 248M Oct 12 15:16 .sbc.sf

-r----


1 root root 4.0M Oct 12 15:16 .vh.sf

From the VM folder also;

total 51G

drwxr-xr-x 1 root root 560 Oct 12 23:56 .

drwxrwxrwt 1 root root 1.1K Oct 12 23:56 ..

-rw------- 1 root root 50G Oct 18 04:57 SERVERNAME-flat.vmdk

-rw------- 1 root root 345 Oct 16 20:26 SERVERNAME.vmdk

The above details the following to me; the addition of all the files rounding up those sized in the K to 1M = 50.565GB. This LUN lists in VC as being 50.750GB in size, therefore this volume should have 0.185GB/185M free....

Why is the volume only showing 1M free space? What is utilizing the 185M? Why is the ll -lah operation when run in the VMs folder reporting 51GB used, when we cannot verify the files utilizing this space?

Most importantly why is this LUN showing 1MB free when I'd previously run this operation it reported no free space? What is growing? what is shrinking on a volume that should only contain static VMDK files, metadata information (that remains static and is governed by the equation 500MB + (x -1) (0.016KB) where x equates to the LUN size) and an assumed static heartbeat region.

Simon

https://virtual-simon.co.uk/
Reply
0 Kudos