Hi all,
Yesterday I added the 50th 51st and 52 LUN to an ESX Cluster, the first server I ran the rescan HBA/VMFS from crashed the vpx agent and lost connectivity to virtual centre, after timing out the rescan process.
I ssh'd to the host and restarted the mangement agents (service mgmt-vmware restart). and after a period of just under an hour the host reconnected to VC. however, when I attempted to browse for the LUNs I'd added the VC client crashed. I carried on trouble shooting from the host and discovered I was unable to ls/ll under /vmfs/volumes, however I was able to browse directly into a LUN and view the contents.
Suspecting that the LUN additions had screwed the ESX host, I removed these LUN via the SAN manager, and restarted the ESX host. ON reboot the host took some 20 minutes to load the VMFS3 module, and 10 minutes to load the VMFS2, and I was still unable to ll/ls under /vmfs/volumes.
All the VMs in the cluster are still online, some running live services.
The hosts are IBM LS41s attached to a DS8300 and DS4700 SAN with Cisco fabric switches.
I can't see any fault on the storage side, neither can the storage team, however a PMR has been opened with IBM to dig a little deeper. I also have a PMR opened with IBMs VMware support, with a support dump uploading to them now.
Has anybody seen this behavior before?
Extract from messages;
"Oct 18 12:06:05 wm-cab1-esx21 kernel: Vendor: IBM Model: 2107900 Rev: .215
Oct 18 12:06:05 wm-cab1-esx21 kernel: Type: Direct-Access ANSI SCSI revision: 05
Oct 18 12:06:05 wm-cab1-esx21 kernel: VMWARE SCSI Id: Supported VPD pages for sdbn : 0x0 0x80 0x83 0xc0
Oct 18 12:06:05 wm-cab1-esx21 kernel: VMWARE SCSI Id: Device id info for sdbn: 0x1 0x3 0x0 0x10 0x60 0x5 0x7 0x63 0x5 0xff 0xc1 0x2a 0x0 0x0 0x0 0x0 0x0 0x0 0x22 0x2 0x1 0x14 0x0 0x4 0x0 0x0 0x1 0x2 0x1 0x15 0x0 0x4 0x0 0x0 0x0 0x0
Oct 18 12:06:05 wm-cab1-esx21 kernel: VMWARE SCSI Id: Id for sdbn 0x60 0x05 0x07 0x63 0x05 0xff 0xc1 0x2a 0x00 0x00 0x00 0x00 0x00 0x00 0x22 0x02 0x32 0x31 0x30 0x37 0x39 0x30
Oct 18 12:06:05 wm-cab1-esx21 kernel: VMWARE: Unique Device attached as scsi disk sdbn at scsi3, channel 0, id 1, lun 52
Oct 18 12:06:05 wm-cab1-esx21 kernel: Attached scsi disk sdbn at scsi3, channel 0, id 1, lun 52
Oct 18 12:06:05 wm-cab1-esx21 kernel: scan_scsis starting finish
Oct 18 12:06:05 wm-cab1-esx21 kernel: SCSI device sdbn: 75497472 512-byte hdwr sectors (36864 MB)
Oct 18 12:06:05 wm-cab1-esx21 kernel: sdbn: unknown partition table
Oct 18 12:06:05 wm-cab1-esx21 kernel: scan_scsis done with finish
Oct 18 12:06:05 wm-cab1-esx21 sshd[31224]: Found matching DSA key: ********
Oct 18 12:06:05 wm-cab1-esx21 sshd[31224]: Found matching DSA key: ********
Oct 18 12:06:05 wm-cab1-esx21 sshd[31224]: Accepted publickey for root from 10...* port 4953 ssh2
Oct 18 12:06:05 wm-cab1-esx21 sshd(pam_unix)[31224]: session opened for user root by (uid=0)
Oct 18 12:13:05 wm-cab1-esx21 su(pam_unix)[31269]: session opened for user root by ********(uid=500)
Oct 18 12:17:02 wm-cab1-esx21 /usr/lib/vmware/hostd/vmware-hostd[19502]: Accepted password for user root from 127.0.0.1
Oct 18 12:17:02 wm-cab1-esx21 snmpd[19799]: SNMPVmodl: refreshing cache completed.
Oct 18 12:17:02 wm-cab1-esx21 snmpd[19799]: SNMPVmodl: start refreshing cache...
Oct 18 12:17:04 wm-cab1-esx21 watchdog-hostd: Terminating watchdog with PID 19492
Oct 18 12:17:04 wm-cab1-esx21 snmpd[19799]: SNMPVmodlRefreshCache: Examining 15 VMs found.
Oct 18 12:17:05 wm-cab1-esx21 watchdog-hostd: Signal received: exiting the watchdog
Oct 18 12:17:06 wm-cab1-esx21 /usr/lib/vmware/hostd/vmware-hostd[19502]: Accepted password for user root from 127.0.0.1
Oct 18 12:17:35 wm-cab1-esx21 watchdog-vpxa: '/usr/sbin/vpxa' exited after 1120187 seconds
Oct 18 12:17:35 wm-cab1-esx21 watchdog-vpxa: Executing '/usr/sbin/vpxa'
Oct 18 12:17:36 wm-cab1-esx21 watchdog-hostd: PID file /var/run/vmware/watchdog-hostd.PID not found
Oct 18 12:17:36 wm-cab1-esx21 snmpd[19799]: VmomiClient::CollectVMStats: :vmodl.fault.HostNotReachable
Oct 18 12:17:36 wm-cab1-esx21 snmpd[19799]: eip 0x10b734e
Oct 18 12:17:36 wm-cab1-esx21 snmpd[19799]: eip 0x100df79
Oct 18 12:17:36 wm-cab1-esx21 snmpd[19799]: eip 0xfc8215
Oct 18 12:17:36 wm-cab1-esx21 snmpd[19799]: eip 0x4a9add
Oct 18 12:17:36 wm-cab1-esx21 snmpd[19799]: eip 0x4aa3f2
Oct 18 12:17:36 wm-cab1-esx21 snmpd[19799]: eip 0x4a9e0f
Oct 18 12:17:36 wm-cab1-esx21 snmpd[19799]: eip 0x13b7b6b
Oct 18 12:17:36 wm-cab1-esx21 snmpd[19799]: eip 0x13b6e5c
Oct 18 12:17:36 wm-cab1-esx21 snmpd[19799]: eip 0x13ae2e2
Oct 18 12:17:36 wm-cab1-esx21 snmpd[19799]: eip 0x13ae804
Oct 18 12:17:36 wm-cab1-esx21 snmpd[19799]: eip 0x13b259c
Oct 18 12:17:36 wm-cab1-esx21 snmpd[19799]: eip 0x14004f6
Oct 18 12:17:36 wm-cab1-esx21 snmpd[19799]: eip 0x55c1977
Oct 18 12:17:36 wm-cab1-esx21 snmpd[19799]: eip 0x4b2894
Oct 18 12:17:36 wm-cab1-esx21 VMware[init]: Begin '/usr/sbin/vmware-hostd -u -a', min-uptime = 60, max-quick-failures = 5, max-total-failures = 1000000
Oct 18 12:17:36 wm-cab1-esx21 snmpd[19799]: eip 0x49b212
Oct 18 12:17:37 wm-cab1-esx21 VMware[init]: connect: No such file or directory.
Oct 18 12:17:37 wm-cab1-esx21 snmpd[19799]: eip 0x49a966
Oct 18 12:17:37 wm-cab1-esx21 VMware[init]: connect: No such file or directory.
Oct 18 12:17:37 wm-cab1-esx21 snmpd[19799]: eip 0x496e9f
Oct 18 12:17:37 wm-cab1-esx21 snmpd[19799]: eip 0x4ecfa5
Oct 18 12:17:37 wm-cab1-esx21 snmpd[19799]: eip 0x9479ac
Oct 18 12:17:37 wm-cab1-esx21 snmpd[19799]: eip 0x947bba
Oct 18 12:17:37 wm-cab1-esx21 snmpd[19799]: eip 0x4e9112
Oct 18 12:17:37 wm-cab1-esx21 snmpd[19799]: eip 0x94780c
Oct 18 12:17:37 wm-cab1-esx21 snmpd[19799]: eip 0x93eea3
Oct 18 12:17:37 wm-cab1-esx21 snmpd[19799]: eip 0x9401b5
Oct 18 12:17:38 wm-cab1-esx21 snmpd[19799]: eip 0x93fd6d
Oct 18 12:17:38 wm-cab1-esx21 snmpd[19799]: eip 0x93dd8d
Oct 18 12:17:38 wm-cab1-esx21 snmpd[19799]: eip 0x555e38
Oct 18 12:17:38 wm-cab1-esx21 snmpd[19799]: eip 0x556ddb
Oct 18 12:17:38 wm-cab1-esx21 snmpd[19799]: eip 0x557489
Oct 18 12:17:38 wm-cab1-esx21 snmpd[19799]: eip 0x55626c
Oct 18 12:17:38 wm-cab1-esx21 snmpd[19799]: eip 0x804b9f9
Oct 18 12:17:38 wm-cab1-esx21 snmpd[19799]: eip 0x804aca0
Oct 18 12:17:38 wm-cab1-esx21 snmpd[19799]: eip 0xacc79a
Oct 18 12:17:38 wm-cab1-esx21 snmpd[19799]: eip 0x8049da1
Oct 18 12:17:39 wm-cab1-esx21 snmpd[19799]: SNMPVmodlRefreshCache exception:vmodl.fault.HostNotReachable
Oct 18 12:17:39 wm-cab1-esx21 snmpd[19799]: eip 0x10b734e
Oct 18 12:17:39 wm-cab1-esx21 snmpd[19799]: eip 0x100df79
Oct 18 12:17:39 wm-cab1-esx21 snmpd[19799]: eip 0xfc8215
Oct 18 12:17:39 wm-cab1-esx21 snmpd[19799]: eip 0x4a9add
Oct 18 12:17:39 wm-cab1-esx21 snmpd[19799]: eip 0x4aa3f2
Oct 18 12:17:39 wm-cab1-esx21 snmpd[19799]: eip 0x4a9e0f
Oct 18 12:17:39 wm-cab1-esx21 snmpd[19799]: eip 0x13b7b6b
Oct 18 12:17:39 wm-cab1-esx21 snmpd[19799]: eip 0x13b6e5c
Oct 18 12:17:39 wm-cab1-esx21 snmpd[19799]: eip 0x13ae2e2
Oct 18 12:17:39 wm-cab1-esx21 snmpd[19799]: eip 0x13ae804
Oct 18 12:17:40 wm-cab1-esx21 snmpd[19799]: eip 0x13b259c
Oct 18 12:17:40 wm-cab1-esx21 snmpd[19799]: eip 0x14004f6
Oct 18 12:17:40 wm-cab1-esx21 snmpd[19799]: eip 0x55dfd6d
Oct 18 12:17:40 wm-cab1-esx21 snmpd[19799]: eip 0x552ce41
Oct 18 12:17:40 wm-cab1-esx21 snmpd[19799]: eip 0x4af697
Oct 18 12:17:40 wm-cab1-esx21 snmpd[19799]: eip 0x4aee53
Oct 18 12:17:41 wm-cab1-esx21 snmpd[19799]: eip 0x4ae3e0
Oct 18 12:17:41 wm-cab1-esx21 snmpd[19799]: eip 0xfe65dc
Oct 18 12:17:41 wm-cab1-esx21 snmpd[19799]: eip 0x4aa6fa
Oct 18 12:17:41 wm-cab1-esx21 snmpd[19799]: eip 0x4aa0c1
Oct 18 12:17:41 wm-cab1-esx21 snmpd[19799]: eip 0x4a71df
Oct 18 12:17:41 wm-cab1-esx21 snmpd[19799]: eip 0x49bede
Oct 18 12:17:41 wm-cab1-esx21 snmpd[19799]: eip 0x49a966
Oct 18 12:17:41 wm-cab1-esx21 snmpd[19799]: eip 0x496e9f
Oct 18 12:17:41 wm-cab1-esx21 snmpd[19799]: eip 0x4ecfa5
Oct 18 12:17:41 wm-cab1-esx21 snmpd[19799]: eip 0x9479ac
Oct 18 12:17:41 wm-cab1-esx21 snmpd[19799]: eip 0x947bba
Oct 18 12:17:42 wm-cab1-esx21 snmpd[19799]: eip 0x4e9112
Oct 18 12:17:42 wm-cab1-esx21 snmpd[19799]: eip 0x94780c
Oct 18 12:17:42 wm-cab1-esx21 snmpd[19799]: eip 0x93eea3
Oct 18 12:17:42 wm-cab1-esx21 snmpd[19799]: eip 0x9401b5
Oct 18 12:17:42 wm-cab1-esx21 snmpd[19799]: eip 0x93fd6d
Oct 18 12:17:42 wm-cab1-esx21 snmpd[19799]: eip 0x93dd8d
Oct 18 12:17:42 wm-cab1-esx21 snmpd[19799]: eip 0x555e38
Oct 18 12:17:42 wm-cab1-esx21 snmpd[19799]: eip 0x556ddb
Oct 18 12:17:42 wm-cab1-esx21 snmpd[19799]: eip 0x557489
Oct 18 12:17:42 wm-cab1-esx21 snmpd[19799]: eip 0x55626c
Oct 18 12:17:42 wm-cab1-esx21 snmpd[19799]: eip 0x804b9f9
Oct 18 12:17:42 wm-cab1-esx21 snmpd[19799]: eip 0x804aca0
Oct 18 12:17:42 wm-cab1-esx21 snmpd[19799]: eip 0xacc79a
Oct 18 12:17:42 wm-cab1-esx21 snmpd[19799]: eip 0x8049da1
Oct 18 12:17:44 wm-cab1-esx21 snmpd[19799]: [2007-10-18 12:17:44.493 '
Oct 18 12:17:44 wm-cab1-esx21 snmpd[19799]: SNMP
Oct 18 12:17:44 wm-cab1-esx21 snmpd[19799]: ' 3076449696 info]
Oct 18 12:17:44 wm-cab1-esx21 snmpd[19799]: Disconnect: not connected.
Oct 18 12:17:44 wm-cab1-esx21 snmpd[19799]:
Oct 18 12:17:44 wm-cab1-esx21 snmpd[19799]: VmomiClient::Init: :514 Error connecting to hostd-vmdb service instance.
Oct 18 12:17:44 wm-cab1-esx21 snmpd[19799]: Could not init vmomi client to vmware-hostd.
Oct 18 12:17:44 wm-cab1-esx21 snmpd[19799]: SNMPVModlRefreshCache: Could not acquire vmomi client to vmware-hostd.
Oct 18 12:17:45 wm-cab1-esx21 snmpd[19799]: [2007-10-18 12:17:45.097 '
Oct 18 12:17:45 wm-cab1-esx21 snmpd[19799]: SNMP
Oct 18 12:17:45 wm-cab1-esx21 snmpd[19799]: ' 3076449696 info]
Oct 18 12:17:45 wm-cab1-esx21 snmpd[19799]: Disconnect: not connected.
Oct 18 12:17:45 wm-cab1-esx21 snmpd[19799]: "
Update:
I have an open PMR with IBM now at a SAN and ESX level for this fault.
Thoughts are still very welcome. I've got points to award!
Simon
Anybody got any thoughts, calls been sitting with IBM since last Friday seemingly without a resolution forthcoming.
Simon
My main concern revolves around the following;
For LUN;
/vmfs/volumes/470f81bf-7cffe640-2d98-00145e6d25c2
50G 50G 1.0M 99% /vmfs/volumes/wm-dmz-servername01.ds
That shows 1M free the ll -lah is below
total 562M
drwxrwxrwt 1 root root 1.1K Oct 12 23:56 .
drwxrwxrwx 1 root root 512 Oct 29 10:52 ..
-r----
1 root root 896K Oct 12 15:16 .fbb.sf
-r----
1 root root 62M Oct 12 15:16 .fdc.sf
-r----
1 root root 244M Oct 12 15:16 .pbc.sf
drwxr-xr-x 1 root root 560 Oct 12 23:56 SERVERNAME
-r----
1 root root 248M Oct 12 15:16 .sbc.sf
-r----
1 root root 4.0M Oct 12 15:16 .vh.sf
From the VM folder also;
total 51G
drwxr-xr-x 1 root root 560 Oct 12 23:56 .
drwxrwxrwt 1 root root 1.1K Oct 12 23:56 ..
-rw------- 1 root root 50G Oct 18 04:57 SERVERNAME-flat.vmdk
-rw------- 1 root root 345 Oct 16 20:26 SERVERNAME.vmdk
The above details the following to me; the addition of all the files rounding up those sized in the K to 1M = 50.565GB. This LUN lists in VC as being 50.750GB in size, therefore this volume should have 0.185GB/185M free....
Why is the volume only showing 1M free space? What is utilizing the 185M? Why is the ll -lah operation when run in the VMs folder reporting 51GB used, when we cannot verify the files utilizing this space?
Most importantly why is this LUN showing 1MB free when I'd previously run this operation it reported no free space? What is growing? what is shrinking on a volume that should only contain static VMDK files, metadata information (that remains static and is governed by the equation 500MB + (x -1) (0.016KB) where x equates to the LUN size) and an assumed static heartbeat region.
Simon