Running Into Some Storage Issues in a Mixed ESX4/E...

maxx22 · ‎01-20-2010

We are in the process of bringing our 16 or so ESX 3/3.5 blades up to date by doing new installs of ESX 4.0. We currently have 3 of the 16 converted. I've moved some test vms over to the new ESX4 blades and have noticed a problem in regards to our storage(EMC CX300 LUNs formatted as VMFS).

When doing a rescan of HBA/VMFS on our ESX4 Blades the connection to the SAN seems to lock up. If I ssh into an ESX4 box and do an vdf of any LUN it doesn't return until the rescan completes. Also any VMs on that blade go dead during the rescan...can't ping/rdp/ssh/or console into them while the scan goes on. I looked at the vmkwarning log and see the following errors although pretty much always on different volumes...

Jan 20 00:41:36 esxc31 vmkernel: 1:08:56:13.973 cpu2:4109)WARNING: J3: 1350: No space for journal on volume 4af81daa-7d3cb414-b1e9-0017084f23d0 ("RG2L26 - Real Estate E Drive"). Opening volume in read-only metadata mode with limited write support

Jan 20 00:42:27 esxc31 vmkernel: 1:08:57:04.475 cpu5:4110)WARNING: J3: 1350: No space for journal on volume 4b192cf9-d96303d9-8c4b-0015600a874a ("RG4L46 - colo-hrapps d01"). Opening volume in read-only metadata mode with limited write support

Jan 20 00:42:33 esxc31 vmkernel: 1:08:57:11.150 cpu7:4111)WARNING: J3: 1350: No space for journal on volume 4a21e5f8-847d76be-2920-0015600a874a ("RG1L101 - dbtest2 d02"). Opening volume in read-only metadata mode with limited write support

Jan 20 00:42:56 esxc31 vmkernel: 1:08:57:33.465 cpu4:4163)WARNING: J3: 1644: Error freeing journal block (returned 0) <FB 5336> for 4a9802a2-c59ce5f5-6481-001635c4fc67: Lock was not free

Jan 20 00:44:10 esxc31 vmkernel: 1:08:58:48.286 cpu5:4163)WARNING: J3: 1644: Error freeing journal block (returned 0) <FB 46871> for 465b8edc-1d343b56-2600-00127990e7e9: Lock was not free

I've seen those journal errors before when not enough space was available on a LUN...but in this case it's throwing out just random LUNs that have at least 2gb if not more available. Any one else had this problem? I have a case open with support but they haven't proven fruitful so far.

Jasemccarty · ‎01-20-2010

So your LUNs only have 2GB available? What % of the LUN is that? It is a best practice to keep 10%-15% free.

Do you see any entries in vmkernel log, or possibly in the vmkwarning log?

Jase McCarty

http://www.jasemccarty.com

Co-Author: VMware ESX Essentials in the Virtual Data Center (ISBN:1420070274) Auerbach

Co-Author: VMware vSphere 4 Administration Instant Reference (ISBN:0470520728) Sybex

_{Please consider awarding points if this post was helpful or correct}

Jase McCarty - @jasemccarty

maxx22 · ‎01-20-2010

I know that some of the LUNs it threw out had more the 10-15% free. Not all of them though. Why is it best practice to leave that much free? Our environment is not 100% VM and changes way more often then I'd like. In an effort to be as flexible as we can we usually allocate a LUN based on the vmdk that we want to put on there...but leave 1-2gb(or if it's the os with memory leave 10-20gb free depending on the size of memory allocated. After all temp/swp files etc are written we try to maintain at least 1gb free). This allows us to reclaim that LUN if we decommission the VM that uses it at a later time. We did this because before we just made large VM LUNs which tended to stick all the VMs on 2 or 3 raid groups and when half the systems were replaced we had lots of free space in VMWare but then when our oracle DBA requests another 300gb for the HPUX box and we don't have that available in the SAN we have to buy more drives.

The errors I posted were from the vmkwarning log...here's the errors from the vmkernel. I'm not to experienced with that log so I'm not sure what these mean...

Jan 20 00:41:38 esxc31 vmkernel: gen 231162, mode 1, owner 4afedb1f-c4920afa-ed62-0014c23ddb4c mtime 5752789482156]on volume 'RG2L21 - Real

Jan 20 00:41:41 esxc31 vmkernel: 1:08:56:18.922 cpu5:4110)FS3: 2854: Lock is not free on volume 'RG4L46 - colo-hrapps d01'Jan 20 00:41:41 esxc31 vmkernel: 1:08:56:18.991 cpu5:4110)FS3: 2762: Checking liveness of lock holders [type 10c00002 offset 6456320 v 88212, hb offset 3985408

Jan 20 00:41:41 esxc31 vmkernel: gen 3447, mode 1, owner 4b4b8746-23fb686d-584c-0017084f9008 mtime 714352]on volume 'RG4L46 - colo-hrapps d0

Jan 20 00:41:42 esxc31 vmkernel: 1:08:56:19.529 cpu7:4111)FS3: 2854: Lock [type 10c00002 offset 6451200 v 164522, hb offset 3803648

Jan 20 00:41:42 esxc31 vmkernel: gen 231162, mode 1, owner 4afedb1f-c4920afa-ed62-0014c23ddb4c mtime 5752789482156] is not free on volume 'RG2L21 - RealEstate E'Jan 20 00:41:45 esxc31 vmkernel: 1:08:56:22.844 cpu7:4111)FS3: 2762: Checking liveness of lock holders on volume 'RG1L101 - dbtest2 d02'.

Jan 20 00:41:45 esxc31 vmkernel: 1:08:56:22.993 cpu5:4110)FS3: 2854: Lock [type 10c00002 offset 6456320 v 88212, hb offset 3985408

Jan 20 00:41:45 esxc31 vmkernel: gen 3447, mode 1, owner 4b4b8746-23fb686d-584c-0017084f9008 mtime 714352] is not free on volume 'RG4L46 - colo-hrapps d01'Jan 20 00:41:46 esxc31 vmkernel: 1:08:56:23.618 cpu5:4110)FS3: 2762: Checking liveness of lock holders [type 10c00002 offset 6456320 v 88212, hb offset 3985408

Jan 20 00:41:46 esxc31 vmkernel: gen 3447, mode 1, owner 4b4b8746-23fb686d-584c-0017084f9008 mtime 714352]on volume 'RG4L46 - colo-hrapps d0Jan 20 00:41:48 esxc31 vmkernel: 1:08:56:26.204 cpu4:4163)FS3: 2762: Checking liveness of lock holders [type 10c00002 offset 6460416 v 483354, hb offset 3813888

Jan 20 00:41:48 esxc31 vmkernel: gen 59546, mode 1, owner 4b517908-51845b73-9494-0013211d4fef mtime 339206581262]on volume 'RG3L31 - HarvCD

Jan 20 00:41:49 esxc31 vmkernel: 1:08:56:26.847 cpu7:4111)FS3: 2854: Lock is not free on volume 'RG1L101 - dbtest2 d02'Jan 20 00:41:49 esxc31 vmkernel: 1:08:56:26.868 cpu7:4111)FS3: 2762: Checking liveness of lock holders [type 10c00002 offset 7099392 v 257364, hb offset 4169728

Jan 20 00:41:49 esxc31 vmkernel: gen 191, mode 1, owner 4b54e080-7caa0f96-7b49-001e0b6ec0e4 mtime 41087]on volume 'RG1L101 - dbtest2 d02'.

Jan 20 00:41:50 esxc31 vmkernel: 1:08:56:27.621 cpu5:4110)FS3: 2854: Lock [type 10c00002 offset 6456320 v 88212, hb offset 3985408

Jan 20 00:41:50 esxc31 vmkernel: gen 3447, mode 1, owner 4b4b8746-23fb686d-584c-0017084f9008 mtime 714352] is not free on volume 'RG4L46 - colo-hrapps d01'Jan 20 00:41:50 esxc31 vmkernel: 1:08:56:27.903 cpu5:4110)FS3: 2762: Checking liveness of lock holders on volume 'RG4L46 - colo-hrapps d0Jan 20 00:41:52 esxc31 vmkernel: 1:08:56:30.281 cpu4:4163)FS3: 2762: Checking liveness of lock holders [type 10c00002 offset 6383616 v 102790, hb offset 3813888

Jan 20 00:41:52 esxc31 vmkernel: gen 59556, mode 1, owner 4b517908-51845b73-9494-0013211d4fef mtime 339211685395]on volume 'RG3L302 - AccprJan 20 00:41:53 esxc31 vmkernel: 1:08:56:30.871 cpu7:4111)FS3: 2854: Lock [type 10c00002 offset 7099392 v 257364, hb offset 4169728

Jan 20 00:41:53 esxc31 vmkernel: gen 191, mode 1, owner 4b54e080-7caa0f96-7b49-001e0b6ec0e4 mtime 41087] is not free on volume 'RG1L101 - dbtest2 d02'Jan 20 00:41:53 esxc31 vmkernel: 1:08:56:30.892 cpu7:4111)FS3: 2762: Checking liveness of lock holders [type 10c00002 offset 7099392 v 257364, hb offset 4169728

Jan 20 00:41:53 esxc31 vmkernel: gen 191, mode 1, owner 4b54e080-7caa0f96-7b49-001e0b6ec0e4 mtime 41087]on volume 'RG1L101 - dbtest2 d02'.

Jan 20 00:41:54 esxc31 vmkernel: 1:08:56:31.905 cpu5:4110)FS3: 2854: Lock is not free on volume 'RG4L46 - colo-hrapps d01'

Jan 20 00:41:54 esxc31 vmkernel: 1:08:56:31.905 cpu5:4110)J3: 1275: Couldn't allocate journal. Garbage collect on volume 4b192cf9-d96303d9-8c4b-0015600a874aJan 20 00:41:56 esxc31 vmkernel: 1:08:56:34.283 cpu4:4163)FS3: 2854: Lock [type 10c00002 offset 6383616 v 102790, hb offset 3813888

Jan 20 00:41:56 esxc31 vmkernel: gen 59556, mode 1, owner 4b517908-51845b73-9494-0013211d4fef mtime 339211685395] is not free on volume 'RG3L302 - Accprodapp E'Jan 20 00:41:57 esxc31 vmkernel: 1:08:56:34.895 cpu7:4111)FS3: 2854: Lock [type 10c00002 offset 7099392 v 257364, hb offset 4169728

Jan 20 00:41:57 esxc31 vmkernel: gen 191, mode 1, owner 4b54e080-7caa0f96-7b49-001e0b6ec0e4 mtime 41087] is not free on volume 'RG1L101 - dbtest2 d02'Jan 20 00:41:57 esxc31 vmkernel: 1:08:56:34.910 cpu7:4111)FS3: 2762: Checking liveness of lock holders [type 10c00002 offset 7099392 v 257364, hb offset 4169728

Below that there are a string of these which I think is normal but not sure...

Jan 20 00:48:33 esxc31 vmkernel: 1:09:03:10.487 cpu3:4109)ScsiScan: 839: Path 'vmhba1:C0:T0:L20': Vendor: 'DGC ' Model: 'RAID 5 ' Rev: '0224'

Jan 20 00:48:33 esxc31 vmkernel: 1:09:03:10.487 cpu3:4109)ScsiScan: 842: Path 'vmhba1:C0:T0:L20': Type: 0x0, ANSI rev: 4, TPGS: 0 (none)Jan 20 00:48:33 esxc31 vmkernel: 1:09:03:10.488 cpu3:4109)ScsiScan: 839: Path 'vmhba1:C0:T0:L21': Vendor: 'DGC ' Model: 'RAID 5 ' Rev: '0224'

Jan 20 00:48:33 esxc31 vmkernel: 1:09:03:10.488 cpu3:4109)ScsiScan: 842: Path 'vmhba1:C0:T0:L21': Type: 0x0, ANSI rev: 4, TPGS: 0 (none)

Jan 20 00:48:33 esxc31 vmkernel: 1:09:03:10.488 cpu3:4109)ScsiScan: 839: Path 'vmhba1:C0:T0:L22': Vendor: 'DGC ' Model: 'RAID 5 ' Rev: '0224'

Jan 20 00:48:33 esxc31 vmkernel: 1:09:03:10.488 cpu3:4109)ScsiScan: 842: Path 'vmhba1:C0:T0:L22': Type: 0x0, ANSI rev: 4, TPGS: 0 (none)

Jan 20 00:48:33 esxc31 vmkernel: 1:09:03:10.489 cpu3:4109)ScsiScan: 839: Path 'vmhba1:C0:T0:L23': Vendor: 'DGC ' Model: 'RAID 5 ' Rev: '0224'Jan 20 00:48:33 esxc31 vmkernel: 1:09:03:10.489 cpu3:4109)ScsiScan: 842: Path 'vmhba1:C0:T0:L23': Type: 0x0, ANSI rev: 4, TPGS: 0 (none)

Jan 20 00:48:33 esxc31 vmkernel: 1:09:03:10.489 cpu3:4109)ScsiScan: 839: Path 'vmhba1:C0:T0:L24': Vendor: 'DGC ' Model: 'RAID 5 ' Rev: '0224'

Jan 20 00:48:33 esxc31 vmkernel: 1:09:03:10.489 cpu3:4109)ScsiScan: 842: Path 'vmhba1:C0:T0:L24': Type: 0x0, ANSI rev: 4, TPGS: 0 (none)

Jan 20 00:48:33 esxc31 vmkernel: 1:09:03:10.489 cpu3:4109)ScsiScan: 839: Path 'vmhba1:C0:T0:L25': Vendor: 'DGC ' Model: 'RAID 5 ' Rev: '0224'

Jan 20 00:48:33 esxc31 vmkernel: 1:09:03:10.489 cpu3:4109)ScsiScan: 842: Path 'vmhba1:C0:T0:L25': Type: 0x0, ANSI rev: 4, TPGS: 0 (none)Jan 20 00:48:33 esxc31 vmkernel: 1:09:03:10.490 cpu3:4109)ScsiScan: 839: Path 'vmhba1:C0:T0:L26': Vendor: 'DGC ' Model: 'RAID 5 ' Rev: '0224'

Jan 20 00:48:33 esxc31 vmkernel: 1:09:03:10.490 cpu3:4109)ScsiScan: 842: Path 'vmhba1:C0:T0:L26': Type: 0x0, ANSI rev: 4, TPGS: 0 (none)

Jan 20 00:48:33 esxc31 vmkernel: 1:09:03:10.490 cpu3:4109)ScsiScan: 839: Path 'vmhba1:C0:T0:L27': Vendor: 'DGC ' Model: 'RAID 5 ' Rev: '0224'

Jan 20 00:48:33 esxc31 vmkernel: 1:09:03:10.490 cpu3:4109)ScsiScan: 842: Path 'vmhba1:C0:T0:L27': Type: 0x0, ANSI rev: 4, TPGS: 0 (none)

Jan 20 00:48:33 esxc31 vmkernel: 1:09:03:10.491 cpu3:4109)ScsiScan: 839: Path 'vmhba1:C0:T0:L28': Vendor: 'DGC ' Model: 'RAID 5 ' Rev: '0224'Jan 20 00:48:33 esxc31 vmkernel: 1:09:03:10.491 cpu3:4109)ScsiScan: 842: Path 'vmhba1:C0:T0:L28': Type: 0x0, ANSI rev: 4, TPGS: 0 (none)

Jasemccarty · ‎01-20-2010

Definite see some space issues in the first log file.

I don't have any FC connected datastores anymore, but when I did, I always made sure to have at least 10%-15% free space.

Upgraded from 2.5 to 3.5 (unsupported), and now have a mixed 3.5/4.0/4.0U1 environment.

Going from 2.5 to 3.5 datastores were only FC attached. Other than the VMFS2 to VMFS3 upgrade issue.

Going from 3.5 to 4.0 (slowly) we have NFS storage.

With approximately 400 guests, LUNs were sized to accomodate 25-30 guests per LUN, as we have a highly random workload.

Remember, that the number of hosts connected, as well as the number of

presented LUNs can have an impact on things like QueueDepth, which can

lead to problems if you have more requests being sent to your storage

processor than it can accomodate.

NFS is a little different, so the configuration is somewhat different.

How many guests are you running?

Jase McCarty

http://www.jasemccarty.com

Co-Author: VMware ESX Essentials in the Virtual Data Center (ISBN:1420070274) Auerbach

Co-Author: VMware vSphere 4 Administration Instant Reference (ISBN:0470520728) Sybex

_{Please consider awarding points if this post was helpful or correct}

Jase McCarty - @jasemccarty

maxx22 · ‎01-20-2010

I think we run light compared to other people I've seen. 3-4 guests per blade, 16 blades in this Datacenter with the issues. We have a similar setup in another DC running on a CX500 but haven't run into this issue. LUN setup there is the same with very tight fits on LUNs to vmdks. No NFS storage. Maybe this is just a space issue. On some of our luns we're 99% full with only a couple hundred meg free. When I set these up I figured planning for just the vmdk that wouldn't be an issue...didn't realize at the time that VMFS writes large journals to the LUNs for metadata updates. I've been going through and tacking on extents to the ones that I've seen throwing this error that are full. Maybe the ones that aren't full that pop up are just side effects of the ones that are full?

Here's a question then. Until I get this worked out any VMFS change I make on a host triggers an automatic update of VMFS Storage on all ESX hosts which causes the VMs to lock up for a few minutes. To keep that from happening is there a way to disable automatic VMFS updates in vCenter?

Jasemccarty · ‎01-20-2010

That sounds like space is your issue.

Jase McCarty

http://www.jasemccarty.com

Co-Author: VMware ESX Essentials in the Virtual Data Center (ISBN:1420070274) Auerbach

Co-Author: VMware vSphere 4 Administration Instant Reference (ISBN:0470520728) Sybex

_{Please consider awarding points if this post was helpful or correct}

Jase McCarty - @jasemccarty

All

Running Into Some Storage Issues in a Mixed ESX4/ESX3.5/ESX3.0 Environment