Solved: Re: Hitachi HDS USPV SAN errors and queue depth

jrr001 · ‎01-22-2009

Almost finished with a migration to Hitachi HDS USPV storage from IBM SVC storage. Using ESX 3.5 update 3, VC update 3, 1 cluster with 10 servers seeing 20 HDS LUNS (and other LUNS still from IBM SVC).

Errors in VMKernel logs:

Jan 11 18:16:54 esx3srv0104ph vmkernel: 29:06:37:01.612 cpu11:1051)WARNING: SCSI: 119: Failing I/O due to too many reservation conflicts

Jan 11 18:16:54 esx3srv0104ph vmkernel: 29:06:37:01.612 cpu11:1051)WARNING: FS3: 4785: Reservation error: SCSI reservation conflict

Jan 18 19:30:32 esx3srv0101ph vmkernel: 43:07:20:18.293 cpu13:3468)StorageMonitor: 196: vmhba1:12:14:0 status = 24/0 0x0 0x0 0x0

Jan 18 19:30:32 esx3srv0101ph vmkernel: 43:07:20:18.293 cpu3:1051)SCSI: vm 1051: 109: Sync CR at 64

Jan 18 19:30:35 esx3srv0101ph vmkernel: 43:07:20:22.039 cpu1:1051)SCSI: vm 1051: 109: Sync CR at 48

Jan 18 19:36:27 esx3srv0101ph vmkernel: 43:07:26:13.235 cpu5:1051)SCSI: vm 1051: 109: Sync CR at 32

Jan 18 19:36:28 esx3srv0101ph vmkernel: 43:07:26:14.300 cpu4:1051)SCSI: vm 1051: 109: Sync CR at 16

Jan 18 19:36:29 esx3srv0101ph vmkernel: 43:07:26:15.322 cpu6:1051)SCSI: vm 1051: 109: Sync CR at 0

Jan 18 19:36:29 esx3srv0101ph vmkernel: 43:07:26:15.323 cpu6:1051)WARNING: SCSI: 119: Failing I/O due to too many reservation conflicts

Jan 17 22:34:27 esx3srv0101ph vmkernel: 42:10:24:14.560 cpu1:3354)StorageMonitor: 196: vmhba1:12:14:0 status = 0/2 0x0 0x0 0x0

Jan 17 22:34:28 esx3srv0101ph vmkernel: 42:10:24:15.559 cpu1:2478)StorageMonitor: 196: vmhba1:12:14:0 status = 0/2 0x0 0x0 0x0

Jan 15 21:49:50 esxdr0103acc vmkernel: 23:08:00:03.852 cpu10:1055)WARNING: SCSI: 2934: CheckUnitReady on vmhba2:12:11 returned Storage initi

ator error 0x7/0x0 sk 0x0 asc 0x0 ascq 0x0

Jan 15 21:49:50 esxdr0103acc vmkernel: 23:08:00:03.852 cpu10:1055)WARNING: SCSI: 2934: CheckUnitReady on vmhba2:13:11 returned Storage initi

ator error 0x7/0x0 sk 0x0 asc 0x0 ascq 0x0

Queue depth for Qlogic 2340 cards set to 64 now, storage vendor suggesting back down to 2 or 4 as a max...which really seem like we wouldn't get the throughput of the SAN we should with those settings.

Questions:

what is the normal queue depth setting most people use in larger setups like ours is becoming (32 is the default for qlogic)

I have read most LUNS shared per Vmware Clusters are in the 20-25 range, what do you use?

Disk.UseDeviceReset=0 per san cfg guide

Disk.SchedNumReqOutstanding matches queue depth setting

we have up to high 20s VMs on a VMFS LUN..some lower..max I hear to stay below is 32.

Any other advice or input appreciated.

Points will be awarded!!!!!

bobross · ‎01-28-2009

There is a very good paper describing how SCSI reservations come about on ESX...one thing that jumped out at me is your use of SATA LUNs for even 15 VMs. That is a mismatch of drive type with workload...the one sure way to reduce SCSI reservation conflicts is to reduce the # of VMs per LUN. This is true especially during operations which modify metadata and force a reservation to occur, e.g. backup, replication, etc.

View solution in original post

ChadAEG · ‎01-27-2009

First, I believe it is still HDS recommendation to set the host mode on the USPV to OA (netware) and option 19, so make sure your HDS Engineer sets this correctly. Even with those settings in place, we had a similiar issue on our HDS SAN's due to the different levels of scsi reservation the two systems use. Solution was to use smaller luns with fewer VM's. In my experience, 7 - 10 per lun is about as high as I can go before I start seeing these errors in the log. It's not the number of luns, but the number of VM's on the same lun that is causing your issue.

Supposedly this will be improved with ESX4 when VMware implements scsi-3 reservations as an option.

williambishop · ‎01-27-2009

I would take the group back to 32 (certainly not 4 though), and test that. 64 is not often necessary. And we have anywhere from 1 to 200 vm's on a lun, so there's no hard and fast rule.

--"Non Temetis Messor."

kjb007 · ‎01-27-2009

I would lower this to 32 as well. I use HDS as well and have not needed to change this value, and I have vm's in the same range as you per LUN. Have you configured the storage to match the depth that you set on the hba? Since it's two layers you are going through, the svc and then the array, have you made sure the values are the same everywhere?

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

jrr001 · ‎01-28-2009

We have returned to queue depth =32 and the other best practice:

Disk.SchedNumReqOutstanding=32

+

VirtualCenter -> Configuration Tab -> Advanced Settings -> Disk -> Disk.UseLunReset=1 , Disk.UseDeviceReset=0

The issue appears to be with our SATA LUNS in particular.

We load balanced the LUNS to have <15 VMs per LUN.

ESXTOP shows definate signs of storaeg latency.

Hitachi is very invovled now looking for where the problem is.

Will update with solution found....

+

bobross · ‎01-28-2009

There is a very good paper describing how SCSI reservations come about on ESX...one thing that jumped out at me is your use of SATA LUNs for even 15 VMs. That is a mismatch of drive type with workload...the one sure way to reduce SCSI reservation conflicts is to reduce the # of VMs per LUN. This is true especially during operations which modify metadata and force a reservation to occur, e.g. backup, replication, etc.

klich · ‎02-10-2009

working through a similar issue.

On the USPV, each LUN has a queue size of 32, and each port has a total queue size of 2048 (was 32 and 1024 on the USP)

To determine the queue size, divide 32 / <# of hosts with active I/O to a LUN>

For example, if you limit each LUN to 8 virtual machines, in which case you will never have more than 8 ESX Hosts with active I/O to that LUN at a given time.

32 / 8 = HBA queue depth of 4

What we are still investigating, is once you reach 64 LUNs, would we need to utilize another port pair? (32 * 64 = 2048 - the total queue size for the port)

Besides the HBA queue depth, also limit your fan-in ratio to the USPV ports. We use a 6:1 ratio, with only ESX servers on the ports (do not share the ports with hosts from other platforms), and if you are virtualizing external arrays behind the USPV, be sure to carry that same ratio from the USPV to the external array as well. Host-to-USPV (6:1) USPV-to-ExternalArray (6:1)

Hope this helps.

All

Hitachi HDS USPV SAN errors and queue depth