Reservation error: SCSI reservation conflict

miyako · ‎10-16-2009

Dear VMware support

My customer keep run into lun lock up, VMs no response and vmkernel log(vmkernel.34) report following,

Sep 30 05:00:36 dtesx01 vmkernel: 5:04:13:22.215 cpu2:4116)WARNING: ScsiDeviceIO: 1374: I/O failed due to too many reservation conflicts. naa.600d02310006257f0000000442248d24 (920 0 3)

Sep 30 05:00:36 dtesx01 vmkernel: 5:04:13:22.215 cpu2:4116)WARNING: FS3: 6509: Reservation error: SCSI reservation conflict

Customer claim they have 2 ESX servers as a cluster. our Storage as a lun map to ESX servers as share storage. install 10 VMs, assign 5 VMs to each ESX server.

when all VMs boot up, running IOMeter, while later, /O failed due to too many reservation conflicts and Reservation error: SCSI reservation conflict will report, and lun seems lock up.

customer's environment:

2 and more ESX servers, a set of 2 ESX create as a cluster. (both ESX 3.5 and 4.0 have the same issue)

4GHBA:Emulex (HP FC2142SR 4GPCIE, this is refer to LPe1150). our qualified 4G HBA is LP1150

FC 8G storage S16F-R1840/3.73K08,

FC switch: Qlogic 5800

log files in attached files.

According to search KB, VMware ESX 4 release note has mention this kind of issue in http://www.vmware.com/support/vsphere4/doc/vsp_esx40_vc40_rel_notes.html

says,

On rare occasions, after repeated path failovers to a particular SAN LUN, attempts to perform such operations as VMFS datastore creation, VMotion, and so on might fail on all ESX/ESXi hosts accessing this LUN. The following warnings appear in the log files of all hosts:

I/O failed due to too many reservation conflicts.

Reservation error: SCSI reservation conflict

If you see the reservation conflict messages on all hosts accessing the LUN, this indicates that the problem is caused by the SCSI reservations for the LUN that are not completely cleaned up.

Workaround: Run the following LUN reset command from any system in the cluster to remove the SCSI reservation:

vmkfstools -L lunreset /vmfs/devices/disks/

Unfortunately, use have try LUN reset command above, but did not solve his problem. Besides, we can not duplicate it in our lab.

I review the customer's log(vmkernel.36 and vmkernel.35), before "I/O failed due to too many reservation conflicts" report, there are following msg keep appear about 5 hrs.

Sep 30 01:02:16 dtesx01 vmkernel: 5:00:15:02.682 cpu1:7446)ScsiScan: 839: Path 'vmhba1:C0:T2:L0': Vendor: 'IFT ' Model: 'S16F-R1840 ' Rev: '373K'

Sep 30 01:02:16 dtesx01 vmkernel: 5:00:15:02.682 cpu1:7446)ScsiScan: 842: Path 'vmhba1:C0:T2:L0': Type: 0x1f, ANSI rev: 4, TPGS: 0 (none)

Sep 30 01:02:16 dtesx01 vmkernel: 5:00:15:02.682 cpu1:7446)ScsiScan: 105: Path 'vmhba1:C0:T2:L0': Peripheral qualifier 0x3 not supported

Sep 30 01:02:16 dtesx01 vmkernel: 5:00:15:02.682 cpu1:7446)ScsiNpiv: 1304: GetInfo for adapter vmhba1, , max_vports=0, vports_inuse=0, linktype=0, state=1, failreason=0, rv=0, sts=0

I'm wondering if these event related to this "SCSI reservation conflict" issue? if yes, do you have any suggestion to our customer?

If this is VMware known issue? what can determine this is storage issue? or is it a HBA and Storage compatible issue?

We are stocked here for long time. Please kindly have a comment.

Thank you very much

Peggy

Peggy.Wu

TobiasKracht · ‎10-16-2009

Resolving SCSI reservation conflicts in ESX 4

SCSI Reservation Issue with Fibre Channel HBAs

StarWind Software R&D

StarWind Software R&D http://www.starwindsoftware.com

erickmiller · ‎10-17-2009

Hi miyako,

I'm assuming you are referring to our environment. The FC Switches are actually HP StorageWorks 4/16 switches (rebranded Silkworm 200E switches). Also, the diagram isn't right. All 4 clusters have Disk.UseLUNReset = 1, and as of a day or so ago, all nodes have Disk.UseDeviceReset = 0.

After finding that some of the hosts had Disk.UseDeviceReset =1 (should be = 0 for a shared storage environment) and fixing this, we have not seen any SCSI I/O Reservation Conflicts that cause LUN locks on our ESX 3.5 clusters (yet... it hasn't been long enough to feel 100% confident that it won't happen again). We had a "lot" of activity on all nodes last night running a "huge" number of parallel backups, each of which creates a snapshot, and thus quite a few SCSI I/O Reservations.

On the 2-node vSphere cluster, everything was working all night with no problems. Backups finished at 6am on one node and 8am on the other node. Strange enough, when there was basically little or no I/O, at 11:45am this morning, we had one of the LUNs lock on it, where a LUN reset was necessary to bring it back online for one of the hosts.

Eric K. Miller, Genesis Hosting Solutions, LLC

- Lease part of our ESX cluster!

Eric K. Miller, Genesis Hosting Solutions, LLC http://www.genesishosting.com/ - Lease part of our ESX cluster!

erickmiller · ‎10-17-2009

Hi Tobias,

Thanks for the links!

I hadn't seen the second link before. I'm going to investigate further. Since we have multiple SANs, we have to be very cautious in changing this, but the vSphere cluster is only presented to the Infortrend, so it would be an easy test to see if this solves the problem.

I'll let you know if I find any more information about this setting change and ultimately whether it works.

Thanks!

Eric K. Miller, Genesis Hosting Solutions, LLC

- Lease part of our ESX cluster!

Eric K. Miller, Genesis Hosting Solutions, LLC http://www.genesishosting.com/ - Lease part of our ESX cluster!

miyako · ‎10-18-2009

Hi Eric,

Since I have difficulty to duplicate this issue in our lab. But we are truly would like to do something to help with it. Therefore, I post this for any help from vmware.

We also wonder if the second link (SCSI reservation issue with Fibre Channel HBAs) can fix "SCSI reservation conflict".

I will also keep tracing this issue from your reply here.

Thanks for your patient.

Peggy.Wu