Solved: SCSI reservation conflicts with ESX 3.0.1 accessin...

peetz · ‎11-14-2006

Hello all,

we are experiencing serious SCSI reservation issues in our ESX 3.0.1 / VC 2.0.1 environment.

This is our setup and the whole story:

Host hardware:

\- 2 IBM xSeries 445 (each with 8 SingleCore-CPUs and 32 GB RAM)

\- 3 HP ProLiant DL585 (each with 4 DualCore-CPUs and 32 GB RAM)

\- 2 HP ProLiant DL580 (each with 4 SingleCore-CPUs and 16 GB RAM)

We started with all servers running ESX 2.5.x attached to a EMC Symmetrix 8530. All servers used three 600 GB LUns on this box. All have two QLogic HBAs in them. No issues.

Then we started our migration to ESX3. At the same time we also needed to migrate to new SAN storage: six 400 GB LUNs on a HP XP12000. We used the brand new "VMotion with storage relocation"-feature to do both migrations. At the beginning this worked really fine.

So we re-installed all hosts one after the other with ESX3, attached the new storage LUNs to them (in addition to the old ones) and migrated the VMs from the not-yet-upgraded hosts to the already-upgraded hosts and the new storage.

We started with the three DL585 and were very pleased with the speed an the reliability of the process.

However, when we re-installed the first IBM-host the trouble began. All sorts of VM related procedures (e.g. storage relocation, hot and cold, powering on VMs, VMotion, create new VM) failed with all sorts of error messages in VirtualCenter. Looking at the vmkernel-logs of the hosts we discovered the reason for this: excessive SCSI reservation conflicts. The messages look like this e.g.:

Nov 14 13:29:43 frasvmhst06 vmkernel: 0:00:03:34.249 cpu4:1045)WARNING: SCSI: 5519: Failing I/O due to too many reservation conflicts

Nov 14 13:29:43 frasvmhst06 vmkernel: 0:00:03:34.249 cpu4:1045)WARNING: SCSI: 5615: status SCSI reservation conflict, rstatus 0xc0de01 for vmhba2:0:0. residual R 919, CR 0, ER 3

Nov 14 13:29:43 frasvmhst06 vmkernel: 0:00:03:39.086 cpu4:1045)FSS: 343: Failed with status 0xbad0022 for f530 28 2 453782fc 6b8bc9e9 1700770d 1d624ca 4 4 1 0 0 0 0 0

Things we have tried so far to make it better:

\- filed a SR with VMware. No helpful answers yet.

\- checked the firmware code of the XP12000. It is the latest: 50.07.64.

\- distributed SAN load on the two HBAs in each host (Three LUNs fixed on first path, the other three fixed on

the second). This helped a lot(!), but we still had frequent reservation conflicts.

\- updated all HBAs to the latest EMC-supported BIOS (version 1.47). Did not change anything.

\- doubled the HBA's queue depth to 64. Doesn't seem to help.

In the meantime we have updated all seven hosts and migrated all 124 VMs to the new storage. The old EMC-storage is still connected to all hosts but is unused. We even unloaded the VMFS2-driver like advised somewhere in the

SAN configuration guide. So, everything should be quiet now. However, we still see sporadic SCSI reservation conflicts, although there is no storage relocation or VMotion etc. in progress! Even if we just reboot a host it will generate these errors when initializing its SAN storage access.

What's wrong here? Are we already driving VMware to its limits by having 7 hosts accessing 6 LUNs concurrently?

Is it the IBM hardware? Is it ESX3 not properly releasing SCSI locks?

I'd love to read comments from people that have similar problems with maybe even similar hardware configurations or better: no issues with a similar hardware configuration (esp. the IBM hosts accessing a XP12000).

\- Andreas

Twitter: @VFrontDe, @ESXiPatches | https://esxi-patches.v-front.de | https://vibsdepot.v-front.de

fabian_bader · ‎02-27-2007

There is out a new KB article (http://kb.vmware.com/KanisaPlatform/Publishing/725/8411304_f.SAL_Public.html) with the title "VMotion Failure of Virtual Machines Located on LUSE LUNs on HP XP 10000 and 12000".

Let the HP technican change the Host Mode Option to 19. The Host Mode for the LUN should be 0C (Windows)

gretz Fabian

View solution in original post

fcarvalho · ‎11-14-2006

You already migrate vmfs2 to vmfs3?

You still have ESX 2 installed in you infrastructure?

peetz · ‎11-14-2006

No, it's all on ESX3 and VMFS3 now.

Twitter: @VFrontDe, @ESXiPatches | https://esxi-patches.v-front.de | https://vibsdepot.v-front.de

fcarvalho · ‎11-14-2006

Make sure you don't have any other operation system connected to vmfs LUN.

How many hosts are connected to that vmfs?

fcarvalho · ‎11-14-2006

HBA's are 2 GB or 4GB?

Anders_Gregerse · ‎11-14-2006

That's right there is an issue with 2Gb vs 4Gb HBA's. If 2Gb you need to install the 2Gb driver on certain HBAs (the 4Gb is installed by default). I think it mentions in the release notes.

fcarvalho · ‎11-14-2006

Try change to 2GB because ESX 3 only support this SAN with 2GB HBA's.

Check this document : http://www.vmware.com/pdf/vi3_san_guide.pdf

peetz · ‎11-14-2006

Thanks for your comments.

\- All HBAs are 2Gb.

\- The vmfs3-LUNs are shared by 7 ESX3-hosts and no other machines.

Any other suggestions?

Twitter: @VFrontDe, @ESXiPatches | https://esxi-patches.v-front.de | https://vibsdepot.v-front.de

jhanekom · ‎11-14-2006

Some SCSI reservation conflict timeouts during normal operation seem to be fine. We get them during periods of high load with no apparent ill-effects.

We had an issue on 2.5.x on EMC Clariions that seemingly was caused by the way cabling was done. The Clariions are active/passive units, which was a large factor in the problem, but it could still be worth investigating on your side.

Essentially, we were told that ESX Servers that share a LUN want to be able to access it through similar paths. That is, each ESX server must see the SAN SP targets in the same order. (In the ESX 2.5 world, you could run "wwpn.pl -v" to get this. I forget what it is in 3.x.)

Take a careful look at the following KB article: http://kb.vmware.com/kb/1301

It's deceptively simple at first... take a careful look at your fibre patching, however, and see if it matches that config. In a nutshell, the order that ports of each storage processor are connected to the switch fabric determines the order that ESX enumerates them as targets, which could affect stability.

Also ensure that HBA1 on each of your servers are connected to the same fabric, with HBA2 on each server connected to the other fabric. (Assuming a split-fabric layout.)

fcarvalho · ‎11-15-2006

You have fixed path, you already try change it to MRU and check logs?

Are you installed IBM Director?

Randy_Evans · ‎11-15-2006

We are having the same problem with ESX 3.0.1 and HP XP 12000 storage. While the virtual machines power up and run without problems, we get failures of serveral operations such as VirtualCenter cold migrations and cloning, and vmkfstools cloning. In all cases, the vmkernel log has messages about reservation conflicts. For example:

Nov 14 17:07:22 ht03b01a01 vmkernel: 15:08:10:09.810 cpu3:1033)SCSI: vm 1033: 5509: Sync CR at 64

Nov 14 17:07:23 ht03b01a01 vmkernel: 15:08:10:10.726 cpu3:1033)SCSI: vm 1033: 5509: Sync CR at 48

Nov 14 17:07:24 ht03b01a01 vmkernel: 15:08:10:11.828 cpu3:1033)SCSI: vm 1033: 5509: Sync CR at 32

Nov 14 17:07:25 ht03b01a01 vmkernel: 15:08:10:12.885 cpu0:1033)SCSI: vm 1033: 5509: Sync CR at 16

Nov 14 17:07:26 ht03b01a01 vmkernel: 15:08:10:13.933 cpu0:1033)SCSI: vm 1033: 5509: Sync CR at 0

Nov 14 17:07:26 ht03b01a01 vmkernel: 15:08:10:13.933 cpu0:1033)WARNING: SCSI: 5519: Failing I/O due to too many reservation conflicts

Nov 14 17:07:26 ht03b01a01 vmkernel: 15:08:10:13.933 cpu0:1033)WARNING: SCSI: 5615: status SCSI reservation conflict, rstatus 0xc0de01 for vmhba0:0:15. residual R 919, CR 0, ER 3

Nov 14 17:07:26 ht03b01a01 vmkernel: 15:08:10:13.933 cpu0:1033)FSS: 343: Failed with status 0xbad0022 for f530 28 2 4559f9fb dfc9448 13001a37 e5b1e21 4 1 0 0 0 0 0

It is very interesting that we can access EVA-hosted LUNs through the XP front end without any failures at all. Everything is the same except the SAN hosting the LUN. The failures only occur with XP LUNs. EVA LUNs never fail.

All our software, hardware, and firmware is using supported versions and configurations.

We have calls open with both VMware and HP.

peetz · ‎11-16-2006

Many thanks for your info.

We too have calls open with both VMware and HP.

Please let me know (here) if you get any good answers.

\- Andreas

Twitter: @VFrontDe, @ESXiPatches | https://esxi-patches.v-front.de | https://vibsdepot.v-front.de

peetz · ‎11-16-2006

What HBAs are you using? QLogic or Emulex?

Have you tried changing the Queue depth of the HBAs?

Twitter: @VFrontDe, @ESXiPatches | https://esxi-patches.v-front.de | https://vibsdepot.v-front.de

garybrown · ‎11-29-2006

Just curious - did you make any progress on this issue. I have the same problems and wondered whether you had found a resolution ?

peetz · ‎11-29-2006

Unfortunately, no.

My issue is at VMware engineering right now, and they just told me that it will most probably require a patch to ESX to resolve this kind of problem.

However, this means that it will not be fixed until the "next maintenance release" in "some months or so".

Hopefully I can get some kind of private fix before this official release. Otherwise I'm doomed.

If you have not yet filed an SR with VMware regarding this problem, then please do so now. They told me that they currently have one more case open that is similar to mine. There should be more of such cases, much more, to raise the pressure on VMware and its engineering team...

Regards

Andreas

Twitter: @VFrontDe, @ESXiPatches | https://esxi-patches.v-front.de | https://vibsdepot.v-front.de

garybrown · ‎11-30-2006

Ah in that case I will open a case again.

I did already open a case but they said 'it's your SAN - get your SAN fixed and the problem will be resolved' closed call!!!!

If you do get anything can you post back here - I will do likewise.

thanks

Gary

Maurice_Perreij · ‎11-30-2006

I had the exact same problem. Everything worked fine until we created a DRS cluster and moved two servers into that cluster.

I resolved the problem by moving the hosts out of the DRS cluster.

peetz · ‎11-30-2006

Hello Maurice,

this is interesting. There may be a co-incidence with clustering and DRS in our case, too. We disabled DRS and HA very soon after the problems started. However, 5 of the 7 hosts are still in the cluster right now.

Anyway, this is not really a solution to the problem. Of course we want to use Clustering, DRS and HA ...

Andreas

Twitter: @VFrontDe, @ESXiPatches | https://esxi-patches.v-front.de | https://vibsdepot.v-front.de

Adrian_Saltmars · ‎12-01-2006

Hi Maurice,

I am now working on creating the clones on the SAN but it still fails with differing errors. This is after you removed them from the DRS cluster.

BUT - the problem got worse, I can't now create clones on the internal store either. Doesn't matter if I use a SAN based source or local source from another ESX host.

Still WIP.

peetz · ‎12-01-2006

Hi Gary,

you may want open a case with HP, too. We did this, they checked our SAN

setup and found no errors. I even passed the HP call id to VMware and they

got in contact with HP to talk about the issue (with no apparent results of

course...) This way you can make sure that VMware won't send you away

again with a "Go and fix your SAN".

If you drop me a private message I will send you my call IDs with HP and

VMware so that you can eventually reference my calls to help them compare

our setups and find similarities or whatever.

By the way, our SAN setup is quite complex: We have three Brocade 4100

switches and a HP Multi Protocol Router (MPR) on the way between the

VMware hosts and the XP12000. I wonder if this causes or adds to the

errors we are seeing. I'd love to here that your SAN setup is easier so

that we can rule this out.

Another point is that we use LUSE (LUN size expansion) on the XP12000

to construct the VMFS LUNs from multiple smaller LUNs. Are you using this

too?

And what about multi-pathing? We have two HBAs in each host and use

different paths to the XP12000 in an active-active configuration. This might

play a role in our issue. At least we observed that switching the paths

on the ESX host would eventually reduce SCSI reservations. On the other

hand the issues may be originally caused by a bug in ESX not being able

to properly handle different paths to the same storage. Are you using

multi-pathing?

Thanks and best regards

Andreas

Twitter: @VFrontDe, @ESXiPatches | https://esxi-patches.v-front.de | https://vibsdepot.v-front.de

All

SCSI reservation conflicts with ESX 3.0.1 accessing an HP XP12000