VMware Cloud Community
erickmiller
Enthusiast
Enthusiast

Infortrend S16F-R1840 and excessive LUN locking

Has anyone used Infortrend's S16F-R1840 fiber channel SAN?

We've had various problems from day one, which a recent firmware update solved many problems, but we often get SCSI I/O Reservation Conflicts that result in a locked LUN (a LUN inaccessible by all but one host).

I'm quite familiar with the problems that result in re-tried I/O reservations due to conflicts during a reservation. We have some HP and IBM DS3400 SANs that have "no" problems whatsoever, so this is a relatively new problem with a new SAN.

I've been in almost constant communication with Infortrend's support regarding this issue, and thought that it was related to a number of possible problems, but none

have panned out (firmware updates, ESX upgrades to the latest version, BIOS updates, fiber channel switch port error counter checks, etc.).

We have both ESX 3.5 and ESX 4.0 clusters attached. It appears the problem occurs at random times. We've even had it happen on a 2-node cluster by itself. I thought it was related to high I/O, but I've seen a LUN get locked with little I/O. Recently, it occurred while backups were running (using esXpress) while some Storage VMotions were running, so I can understand there will be some conflicts in this case, but nothing that would bring a whole LUN down. Since we

push our HP and IBM SANs to their limits with the same workloads, and have no problems at all, I'm confident in saying that this is an Infortrend problem.

All hardware is on the HCL (including the Infortrend), and includes mostly HP DL385 G5p's with FC2142SR (Emulex LPe1150) controllers or HP DL365 G5's with FC2242 (Emulex LPe11002) controllers (dual-port version of the LPe1150). Fiber channel switches are HP StorageWorks 4/16 switches (Brocade Silkworm 200E switches).

So, I'm looking for others that have had similar issues since this driving me nuts and we're about ready to return this unit (not going to be a great day if it comes to that).

Eric K. Miller, Genesis Hosting Solutions, LLC

- Lease part of our ESX cluster!

Eric K. Miller, Genesis Hosting Solutions, LLC http://www.genesishosting.com/ - Lease part of our ESX cluster!
0 Kudos
11 Replies
erickmiller
Enthusiast
Enthusiast

A quick update... it appears as though we accidentally had a couple hosts on the fiber channel network that still had Disk.UseDeviceReset = 1 (the default during an ESX install for some strange reason, seeing as how it usually causes most storage problems). Apparently this really caused a lot of havoc with the Infortrend, so be sure to set this to 0 if you have one of these units.

We're still doing testing, but so far so good, with 40+ simultaneous backups from various clusters using esXpress against the Infortrend and it has been running smoothly without any locked LUNs.

Eric K. Miller, Genesis Hosting Solutions, LLC

- Lease part of our ESX cluster!

Eric K. Miller, Genesis Hosting Solutions, LLC http://www.genesishosting.com/ - Lease part of our ESX cluster!
0 Kudos
erickmiller
Enthusiast
Enthusiast

Another update... after adding quite a few more VMs to the Infortrend, the LUN locking has been started again, and more frequent with more VMs.

However, the problem seems somewhat isolated to a couple LUNs. I'm still working with VMware and Infortrend on the issue and will report back once we find some answers.

We have no problems whatsoever with our IBM DS-series SANs so I'm confident in saying that this is Infortrend-related.

Eric K. Miller, Genesis Hosting Solutions, LLC

- Lease part of our ESX cluster!

Eric K. Miller, Genesis Hosting Solutions, LLC http://www.genesishosting.com/ - Lease part of our ESX cluster!
0 Kudos
MStoeckle
Contributor
Contributor

Hi Eric,

did the latest firmware upgrades by Infortrend resolve your issues (or have you returned the unit before ...?)

We are experiencing the same massive problems here with SCSI reservations and were told by Infortrend, that the latest firmware addresses these issues.

Best regards, Martin

0 Kudos
erickmiller
Enthusiast
Enthusiast

Hi Martin,

Sorry, I didn't see an email indicating that a response to this thread was made.

The latest Infortrend firmware does not solve the problem. We have determined that the management NICs are the cause. Specifically, disconnecting cables from the management NICs or disabling the switch ports that they are connected solves the problem... no more LUN locking / stuck SCSI reservations.

We have not tested whether managing the unit through the fiber channel results in the same problem or not.

The problem occurs on both S16F-R1840 and S16F-R1840-4 units with their respective latest firmwares.

Controllers have been replaced without improvement.

Eric K. Miller, Genesis Hosting Solutions, LLC

- Lease part of our ESX cluster!

Eric K. Miller, Genesis Hosting Solutions, LLC http://www.genesishosting.com/ - Lease part of our ESX cluster!
0 Kudos
MStoeckle
Contributor
Contributor

Hi Eric,

this sounds like a catastrophic outcome and makes Infortrend virtually unusable fo vSphere4 ...

Do you still use these systems in a production environment?

0 Kudos
erickmiller
Enthusiast
Enthusiast

Hi Martin,

You are indeed correct that it is ridiculous. We only use them for backup targets, but have many hot-spares so we don't have to monitor them manually often.

It's a shame that Infortrend hasn't taken this problem seriously. I can only suggest two things:

a) Complain to Infortrend as much as you possibly can - unfortunately this takes up your time and money, especially if you plan to help troubleshoot

b) Avoid Infortrend completely and spread the word

Infortrend isn't the only SAN vendor with problems, though... there are numerous vendors that have major problems that I'm surprised you don't hear more about, including some of the main-brand vendors. Time to market has been much more important than testing and being sure products are reliable. Imagine running into a problem with a bug in a scale-out storage system where the bug exists on many nodes, affecting Petabytes of data! I'm not so sure I'm ready personally to take that risk in performing an upgrade to one of those systems and having the entire stack of storage nodes fail.

Eric K. Miller, Genesis Hosting Solutions, LLC

- Lease part of our ESX cluster!

Eric K. Miller, Genesis Hosting Solutions, LLC http://www.genesishosting.com/ - Lease part of our ESX cluster!
0 Kudos
MStoeckle
Contributor
Contributor

Hi Eric,

just to give a short follow-up on our SCSI reservation / Infortrend problem: after moving away all VMs either to local storage or other RAID systems, we upgraded the Infortrend boxes to the latest firmware level end of last year and took them into operation again.

In December, we experienced another crash with the same problems.

Luckily, we had ordered two NetApp boxes already 😉

We decided to move away from Infortrend and also from Fibrechannel and go with NFS, which seems a better solution for VMware (no SCSI reservation problems, bigger datastores possible, thin provisioning etc.)

Do you have any updates from your side?

Best regards,

Martin

0 Kudos
erickmiller
Enthusiast
Enthusiast

Hi Martin,

Sounds familiar.  I'm assuming you still had the management NIC connected on the Infortrend controllers?  We've been running fine without the management NIC connected (instead using the serial port management).  No controller crashes, performance is stellar, etc.  Obviously a "major" problem though, not to have monitoring of the device.

It's obvious that Infortrend could care less about this problem, which amazes me, since I suspect a large customer population uses ESX with their equipment.

Good idea to order the NetApps.  SCSI Reservations aren't the worst thing in the world, if implemented properly (most haven't been in the past).  NFS does make it easier to manage by using a lock file instead of some hidden reservation that is unmanageable, and has no statistical measure.

If you still have the Infortrends, I suspect you'll be fine with them without the management port connected.  Would be good for backup/archive storage.

We have had no issues with our NEC D4 either, which just sits there and runs flawlessly, as do our IBM DS3400 units (with the latest firmware, I should add...).  Although, the DS3400 has a common problem where if a management port is connected to a switch that fails or the port on the switch fails, the management port becomes non-functional on the DS3400.  It's a known problem that hasn't been correct by LSI Logic.  Ugh.

We also had issues with our old MSA1500cs series SANs...  a common problem that HP claims they couldn't reproduce (surprise surprise) where the controllers would lock up after a period of time.  Complaints rolled in about the issue, and only last year did they release a firmware that, in theory, fixes the problem.  We have only been running on this latest firmware for a couple weeks, so we don't have enough time on them to prove it's fixed.

I just amazes me, though, that companies (large, popular ones) can produce junk equipment and fail to fix problems.

And it doesn't seem to matter how small or large the SAN company is.  Hitachi had numerous problems with their flagship products for a long time.  Thankfully most of those have been resolved, but still...  amazing that you can spend millions on equipment that doesn't work.

Now if only we can get past this lousy 2TB limit in products nowadays. Smiley Happy  As crazy as it sounds that this is too small...  we often run into this limit and using extents or OS-level spanning, gets to be more and more of a pain.  Filesystems in general have become the biggest nuisance.

Eric

Eric K. Miller, Genesis Hosting Solutions, LLC http://www.genesishosting.com/ - Lease part of our ESX cluster!
0 Kudos
erickmiller
Enthusiast
Enthusiast

Not that Infortrend really deserves this, but I thought I should add that a customer of ours, who has one of these units, says that with the latest firmware on the S16F-R1840, the issues that we experienced have been resolved.  They have had the firmware running for 2 months without issues.

Just wanted to let everyone know, in case someone else is going nuts with problems and needs a resolution. Smiley Happy

Eric

Eric K. Miller, Genesis Hosting Solutions, LLC http://www.genesishosting.com/ - Lease part of our ESX cluster!
0 Kudos
Texiwill
Leadership
Leadership

Hello,

Recently I helped a customer with something like this and it ended up being a bad emulex adapter. When you have SCSI Reservation Conflicts start at the storage and work to the hosts, verifying everything.

The key is to know how to read the logs and look before the reservation for other possible errors that could cause this issue. But at the same time inspect what everyone is doing and what automated tools are running within the virtual environment that could be touching the LUNs.

Also ensure ONLY vSphere/ESX hosts are touching those LUNs as a common issue is SCSI 3 PGR locks being thrown by clustered 2008 impacting vSphere LUNs, etc.

Best regards,

Edward L. Haletky

Communities Moderator, VMware vExpert,

Author: VMware vSphere and Virtual Infrastructure Security,VMware ESX and ESXi in the Enterprise 2nd Edition

Podcast: The Virtualization Security Podcast Resources: The Virtualization Bookshelf

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill
0 Kudos
erickmiller
Enthusiast
Enthusiast

All of those are good things to check.  In our case, we had pure vSphere environments, and it was very obvious it was a bug in the Infortrend firmware.  By unplugging the "management" NIC, the problem went away.  So, something completely unrelated to SCSI I/O Reservations fixed the issue.  Imagine trying to diagnose this...  not fun.

Infortrend uses a single controller chip for "everything" (to keep costs low, I'm sure), and so any problem related to anything can cause havoc on everything else.

Eric

Eric K. Miller, Genesis Hosting Solutions, LLC http://www.genesishosting.com/ - Lease part of our ESX cluster!
0 Kudos