VMware Cloud Community
kristinnE
Contributor
Contributor

iSCSI blocks other computers on SAN

I've got a very funky problem.

For about a year now we've been running ESX 3.0.x ( 3.0.2 latest ) on just a single server to test out how we'd like VMWare before going fully virtual. We've been running this in a c-Class blade from HP ( bl460c ) connected to MSA1510i SAN without any major problems so far ( except the usual lack of NIC's and I/O performance ). We now added another blade to the enclosure and I installed ESX 3.5 on that one and connected it to the SAN like the other except for one tiny little problem. I can only see a part of the LUNs on the SAN. On my original ESX server I can see 14 LUNs, on my new one, I can see anywhere from none to 9 LUNs but I can't for the life of me get up to 14. The LUNs I see are totally random, depending on server reboot, so at one point I can have access to my datastore but after a reboot I can't. So you see this is causing me a few problems with VMWare, but that's not all. When that buggy server is up and running and connected to the SAN and I have to reboot any of my other blades ( running Win2k3 ) I get an error after reboot, saying there's insufficient resources to connect to the storage. Turn any blade off and it makes no difference, turn the buggy VMWare blade off and everything connects without a problem. This is not a firmware issue, everything has been patched and is up to date, this is not a hardware issue as my HP vendor lent me a blade to check that out and this does not seem to be a MSA issue since I can turn off any blade without making a difference but turning the 2nd VMWare blade off does. Turning the 1st VMWare blade off doesn't matter, don't know why and I'm not gonna reinstall that blade since it's working just to find out if it will crash or not. This is also not bound to 3.5 since I installed 3.0.2 on the blade to have it as identical as possible ( created all partitions again to avoid leftovers ).

Setup:

In this bladecenter we have 11 blades running, each with 2 NICs, one connected to the LAN and one connected to the SAN. The SAN is on a closed circuit network and only way to reach it is through one of the blades. MSA1510i SAN storage connected to 2 cages with SCSI disks. Blade enclosure has Cisco 3020 switches.

This little problem is what's keeping me from going full blown VMWare since i'm not going to buy expensive interconnects and NICs along with VMWare licenses just to mess up my system and cause me more work.

Please help me before I go insane ( if it isn't too late )

Tags (3)
0 Kudos
5 Replies
bobross
Hot Shot
Hot Shot

You may want to look at the log to determine if the LUNs are in order (consecutive numbers) on discovery...I have seen ESX behave very badly if your LUNs have a 'gap', in other words you discover LUN 0, 1, 2, ... 12, but 13 is not there, then you have 14, 15 ... 31 (or whatever). The problem is that LUN discovery can occur out-of-order and if they are not discovered perfectly in order, you end up with missing LUNs. Again, may be your problem, may not, but worth checking out. Good luck.

kastlr
Expert
Expert

Hi,

because you're talking about NIC's (and not HBA's), I assume that you uses the software iSCSI initiator.

If so, you should check the speed and duplex configuration of the used NIC.

Usually, the NIC's are set to Auto Negotiation[/b], which should be adjusted.

In your scenario, the NIC and the used switch port should be set to 1000Mb, Full Duplex[/b].

I've seen a similar behaivior in my LAB when playing around with software iSCSI, wasn't even able to scan for disks.

AFAIK, it's recommended by VMWare to disable Auto Negotiation for NIC's by configure them to use the highest available static speed and duplex mode.

Hope this helps a bit (and will reach you just in time[/i]). Smiley Wink


Hope this helps a bit.
Greetings from Germany. (CEST)
christianZ
Champion
Champion

The msa1510i is not supported with Esx 3.5 and I remember guys here having big problems with it - so I guess that won't work.

0 Kudos
kristinnE
Contributor
Contributor

Right....

MSA1510i works on ESX 3.5, just not very well.

That thought about the LUN order and all that got me thinking a bit more towards the storage unit and it turns out that I had LUN 1 configured for 2 targets which was causing me some problems. Also I just noticed that the LUNs don't come in the right order, there is no gap, just not the correct order since it's ordered by targets. Also by looking into some logs I noticed that 2 windows servers were actually connected to the same target although one of the wasn't allowed to. Strange bug but nothing too serious I hope since i've fixed that.

I now managed today with those thoughts to connect to 2 more targets then earlier so I guess I'm on track now. I just have to try to rearrange my target order so that each target number has the same number as the LUN and then they'll come in order. Don't know if that's gonna help me or not, but it's worth the try.

The thought about the NIC: You're right, it's software initiated iSCSI and it was set to auto neg. I've now changed that on my ESX and have to connect via serial to the switch to configure the port, but I think it's set to 1000mb full duplex and not auto neg.

What bothers me most is that one ESX is actually up and running and not bothering anyone, the other one is up and running too and works just fine ( when it can connect to the right targets ), it just bothers all other servers that need to connect to the SAN.

0 Kudos
kristinnE
Contributor
Contributor

I'm going to wake this up again as my problem hasn't been solved yet, but i've got updates.

I've been monitoring my systems for a bit now to see what exactly is going on and I've noticed one very very very odd behavior from the ESX server. It does not matter what version I run, no matter if it's 3.0.2, 3.5.0 or 3.5.0 update 1 ( haven't tried 3.0.1 yet ) I always seem to get the same result.

Normal behavior from MS iSCSI initiator is to look for targets and log on and only try to log on to targets that the user tells the iSCSI initiator to log on to. VMWare however seems to scan for targets and try to log on to all of them resulting in multiple sessions to targets that are supposed to have only one session. I've also noticed that even though I create a portal with a specific IP and assign that portal to a specific target, VMWare doesn't really care about that, it just scans the network and displays all targets from all portals, even ones it's not supposed to scan.

I have now set my ESX server to discover only one IP address and that IP address is only connected to one portal and that target is only connected to one target, however I can see all the other targets in my system and that is a behavior I'm not pleased with.

I don't know if this is something that's just connected to the MSA1510i or if it's general behavior by VMWare, but if this is general behavior, my guess is that if this were to be fixed, then ESX 3.x would support all iSCSI solutions.

I am already in the process of buying new storage solution that is not iSCSI so the problem will be fixed, I just hate having problems I can't solve so I'd like this fixed if possible. So if anyone has a solution, please feel free to share.

0 Kudos