VMware Cloud Community
TravisT81
Contributor
Contributor

iSCSI Problems in ESXi 5.1

I'm trying to get iSCSI working with my home lab setup.

I'm running FreeNAS 8.3.0-Release with iSCSI running on top of ZFS.  My datastore consists of two 1TB SATA drives in a striped zdev.  I have created 6 vdevs on this and am sharing one for each vm that I plan to host on the ESXi host.  I've had this setup once successfully, but in testing it was torn down and rebuilt several times.  I now cannot get the iSCSI link to work reliably.

The network consists of one cisco 3750G switch.  It's configured for L2 only and the vlan setup for iSCSI is only configured on two ports (one for ESXi and the other for FreeNAS).  The two servers are both running intel e1000 nics that are dedicated for iSCSI traffic, and FreeNAS, ESXi and the switch are configured for jumbo frames.

Currently, my switch interface is dropping out serveral times a minute on the ESXi interface.  Because of this, it's throwing "istgt_iscsi.c:5239:worker: ***ERROR*** iscsi_read_pdu() failed" errors on the FreeNAS box.  It has done the same thing before moving iSCSI traffic to it's own NICs, as well as before setting up jumbo frames and upgrading the ESXi box to 5.1.

If I've left any important info out, please let me know.  If anyone has ideas of what to check, I'd love to hear them.

Thanks

Tags (4)
0 Kudos
17 Replies
sm00ter
Contributor
Contributor

There are a multitude of FreeNAS/vSphere 5.0+ gotcha's on the web.  I myself have read quite a few to get setups working.  I would browse around and make sure all of the nuance settings that are required are set.

Also, do you have your 3750G set for Jumbo Frames?  That is in entire switch setting, just so you know.  I am sure "testing" Jumbo Frames so that you know what needs to be set in order to get it configured is a good idea, but you are NEVER going to need/use Jumbo Frames in a home environment.

Heck, most production deployments don't see that great of an improvement (if any) by enabling Jumbo!

Just FYI,

sm00ter

0 Kudos
TravisT81
Contributor
Contributor

I've been searching a bunch, and found several things along the way.  I would really love for this to work correctly, and being that it's a home server, I've thrown plenty of money at it up until now.  Hopefully I can get things worked out soon.

As for the jumbo frames, it is enabled on the entire iSCSI network (which includes the entire 3750 switch, as it's a global command to enable jumbo frames).  It's been tested, and is working like it should between FN and ESXi.  As for the benefits of it, I can't speak to the performance gains.  I just assume that because there is a gigabit network with no other traffic on it, increasing the frame size will allow larger frames/less overhead.  Will it have an effect on my low use VM's that I have running?  Probably not.

My problem seems to be isolated to the ESXi server, and it seems that the E1000 nic is dropping out with any substantial traffic flow.  I'm getting errors on FN, but I'm also seeing the protocol/interface dropping on the iSCSI network connected to the ESXi box.  I think the errors are only due to the dropouts, and not the interface between FN and ESXi.  Can't seem to find much on this - the E1000 interfaces have been supported for some time, but I don't know if it's a hardware or software/driver issue.

0 Kudos
TravisT81
Contributor
Contributor

Still searching for a solution to this.  With any traffic over my dedicated iSCSI network adapter (Intel Pro 1000/GT), I get watchdog timeouts and lose connectivity.

I turned jumbo frames back off on both ESXi and FreeNAS to ensure that wasn't effecting anything.

0 Kudos
spravtek
Expert
Expert

How is the config of the switch? Did you enable Flow-Control for example?

0 Kudos
markpattersones
Contributor
Contributor

Stop waiting for the answer to be provided to you and begin a process of elimination. It doesn't appear you are going to solve this by waiting for an answer. Time to put on the gumboots and roll up your sleeves.

I agree the error messages are just telling you something is wrong but their specifics aren't informative. I think your initial diagnosis is good. It sounds like a problem between the esxi host and the switch. It sounds like your configuration is probably okay as you haven't changed much. You have also eliminated the jumbo frames as the problem. I am thinking it is possibly a cable or network interface problem. Except in exceptional cases these problems are solved by a process of elimination not by getting a list of symptoms and then calculating the absolute answer, you start with a most likely, then follow a lot of troubleshooting steps to isolate the troublemaker.

Here are some troubleshooting steps to try:

1) Bypass the switch. Plug a cable directly between host and FN. If this doesn't fix it I would try a different cable. If you can't do this because each device are in different wings of your mansions use a different switch, even FastEthernet if you have one. IMPORTANT: If this doesn't eliminate the problem plug them back into the switch so the next step works.

2) Have you manually forced Gigabit on the interface? If your cabling isn't up to scratch this will cause you problems. Always replace things until autonegotiation picks up the bandwidth you want. Forcing bandwidth creates problems like this where things work okay for a while but under load everything falls apart.

3) Configure ISCSI to use the default NIC for the ESXi host. Off the top of my head I think you can just rebind the iscsi vmkernel port to your other interface. You will also need to put your FN back on the default VLAN. Test to see if you can get iscsi working across this interface.

0 Kudos
ranjitcool
Hot Shot
Hot Shot

Just a curious q - why are you only using iscsi - i used freenas before and iscsi seemed flaky so switched to NFS.

R

Please award points if you find my answers helpful Thanks RJ Visit www.rjapproves.com
0 Kudos
TravisT81
Contributor
Contributor

By no means am I just waiting for an answer here.  My time to troubleshoot this is limited, but I've tried what I can think of up to this point.  Googling is only leading me in circles.

Tonight, I bypassed the switch and connected both hosts with a crossover cable.  At first things seemed promising, but with a little more load things went south quickly.  I saw a max of 400M reported on the interface in the FreeNAS GUI.

I will attempt to bind the iSCSI interface to the other NIC and see what happens.  IIRC, the last time I did that, I not only lost iSCSI, but my connection via vSphere Client was very unstable as well.

One notable difference when connecting both NICs via crossover cable is tha my FN NIC is actually dropping out, where before I was just getting iSCSI errors.  Not sure if this is relevent or not.

0 Kudos
markpattersones
Contributor
Contributor

So we now know the switch wasn't causing problems (unless it independently has another problem). My money is one your ESXi iscsi nic being faulty.

The way you have discussed all this you are sidestepping this diagnosis and not eliminating it from the equation completely (which is why I first suggested an easy attempt of bypassing the switch).

By the way with Gigabit you don't need crossover cables it works with any cable. Its awesome when a new standard makes life easier. Seeing how you are still having weird network issues I'd try to swap this cable for a known good one. It really makes your life difficult if you replace a faulty cable with a faulty cable. On the flip side of that any time you think a cable is dodgy throw it out immediately.

There is no reason that you should have trouble with iscsi bound to the same nic as your management and vm traffic. If that is problematic there is a fault somewhere. Back when you were having trouble with all traffic going across one nic did it happen to be the iscsi nic?

0 Kudos
markpattersones
Contributor
Contributor

Your FN interface actually dropping out now is incredibly relevant. Before the switch was stopping layer 2 errors propagating. This is the exact problem I would anticipate with a faulty nic or cable. If everything was fine that would have suggested the switch, in specific the switch port you were using to the ESXi box or the cable.

0 Kudos
TravisT81
Contributor
Contributor

Hey mark,

By no means am I trying to sidestep the problem, but my resources at home and time are limited, so I'm trying to make due with what I have.  Not to mention I don't have a whole lot of *nix/BSD experience, so I'm learning.

Anyway, I've made a little more progress and wanted to update anyone who is following along:

I removed the switch and it seemed to run great, but after a few minutes I saw the same errors again. I then setup iSCSI again with only the onboard NICs. I used only one NIC per box, and iSCSI seemed to run great. I didn't fully tax the link, but I didn't get a single iSCSI error reported on FN and no dropouts on ESXi.

The last test I've performed to this point is to setup iSCSI on the on-board interface on FN (only interface) and attach both interfaces (on-board and Intel NIC) to the vSwitch with my management network and iSCSI network. Although I only tested briefly, all iSCSI targets came up and worked. I then removed the on-board link from the vSwitch, so that the Intel NIC was serving all iSCSI and management traffic. I expected a failure almost immediately.

It ran great for a few minutes with 5 VMs powered on (which would normally throw an error pretty quickly). I then downloaded hdtune to three vms and started 10GB file benchmarks on each of them simultaneously. I went to bed expecting it not to run long before throwing another error.

I was a little surprised this morning when I checked it. The aggregate read was 75MB/s and write was 41MB/s. This is probably the most traffic the iSCSI network has seen, and there were no errors until about 4AM this morning (which is hours after these tests completed). The VMs have been running all day and have not received another error on the FN box.

I'm not sure why I can't get consistent results on these boxes, but it seems that the Intel NIC installed in the ESXi box now is working "OK". Still an issue this morning, so can't say it's "good".

I think I'm going to put the other Intel NIC into the ESXi box and re-run the same test again and see what happens. I'm just trying to identify a bad NIC if I have one.

0 Kudos
markpattersones
Contributor
Contributor

TravisT81 wrote:

I think I'm going to put the other Intel NIC into the ESXi box and re-run the same test again and see what happens. I'm just trying to identify a bad NIC if I have one.

If I thought a nic was faulty and had another one handy I would swap them too.

Have you had that spare since your first post?

0 Kudos
TravisT81
Contributor
Contributor

markpattersonesq wrote:

TravisT81 wrote:

I think I'm going to put the other Intel NIC into the ESXi box and re-run the same test again and see what happens. I'm just trying to identify a bad NIC if I have one.

If I thought a nic was faulty and had another one handy I would swap them too.

Have you had that spare since your first post?

While I appreciate your help, I never said anything about a "spare" nic.  Maybe re-reading my first post would be in order.

In case you don't bother doing that:

I have two Intel nics.  One in my ESXi box and one in my FreeNAS box.  Only time I have problems is when the two talk to one another, except for a small hiccup that happened last night when a very minimal amount of traffic was being passed across the link.  When the two intel nics are talking to one another, the errors occur almost immediately.

Until I can recreate the problem and isolate it to one thing, I'm not interested in throwing parts at this until the problem is fixed.

0 Kudos
markpattersones
Contributor
Contributor

Easy tiger. I've read all your posts 10 times and spent my time thinking and typing to help solve your problem. When I read NIC I don't assume that means a card, a lot of people use it interchangeably for both built in and add in. I assumed it was onboard.

I would have suggested you swap the cards earlier if I had known. The more complicated steps I suggested were because I thought there was no possibility to do that so more indirect methods (changing bindings) were necessary.

0 Kudos
zombiess
Contributor
Contributor

I'm having the same issue with my intel e1000 NICs.  I'm in the process of switching to the onboard NIC to see if that resolves the issue of my iSCSI datastore dropping out 2-3 times per minute.

0 Kudos
TravisT81
Contributor
Contributor

Really curious to hear if you've made any progress with yours.  I had no issues with the onboard nics, but the intel cards are supposed to be compatible.  I'd love to run my iscsi traffic on a dedicated nic, so any input you can provide would be appreciated.  What model intel nic are you running?

0 Kudos
brodeep
Contributor
Contributor

I know this is a bit of an old post, and I'm not sure if it was ever resolved -- but I was experiencing a similar issue and figured I would add what fixed it for me in case it helps any one.

I was getting randomly disconnected when under moderate-heavy load (eg. svMotion or deploying from template) from my vSphere 5.1 host to my FreeNAS 8.3.1-RELEASE server. Every time it disconnected, it always involved iSCSI; NFS -> NFS worked, but NFS -> iSCSI, iSCSI -> iSCSI, etc would always break. The NAS box would lose network connectivity (thus causing vCenter/ESXi to choke and cancel the running task), with the only (temporary) fix being to restart the network interface on the NAS.

After much trial-and-error, and many hours of searching, I discovered that flow control on the NAS interface was the culprit. Once I disabled it, I was able to deploy templates or svMotion machines without fail and at peak sustained network speeds of ~122 MB/s over commodity NICs & switch.

I should note that I don't have jumbo frames enabled.

Hope this helps.

Server NICs, for reference:

FreeNAS = nForce MCP55 integrated

ESXi = Realtek 8168 integrated

Switch = TP-Link TL-SG1008D

0 Kudos
bradley4681
Expert
Expert

how did you disable it on the NAS?

Cheers! If you found this or other information useful, please consider awarding points for "Correct" or "Helpful".
0 Kudos