VMware Cloud Community
rbutter
Contributor
Contributor

ESXi 5.5 iSCSI issues

  I have two ESXi 5.5 U2 hosts.  Both are connected to a Drobo B1200i via a virtual distributed switch.  Each host has a direct Ethernet connection from a physical host NIC to a dedicated iSCSI port on the Drobo.  Both hosts find the target and populate the LUNs view.  I've created a iSCSIDatastore extended across the 4, 4TB LUNS on the Drobo.  Both hosts can see and browse and store\retrieve VMs from this shared storage.  I can vmkping from host 1 to the target and get a response.  If I vmkping from the 2nd host I get nothing, no response.  Is this normal behavior?  I am also seeing a lot of events that the host lost access to the iSCSIDatastore and then restored it seconds later.  I also see latency jump to warning levels when taking any action on a single VM stored on this datastore.  We are using jumbo frames set at 9000.

148 Replies
JPM300
Commander
Commander

Hey rbutter,

If your setup is like this:

Drobo.JPG

What speed is your iSCSI nics's?  Also how does the drobo's controllers / iSCSI connections work?  Is there only one controller on it with iSCSI/NIC ports.  Do the ports need to be changed from mgmt mode to iSCSI mode? Are all the ports ready to receive the target?

I would temporarly unplug host 1's connection on the drobo and plug host 2 into that iSCSI port and see if you can vmkping it and if normal behavior occurs.  If you can then its a configuration issue with your iSCSI ports on the drobo.

I tested a drobo a numbers of years ago back in version 4.0/4.1 and the speed results where not fantastic.  It would be good for a very small environment or a lab, however when more stress is on it, it kinda started to get overwhelmed

On another note if your config is like the picture, you have a lot of single points of failure.  The NIC's on each host if they fail your VM's will go down.  If the NIC on the drob fails your VM's will go down.  If there isn't 2 controllers on the Drobo, if the controller fails your VM's will go down.  Just wanted to point these out.

I hope this has helped.

0 Kudos
rbutter
Contributor
Contributor

JPM300,

  The NICs are 1GB both on the Drobo and the Hosts.  I'm fairly sure there is a single controller in the Drobo with 3 iSCSI ports and a management port.  The iSCSI ports are dedicated to that and don't need anything but a MTU and IP address set.  My setup is fairly close to the PIC and failover\redundancy is not a big concern for this small environment but, I greatly appreciate any and all advice.  I will change host 2 to the apparently working iSCSI port and attempt the vmkping again.  Thanks, Ron B.

0 Kudos
JPM300
Commander
Commander

It's possible that the Drobo may require an individual IP address per iSCSI port which would mean you would have multiple iSCSI targets that you would have to setup in your ESXi host.  Typically when most SAN's or NAS's have 3 iSCSI ports they usually have a group IP of some sort and the controller does the load balancing etc, much like Dell's Equallogic's, however this is not always the case.  I have seen some if you are not setting up the ISCSI in a trunk you would have to setup 3 iSCSI IP addresses and setup 3 iSCSI targets in your ESXi hosts.

0 Kudos
MightyQuinn4310
Enthusiast
Enthusiast

First off, I don't have any answers! But I am experiencing very similar symptoms with a very similar setup as you.

First My environment:

I have a b1200i with 3 600GB intel enterprise SSD and 9 Wd-re 2tb drives, I am running esxi 5.5, and I have 3 dell R510 Hosts with 2 nics (4 ports total) on each server. I have an unmanaged Netgear gigabit switch dedicated/isolated to the san and each server has a nic port dedicated for the storage network. I am using the esxi software iscsi iniator. We are not using jumbo frames. The Luns are configured to use Round Robin and are VMFS5. I originally had all 8 guests on 1 8 TB LUN, but drobo support suggested I create a total of 4 luns and distribute the guests, because the drobo dedicates a core to process each luns traffic. So i created 3 addtional 2 TB luns and moved my more performance intensive guests to them.

I have 8 Guests in my environment. My typicall IO as reported by the Drobo Dashboard rarely exceeds 500

Server 2012, running Exchange 2013 Server, all roles, for 100 users

Server 2008 R2, running Sql 2008, for 5 users

Server 2008 running Terminal server for 5 users

Server 2008 R2 File and Printer sever for 25 users

Server 2008 r2 Running Avaya voicemail software for 50 phones and 25 users. max 8 voicemail lines in use at a time.

server 2008 r2 AD controller for 100 users and Windows Update server

server 2008 r2 Backup AD controller for 100 users

server 2003 file server for 5 users and one access application

I have been getting windows events on several hosts indicating I/O was taking longer than expected and referencing delays of 19-36 seconds. I am also recieving vm events that indicate the hosts have "lost access to Lun." The events are not at the same time on each host and appear to be occuring at random.

Due to the performance issues, I finally was forced to migrate my SQL and Exchange servers back to local storage, but still left 5 servers using the Drobo. During this migration (last night), I powered down all of the guests to provide maximum throughput for the moves. During the moves, I maintained about 1300-1700 IOPS and 90MB/s as reported by Drobo. The moves completed successfully.

I started rebooting vms 1 at a time, and observed that with only 1 Ad controller running on 1 host, I still recieved the lun disconnected message in Vmware. I have now powered up the rest of my guests, and have 5 guests running on the drobo and the esxit hosts are all reporting lun disconnected and recovered messages periodically.

I have a case going with Drobo Support.

I have considered bypassing the switch all together and connecting each host directly to a port on the Drobo, but will have to wait for a maintenance window when I can be onsite and take everything down.

Hopefully our issues are the same or at least related and we can get a resolution soon!

0 Kudos
rbutter
Contributor
Contributor

MightyQuinn4310,  Thanks for the info.  Have your heard back from Drobo?  I would be very interested in what they have to say.

Thanks, rbutter

0 Kudos
MightyQuinn4310
Enthusiast
Enthusiast

Drobo Support is the worst. They take 3 days minimum to respond and when they do, they don't read the case notes or my reply and just ask me  if I followed their Support documents. In past cases, they have told me that they are used to dealing with consumers and not business and that they think that they might create a separate business support unit at some point, but for now, they are not capable of enterprise support. It is really disappointing! I had high hopes for the b1200i, but now I would tell anyone who would listen, to stay away from them. They really can't seem to support their product. I'll reply if they offer something useful, or If I can figure out something on my own.

0 Kudos
rbutter
Contributor
Contributor


MightyQuinn4310, Thanks, anything at this point would be helpful.  You would think if Drobo advertises the B1200i for enterprises it would support it, especially if you're paying for support.  We are still seeing I/O latency warnings and an occasional access lost and within seconds access re-established.

0 Kudos
MightyQuinn4310
Enthusiast
Enthusiast

Rbutter, I may have found something to try. So far in my search for answers, I have not been able to find a way to get rid of the "Connection lost" and "Connection Restored" messages in ESXI. I removed the switch, set path to fixed and connected directly, still happened. Today, I removed the Drobo Static and Dynamic Targets from all of my ESXI hosts, except for the one I was performing testing on. On my Drobo, I currently have 4 Luns (Drobo support told me that each lun would use a separate processor, so there might be benefits to having separate luns for vms, but 4 is the limit). On the Test host, I went into static discovery, and I removed 3 of the 4 luns, the 3 luns that my test VM DID NOT reside on. So to recap, I have 1 Drobo  connected by 1 port, to 1 ESXI server, and am only connected to 1 Lun. I started the test that has reliably started causing the lost and restored connection messages, and so far no messages after about 30 minutes. I still don't know if this result is because I removed all of the paths, or because I only have 1 esxi host connected. I will perform more testing to further narrow down the results.

* Do Note "rescan Hus bus adapter" after removing luns, or they will be added back automatically. Just ignore the message.

Chris

0 Kudos
nettech1
Expert
Expert

just to confirm, when you ping your drobo from your hosts you are using  vmkping -d -s 8972 x.x.x.x ? can you show the output of esxcfg-vmknic -l from both hosts

0 Kudos
nettech1
Expert
Expert

doing a quick search, I could not find any documentation on drobo iscsi configuration, but came across this post 
http://www.q8i.org/imac-drobo-fs-nas-slow-connection/

i am wondering if disabling delayed ACK on your host iscsi ports is the answer?
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=100259...

0 Kudos
rbutter
Contributor
Contributor

Chris, thanks for the update.

0 Kudos
rbutter
Contributor
Contributor

Nettech1
I have corrected the connectivity issues between my hosts and Drobo that allow me to vmkping between both hosts and the Drobo iSCSi ports.  What I am left with is latency warnings and frequent access lost and within seconds access restored issues.  Turning off DelayedAck and disabling LRO did not correct this issue. I'm wondering if it has something to do with the HA datastore heartbeat timing or iSCSI login timing. My Drobo's are using Firmware version 1.2.0.

0 Kudos
nettech1
Expert
Expert

you are using distributed switch. can you span an iscsi port form a your host 1 or 2 to a Linux or Windows VM with wireshark on it and capture iscsi traffic only? Upload your pcap file somewhere so we can take a look. If you aren't allowed to post network captures let us know if you see anything abnormal in the stream

0 Kudos
MightyQuinn4310
Enthusiast
Enthusiast

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=100955...

VMware article about connection lost message.

Looks like I may have celebrated too soon. My Connection lost and restored messages have returned. Just for fun, I spun up a new esxi 5.1 update3 server to see if it was related to some incompatibility with esxi 5.5 and I appear to be receiving the same symptoms.

Have you read the VM article that accompanies the connection lost message? It indicates that the VM host hasn't heard from the Drobo in 16 seconds! That is a crazy amount of time! This really makes no sense!

Drobo support hardly responds, and so far has had nothing helpful to provide.

0 Kudos
nettech1
Expert
Expert

anyone having a connectivity issue to drobo storage, can you provide a wireshark trace of the iscsi traffic?

MightyQuinn4310,
if you have Dell Open Manage Server administrator installed on your R510 host, can you check  Network Interface Information page for errors?

maybe this http://www.dell.com/support/article/us/en/04/SLN283398/EN ?

0 Kudos
MightyQuinn4310
Enthusiast
Enthusiast

NetTech1,

  This is good info, thank you very much. I am working on applying this settings change on my testbed, and then I will report back. Thanks! I am optimistic that this might actually be something helpful for me. The majority of my servers do have the 5720 nics that are referenced, but one server that is also showing symptoms has bcm5716 and bcm5709 , which is covered by the note in the article about the tg3 drivers. I will let you know once I have a chance to test. Thanks!

Just a general update, Drobo wanted to replace the controller card to make sure it was not a hardware problem on their side. So I did that yesterday, and the symptoms returned immediately. I am sure they are as frustrated as I am, and if it turns out to be a nic driver issue, that would probably restore my faith in the product and their support a little bit.

Thanks NetTech1!

0 Kudos
MightyQuinn4310
Enthusiast
Enthusiast

Well, unfortunately, this does not appear to have fixed my problem, unless I did something wrong,which is entirely within the realm of possibility!

First of all, I followed the instructions in the article you posted for the workaround, and disabled netqueue. I queried first to see if it was on (it was), then ran the command and rebooted, and then requeried, and it was off. I started up a guest VM, and immediately the lost and restored messages returned.

So, I thought, alrighty, let's go ahead and try the resolution and update the driver. So, I followed those instructions and I learned that when you copy into SSH, sometimes the paste screws up dashes(-) in your command, and it took me longer than I'd like to admit to find out that it was best just to type the whole command in manually. So I did, and I get a message that the install was skipped. So, I looked to see what driver I had installed already, and it turns out the driver I had installed was a newer driver (Driver: tg3 version: 3.137d.v50.1-1OEM.500.0.0.472560). I am running the latest release of Dell Customized release of  ESXI 5.1 U3 on my test server, which was just released in December. My production servers which also have exhibited this issue, are running the Dell Customized ESXI 5.5 U2 revision and it appears to have a newer driver as well (3.136h.v55.1-1oem.550.0.0.1331820)

So I am back to the drawing board. I can see about adding a span port and capturing some traffic with wireshark if you still think that might be helpful. I also am including my vmkernal logs which you can open in wordpad and view. The errors seem to mostly contain "b1200i", so you can search to see where they appear.

Many Thanks Nettech1, I was really hoping we had solved it! Let me know if you have any other ideas!

0 Kudos
nettech1
Expert
Expert

can we take a look at your iscsi ports in dell open manage?

You are gonna need to download a Dell VIB file and place it on a datasote available to your host.


Dell OpenManageServer Administrator vSphere Installation Bundle (VIB) for ESXi 5.5, V7.4   

OM-SrvAdmin-Dell-Web-7.4.0-876.VIB-ESX55i_A00.zip | Hard-Drive (7 MB)

http://www.dell.com/support/home/us/en/04/product-support/product/poweredge-r510/drivers


follow this to install http://en.community.dell.com/support-forums/servers/f/1466/t/19586952

Install OMSA web client 7.4 on one of your windows machines and connect to the management ip of the esxi host with esxi's root credentials.

Please generate some iscsi traffic and look at the iscsi port statistics in OMSA, post a screen shot if you can.


wireshark would be next..

0 Kudos