VMware Cloud Community
rbutter
Contributor
Contributor

ESXi 5.5 iSCSI issues

  I have two ESXi 5.5 U2 hosts.  Both are connected to a Drobo B1200i via a virtual distributed switch.  Each host has a direct Ethernet connection from a physical host NIC to a dedicated iSCSI port on the Drobo.  Both hosts find the target and populate the LUNs view.  I've created a iSCSIDatastore extended across the 4, 4TB LUNS on the Drobo.  Both hosts can see and browse and store\retrieve VMs from this shared storage.  I can vmkping from host 1 to the target and get a response.  If I vmkping from the 2nd host I get nothing, no response.  Is this normal behavior?  I am also seeing a lot of events that the host lost access to the iSCSIDatastore and then restored it seconds later.  I also see latency jump to warning levels when taking any action on a single VM stored on this datastore.  We are using jumbo frames set at 9000.

148 Replies
Sedude
Contributor
Contributor

I feel you. I have changed out so many components and tried so many things and nothing. Still working with Drobo, but not getting anywhere with them. They are reviewing some Logs I sent , Drobo and VMware logs and I even pointed them to this post and still have not gotten any response.

I am ready to give up, choose another unit and then see about either using this device in another capacity or headed to EBay maybe.

I have long gone past my wits end.

Chris

0 Kudos
FrankTheTank196
Contributor
Contributor

Agreed! From the moment I started with this organization, I knew the Drobo would be an thorn in my side. It was purchased by my predecessor. I hate that I can't update the firmware or replace hardware without taking the environment down. That I can't automate a shutdown of the unit in the event the power goes out. It isn't an enterprise SAN at any level. I also pointed them at this thread. Just tell us it's a known issue. At least if they acknowledge the problem it would be a step in the right direction. Before I found this post, I had hope of resolving this issue but now I think I will be dealing with this until I replace the unit.


So you have tried directly connecting host to drobo? What about disabling jumbo frames?

Frank

0 Kudos
j_rodstein
Enthusiast
Enthusiast

Hi All,

My company was a reseller of Drobo units a couple of years ago and as part of that deal, we purchased a deeply discounted Drobo 1200i loaded with SSD and Sata disk.  We were excited because a whitepaper on their site at the time stated it could run something like 200 VMs.  I worked with Support and their engineering team for over a month with all of the terrible performance and host disconnects.  Just wanted to add a few points:

1.  Jumbo frames is by far the best setting for using this device in ESXi.

2.  The device comes with a 4 core processor and their support recommended dividing up all of my diskspace into 4 equal parts.  Supposidily, that was going to didicate a core per share.  Sounded like nonsense, but I did test it.  It performed no better.

3.  You will constantly get disconnect alerts from your hosts when the Drobo is attached.  This has been the case in my experience from versions 5.1, 5.5, and now 6.0 of ESXi.

4.  The Drobo is only valuable as a backup device.  I have mine carved up as one single 16TB LUN and created a new VMDK from my Veeam backup server to host the vCenter backups.

Hope this is somewhat helpful as a confirmation of what you can expect from the device 😕

0 Kudos
J45p3r
Contributor
Contributor

I have given up on the issue. I got tired of jumping through the same hoops for Drobo support over and over and not getting any results from them. I'm done. I'm buying an HP MSA 2040 and I'm done with Drobo forever.

0 Kudos
Sedude
Contributor
Contributor

I hear ya man.

To answer your question, yes. I have tried both and no luck.

I am think I am with J45p3r on this one and its just going to be a replacement unit for me too.

I don't see Drobo doing anything to resolve this issue. I am trying to get a refund out of them

now for my DroboCare and maybe the unit, but I doubt I will get either...

I will report back on what I go with and advise ..

Chris

0 Kudos
FrankTheTank196
Contributor
Contributor

I agree that a new unit will most likely be our solution as well. Unfortunately we won't be able to do that for sometime, so at this point, I have no alternative other than ignoring it or continuing to engage Drobo and VMware support. If VMware has certified the Drobo on 5.5 and 6.0, then maybe there is something that can be done about these issues. Here is my Drobo case #150717-000018. I would like to mention your cases as well in reference to this issue. Do any of you mind sharing yours? I'm hoping to avoid having to jump through all of the same unsuccessful troubleshooting steps. Thanks everyone for your efforts on this.

Frank

0 Kudos
Sedude
Contributor
Contributor

I hear you, Might be a little time for me as well. Thank god for local storage and backups right now... Smiley Happy

My Case number with Drobo is :  150427-000127 .

I will reference your case as well just in case.

Thanks

Chris

0 Kudos
FrankTheTank196
Contributor
Contributor

I found this Drobo article:

https://myproducts.drobo.com/article/AA-01836

It mentions changing a iSCSI value for Windows hosts to 262144. I checked the VMware iSCSI adapter value and it is 131072. I am only stating what I noticed as I have no idea the effect this change would have on a ESXi host. Anyone know what this would do and or if it has been tried?

Thanks,

Frank

0 Kudos
Sedude
Contributor
Contributor

I have not personally tried this. But am willing to. I will make that change on my VMware hosts and let you know what I see.

Chris

0 Kudos
FrankTheTank196
Contributor
Contributor

Chris,

Thanks for taking a look at that. I also found this article:

https://myproducts.drobo.com/article/AA-01284

Most of the things listed others have tried but the two that caught my eye and apply to my environment are:

  • If there is a MS Exchange (2010 or newer) VM installed on the Drobo, ensure that automatic database maintenance is disabledwithin the Exchange VM
  • ESX 4.x: A 14-second Heartbeat Token Timeout must be set on each ESX/ESXi server, as in this example:

                      esxcfg-advcfg -s 14000 /VMFS3/ HBTokenTimeout

Now the second one doesn't apply based on it being stated for ESX and version 4 for that matter but Drobo emailed me this morning telling me to make that change as well as the disable delayed ack change.

Just updating the notes with where i'm at. I just finished fully patching my ESXi 5.5 hosts to Build 2718055. I am going to update vCenter now to 5.5 U2e. I am currently at U2b. I needed to do these anyways so getting them out of the way makes one lets cause. I do notice that one of my host has way more lost access errors than the other but this could easily be caused by the work load.

Frank

0 Kudos
FrankTheTank196
Contributor
Contributor

Hello everyone,

I don't want to jinx myself but I made a change last night and since then I have not had a lost access error on either of my hosts for almost 12 hours. I was losing access every 10-20 minutes on both hosts. The change I made was this:

VMware KB: Adjusting Round Robin IOPS limit from default 1000 to 1

I've used this fix before with other SAN devices when using iSCSI but it was only to increase performance, not resolve issues. I wrote down all the steps that had been tried in this article as well as the ones that VMware and Drobo were suggesting and this was the only one that hadn't been tried. It also doesn't require a reboot of the Drobo or hosts so I started with it. I can't say for sure if it is a permanent solution or will help any of you but hopefully it will.

I am getting warnings now about I/O latency that I was not getting before but that is better than losing access to a datastore. I'll start looking into those Monday. Also my full backups run between Sat 12am-10am so that could be the cause of the latency as those jobs use heavy I/O. I am also still getting some dropped packets on my iSCSI switch as my hosts are not directly connected to the Drobo. Those dropped packets could also be causing the latency issue. The switch is a Cisco 2960 which is more of an access switch. I'll be able to comment more on this once I have more time dig in.

As for changing MaxRecvDataSegLen to 262144 in ESXi, VMware hasn't commented if this should be adjusted and Drobo said it ONLY applies to Windows hosts so if you did change it and didn't notice a difference, you may want to change it back.

I've also asked VMware about the other Drobo suggested changes of: disabling 24x7 database scanning in Exchange, esxcfg-advcfg -s 14000 /VMFS3/ HBTokenTimeout and disabling delayed ack but I have not heard back yet. The exchange one might help performance but I can't see that causing the lost access issues. The HBTokenTimeout seems to relate to VMFS3 and ESX4 so I am not sure it would have any impact. The delayed ack setting is the only one I am still considering as it is a VMware recommendation as seen here:

VMware KB: ESX/ESXi hosts might experience read or write performance issues with certain storage arr...

Anyways, I'll keep this posted updated with my results as I am still actively trying to resolve this. Sure someday I'll get away from Drobo but that day isn't not today so until then, i've gotta make due with what I have. Thanks to everyone that has helped in this thread.

Frank

0 Kudos
Sedude
Contributor
Contributor

Thanks for the update Frank! I am curious to see what you see over the next few days.

Chris

0 Kudos
FrankTheTank196
Contributor
Contributor

Good morning,

Just a quick update! I checked both hosts today and only 1 host lost access and it only lost access once. Host A lost access to Volume 4. At the time when Host A lost access, there was a full backup being ran of a VM on Host B who's datastore resides on Volume 4. I think there are some other things happening in my environment that need to be addressed but changing the PSP RR IOPS from 1000 to 1 has helped. I was losing access to Volumes on both hosts every 10-20 minutes.

I think another issue with my particular environment is the switch I am using for iSCSI. I have read several articles that the Cisco 2960 does not have deep enough buffers for iSCSI traffic. I have checked the switch and I do see dropped packets on the iSCSI ports. I think this is causing my latency warnings and possibly the 1 lost access error. I will do more testing this week but I just wanted to provide my results so far. I am still actively working with Drobo, VMware and Cisco on this.

Frank

0 Kudos
MightyQuinn4310
Enthusiast
Enthusiast

All,

  Sorry I have been so quiet lately. I have been very busy at work and then on vacation last week. It is somewhat encouraging to see several other users sharing the same problems.

With that being said, I am in the process of returning my Drobo unit. I have completely given up on the issue being solved correctly. Good Luck to everyone and thank you to all who have contributed their time, effort, and expertise in troubleshooting and working toward a solution. I hope you are successful or that Drobo and/or your vendor makes the product/situation right on their own!

Chris

0 Kudos
FrankTheTank196
Contributor
Contributor

Chris,

How long have you had your unit? Returning it to Drobo or the vendor you purchased it through? Ours was purchased in 2012 so I doubt I can return mine but i'm still curious! Thanks,


Frank

0 Kudos
MightyQuinn4310
Enthusiast
Enthusiast

Frank,

  About 8 months. I've been heavily engaged with the vendor since February, and with Drobo since I purchased in Oct/Nov. RMA is not finished yet, but has been promised and is in process with all parties. Don't want to jinx it!

Chris

0 Kudos
tknutson300
Contributor
Contributor

We are having similar problems with our DROBO B1200i units, and have narrowed the problem down to read operations.  After experiencing the problem, we began to do detailed disk testing by placing a single virtual guest (ESXi 5.5) onto the DROBO B1200i unit.  To perform testing, we used the SQLIO Disk Subsystem Benchmark Toll that is freely provided by Microsoft at Download SQLIO Disk Subsystem Benchmark Tool from Official Microsoft Download Center.  During testing we did not have problems during write operations, and achieved acceptable IOPS and throughput, but did have problems during all read operations (extremely low IOPS combined with extremely low throughput and also extremely high latency, as well as storage disconnects on our vSphere host).  We have noticed that 100% of the time when our read operations start that storage disconnects are immediately triggered on our vSphere host.  Our testing environment consists of Cisco UCS blades, using a Cisco Nexus 5k/2k stack, Jumbo Frames are enabled and we have followed and implemented best practices provided by Cisco, VMware and DROBO.  We have had an support ticket open with DROBO since April 2015 and have provided detailed information to their engineers (including our disk testing results) but they have yet to provide a solution.

Testing Results (*NOTE: We have SSD Tiering Disabled as part of these tests)

sqlio_test_results.png

0 Kudos
tknutson300
Contributor
Contributor

We wound up resolving the problem by making several changes, and testing has shown that it is resolved.  The first change we made was to manually disable delayed ack on the DROBO targets inside of vSphere.  The second was to increase the login timeout from the default 5 seconds to 15 or 30 seconds (we went with 30).  After making these changes, the problem was improved but still existed, and it appeared that the changes only masked the problem.  We then also adjusted the default round robin IOPS limit from 1000 and changed it to 1 (this may require further tweaking to find the ideal sweet spot) as outlined at VMware KB: Adjusting Round Robin IOPS limit from default 1000 to 1.  After making this last change, our testing conclusively showed that the problem was resolved and we were getting performance that was within specifications.  Our testing results after the changes are shown in the image below.

upload.png

0 Kudos
FrankTheTank196
Contributor
Contributor

tknutson300,

Thanks for the update.  I made all three of those same changes but in a different order. I started with lowering the IOPS to 1 and that reduced my lost access errors from every 10-20 minutes to 1 or 2 a week. Then I disabled delayed ACK on our ESXi hosts as well as changed the iSCSI timeout from 5 to 15. Since then I only see warnings about the I/O latency. As I mentioned earlier, I believe part of my problem was the switch I was using. I am working on replacing it so I will report back once I do. Hopefully some others can try the changes we have made and report back on their experiences.

Frank

0 Kudos
FrankTheTank196
Contributor
Contributor

tknutson300 - How did you disable the SSD tiering? Did you just remove the SSD drives? If so, can you tell me the procedure you used?  Thanks,

0 Kudos