Re: high iscsi disk latency after upgrade to vSphe...

cbrou · ‎04-05-2010

I recently did an inplace upgrade from esx 3.5 to esx 4 Update 1. After the upgrade I noticed that the disk latency increased substantially (see attached screenshot). Nothing else was changed in the environment during the upgrade. I did upgrade the vm tools and also upgraded all VMs to hardware version 7 as directed in VMware's documentation. Our SAN is an Equallogic PS5000XV on supported firmware V 4.2.1.

To troubleshoot so far I have done the following:

1. Engaged Dell and VMware. They checked the diagnostic logs to check for any obvious issues and couldnt find anything.

2. Reinstalled VSphere from scratch on one of my hosts. Also setup MPIO on this host according to VMWare's recommended configuration (2 port groups in 1 vSwitch. vmk1 to vmnic1, vmk2 to vmnic2). Neither of these changes fixed the latency issue.

3. Attempted to utilize paravirtualized iscsi adapter on non-boot disks. This did not fix the latency issue.

Does anyone have any thoughts on this?

Thanks

Update: for some reason vmware thinks that my 56KB attachement "is too large". I will try again later. In a nutshell, I/O latency used to average less than 5ms and now it is averaging around 10-20ms and often jumping up to 60ms.

AnatolyVilchins · ‎04-08-2010

What have you used to check performance?

Are these threads migh be helpful for you:

http://communities.vmware.com/thread/216914?tstart=0

http://communities.vmware.com/thread/249982?tstart=0

iSCSI Software Support Department

http://www.starwindsoftware.com

Kind Regards, Anatoly Vilchinsky

ToreTrygg · ‎04-08-2010

I think it might be equallogic issue - its pretty old one, huh?

AndreTheGiant · ‎04-08-2010

The PS5000XV is not so old and use fast SAS disk.

So maybe the problem is in another place.

Have you tried to open a call to VMware support?

Andre

Andrew | http://about.me/amauro | http://vinfrastructure.it/ | @Andrea_Mauro

szelinsky · ‎04-08-2010

Just a shot in the dark, but I would check that the proper drivers for your Dell HW (adapters etc.) are installed and configured correctly since the install.

cbrou · ‎04-08-2010

I was able to attach the screenshot to the origional post so you can see the dramatic change in latency after the vsphere upgrade.

cbrou · ‎04-08-2010

one of my troublshooting steps was to run the dell server update utility to get everything current on the PowerEdge 1950s. I then went ahead an reinstalled vSphere from scratch to make sure there was no issues with the in place upgrade. I am still seeing the same latency though.

cbrou · ‎04-08-2010

VMware support has not been so helpful. They suggested the changes that I mentioned in the origional post and after that said that i should engage EqualLogic. EqualLogic just says that I should be using an iSCSI initiator inside my windows OS for all of my high I/O applications (which I understand would help but I didn't need to to that with ESX 3.5 so why should I need to do it now). Beyond that EqualLogic said that the logs look clean and that they think its a VMware issue. I then opened a ticket with Cisco (I am using 2960G switches for my iSCSI traffic) and they looked things over and also said everything looked fine on the networking side.

Beside the high latency in the equallogic monitoring software, one symptom that I am seeing in that my backups are taking much longer (the job rate used to be about 1,000 MB/min and now it is about 600 MB/min). This is using symantec backup exec 12.5 - fully patched. I am using agents installed in the guest OS so the backup is utilizing both regular network and iscsi traffic. I have seen some posts on backups that utilize the service console being much slower with vSphere but I do not understand what "utilizing the service console" means...

AndreTheGiant · ‎04-08-2010

The graph that you have attaced is from SanHQ 2.0?

You have all VMFS volumes?

EqualLogic just says that I should be using an iSCSI initiator inside my windows OS for all of my high I/O applications

Could be useful also to use AutoSnapshot Manager inside the VM.

But is not an answer for the problem.

With vSphere there is another configuration for iSCSI that use more vmkernel port (you can find the document on Equallogic site), are you using this configuration?

Have you enabled jumbo frame?

Andre

Andrew | http://about.me/amauro | http://vinfrastructure.it/ | @Andrea_Mauro

cbrou · ‎04-08-2010

I am using an older version of SANHQ (1.8.2). EQL says that if I updgrade I might loose all my history and don't want to do that until this problem is fixed.

As one of my troubleshooting steps I setup the two vmkernals ports, connected each of them to a seperate physical nics, connected them to the iSCSI initiator, and enabled round robin on the datastores (according to vmwares documentation. This was one of the options in EqualLogic's documentation although they also say that you can setup more vmkernal ports and connect them to just these two pnics). If anything this made the latency slightly worse.

I have not enabled jumbo frames but I am hoping to try that out soon. I wonder if something has changed with ESX 4 that would require me to enable jumbo frames to see the same low latency that I saw with ESX 3.5.

I am wondering also if it is a physical nic issue. My iSCSI traffic is going over 2 ports on an Intel PRO 1000 PT Quad Port 1GbE NIC PCIe-4. What can I do to make sure that the driver for this is up-to-date? I just did a fresh install of vSphere so I would think it would be ok but can't be certain.

szelinsky · ‎04-08-2010

Regarding the NIC drivers, it seems to me that if you ran the Dell utility to update the drivers after the clean ESX install, that should be fine. You can verify what is loaded from the console with the command lspci | grep Ethernet and check OpenManage. What are your speed and duplex settings?

cbrou · ‎04-08-2010

Here is what I get when I run the ethtool command on my iSCSI nics

Advertise auto-negotiation: Yes

Speed: 1000Mb/s

Duplex: Full

Port: Twisted Pair

PHYAD: 1

Transceiver: internal

Auto-negotiation: on

Supports Wake-on: d

Wake-on: d

Current message level: 0x00000007

Link detected: yes

s1xth · ‎04-08-2010

Since you can't/dont want to update SAN HQ yet (I have done many SAN HQ upgrades and never had a problem)....the next best thing to look at is RESXTOP. You can use VMA management server if you are using ESXi so you can get your ESXTOP output remotely. How does your disk times look? How is the latency on the nics?

http://www.virtualizationimpact.com http://www.handsonvirtualization.com Twitter: @jfranconi

cbrou · ‎04-09-2010

I just updated SAN HQ per your recommendation and it the logs did transfer over fine... I am not sure why EQL told me that they would not. Thanks!

I have attached a screenshot of esxtop where you can see that the DAVG/cmd has jumped up to 88.69. Normally it is between 0 and 5 but every few minutes it will jump up to a value greater than 40.

s1xth · ‎04-09-2010

Awesome....not sure why they said you would loose everything. Could you provide another screenshot from HQ2 like you did in the first post but with 'combined graphs' being displayed?

I see a lot of read/s usage on this volume, what is running on that volume? Edit- You are doing MPIO....what build of ESX/i are you running now on this host? What kind of switches are you using for your iSCSI network? From what I am seeing in esxtop it looks like a ton of reads are causing some high disk latency, but I would love to see the HQ2 combined graphs to get a better understanding of the use pattern.

http://www.virtualizationimpact.com http://www.handsonvirtualization.com Twitter: @jfranconi

depping · ‎04-09-2010

DAVG is usually caused on the Array side not on the ESX side. (that would be KAVG).

Duncan

VMware Communities User Moderator | VCP | VCDX

-

Now available: <a href="http://www.amazon.com/gp/product/1439263450?ie=UTF8&tag=yellowbricks-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=1439263450">Paper - vSphere 4.0 Quick Start Guide (via amazon.com)</a> | <a href="http://www.lulu.com/product/download/vsphere-40-quick-start-guide/6169778">PDF (via lulu.com)</a>

Blogging: http://www.yellow-bricks.com | Twitter: http://www.twitter.com/DuncanYB

cbrou · ‎04-09-2010

That specific instance of high latency was caused by SQL. I will attach the screenshot of the new SAN HQ. I didn’t have MPIO/ Round Robin setup at first but I did set it up recently to see if it would help latency and it did not make much of a difference.[cid:image001.png@01CAD7E2.36DE75D0]

chimera · ‎05-20-2011

Hi cbrou,

Did you manage to get this sorted?

I'm getting exactly the same issue however its with an EQL PS6010XV (latest firmware 5.0.5) connected via 2 x Powerconnect 8024F switches, hosts are connected the same via dual port Broadcom 10GbE Netxtreme II 57711 NIC's. If I look at SAN HQ, I can see write latency is perfectly fine - read latency is through the roof (50ms+, even hit 1500ms at one stage!). I'm running ESX 4.1 and software iSCSI intiator / vSwitch / vmNic etc setup as recommended (jumbo frames enabled all the way through, 9000 on VMWare side, 9216 on Powerconnects and 9000 on EqualLogic)

I have logged a call with EqualLogic support. So far they have said the firmware on the switches needs updating to at least 3.4.8 A3 (done this no difference) They have also said the PowerConnect should have the iSCSI traffic in anything BUT the default VLAN, as VLAN1 doesn't support jumbo frames. I have got an outage to do this on Monday night, so will post results after this and whether it resolves the issue.

Oh, and I dont think its VMware specifically, because I also have a backup server with the same NIC running Windows 2k8 with Veeam and iSCSI initiator to the EQL, I get poor backup performance in Veeam when its backing up direct off the SAN. So, its either the Dell NIC (firmware probably), a switch configuration issue (firmware already updated to the same that Dell recommend/test with) or potentially the latest firmware 5.0.5 on the EQL maybe the cause.

Cheers

ngeron · ‎06-01-2011

Hi cbrou and Chimera,

Chimera: I'm seeing almost exactly what you are with a PS6510 (v5.0.4). I also have 8024F with the 57711 and the same software iSCSI to vmnic/nic bonding. We're using the Equalogic PSP on all hosts as well. Have you updated your switches yet? Mine are running 3.1.4.5. I'm also waiting to hear back from Dell support on on diags. I'm very interested to hear if you've managed to find a solution or work-around.

cbrou: I'm interested to hear your results too. Hopefully things have worked out for you.

ngeron · ‎06-01-2011

For those that are interested, my issues have been solved and/or worked around with help from Dell Equalogic support. We found that there was no true latency accessing disk on the array, but under sustained I/O load the network latency crept up into high ranges reported on the VMware side. A suspicious number of delayed ack counters suggested to the tech that I try disabling 'Delayed Ack' on the sessions. I did so on the group IP (discovery address). After a reboot, I cannot recreate the high latency issues.

I can't say that this is an ideal solution. I expect it increases chatter on the network, which may not scale. However, it may be fine in some environments so I thought I'd pass it along.

All

high iscsi disk latency after upgrade to vSphere