VMware Cloud Community
Marc_P
Enthusiast
Enthusiast

Event: Device Performance has deteriorated. I/O Latency increased

Hi,

Since upgrading to vSphere 5 I have noticed the following errore in our Events:

Device naa.60a980004335434f4334583057375634 performance has deteriorated. I/O latency increased from average value of 3824 microseconds to 253556 microseconds.

This is for different devices and not isolated to one.

I'm not really sure where to start looking as the SAN is not being pushed as these messages even appear at 4am in the morning when nothing is happening.

We are using a NetApp 3020C SAN.

Any help or pointers appreciated.

64 Replies
CRad14
Hot Shot
Hot Shot

Yeah, I have been trying everything I can think of to resolve this issue myself, but haven't been able to.

Again, I didn't notice this until I migrated to vSphere5 and with the number of people reporting this issue, I definitely think VMware needs to be taking a look at this.

Conrad www.vnoob.com | @vNoob | If I or anyone else is helpful to you make sure you mark their posts as such! 🙂
Reply
0 Kudos
irvingpop2
Enthusiast
Enthusiast

I've noticed the same issue on my FAS2040, but my events happen only when running SMVI (NetApp VSC) backups and when Snapshots are being removed.  (they usually happen together)

Check to see if you can correlate the events with high NetApp CPU usage.  If so, check to see if you have large numbers of Snapshots and possibly in combination with dedupe.   Also check your NetApp and SMVI snapshot schedules.

Reply
0 Kudos
boromicmfcu
Contributor
Contributor

This is interesting.

I was on a call for a vCenter Server issue and the vmware tech was looking at the logs on one of my hosts and he noticed the I/O latency increased errors in the logs.  The funny thing is they didnt start appearing till after I updated to 5 update 1.  I checked my other host and it had the same errors on luns that are presented to multiple ESXi hosts.  Prior to the upgrade I was recieving iscsi_vmk: iscsivmk_ConnReceiveAtomic: Sess [ISID:  TARGET: (null) TPGT: 0 TSIH: 0] errors.  This has something to do with EqualLogic luns that aren't configured for access from multiple ESXi hosts...so that is okay.  I dont seem them happening anymore after the 5 update 1 upgrade...odd.

There was a netorking bug in 5 that could affect iSCSI connections, but I dont know if it is the same one being discussed in this thread.  It has been fixed in 5 update 1.  See the links below.

http://vmtoday.com/2012/02/vsphere-5-networking-bug-affects-software-iscsi/

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=200814...

Can anyone confirm if this issue persists into 5 update 1 that had the latency errors prior to upgrading.

Reply
0 Kudos
vxaxv17
Contributor
Contributor

I ended up putting a support call into vmware about this message.  For me, I wasnt actually noticing any problems with my setup, I was just worried about the messages.  After the rep verified all of my settings were optimal, they pulled a developer onto the call who basically said that these messages dont necessarily indicate a problem in my case.  They appear when the latency changes by 20 or 30% (cant remember which).  Im seeing messages like latency changed from 1286 microseconds to 24602 microseconds which is still only 24ms. this happens for a second and then it drops back down again so it isnt even that high to begin with and is only for a second.  And they confirmed that these messages are new to version 5 so people who only started seeing them after upgrading from 4->5, thats why.  Anyway, i wish they would change the logic on these messages so they would only appear if the % changed by a certain amount AND the overall latency was over a certain threshold as well.

Reply
0 Kudos
TrevorW20111014
Contributor
Contributor

After what seems like a 1,000 tests and an eternity, Dell has finally decided to involve the vMware support team. This really does seem like there is a problem with software ISCSI in Esxi 5. The problem has not been resolved in any updates. Like my image from above shows, this is a quick burst of high latency that occurs and it always occurs on the hosts that are not the active path, but the latency can leak over to the active path. I think there is either a bug in monitoring latency (I.e. the monitoring functionality within vMware itself is the cause of the latency) or, I believe, there is a flaw in software ISCSI with with something like locking or pathing that causes the issue.

I would ignore this, except my tests show that even a small burst of latency can cause the Esxi host to disconnect from the datastore, granted it takes a serious workload to get that result.

I will post back with the results of the vMware support testing (hopefully, I have a repeatable environment where they can isolate the problem). I am pulling my hair out - over $100K in equipment has been on hold for months over this.

Reply
0 Kudos
vxaxv17
Contributor
Contributor

Thanks for the info.  Please keep us updated!

Reply
0 Kudos
TrevorW20111014
Contributor
Contributor

Another FYI: I have now proven that if all the software ISCSI sources have at least one active path, the latency dissappears and all performs well. I get the WORST latency (by far) when the paths are inactive, the hosts are doing essentially nothing. Look at the image below. The latency plunged the moment I started activity on all paths. I can repeat this over and over. The latency to the left occurred with nothing going on with the hosts, they were idle.... then I started iometer with multiple threads on clients of each software ISCSI source (you would think it would get worse) and the latency goes normal.

Latency Sample.png

Reply
0 Kudos
vxaxv17
Contributor
Contributor

How are you determining if/when paths are active?  Do you have multiple hosts accessing the volume?  The graph you show is for a particular host, no?  Does it show the same on all hosts?

I have 4 hosts and 2 datastores that are on all 4 hosts.  1 datastore has vm's on it that are moderately busy and the other was just created a few weeks ago and only has a couple vm's on it.  Over the last hour (thats all the real-time graph shows) the latency numbers on the almost unused datastore look like:

host 1: max 5ms, average .1ms

host 2: max 5ms, average .033ms

host 3: max 0ms. average 0ms

host 4: max 3ms, average .022ms

These are obviously low usage for this volume and I am not seeing the numbers that you are seeing.  Also, your graph doesnt show, but is that read or write latency that is spiking?

Reply
0 Kudos
TrevorW20111014
Contributor
Contributor

Yes, there are multiple (two) hosts, and yes I see the same result on both hosts. I use esxtop to monitor the latency as well as the built-in performance charts. I know which path I will set active, by which datastore contains the VM and choosing the host that will run the VM. I only have this issue on Esxi 5, not 4.1. I have completely distinct software ISCSI sources (I.e. a Dell MD3200i and a FalconStor NSSVA). Both these sources have multiple datastores. Both use differrent physical connections (I.e. different net work cards, different  switches, etc). There are no common drivers or physical connections, yet both have the same problem.

Again, if I have all my ISCSI data sources active, then the latency goes to normal. If any path is inactive (no VMs performing any read/write activity), I get bad latency.

Reply
0 Kudos
vxaxv17
Contributor
Contributor

So If you were to create a new volume and put nothing on it would you see this behavior?  Im trying to replicate it here but so far have been unable.  I am only running 5.0 update 1 here, never had 4.x installed.

Reply
0 Kudos
TrevorW20111014
Contributor
Contributor

Its not about the volume, its about the ISCSI paths (not physical paths). If I have a Dell MD3200i and two hosts and it has two datastores. If I just power on VMs and let them sit idle, I get horrible latency. If I have one host access a datastore, I still have high latency. If both hosts access the datastore, the latency drops to normal.

I have had my connections, my drivers, my setup, etc etc all reviewed by numerous engineers. vMware wants to blame the vendor of the ISCSI product (I.e. Dell and its MD3200i). Dell and FalconStor have perfomed hundreds of tests and it just doesn;t make sense that both have the same issue at the same time. I have gone so far as to completely start fresh installs of Esxi 5 with the latest updates, reloading and configuring everything.

I don't think this is a fixed bug with Esxi 5 for all users. There is something about particular conditions for particular users with software ISCSI. I just think its a vMware problem under certain conditions. It is driving me nuts.

Reply
0 Kudos
TrevorW20111014
Contributor
Contributor

This "may" be premature, but I found some references to DelayedAck - a setting that can be setup at mutiple levels for software ISCSI. I edited the advanced settings for the software ISCSI adapter, turned off DelayedAck (at the highest level - all software ISCSI sources would not use it) and rebooted each host. So far (knock on wood) the latency issue has vanished and I am getting normal (low latency) performance.

We will see what happens over the next few days.

Reply
0 Kudos
vxaxv17
Contributor
Contributor

Dell recommends that delayed ack be disabled for most, if not all of their iscsi devices.
Below is a message that was a sent to me for a performance ticket I had open with dell.

I've disabled delayed ack on the whole iscsi initiator level as i didnt want to have to do it for each connection.

Tcp Delayed Ack

We recommend disabling tcp delayed ack for most  iscsi SAN configurations.

It helps tremendously with read performance in most cases.

WINDOWS:

On windows the setting is called  TCPAckFrequency and it is a Windows registry key.

Use these steps to adjust Delayed Acknowledgements in Windows on an iSCSI interface:

1. Start Registry Editor.

2. Locate and then click the following registry subkey:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\Interfaces\<Interface GUID>

Verify you have the correct interface by matching the ip address in the interface table.

3. On the Edit menu, point to New, and then click DWORD Value.

4. Name the new value TcpAckFrequency, and assign it a value of 1.

5. Quit Registry Editor.

6. Restart Windows for this change to take effect.

http://support.microsoft.com/kb/328890

http://support.microsoft.com/kb/823764/EN-US  (Method 3)

http://support.microsoft.com/kb/2020559

---------------------------------------------------------------------------------------------------------------------

ESX

For ESX it can be set in 3 places and is actually called Tcp Delayed Ack.  It can be set in 3 ways:

1.  on the discovery address for iscsi  (recommended)

2.  specific target

3.  globally

Configuring Delayed Ack in ESX 4.0, 4.1, and 5.x

To implement this workaround in ESX 4.0, 4.1, and 5.x use the vSphere Client to disable delayed ACK.

Disabling Delayed Ack in ESX 4.0, 4.1, and 5.x
1. Log in to the vSphere Client and select the host.
2. Navigate to the Configuration tab.
3. Select Storage Adapters.
4. Select the iSCSI vmhba to be modified.
5. Click Properties.
6. Modify the delayed Ack setting using the option that best matches your site's needs, as follows:

Modify the delayed Ack setting on a discovery address (recommended).
A. On a discovery address, select the Dynamic Discovery tab.
B. Select the Server Address tab.
C. Click Settings.
D. Click Advanced.

Modify the delayed Ack setting on a specific target.
A. Select the Static Discovery tab.
B. Select the target.
C. Click Settings.
D. Click Advanced.

Modify the delayed Ack setting globally.
A. Select the General tab.
B. Click Advanced.

(Note: if setting globally you can also use vmkiscsi-tool
# vmkiscsi-tool vmhba41 -W -a delayed_ack=0)


7. In the Advanced Settings dialog box, scroll down to the delayed Ack setting.
8. Uncheck Inherit From parent. (Does not apply for Global modification of delayed Ack)
9. Uncheck DelayedAck.
10. Reboot the ESX host.

Re-enabling Delayed ACK in ESX 4.0, 4.1, and 5.x
1. Log in to the vSphere Client and select the host.
2. Navigate to the Advanced Settings page as described in the preceding task "Disabling Delayed Ack in ESX 4.0, 4.1, and 5.x"
3. Check Inherit From parent.
4. Check DelayedAck.
5. Reboot the ESX host.

Checking the Current Setting of Delayed ACK in ESX 4.0, 4.1, and 5.x
1. Log in to the vSphere Client and select the host.
2. Navigate to the Advanced Settings page as described in the preceding task "Disabling Delayed Ack in ESX 4.0, 4.1, and 5.x."
3. Observe the setting for DelayedAck.

If the DelayedAck setting is checked, this option is enabled.
If you perform this check after you change the delayed ACK setting but before you reboot the host, the result shows the new setting rather than the setting currently in effect.

Source Material:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=100259...

http://www.vmware.com/support/vsphere4/doc/vsp_esx40_vc40_rel_notes.html

http://www.vmware.com/support/vsphere4/doc/vsp_esx40_u2_rel_notes.html

http://virtualgeek.typepad.com/virtual_geek/2009/09/a-multivendor-post-on-using-iscsi-with-vmware-vs...

Reply
0 Kudos
ansond
Contributor
Contributor

I tried setting the DelayedAck per previous post - however my particular instance is not using iSCSI targets - all of my drives are simple SATA drives directly connected to the host.   I continue to see the warning messages - and ESXi did complain about not finding the appropriate iSCSI stuff when I tried to force set my config...

On a whim, I upgraded my ESXi host to the latest patchset the VMware has: 5.0_update1 - seems to work great as an update, but I still seem to see the log entries even in this latest patchset.

Doug

Reply
0 Kudos
TrevorW20111014
Contributor
Contributor

I am still updating the continuing progress/saga here. With the DelayedAck change, I do get substantially better performance (better latency). However, I still have two weird issues.

1) If I have a t least one virtual machine actively doing something on an ISCSI datastore, I get this kind of performance:

2025118_1.png

It is what I would expect with the hardware involved. However, I STILL get events in the event log that the "performance has deteriorated". The event lists a datastore and a time and the values that triggered the event. The problem is that I was watching during the time, I was monitoring with esxtop during that time. I was monitoring with IOMeter during that time. There WAS NO LATENCY ON THAT DATASTORE AT THAT TIME! It was not in the vCenter performance log, nor did it display in esxtop, nr did it show in IOMeter. Clearly, there is a major bug with the code that triggers this event.

2) Now, my SECOND issue. If I do NOT have, at least, one active virtual machine (reading and/or writing data). If there are VMs powered on, but, essentially, sitting idle. Then I get significantly worse latency and many, many mor eevents in the event log reporting latency errors. Here is a sample:

2025118_2.png

Reply
0 Kudos
tranp63
Contributor
Contributor

I recommends applying the NMP policy to all of your ESX hosts datastores to use ‘Round Robin” to maximize throughput because ‘Round Robin’ uses an automatic path selection that rotates through all of its available paths and enables the distribution of the load across those paths.  The default is fixed setting.  Hopefully, this will resolve the errors relating to latency.

"Device naa.60a980004335434f4334583057375634 performance has deteriorated. I/O latency increased from average value of 3824 microseconds to 253556 microseconds."

Round Robin Policy.jpg

Reply
0 Kudos
dwilliam62
Enthusiast
Enthusiast

If you do use VMware Round Robin, you will need to change the IOPs per path value from 1000 to 3.  Otherwise you will not get the full benefit of multiple NICs.

For Equallogic devices, you can use this script to set EQL volume to Round Robin and also set the IOPs value.  You can modify it for other vendors.

This is a script you can run to set all EQL volumes to Round Robin and set the IOPs value to 3..


esxcli storage nmp satp set --default-psp=VMW_PSP_RR --satp=VMW_SATP_EQL ; for i in `esxcli storage nmp device list | grep EQLOGIC|awk '{print $7}'|sed 's/(//g'|sed 's/)//g'` ; do esxcli storage nmp device set -d $i --psp=VMW_PSP_RR ; esxcli storage nmp psp roundrobin deviceconfig set -d $i -I 3 -t iops ; done

After you run the script you should verify that the changes took effect.
#esxcli storage nmp device list

This post from VMware, EMC, Dell, HP explain a little bit about why the value should be changed.

http://virtualgeek.typepad.com/virtual_geek/2009/01/a-multivendor-post-to-help-our-mutual-iscsi-cust...

Another cause of the latency alerts is having multiple VMDK (or Raw Device Maps) on a single virtual SCSI controller.  You can have up to four in each VM, and assigning a unique SCSI adapter greatly increases IO rates and concurrent IO flow.  As with a real SCSI controller, it will only work with one VMDK (or RDM) at a time before selecting the next VMDK/RDM.   With each having their own, the OS is able to get more IOs in flight at once.   This is especially critical for SQL and Exchange.   So the logs, database and C: drive should all have their own virtual SCSI adapter.

This website has info on how to do that.  Also talks about the "Paravirtual" Virtual SCSI adapter which can also increase performance and reduce latency.

http://blog.petecheslock.com/2009/06/03/how-to-add-vmware-paravirtual-scsi-pvscsi-adapters/

Regards,

Don

Reply
0 Kudos
Dave_McD
Contributor
Contributor

I am having the same problem with FC storage, which has caused hosts to disconnect from the VC server. That has only happened since I installed SRM 5.

I checked my path profile and several were on Fixed instead of Round Robin, so I changed them. I am still getting the latency messages, although the hosts are not disconnecting.

The main culprit is a RDM attached to a Linux VM that actually has 7 RDMs attached.

All the RDMS are on the 1 datastore, what can I do to improve the performance? Should I consolidate the RDMs, or split them over diferent datastores?

Reply
0 Kudos
dwilliam62
Enthusiast
Enthusiast

Make sure that the IOPS value on all your volume aren't at the default value.  The default is 1000, which won't leverage all available paths fully.  For iSCSI I use 3.  A similar low value should work well with Fibre Channel as well.  The script I posted would need slight modification to work with FC. 

Also, on that Linux VM, how many virtual SCSI controllers are there?   I suspect only one.   "SCSI Controller 0" and the drives are at  SCSI(0:0), SCSI(0:1), etc..) under the Virtual device node box on the right hand side.

If so you need to create additional SCSI controllers.   You can have up to four virtual SCSI controllers per VM.  So you'll need to double up on a copy of RDMs in your case.  But any VMs that have multiple VMDKs or RDMs need to have this done if they are doing any significant IO.

Shutdown the VM, edit settings.   Select the VMDK/RDM you want to add a controller to and under "Virtual Device node" change the ID from SCSI(0:2) (for example) to SCSI(1:0) using the drop down button and scroll the list until you see SCSI(1:0).    Repeat until you have done this for all the busiest RDMs.  You'll need to double up some.  So your boot drive at SCSI(0:0) should share a controller with the least busy RDM you have, and that would be set at SCSI(0:1).   The two remaining would also need to be on different SCSI adapters, again pair next least busy RDMs.  So they'd be at SCSI(1:1),  SCSI(2:1).

Then boot the VM.  You should notice a big difference.  

If you have problems with this procedure, let me know.  I have a draft of doc that I put together on how to do this.  Includes screenshots,etc.. 

Regards,

Don

Reply
0 Kudos
stainboy
Contributor
Contributor

Just remember like in a windows MSCS with RDM's, if your Linux is "reservating" those LUNS you migth end up with problems when using RR...

Reply
0 Kudos