1 2 3 4 5 Previous Next 64 Replies Latest reply on Nov 9, 2014 10:57 PM by ErwinInfracontrol Go to original post Branched to a new discussion.
      • 45. Re: Event: Device Performance has deteriorated. I/O Latency increased
        ModenaAU Novice

        ok, I changed the cache policy to always write back, and performance has gone through the roof. On a Linux guest I can now see consistent 450+MB/sec writes, over 1000IOPS and the DAVG values are not going over 2. The worst recorded latency was 30ms.

         

        Stressing a windows guest as far as I can with multiple large file copies, the performance is less stellar, but still over 150MB/sec, DAVG seeing up to 50 or so, latency maxed out at 80ms.

         

        Now to get some batteries so I can leave it like this...

         

        Thank you Don for pointing out what I had overlooked!

        • 46. Re: Event: Device Performance has deteriorated. I/O Latency increased
          ModenaAU Novice

          The config is 6 drives, 300GB SAS, single RAID 5.

           

          Apparently the battery is an option.....wtf? Who makes a RAID battery an option? Also just for grins, they dont tell you about this "option" when you order the server, silly me for assuming a RAID card would come with a battery......

          • 47. Re: Event: Device Performance has deteriorated. I/O Latency increased
            dwilliam62 Enthusiast

            You are VERY welcome!!  Glad I could help out.

             

            I don't recall the last RAID card that came w/o batteries. Until you get them I would not leave it in WB.  Very risky.

             

            Windows copy is not very efficient, each copy is single threaded.    Using Robocopy or better yet Rich copy yields better results.

             

            Regards,

            • 48. Re: Event: Device Performance has deteriorated. I/O Latency increased
              Dave McD Enthusiast

              Thanks for the reply, this will really help. The only question is, how do I change the IOPS for FC? I can't see the option anywhere.

               

              As for changing the SCSI controllers, I will have to schedule an outage etc as these are production systems. However, you have shown me ther is light at the end of the tunnel!

              • 49. Re: Event: Device Performance has deteriorated. I/O Latency increased
                dwilliam62 Enthusiast

                Earlier in this thread I posted a script to change the IOPs value.  There's no GUI option to do so.

                 

                #esxcli storage nmp device list

                When you run the above command you'll get a list of your current devices, their path policy and for RR policied volumes the IOPS=1000. 

                 

                I'm not sure what FC storage you are connecting to but it will have a VENDOR ID.  On EQL volumes that ID is EQLOGIC.  If yours is EMC then you need to change the line in the script from EQLOGIC to EMC.

                 

                esxcli storage nmp satp set --default-psp=VMW_PSP_RR --satp=VMW_SATP_EQL ; for i in `esxcli storage nmp device list | grep EQLOGIC|awk '{print $7}'|sed 's/(//g'|sed 's/)//g'` ; do esxcli storage nmp device set -d $i --psp=VMW_PSP_RR ; esxcli storage nmp psp roundrobin deviceconfig set -d $i -I 3 -t iops ; done

                 

                After you run the script you should verify that the changes took effect.
                #esxcli storage nmp device list

                 

                Regards,

                 

                Don

                • 50. Re: Event: Device Performance has deteriorated. I/O Latency increased
                  jdiaz1302 Lurker

                  I saw that most of you says that just want to know a way to deactivate the messages but in my case I am having a degraded performance in one of my vm's and packet loss in that VM is not just the message I have other simptoms.

                  • 51. Re: Event: Device Performance has deteriorated. I/O Latency increased
                    dwilliam62 Enthusiast

                    Are you connecting to a Dell/Equallogic array?    That's what I'm most familiar with.

                     

                    Common causes of performance issues that generate that alert are:    (Most will apply to all storage)

                     

                    1.)  Delayed ACK is enabled.

                    2.)  Large Recieve Offload (LRO) is enabled

                    3.)  MPIO pathing is set to FIXED 

                    4.)  MPIO is set to VMware Round Robin but the IOs per path is left at default of 1000.  Should be 3.

                    5.)  VMs with more than one VMDK (or RDM) are sharing one Virtual SCSI adapter.  Each VM can have up to four Virtual SCSI adapters.

                    6.)  iSCSI switch not configured correctly or not designed for iSCSI SAN use.

                     

                    If this is a Dell array, please open a support case.   They can help you with this.

                     

                    Regards,

                    • 52. Re: Event: Device Performance has deteriorated. I/O Latency increased
                      irvingpop2 Novice

                      Lately we've seen huge increases in performance with a few simple iSCSI tuning methods (NetApp FAS2040 - 4x 1GbE).   We've gone from latency alarms several times per day to none at all.  

                       

                      I haven't seen these concisely documented anywhere, so here's what we did:

                       

                      1. Using bytes=8800 (with Jumbo frames) rather than an IOPS value (or the default)
                      2. Make sure the active Path count matches the number of storage adapter NICs on your VM host or Storage system (whichever is less). 
                        1. Previously we had iSCSI Dynamic Discovery which added all 4 NetApp paths for each storage adapter vmk (resulting in 16 paths per LUN);  this resulted in "Path Thrashing".   Changed to Static discovery and manually mapped only 1 iSCSI target per vmk.  
                      3. Don't use LACP on either side.   LACP completely ruins RR MPIO.
                      4. Fix VM alignment.   We had a handful of Windows 2003 and Linux guests with bad alignment.  They didn't do much IO so we ignored them in the past,  big mistake.  (NetApp's performance advisor really helped to nail this down)
                      5. Stagger all scheduled tasks.   We found a number of IO-intensive tasks (AV updates, certain backups) all running at the same times in our environment.  

                       

                      From this article:  http://blog.dave.vc/2011/07/esx-iscsi-round-robin-mpio-multipath-io.html

                       

                      The command we used is:

                      esxcli storage nmp device list |grep ^naa.FIRST_8_OF_YOUR_SAN_HERE | while read device ; do
                          esxcli storage nmp psp roundrobin deviceconfig set -B 8800 --type=bytes --device=${device}
                      done

                       

                       

                      Throughput results:

                      • Original:  95 MB/s
                      • IOPS=1 or IOPS=3:   110-120 MB/s
                      • Bytes=8800:   191 MB/s   (hurray!)
                        • 4KB IOPS also saw a 3x improvement over the original configuration

                       

                       

                       

                      NOTE:   We also changed back from Software iSCSI to the Broadcom NetXtreme II "Hardware Dependent" driver now that the new June 2012 version supports Jumbo frames: https://my.vmware.com/group/vmware/details?downloadGroup=DT-ESXi50-Broadcom-bnx2x-17254v502&productId=229

                       

                      If I could do this all over again I would skip iSCSI altogether.   What a complete PITA it has been to get decent performance compared to spending a few grand more for FC.

                      • 53. Re: Event: Device Performance has deteriorated. I/O Latency increased
                        boromicmfcu Lurker

                        With ESXi 5, Delayed ACK keeps re-enabling itself on my hosts resulting in high Latency on my SAN. It is getting really annoying. Has anyone else experienced this problem? I am disabling it globally on the software iSCIS initiator. I believe a reboot is required when you disable it, so when it re-enables itself I am not sure if it takes effect till the next reboot or when it turns itself back on.

                        • 54. Re: Event: Device Performance has deteriorated. I/O Latency increased
                          dwilliam62 Enthusiast

                          Are you at the current build of ESXi v5?

                           

                          What I've been seeing is that if you just disable the Delayed ACK, it's not updating the database that stores the settings for each LUN.   Any NEW luns will inherit the value.

                           

                          You can check by going on the ESXi console and entering:

                           

                          #vmkiscsid --dump-db | grep Delayed

                           

                          All the values should be ="0" for disable.

                           

                          I find that removing the discovery address, and removing the discovered targets in the "Static Discovery" tab to clean out the db.   Then add discovery address back in with Delayed ACK disabled, AND make sure the login_timeout value is set to 60.  Default is 5.   Then do rescan.

                           

                          Go back to CLI and re-run #vmkiscsid --dump-db | grep Delayed to verify.

                          Also you should run #vmkiscsid --dump-db | grep login_timeout to check that setting as well.

                          • 55. Re: Event: Device Performance has deteriorated. I/O Latency increased
                            boromicmfcu Lurker

                            I am at 5.0.0, 721882

                             

                            I got 17 `node.conn[0].iscsi.DelayedAck`='x' results back with only 6 of them reporting a 0 and all the rest 1.

                             

                            I have some scheduled maintenance this weekend, so I am going to install the latest ESXi patch and clean out the discovered addresses.

                             

                            I also found this Article: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2007829 referencing recommendations for EqualLogic arrays and iSCSI logins. We use EqualLoic and the article recommends15 and that is what it is currently set at.  I am not getting any initiator disconnect errors from the SAN.

                             

                            Is 15 too conservative from your experience?

                             

                            Thanks

                            • 56. Re: Event: Device Performance has deteriorated. I/O Latency increased
                              dwilliam62 Enthusiast

                              The default in ESXi v5 is 5 seconds, in larger groups with many connections, that timeout will be too short.   Setting to 60 covers all scenarios.

                               

                              Also, VMware will be releasing a patch for 4.1 that will also allow the login timeout to be extended from 15 second default to 60.

                               

                              Re: Delayed Ack.  I've seen that also.   Worst case I've gone to the static discovery and manually modified each target.  Then repeated on the other nodes.  :-(   No fun if you have allot of volumes.

                              • 57. Re: Event: Device Performance has deteriorated. I/O Latency increased
                                chi201110141 Novice

                                I too am seeing this on 2 of my 3 hosts.  1 host is hardly doing anything (at the moment) the other 2 coming up with these messages mainly out of hours.

                                 

                                All 3 are the same spec, using local storage - raid 6 16x drives. 

                                esxi v5.0.0, 469512.

                                 

                                Is there a fix?

                                • 58. Re: Event: Device Performance has deteriorated. I/O Latency increased
                                  iBahnEST Lurker

                                  @IrvingPOP2

                                  I have been receiving these messages since we first built our solution which consists of HP Blades with 2x1Gb NICs per server, a Cisco 3120G switch and a NetApp FA2040. I've been researching this issue for a long time, and your post has given me hope that there might be a light at the end of the tunnel.  I'm planning on implementing some of your same steps, but I'm curious about a few things from your post:

                                   

                                  Make sure the active Path count matches the number of storage adapter NICs on your VM host or Storage system (whichever is less)

                                  We only have 2 links per server to attach to the network, but the FAS2040 has 4 NICs.  The FAS2040's NICs are setup using LACP (Dynamic Multimode-VIFs).  Are 2 links per server enough for this configuration or would you recommend more?

                                   

                                  Previously we had iSCSI Dynamic Discovery which added all 4 NetApp paths  for each storage adapter vmk (resulting in 16 paths per LUN);  this  resulted in "Path Thrashing".   Changed to Static discovery and manually  mapped only 1 iSCSI target per vmk.


                                  When I configured my ESX hosts for Static Discovery, the next time I rebooted those hosts the iSCSI paths were gone.  Have you run into this issue?

                                   

                                  Don't use LACP on either side.   LACP completely ruins RR MPIO.

                                   

                                  NetApp’s documentation (TR-3802) discusses link aggregation and LACP (Dynamic Multimode) looks like the best option on paper as opposed to using EtherChannel (Static Multimode) due to the fact that EtherChannel is susceptible to a “black hole” condition.  I’m curious which way you configured your storage and switch since you removed LACP.  Would you be so kind as to paste the configs from your NetApp and Switch?

                                   

                                   

                                  Lastly, out of all the changes that you made, which would you say was the most helpful?

                                   

                                  Thanks!

                                  • 59. Re: Event: Device Performance has deteriorated. I/O Latency increased
                                    irvingpop2 Novice

                                    iBahnEST,

                                     

                                    Many months now since we implemented these changes,  many more lessons learned.   Let me summarize them:

                                     

                                    Regarding the NetApp FAS2040:

                                    Lower-end NetApps give poor throughput (MB/s) compared to "dumber" arrays.  However they give much better IOPS, so the trade-off is yours to make.   The summary of my many conversations with NetApp I learned:

                                      1. FAS2040 has a really tiny NVMEM cache (512MB, but only 256MB usable at a time).    Your statit and sysstat output will show huge amount of flushing to disk during write because of "nvlog full"
                                      2. WAFL is spindle-greedy.   If your aggregate RAID groups are less than the recommended size (16-20 disks) your throughput will suffer badly (like 15 MB/s per disk).  a 2040 only has 12 disks (split among 2 controllers) so the RAID groups are super un-optimized no matter what kind of disk you use.
                                      3. ONTAP 8 is RAM-greedy, especially with fancy features like Dedupe.    FAS2040 controllers only have 4GB of RAM each,  and NetApp will tell you that only 1.5GB is left to work with once the OS is booted.  See NetApp communities,  people with 4GB RAM filers (2040, 3100) are getting crushed by the upgrade to 8.1 when dedupe is involved.   Remove Dedupe and don't go higher than ONTAP 8.0.4.

                                     

                                    In our case, we shifted our backups (Netbackup direct-style off-host backup) from iSCSI to FC, thinking our iSCSI setup was still sub-optimal.   Sustained throughput (read only) still around 90-110 MB/s.       

                                     

                                    For the math-challenged, that is still comparable to what a single iSCSI gigabit line can achieve with Jumbo frames enabled).

                                     

                                    Regarding iSCSI

                                    • In summary, I would never use iSCSI on another production system.  Ever again.   The amount of effort required to tune and monitor is huge and you STILL get sub-par performance.  Just not worth it.  
                                      • For NetApps, use NFS.  Even NetApp will tell you that the performance will be much better.
                                    • The biggest performance improvements we got (in iSCSI were):
                                      1. Reducing the number of iSCSI paths per LUN.   1-2 is enough, especially if you are storage throughput limited.
                                      2. 2 paths between VM host and storage doesn't mean 2 paths.   Because you'll map iSCSI session per "path" per LUN,  you will still have contention on your paths between various LUNs.  
                                      3. Definitely don't use LACP with iSCSI MPIO.    Remember that once a mac address pair has been assigned to an LACP channel it is stuck there until that channel goes down.  We found lots of link contention on both the NetApp and VM host side because LACP is dumb in the way it assigns and then never re-balances.   NetApp recommends LACP for NFS only.
                                      4. We went back from bytes=8800 to iops=1 as we found during business hours there was less latency spikes.     Because of point #2 above,  2 iscsi sessions will try to cram 8800 bytes down a single path (causing contention)

                                    Regarding your static discovery question:   Are you getting the paths by Dynamic discovery and then removing the dynamic entries?    Best to remove all the dynamic stuff, reboot, and then add the entries manually.

                                     

                                    I can share with you a NetApp rc section which simply shows all 4 gigabit interfaces configured for iSCSI only.   Ports going to 2 different switches:

                                     

                                     

                                    ifconfig e0a 192.168.15.11 netmask 255.255.255.0 partner e0a mtusize 9000 trusted -wins up
                                    ifconfig e0b 192.168.15.12 netmask 255.255.255.0 partner e0b mtusize 9000 trusted -wins up
                                    ifconfig e0c 192.168.15.13 netmask 255.255.255.0 partner e0c mtusize 9000 trusted -wins up
                                    ifconfig e0d 192.168.15.14 netmask 255.255.255.0 partner e0d mtusize 9000 trusted -wins up

                                     

                                     

                                     

                                    Sorry for the lengthy post, hope thats helpful.