VMware Cloud Community
Marc_P
Enthusiast
Enthusiast

Event: Device Performance has deteriorated. I/O Latency increased

Hi,

Since upgrading to vSphere 5 I have noticed the following errore in our Events:

Device naa.60a980004335434f4334583057375634 performance has deteriorated. I/O latency increased from average value of 3824 microseconds to 253556 microseconds.

This is for different devices and not isolated to one.

I'm not really sure where to start looking as the SAN is not being pushed as these messages even appear at 4am in the morning when nothing is happening.

We are using a NetApp 3020C SAN.

Any help or pointers appreciated.

64 Replies
idle-jam
Immortal
Immortal

can you login to dataontap and see if there is any error message on the filer (cache and etc .. )

Reply
0 Kudos
Marc_P
Enthusiast
Enthusiast

The filers show no errors and the status is normal.

One thing I forgot is that we recently enabled ALUA on our iSCSI group and changed the path selection to Round Robin. The SAT shows correctly on all hosts as VMW_SATP_ALUA.

Reply
0 Kudos
nitinp1
Contributor
Contributor

I am having the same problem, however I don't have any SAN attached to the host, the datastore is mounted to the local disks.

Reply
0 Kudos
Marc_P
Enthusiast
Enthusiast

I have a support case open with VMWare at the moment so hopefully they can shed some light on the issue.

I will report back if and when they find anything.

Reply
0 Kudos
Ethanism
Contributor
Contributor

I'm getting similar errors in the event list on my ESXi host, but it's referring to a SAS tape drive connected to a dedicated SAS controller.  I'd be interested to find out what you discover, since it might just be related to the errors I'm seeing.

Reply
0 Kudos
Marc_P
Enthusiast
Enthusiast

Turns out we can ignore these errors as they are warnings which were introduced in vSphere5.

The highest lag we had was equivelant to 10 miliseconds and this was during our peak hours of when users were logging in and our backup window.

Reply
0 Kudos
admin
Immortal
Immortal

Device latency is "Average amount of time, in milliseconds, to complete a SCSI command from the physical device".  Now, the term "physical device", represents not only the disk, but also any hardware between ESXi and that disk.  A storage network can include storage adapters, switches, and arrays (or equivilents in ethernet storage networks).

If you are investigating these messages, you may also want to broaden your investigation to the storage network adapters (ESXi and Array side if applicable) and the switch firmware/configuration.  You may also want to read up on the storage network best practices and compatibility from the vendor.

Here are some references that define "Device Latency":

http://communities.vmware.com/docs/DOC-11812

http://pubs.vmware.com/vsphere-50/topic/com.vmware.wssdk.apiref.doc_50/disk_counters.html

Reply
0 Kudos
ansond
Contributor
Contributor

Is there a simple way to just turn off the diagnostic messages?  I am seeing this message as well... I have a very simple environment setup and don't care as much about trying to maximize the I/O performance - my disks are just fine... just busy...

I am just wanting to get rid of the messages...

Thanks!

Doug

Reply
0 Kudos
PUREJOY
Enthusiast
Enthusiast

These messages are informational and not a source of error or any system malfunction.

This gives me a clear idea of how my storage array is behaving with the given workload and  is very useful from an administrative perspective.

If i keep seeing these messages, i can go and fix my backend storage and move things around, ...

I think its not going to hurt you but help you in designing a better storage layout, functionality and deliver a better latency to applications on the VMs. I wont turn it off (which i don't think you can today)

Architect @ Pure Storage || www.purestorage.com || http://www.purestorage.com/blog/ || http://twitter.com/#!/purestorage ||@ravivenk || VCAP-DCA5, VCP 4, VCP 5
Reply
0 Kudos
ansond
Contributor
Contributor

totally understand - as I stated, I am fully aware of the fact that my storage is getting really busy in its current design state... this is OK for my particular usage of ESXi.

What I'd like to be able to do is to filter out the messages - so that what messages do appear in my log are potentially more of a concern to me... think "filter warnings, show errors only" kind of feature...

is there no way to supress informational messages?

Doug

Reply
0 Kudos
PaCAAP
Contributor
Contributor

Will these warnings keep appearing if the I/O demand is high and keep on a constant value?

I have a customer that began to put some real demand on I/O but recently after having his VMWare ESXi 5 not doing anything heavy for 1 month before yesterday.

I know the hardware is fine but just keep receiving the warning messages.

" Device naa.5000c5000b36354b  performance has deteriorated. I/O latency increased from average value  of  1875 microseconds to 140800 microseconds."

Any suggestion beside just keep the message there is greatly appreciated.

Reply
0 Kudos
ansond
Contributor
Contributor

I am 99.9999% sure that my HW is just fine as well... and, for grins, I turned off my hourly builds... so the HW is pretty much idle.  I still see this message (not as much, but still occasionally) pop up - seems to be for no real reason.  Nothing seems to break from it (all my VMs hum right along without issue), its just annoying to see and it pollutes my log files imo.

I am beginning to wonder if there is some sort of short-lived live-lock condition in ESXi 5.x... I never saw this message in 4.x.   My HW is unchanged and has been working flawlesly as far as I can tell.   I have a stock Dell PowerEdge T710.

Doug

Reply
0 Kudos
vxaxv17
Contributor
Contributor

I am seeing this as well on an array that is basically unused.  The times that the messages are reported seem to be random and during very low usage times like 6am or 10pm.  There are no errors on the array end, only on the esxi side. This has to be something coming up in esxi 5 as I am not seeing any actual performance issues and my windows servers which are connecting to the san are not reporting any problems as well.  I really wish vmware would provide a little more information about this beyond the "your storage device/network is overloaded" article they have posted.  this clearly is not the case here.

Reply
0 Kudos
ansond
Contributor
Contributor

Ditto... I would simply like to filter it out... I really dont care if its an overly sensitive counter/sensor within ESXi (I know my HW and VMs dont seem to be having any issues whatsoever...) - I just dont like it polluting my logs with extraneous information... I'd rather just see "the more serious stuff"... 🙂

Anyone from VMware care to provide an update on this?  It would be great to know the skinny (or better when when/how we might be able to filter it...)

Thanks,

Doug

Reply
0 Kudos
admin
Immortal
Immortal

I have discussed this event with others internally and I have not been informed of a method of filtering, or throttling these events.  The request for this feature has been submitted.  The feature request submission, review, approval, and developement is not a public process.  We cannot make any public facing statements or share any details as to whether the feature will be included in a future version.  If you feel strongly about this feature request, please reach out to your account management team to provide use cases and help prioritize it.  Thank you.

Reply
0 Kudos
ansond
Contributor
Contributor

Hi Daniel - thanks so much for the response - appreciate VMware having a look at the thread.   A filter feature would be a great addition if time/resource permits for you guys! Thanks again. Doug

Reply
0 Kudos
jcwuerfl
Hot Shot
Hot Shot

I'm seeing these alerts as well, but there doesn't seem to be any alarms in vCenter to change these correct?  I wonder why that is?  Shouldn't everything be created as a vCenter alarm?   aka So how would I change the alerts to get email notifications if I wanted those?  I'm not seeing anything for vCenter alarms for devices.

Thanks

Reply
0 Kudos
TrevorW20111014
Contributor
Contributor

I do not believe these messages are false. These messages occur because there IS high latency occurring, although very briefly. I have confirmed that there is a problem with Esxi 5 and the software ISCSI initiator. I purchased vMware from Dell with Dell R710 servers and must get my support through Dell.

Look at this:

ISCSI HIGH LATENCY.png

What this shows is extop for 2 Esxi hosts accessing the same datastore. For some reason, the host that is 'inactive' seems to pause or lag where the latency can spike anywhere from 50-2000 milliseconds. In this example, its 561 ms on the inactive host and 1.8 on the active host. When I run IOMeter on VMs that run on the datastore, the average performance is normal, but IOMeter does show the Max I/O Response time jumping to the high latency numbers reported in the vCenter event log. These are also shown in the performance graphs for the hosts.

The reason most people say ignore this, I believe, is because having high latency on the inactive host and not the active applications or VMs will result in them, generally, performing as expected. However, with a high workload, it can actually trigger the inactive host to lose the connection altogther. Also, if both hosts try to access the same datastore at the same time, the actual VMs or applications CAN lag significantly because of this.

There is clearly a problem with software ISCSI. I have completely different datastores, differrent physical and virtual hardware, and completely seperate drivers for different hardware. The only thing in common is software ISCSI and Esxi 5 (does not happen on 4.1).

This seems to be a weird locking issues or something like it between different hosts accessing the same ISCSI datastore. I can reproduce this on any ISCSI initiator and for comepletly different ISCSI datastores. This is a vMware bug that needs to be addressed, not ignored. I wish I could deal with vMware directly, but I have to go through Dell. For those who seemed to report a similar issue with non software ISCSI, it looks like it really is your hardware/setup and that just hides this particular problem.

vMware... this is reproducable. Something is wrong with software ISCSI in Esxi 5.

Message was edited by: TrevorW201110 to correct grammar.

Reply
0 Kudos
vxaxv17
Contributor
Contributor

This is basically what i noticed as well.  The times I see these messages in the event log are usually during hours where there is almost no traffic on the SAN and/or vm hosts which is why i think most people are saying to either ignore the warnings or they are not true.  there clearly is a problem of some sort though.  I have been working with dell to alleviate high latency times reported on the san group itself (we have equallogics) and their suggestions have definitely helped in that regard but these warnings in vsphere still remain.  I think i might submit a ticket with vmware just because i pay for support and i want to make sure they are seeing reports of this instead of hoping they read these forums.

Reply
0 Kudos