Monitoring for increases in I/O latency

vmproteau · ‎09-10-2015

We recently had an issue with I/O latencies significant enough to cause some guest issues. We were alerted to this by the end users and guest OS engineers. There were no alerts generated from the VI. Increased latencies and, in this case, the temporary suspension of I/O to a datastore is something I'd like to be alerted to. I can't seem to find a native vCenter alarm configuration that triggers alerts for these events. Is anyone monitoring for these? Are you using the vCenter native alarms to generate alerts?

During the events we see these messages in the vCenter Host event log "I/O latency increased from average value of X microseconds to Y microseconds".
When they get severe enough VMFS is suspending I/O to the datastore and logging "Lost access to volume 502daf5e-8efcea35-0dba-e0db550055a0 (Datastore Name) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly"

akblpatel · ‎09-12-2015

Hi,

If you observe that this latency is too high for a consistent period of time, this indicates that there is a concern about storage performance and you must check the logs on the storage array for any indication of a failure. If failures are logged on the storage array side, corrective action should be taken. Contact your storage vendor for information regarding checking logs on the array.

Also check if these messages are generated when there were any scheduled tasks, such as backups, replications, etc., as these can also cause intermittent performance hits.

If the message is generated because of an overload condition, attempt to reduce the load on the affected storage device. If the storage device is overloaded, reduce the load on the storage device.

If running a LUN replication tool, pause the task from the storage end and attempt a storage vMotion to a different datastore. This should help improve the I/O operations.

For more/related information on diagnosing overload/performance issues, see Using esxtop to identify storage performance issues (1008205).

ThompsG · ‎09-13-2015

Hi vmproteau,

That sounds nasty and very unpleasant!

For your information we do use the native vCenter alarms to alert us of any potential storage issues with our VMs or hosts. We also have alarms set on our storage however the vCenter ones generally suffice.

Have you looked at the following two?

Host storage status
Virtual machine total disk latency

Both of these are default alarms however may need further configuration to get them to trigger when you want. For example we have modified the Host storage status alarm so that an email is sent otherwise once the storage "comes back" we may not know it ever happened.

For the Virtual machine total disk latency alarm the defaults sufficient for sustained issues but not for quick load events. In this case you might want to decrease the time threshold from 5 minutes something else (30 seconds is the minimum) in order to get the alarm to trigger for your situation.

If you have already explored the above two then apology's for your time

Kind regards.

All

Monitoring for increases in I/O latency