VPXA
Contributor
Contributor

Disk latency alarms on multiple VM's.

 

Multiple VM's running inside the HPe hosts ( Gen8 / Gen9) 6.5.u3 are giving alarms of disk latency, which is causing for few VM's unavailable issues for production.

please find the below alarm state and provide me resolution steps towards.

 

Alarm Definition:
([Yellow Metric Is above 70ms; Red Metric Is above 130ms])
 
Current values for metric/state:
 Metric Disk Highest latency = 570ms

0 Kudos
13 Replies
harry89
Enthusiast
Enthusiast

Hey ,

For disk latency you can check from esxtop and vrops if you have configured 

Ssh to the residing host and press d and u for disk and hba related latency 

Davg ,Kavg and Gavg is value you can look for .

I have also seen some cases where there was rogue process runing within the guest OS which causes all weird issues

Harry
VCIX-DCV6.5 ,VCIX-NV6 , VCAP-CMA7
Mark answer as correct/helpful if it solves your query
Lalegre
Virtuoso
Virtuoso

Hey @VPXA,

This is not a question that we can simple answer as there are a lot of points involved once a virtual machine tries to read a file or write one. If your virtual machines are all getting affected then maybe we can jump off from the Guest OS layer. 

First of all, which protocol are you using for Storage? (iSCSI, NFS, FC, FCoE, etc)

0 Kudos
VPXA
Contributor
Contributor

 

might be storage side to, we need to verify once (just suspecting storage connected network too) where need your suggestions, more over we are getting these alerts only from Hpe BL Gen8 and few Gen9's not UCS. 

 

Storage Protocol's we are using :  iSCSI, NFS,

please provide your comments on this.

0 Kudos
depping
Leadership
Leadership

As someone already mentioned, start with ESXTOP when you encounter these issues, figure out at which latency the issue occurs. I have some hints/tips here: http://www.yellow-bricks.com/esxtop/

I would look at KAVG, DAVG, GAVG, and probably QUED for iSCSI. That should give you an idea what is going on, and where the latency is happening. Maybe the VMs are generating a lot of IO, maybe the hosts have too many VMs running, driving too much IO, maybe you are overloaded from a memory point of view and you are doing a lot of IO + Swapping. It could be anything without seeing / knowing more details. 

0 Kudos
Lalegre
Virtuoso
Virtuoso

Hey,

So basically you are using IP network protocols for the access of the storage. Of course you need to check esxtop for the values mentioned above by @depping.

Take into account that you are using Blade servers and the traffic is flowing over the blade switches which is the equivalent to the UCS Fabric Interconnect. Check on the iLO from the Enclosure to see the utilization of the internal switches and HPE Flex adapters, I also recommend you to see the utilization of those vmnics from the ESXi you have the VMs running on.

0 Kudos
VPXA
Contributor
Contributor

Hi Lalegre / Harry,

 

seems we are good with disk performance, please have look once the below screen dumps and provide your suggestions, having a request with VMware support as well.

 

VPXA_0-1614620927034.png

 

VPXA_1-1614620955861.png

 

0 Kudos
depping
Leadership
Leadership

Which storage system are you connecting to? I would not say you are good from a disk perspective, you have some crazy latency on one of those disks. Definitely would recommend asking GSS to check the environment, but it sounds to me that it is storage related.

I assume that the VMs which are giving those events are running on the datastore with the high latency?

0 Kudos
VPXA
Contributor
Contributor

We are using netapp as a storage provider, could  you please assist how did you checked the latency values through my screen dumps.

Surely, we are already have a case opened with VMWare Support its under the same way.

can we assume Blades or Chassis (Hpe or UCS) can cause this for too (don't think so) or recently we upgraded our environment to 6.5u3 after recommended by VMware.

We can't say after upgradation causing this problem, because all other Data Centers and Clusters (VMs') running fine only VM's in 1 or 2 Clusters (not all VM's) giving these alarms.

Kindly assist me on this.

 

 

 

Tags (1)
0 Kudos
depping
Leadership
Leadership

Look at "DAVG". It is higher than 300. Note, this is 300 milliseconds. Typically anything higher than 10 (in a spindle world) would be alarming to me, especially when it is consistently higher than 10. You have more 300 on DAVG, DAVG is the device latency, which is usually either the network in between, or the storage system.

0 Kudos
VPXA
Contributor
Contributor

Thank you so much for the inputs, 

we are also suspecting network in between Hpe Blades (chassis ) switch status as well storage network too. 

Can we suspect these too, please advice on this.

Regards,

 

 

0 Kudos
depping
Leadership
Leadership

yes that could be an issue indeed. 

0 Kudos
VPXA
Contributor
Contributor

Thanks for all your help in getting understanding this issue and now the host latency like below.

VPXA_1-1614760637634.png

 

VPXA_2-1614760654710.png

 

please provide me your assistance here.

0 Kudos
depping
Leadership
Leadership

Not sure what assistance you need, as latency is now relatively low.

0 Kudos