Multiple VM's running inside the HPe hosts ( Gen8 / Gen9) 6.5.u3 are giving alarms of disk latency, which is causing for few VM's unavailable issues for production.
please find the below alarm state and provide me resolution steps towards.
([Yellow Metric Is above 70ms; Red Metric Is above 130ms])
Current values for metric/state:
Metric Disk Highest latency = 570ms
For disk latency you can check from esxtop and vrops if you have configured
Ssh to the residing host and press d and u for disk and hba related latency
Davg ,Kavg and Gavg is value you can look for .
I have also seen some cases where there was rogue process runing within the guest OS which causes all weird issues
This is not a question that we can simple answer as there are a lot of points involved once a virtual machine tries to read a file or write one. If your virtual machines are all getting affected then maybe we can jump off from the Guest OS layer.
First of all, which protocol are you using for Storage? (iSCSI, NFS, FC, FCoE, etc)
might be storage side to, we need to verify once (just suspecting storage connected network too) where need your suggestions, more over we are getting these alerts only from Hpe BL Gen8 and few Gen9's not UCS.
Storage Protocol's we are using : iSCSI, NFS,
please provide your comments on this.
As someone already mentioned, start with ESXTOP when you encounter these issues, figure out at which latency the issue occurs. I have some hints/tips here: http://www.yellow-bricks.com/esxtop/
I would look at KAVG, DAVG, GAVG, and probably QUED for iSCSI. That should give you an idea what is going on, and where the latency is happening. Maybe the VMs are generating a lot of IO, maybe the hosts have too many VMs running, driving too much IO, maybe you are overloaded from a memory point of view and you are doing a lot of IO + Swapping. It could be anything without seeing / knowing more details.
So basically you are using IP network protocols for the access of the storage. Of course you need to check esxtop for the values mentioned above by @depping.
Take into account that you are using Blade servers and the traffic is flowing over the blade switches which is the equivalent to the UCS Fabric Interconnect. Check on the iLO from the Enclosure to see the utilization of the internal switches and HPE Flex adapters, I also recommend you to see the utilization of those vmnics from the ESXi you have the VMs running on.
Which storage system are you connecting to? I would not say you are good from a disk perspective, you have some crazy latency on one of those disks. Definitely would recommend asking GSS to check the environment, but it sounds to me that it is storage related.
I assume that the VMs which are giving those events are running on the datastore with the high latency?
We are using netapp as a storage provider, could you please assist how did you checked the latency values through my screen dumps.
Surely, we are already have a case opened with VMWare Support its under the same way.
can we assume Blades or Chassis (Hpe or UCS) can cause this for too (don't think so) or recently we upgraded our environment to 6.5u3 after recommended by VMware.
We can't say after upgradation causing this problem, because all other Data Centers and Clusters (VMs') running fine only VM's in 1 or 2 Clusters (not all VM's) giving these alarms.
Kindly assist me on this.
Look at "DAVG". It is higher than 300. Note, this is 300 milliseconds. Typically anything higher than 10 (in a spindle world) would be alarming to me, especially when it is consistently higher than 10. You have more 300 on DAVG, DAVG is the device latency, which is usually either the network in between, or the storage system.
Thank you so much for the inputs,
we are also suspecting network in between Hpe Blades (chassis ) switch status as well storage network too.
Can we suspect these too, please advice on this.