We have 15 LUN all shared to 7 hosts. We noticed that only one host has bad disk performance (high datastore write latency). The host has very light loading (no VM on it) while others have heavier loading than it. However, only the host has disk performance issue (with 30 - 40 ms write latency to those LUN but at the same time the other hosts have no latency issue). I can't figure out why only the host has the disk performance issue for the same LUN..
Couple of points if already isolate the Latency generator:
ESXi host to LUN paths - how many paths? Are they distributed equally?
ESXi host - which policy is configured for this LUN? RR MRU Fixed?
which policy configured for other LUNs working good? (on same host?)
ESXi host to LUN - we have multiple points like VMkernel - HBA card - FC switch - Storage Processor - Disk
If you suspect first half in the sequence look for driver version and monitor ESXTOP output for checking actual delay
Is it happening all the day? or only for specific period of time?
Who else can access this LUN? Backup software - if yes how other hosts are optimally configured to avoid such latency?
Hope this analysis helps....
Thank you for the reply!
ESXi host to LUN paths - how many paths? Are they distributed equally?
>>> 8 paths in total (4 is active and 4 is for redundancy purpose)
ESXi host - which policy is configured for this LUN? RR MRU Fixed?
which policy configured for other LUNs working good? (on same host?)
>>> All the LUNs in our env. are configured to RR ( it is not particular datastore has latency issue on the host, but it's randomly happened to many datastores)
ESXi host to LUN - we have multiple points like VMkernel - HBA card - FC switch - Storage Processor - Disk
>>> VMkernel - 10Gb Network Card - Switch - SVC - Storage
If you suspect first half in the sequence look for driver version and monitor ESXTOP output for checking actual delay
>>> I have issued a SR to VMware, and VMware recommends me to upgrade the driver & firmware of NIC to latest version...
However, i don't think this is the root cause since we have 14 hosts with exactly the same configuration but only one host has this issue.
It is very hard to monitor in esxtop. As the picture shows (the first post), it is rising in one spot and then dropping back to normal.
When I monitor in esxtop, the issue just doesn't happen..
Is it happening all the day? or only for specific period of time?
>>> All day, but I don't see any rule, looks like randomly.
Who else can access this LUN? Backup software - if yes how other hosts are optimally configured to avoid such latency?
>>> No, there is no backup software. The hosts in the same cluster all can access the LUNs but only one host has this issue.
Is there any other clue?
what kind of storage env do you use? i would check the HCL if RR is supported with your storage
To brunofernandez1: Thank you. I've check with our storage vendor and RR is supported in our storage.
I've also noticed that it is the master host in the HA enabled cluster having the latency issue.
Is it because of the master role of the host among the HA enabled cluster?
Status update: This issue has been processed by VMware development team.