Solved: Re: Read Cache Hit Rate below 90%

szafa · ‎10-19-2020

Hi All

I have 2 identical VxRail clusters and just one (with less workload) I observer low Read Cache Hit Rate one many disk groups (one DG looks worst that others).

CONFIGURATION:

=========================================================

4x VxRail nodes P570 Hybrid ver 4.5.400-14097254

External vCenter 6.5 build 14690228

vSAN config

Hybrid 4 disk groups (each 1x800GB cache + 5x1.2GB magnetic disks)

=========================================================

ISSUE: LOW Read Cache Hit Rate

The vSAN health all green expect vSAN disk balance (same other cluster) => ongoing issue VMware recommends to wait and it should balance itself but needs time

The other cluster has same configuration, type of workload (even more VM's) and is not experiencing this issue.

How I can identify what is causing this problem? If this is purely related to workload access patter (loads random reads) which I doubt (other cluster is OK) how I can identify those VM's and what I can do to improve reads from cache (example add more strips). VMware didn't give me any answers so far just keep recommending upgrade which is not an option at the moment at least until will get proper RCA.

Thanks

TheBobkin · ‎10-19-2020

Hello szafa

So, you use the word 'issue' and 'problem' multiple times here - do you actually have a perceivable performance issue in this cluster or are you just getting tired of the vROps alarm?

I ask this very basic question as I have seen this vROps alert triggered for dozens of clusters where the VMs are working away happily and where the cluster hardware is perfectly adequate for the workload - if the answer to the above is the former, then consider improving the Cache:Capacity ratio of the Disk-Groups, if the answer is just the latter then disable this alarm.

To explain the basics of how a read-cache works in a vSAN Hybrid cluster, here is an analogy:

You have 2 bookshelves - one is relatively small but nice and close to where you read every evening it can hold 10 books, the other can hold 100 books but is on the other side of your house.

You have a good book (book1) that you are reading bits of every evening - you place it in your small but close shelf - , you then get another book (book2) that you find yourself frequently referencing and store it in the small but close shelf also.

Time goes by, you have now amassed 12 books on this topic and thus cannot fit them all in the small but close bookshelf - the original books (book1 and book2) are no longer the books you find yourself reading every night, only referencing them infrequently (or at least less frequently then the others), so you store them in your larger but farther away bookshelf.

In case it was not abundantly obvious, the small bookshelf is the read-cache portion of the Cache-tier SSD of a Disk-Group, the large bookshelf is the cumulative storage of the Capacity-tier HDDs of the same Disk-Group and the 'books' are the data stored on this Disk-Group.

The read-cache of a Disk-Group (~70% of the SSD size) is only so large and thus at any one time it can only contain a subset of all of the data in the Disk-Group - sure you *could* size the Disk-Groups so that the read-cache was the same size as cumulative capacity of the Capacity-tier but for most purposes this would be poor design and wasteful.

Thus on a read IO if the data is not found in the read-cache, it has to access this from the slower HDD - this is a read-cache 'miss' - do not confuse or compare this with anything else such dropped IOs or packets, it's not like it didn't find the data or failed to do it's job, it just had to travel to the further away bookshelf and it had to do this because at some point it INTENTIONALLY moved it out of it's near bookshelf as some other data on that Disk-Group was being read and/or accessed more often (these are called 'evictions').

More information on this topic can be found here:

VMware Knowledge Base

https://blogs.vmware.com/virtualblocks/2019/04/18/vsan-disk-groups/

Bob

View solution in original post

TheBobkin · ‎10-19-2020