So I'll provide a lot of pictures for this topic, but essentially we have one host that sees higher latency and I'm trying to figure out why. I think I've narrowed it down to one VM that is getting a lot of read cache misses.
In the above picture you can see that it isn't "bad" but is definitely worse than the other hosts. (This is the VSAN Client view).
This above view doesn't really show any issues at all. Now for the VM I "think" could be making the host look bad:
The above image is just one of the vmdks on the VM, this particular one is only taking microsoft VSS snapshots of some CIFS shares. Below is the image for the data stored in the CIFS share:
I've circled the RC Hit Rate for this vmdk, which seems quite poor compared to all of the other VMs in our environment.
The one thing that sets the VM asside from the others is that I have specified a "Read Cache Reservation" of 2% = 6.3GB (315GB VMDK). So here are my main questions:
My main reason for asking #2, is that I'm seeing pretty high write latencies for this VM. It has 100% space reserved, combine that with all writes are supposed to hit an SSD it seems like write latency should be pretty low. Here is an image of the latencies:
It may be worth noting this VMDK is used by a 2012 server running deduplication.
A quick question, What SSD's are you using in each disk group?
Just some thoughts - they might help, they may not. I guess the bottom line is that you wish to reduce the latency for that VM, correct?
Do you know anything about the workload in that VM? Does it involved prolonged periods of sequential reads or writes?
Along with the lower RC hit rate, does this VM also show more evictions than other VMs?
The fact that this VM is having read cache misses implies that it has to going to the spinning/magnetic disk layer to retrieve the data block. This will of course increase latency.
If you are also seeing evictions on this VM more than others, it "could" mean that the write buffer is filling up and thus blocks need to be evicted to make room for new writes. This will also increase latency.
A couple of things:
- the read cache reservation could be relevant. Without a reservation, all VMs will share the cache. With a reservation, you are giving this VM a chunk of cache for its own use. Try removing that reservation and see if things improve...
- what is the magnetic disk configuration on the hosts? Are you using a stripe width, or is it a single spindle (no striping). I'm wondering if a stripe width would improve the latency value in the case of read cache misses and cache evictions. Can you try that too?
HTH
Cormac
Thanks for all the replies guys, here are the answers I have for the moment:
SomethingStrange wrote:
A quick question, What SSD's are you using in each disk group?
We have a Intel S3700 400GB SSDs in each disk group.
Cormac:
Thanks again!