vsan 6.2 write latency and read cache misses

MichaelGi · ‎03-23-2016

Since upgrading to vsan 6.2, I am seeing a major increase in read cache misses and write latency. Also, the physical disks show that the flash disks have 0 used capacity. This is a hybrid production environment. Does anyone have any thoughts on this?

douglasarcidino · ‎03-23-2016

Where are you looking at your SSD usage? How many nodes in the cluster? Have you done all of the on disk format upgrades?

If you found this reply helpful, please mark as answer VCP-DCV 4/5/6 VCP-DTM 5/6

MichaelGi · ‎03-23-2016

The monitor tab under the cluster virtual san view used to show used capacity for the ssd's. Now it sits at zero. The may be expected behavior in the new version but I'm not sure. It's a 4 node cluster. I have done the on disk format upgrade on all disk groups. Whenever throughput goes up the latency jumps up pretty high.

srodenburg · ‎04-04-2016

I noticed it too. I still use SexiGraf to look at the VSAN stats and in 6.1. I used to have a very high cache-hit ratio during the day. But now, in 6.2 and after the FS upgrade, it is much lower (on average) and "bounces" up and down a lot (spikey graphs so to say). It is not as stable / flat-lining as it used to be.

It's the same six nodes and the same VM's. I'm not sure what to make of it.

zdickinson · ‎04-05-2016

Good afternoon, nice tip on SexiGraf, which lead me to SexiLog. Getting a bit off topic, but do you know a good repository for documentation for them? I can't seem to find how to do something basic like change the web admin password. Thank you, Zach.

srodenburg · ‎04-05-2016

Hi Zach,

If i understand you correctly. Click on the Icon in the top left which gets you to the systems control panel thingy. Then select Grafana Admin.

See screenshot.

zdickinson · ‎04-05-2016

Yes, that is what i was looking for. Thank you!

depping · ‎04-06-2016

have you reported this issue?

srodenburg · ‎04-07-2016

Hi Duncan,

No I haven't. In all honestly, I have not perceived it as an issue (yet) as performance is still good. It's just something I noticed.

If you look at the General Overview" screenshot, it is what my environment looks like since the 6.2 / FS3 upgrade. In 6.0 and 6.1, all hosts where much more like node ESX02 is now in this screenshots. All nodes where more or less flat-lining all between 70 and 100%.

The only moments that it looked as "spikey" as they do now, was when the nightly backups ran. Then the read-caches obviously get filled with the data from the spindles, pushing the static "during the day" data out. Then, typically a couple of hours after the backup, the caches stabilized and things returned to "just chugging along" the rest of the day.

Since the upgrade, it looks like backups are running all day long... Thing is, that the SSD's hardly see any evictions (second screenshot shows the SSD stats) so why is the cache-miss ratio so bad. Am i missing something?

KurtDePauw1 · ‎04-10-2016

I have noticed this in our enviroment also.

I changed the policy

* stripes : 2

* read cache : 10% ( on heavely used machines )

And now the read cache is always between 100% and 90%

KurtDePauw1 · ‎04-10-2016

Here is a screenshot with and without 10% adjustment

depping · ‎04-11-2016

Just one thing to point out, there is also a small memory read cache on the host where the VM resides as of 6.2 (client cache), so it could be that that is why the "Read Cache" is reporting a lower ratio but you are still seeing good performance. The current performance views do not show this cache layer yet, this is being worked on. The question here is if you are seeing much lower IOPS and higher latency than normal.

KurtDePauw1 · ‎04-11-2016

Hello,

Attached a screenhsot of the 3 nodes

Also Veeam One is reporting sometimes high latency on the VSAN ( 150ms and much higher )