VMware Cloud Community
Yves_
Contributor
Contributor

Intel Optane 900p as Caching Tier

Reaching out too all the professionals since I am lost 😞

Since several weeks I am trying to find out whats wrong with my vSAN Homelab setup but I can't seam to find the answer. Besides yes I already know the Intel Optane 900p is not on the HCL List.

Here is my vSAN homelab setup:

3x Intel R1208GZ4GC 1U Server

Each node one has following config:

- Dual Intel E5-2630L v2

- 128GB RAM

- 1x Intel Optane 900p 280GB PCIe with the Intel NVME vib (intel-nvme-1.3.2.8-1OEM.650.0.0.4598673.x86_64.vib)

- 3x Intel DC S4600 480GB SSDs

- JBOD Mode on the RMS25PB080 so they are directly passed thru to the ESXi

- VMware ESXi 6.7 (Build 8169922) with vCenter 6.7 and vSAN 6.7

- Intel Mezzanine Card with Dual 10GBit

Every node is connected to a Cisco SG550XG-24F with two 10GBit SFP+ Cables vSAN Traffic is in its own VLAN. Everything on jumbo 9k.

Right now if I do run HCIBench easyrun on this setup (only 2 node setup) . I get a very very sad:

VMs = 4

IOPS = 37292.28 IO/s

THROUGHPUT = 145.68 MB/s

LATENCY = 3.4093 ms

R_LATENCY = 3.6230 ms

W_LATENCY = 2.9117 ms

95%tile_LAT = 5.4063 ms

=============================

Resource Usage:

CPU USAGE = 59.89%

RAM USAGE = 24.36%

VSAN PCPU USAGE = 19.0127%

=============================

If I take out the NVME of the vSAN setup. Make the setup like this: 1x Intel DC S4600 as Caching Tier 2x Intel DC S4600 as Capacity Tier (only 2 node setup) I get:

VMs         = 4

IOPS        = 63699.04 IO/s

THROUGHPUT  = 248.82 MB/s

LATENCY     = 1.9798 ms

R_LATENCY   = 1.7205 ms

W_LATENCY   = 2.5840 ms

95%tile_LAT = 5.4800 ms

=============================

Resource Usage:

CPU USAGE  = 80.68%

RAM USAGE  = 32.09%

VSAN PCPU USAGE = 32.4542%

=============================

How is this possible?

Can I debug where the problem is? How can normal DC SSDs be twice as fast as a NVME which can reach over 250k iops random read / write.

Thanks sooo much for your input. Looking forward to find a solution.

Cheers,

Yves

Reply
0 Kudos
7 Replies
TheBobkin
Champion
Champion

Hello Yves,

Welcome to Communities.

"How can normal DC SSDs be twice as fast as a NVME"

Unsupported hardware has a huge YMMV-factor so I don't find it shocking that a 'theoretically' faster device appears to be lagging behind here.

For one, the cache:capacity ratio is far better when using the DC S4600 (bigger + less capacity per DG).

Drivers seem to be far more variable with NVMe than other devices, obviously there is no vSAN-recommended driver for the 900p but it may be worth testing different driver versions.

"Can I debug where the problem is?"

Start with getting a better, more detailed comparison of *why* each setup is achieving the results they get (not just the result itself) - identify where the bottlenecks reside.

vSAN Observer and the in-built performance charts in the Web Client while performing benchmark tests would be a good place to start with this.

VMware Knowledge Base

Bob

Reply
0 Kudos
Yves_
Contributor
Contributor

Hi Bob,

First of all thanks a lot for your reply. I was already afraid that no one is going to even try to help. Since the matter is with non HCL hardware.

Unsupported hardware has a huge YMMV-factor so I don't find it shocking that a 'theoretically' faster device appears to be lagging behind here.

Yeah, this it also what I was kinda afraid off since I do know its unsupported hardware but still I saw several VMware bloggers (​ / ​ / Tai Ratcliff) using Samsung NVME as caching tiers so I thought hey what not give it also a try with a "even faster" NVME. I did some HCIBench on the 900p directly as a datastore with results with over 250k iops on a easyrun. So the NVME seams fine but something in vSAN does not like this NVME.

For one, the cache:capacity ratio is far better when using the DC S4600 (bigger + less capacity per DG).

According to ​ and if I am not mistaken also ​ vSAN caching tier should be sized for using about 10% of the storage capacity tier which in my case is 1.4TB. So the 280GB should be OK, I guess?

Start with getting a better, more detailed comparison of *why* each setup is achieving the results they get (not just the result itself) - identify where the bottlenecks reside.

Okay, I already worked with the vSAN Observer. But I don't know what especially I am looking for. On my newest tests I did following single node vSAN cluster (1x 900p cache tier, 3x dc s4600 capacity tier) everything green put the vmware io analyzer on the vSAN run 100%read / 100% random on 4k and the results where 98% similar to the plain nvmeDatastore. BUT!!! running a 0% read / 100% random 4k on the vSAN gives me about 10% of the same run on the plain nvmeDatastore.

Looking closer at the results:

iops_nvme.jpg

iops_vsan.jpg

You see that LAT/wr is almost 20 times higher than on the direct nvmeDatastore. Is there a way to find out why this write latency is soooo much higher?

Also on the ESXTOP Disk Statistic I see that direct nvmeDatastore is using number of active commands from the VMKernel is 6.67 and does no reads or almost no reads at all as you can see here:

esxtop_nvme.jpg

And on the vSAN NVME it seams that number of active commands from the VMKernel is always at 1.00 and for some reason reading is also happening:

esxtop_vsan.jpg

Where else do I have to look to find out more? I wan't to find the reason why this is happening only with write... read is very good as far as I tested.

I attached the full stats with everything readable and opend.

Thanks again for all your help.

Cheers,

Yves

Reply
0 Kudos
TheBobkin
Champion
Champion

Hello Yves,

"You see that LAT/wr is almost 20 times higher than on the direct nvmeDatastore. Is there a way to find out why this write latency is soooo much higher?"

This is akin to comparing oranges to apples - VMFS on an NVMe is purely local: no network, no controller, no redundancy and thus single write per IO. If you wanted a more like-for-like comparison (that wouldn't account for network latency) would be to create a VMFS disk-group extented (or RAIDed) across the capacity-tier devices attached to the controller.

"so I thought hey what not give it also a try with a "even faster" NVME"

Faster doesn't always mean more reliable or better suited to a purpose as I am sure you are aware. It would be interesting to see if you get better performance with different drivers though.

"vSAN caching tier should be sized for using about 10% of the storage capacity tier which in my case is 1.4TB. So the 280GB should be OK, I guess?"

Cache-tier sizing in AF isn't dependant on the capacity-tier size but the workload characteristics of the cluster (e.g. smaller caches won't perform as well for heavy 100% write operations):

https://blogs.vmware.com/virtualblocks/2017/01/18/designing-vsan-disk-groups-cache-ratio-revisited/

"I already worked with the vSAN Observer. But I don't know what especially I am looking for."

In general - red boxes and/or anything that appears could be a bottleneck. Old guide but maybe it should help in determining what Observer is saying:

https://blogs.vmware.com/vsphere/files/2014/08/Monitoring-with-VSAN-Observer-v1.2.pdf

*Unlikely* to be the main issue here but controllers could also be problematic, these controllers were only ever supported in RAID0 (in vSAN 5.5 Hybrid :smileygrin: ) What mode/personality do you have this controller/disks added as? Do you have the 1GB cache set to 100% Read or disabled? Do you see any evidence in the logs of the controller driver/firmware having issues?

Bob

Reply
0 Kudos
Yves_
Contributor
Contributor

Hello again,

This is akin to comparing oranges to apples - VMFS on an NVMe is purely local: no network, no controller, no redundancy and thus single write per IO. If you wanted a more like-for-like comparison (that wouldn't account for network latency) would be to create a VMFS disk-group extented (or RAIDed) across the capacity-tier devices attached to the controller.

True true 😉 I sometimes forget how the vSAN really works... Just to make sure my understanding is correct. Shouldn't little 4k load tests not still stay in the write buffer of the caching tier? I let this little tests only run for 120 seconds not for hours... It somehow does not feel like this should affect the capacity tier directly, but like I said... I am not a vSAN Engineer so I don't really see behind the black box.

Faster doesn't always mean more reliable or better suited to a purpose as I am sure you are aware. It would be interesting to see if you get better performance with different drivers though.

Fair enough 🙂 but somehow I feel there is something very wrong... Since read seams very nvme accelerated and write seams worse than ssds. I would also have some other nvme lying around 960 Pro, SM951 and some Plextor M9p (or so I think). What did you mean about different drivers? I am open for everything its totally lab so I can't destroy anything 🙂

Cache-tier sizing in AF isn't dependant on the capacity-tier size but the workload characteristics of the cluster (e.g. smaller caches won't perform as well for heavy 100% write operations):

Okay, that somehow past by me 🙂 But never mind its a lab so I guess I will never run more than 100 / 150 GByte at the time. And if I would do so... I guess I would deserve a little bit lower performance since the Cache tier would be full... but Capactiy tier is still okish fast...

In general - red boxes and/or anything that appears could be a bottleneck. Old guide but maybe it should help in determining what Observer is saying

My lecture for tonight 🙂 Thank you.

*Unlikely* to be the main issue here but controllers could also be problematic, these controllers were only ever supported in RAID0 (in vSAN 5.5 Hybrid ) What mode/personality do you have this controller/disks added as? Do you have the 1GB cache set to 100% Read or disabled? Do you see any evidence in the logs of the controller driver/firmware having issues?

That actually was my first guess!!! Since I knew the RMS I use is no longer on the HCL. I use it in JBOD mode only so no cache at all disks directly passed thru to the ESXi. Checking all the logs I don't see anything bad about the Controller or so. And also I think if the controller would be that big of an issue why would it performance in controller only mode (without an nvme - 1x ssd as cache 2x ssds as capacity ) even better!? If you think this really is a problem I can grab some RMS which are on the HCL but will run me about 500$ for my 3 nodes...

Cheers,

Yves

Reply
0 Kudos
TheBobkin
Champion
Champion

Hello Yves,

"What did you mean about different drivers?"

I mean different nvme/intel-nvme drivers for the NVMe devices - as none of these are supported or intended for vSAN with these devices this of course will be YMMV but potentially some will work better than others.

I thought of one other thing that may *potentially* help here and is at least worth trying, I will PM this to you as this is not something to be blanket-advised without much much deeper analysis and thus do not want the chance that others will read it, blindly apply and then encounter issues.

Bob

Reply
0 Kudos
Yves_
Contributor
Contributor

Hi Bob,

Thanks again for the big help / time you are putting in here. Its really appreciated! Since I was totally lost and spent hours and hours trying to figure out where I went wrong.

I mean different nvme/intel-nvme drivers for the NVMe devices - as none of these are supported or intended for vSAN with these devices this of course will be YMMV but potentially some will work better than others.

Well that would also be worth a shot. I would do that after the other thing you wrote me. I am just recreating the vSAN again. And will give the idea of you a spin 🙂 Keep you posted.

And again BIG THANK YOU!

Cheers,

Yves

Reply
0 Kudos
Yves_
Contributor
Contributor

Hello everyone,

Trying to get some light into the vSAN BlackBox. If lets say the 900p really fails to be compatible... (which really looks that way right now) And I would create a lab vSAN with just my Intel DC S4600 (which are not that bad either).

What are realistic numbers? To see if vSAN works properly or not.

Also is it better to have big disk groups on each node or multiple disk groups on each host? (For Example I could do 3 Nodes with 1 cache and 4 capacity or 3 nodes with 2 cache 2 capacity)

Thanks again for all the help

Cheers,

Yves

Reply
0 Kudos