VMware Cloud Community
joergriether
Hot Shot
Hot Shot

Networking questions

Hi,

the VSAN documents state that you could use nic teaming in vsphere for vsan enabled vmkernels with focus on high availability and redundancy but not for aggregation and bandwith increase. Now my question would be, what about if you configure two vmkernels per host (with one 10 gig nic each connected) and activate vsan on both vmkernels. Will vsan distribute the traffic on both or is just one used? And if the last is correct, who decides which vmkernel post is used for the main traffic?

The second question is about high utilization in general. Do you implement any inside multi channel algorithm per nic? Do you implement or plan to implement support for direct memory access technologies (for example RoCE)? And if not, let´s say you connect via 40 G/s mellanox adapters. Is VSAN able to utilize the full bandwith of it, given that the hard disks, striping, sas controllers can deliver this speed?

Last question: What about DCB/PFC and flow control in general? Is there a recommendation covering all these technologies?

Best regards,

Joerg

29 Replies
cmiller78
Enthusiast
Enthusiast

dedwardsmicron wrote:

The end result is a 2 disk group chassis with 1 PCIe SSD to 5 SATA SSD's per disk group.  Our PCIe SSD's are capable of much much higher bandwidth than 10GbE; 400K IOPS @ 4K random read, 100K IOPS @ 4K random write.  My SATA SSD's are capable of about 50K/15K IOPS @ 4K random read/write. 

Are you talking about a Virtual SAN POC? Because you cannot have 100% SSD disk groups unless you want to tag your SATA SSDs as HDDs and I'm not sure that's even supported. Either way, let me use your math as an example.

First of all, Virtual SAN is going to write objects to your local disk and N remote nodes where N = your FTT level. Reads will be localized and not hit the network unless you plan on moving machines around frequently which I would hope is not the case. If you are moving for maintenance, you can perform a full data migration to keep data local.

Understanding this, your reads should not hit the network at all so factoring read IO into your network sizing doesn't make as much sense and if you do factoring read rates at your full drive capabilities doesn't make sense either considering the vast majority of your data is local.

Writes, however, will factor in. Every write you perform will traverse the network based on your FTT policy. Let's assume you stick with the default FTT=1 and your math above. Under a 100% random 4K write workload, your PCIe SSD can handle 100K IOPS, one node could write 400,000 KB per second. 1 KB per second = 8 x 10^-6 Gbps, so 400,000KB = 3,200,000 x 10^-6 = 3.2 Gbps

FTT=1 - would consume 3.2Gbps and totally saturate half of both nodes (6.4Gbps to saturate two entire nodes)

FTT=2 - would consume 6.4Gbps and totally saturate half of three nodes (12.8Gbps to saturate three entire nodes)

FTT=3 - would consume 9.6Gbps and totally saturate half of four nodes (19.2Gbps to saturate four entire nodes)

Now if you dedicate 2 x 10GbE NICs to Virtual SAN and you configure LACP (and balance your VMK IPs properly), you could fully utilize 3 nodes with FTT=3, 4 nodes with FTT=2, and 5 nodes with FTT=1

You can support up to 8 x 10GbE NICs in vSphere 5.5 so you could conceivably quadruple your cluster size and if you balance properly run FTT=3 at hypothetical full speeds across 16 nodes. You'd kill your disk before you killed your network.

However in reality, you wouldn't likely use anything over FTT=1 unless you have a N+2 or N+3 HA policy to begin with. You also will not hit your theoretical maximum for your SSDs because most workloads are not 4K IOs and are a mix of sequential vs. random, read and write.

Instead of quoting the theoretical maximums of your drives, have you measured your actual workload requirements? What does your IO look like at peak? Read vs. write, avg IO size, sequential vs. random, etc.

What are the applications running on this?

0 Kudos
joergriether
Hot Shot
Hot Shot

I would agree for real life workloads. But when it comes to backup windows and when it comes to a single point-to-point-connection which demands all the bandwith it can get (assumed the storage+controller can deliver it) then speed matters and lag/lacp can´t help in most cases. Thus, i understand the need for going beyond the 10g barrier (regaring point-to-point-connections, especially for backup windows). And thus, i´d like to see rdma with vsan.

Now another word regarding the all-flash-vsan: there is a session scheduled for vmworld which at least makes hope that the all-flash-vsan may come in near future.

0 Kudos
dedwardsmicron
Enthusiast
Enthusiast

I'm going to have to clarify and respond in a couple of segments.  Given I'm still digesting your math with things I've been reading.

Yes, we are doing a VSAN POC and we will be demonstrating it at vmworld this year.  We are trying to understand the value of an all SSD VSAN; we make the drives and we believe we should have an answer (good or bad).  IMHO our TCO calculations are compelling.  This means having a vehicle to experiment with all types of workloads; thus we are running any application we can feasibly setup in our environment.  Currently synthetic workloads for characterization, vmmark, loginvsi, and OLTP as we get further down the rabbit hole.

Interestingly, I didn't consider the unsupported aspect of SSD's until recently but the Dell H710 controller doesn't support pass through and it presents the drives as HDD's.  Exactly what I needed.  We would like very much to work with and convince VMware that an all SSD VSAN is not only cost effective, but works very well.  Hence the POC at this stage.

You mention that there will be a local copy and N remote copies based on FTT.  I've been watching IO's on my VSAN and on an FTT of zero (no storage policy), I'm seeing IO's go remote to the other hosts and nothing local.  Also, recently I've been reading that VSAN doesn't write anything local because if that host were to go down, you would lose that data.  Can you please elaborate on this a bit more?

0 Kudos
joergriether
Hot Shot
Hot Shot

What you just described with the H710 is called RAID0 Mode which is supported by VMware - what it not supported is to let the system "think", it has hdds but instead the real physics are ssds. This is by nature and vmware provided some "tricks" to get it set right again.

You can easily override VSAN disk type detection to either force vsan to see a device as ssd or hdd. In addition, it is possible to make vsan see a non local device as local device (check for the options enable_ssd and enable_local). Check out VMware KB: Enabling the SSD option on SSD based disks/LUNs that are not detected as SSD by default

You can not control where VSAN puts your data. It distributes autonomously - but respects your FTT policy, ofcourse. Nevertheless, you can not define affinity rules, for let´s say distributed datacenterns, where you want to make sure copy a relies in datacenter a and copy b in datacenter b. But this may be on the roadmap. Check out vSphere Metro Storage Cluster using Virtual SAN, can I do it?

Best regards,

Joerg

0 Kudos
cmiller78
Enthusiast
Enthusiast

It's interesting that you are seeing IO leave the server with FTT=0. How did you see this? Watching network traffic or were you looking at IO in VSAN Observer? The latter would be a better way to determine where things go. An experiment would be to shut down all VMs but one, run a single VM, generate some IO, then deep dive into the disks in VSAN observer to see which host is taking the IO. From a network perspective you may be seeing control traffic not data.

Aside from the support aspect (which I can't comment on, maybe someone from VMware can chime in here), using SSDs for the data drives would provide limited benefit. Virtual SAN is going to optimize IOs that happen on the back end. Every write your VM performs first hits the cache layer. All of the SSDs (PCIe in your case) would pool together to form a cache layer of which 70% will be utilized for read cache and 30% for write. Every write a VM performs lands in the pooled 30% and over time that buffer flushes out to the HDDs (SATA SSDs in your scenario).

Typical workloads will generally consist of more random IO than sequential. Spinning drives perform quite well with the latter, a characteristic that Virtual SAN utilizes. When the write cache buffer flushes, it's doing so in 1MB stripes. This behavior, write coalescence, is what most storage arrays do as well in order to improve random write. In the case of Virtual SAN, your random writes hit the SSD cache buffer which is really good at handling random writes. When that buffer flushes, it's striping 1MB chunks sequentially which spinning disks are really good at handling.

On the read side, if the block is in cache you get great performance for random reads. If a block isn't in cache, Virtual SAN needs to go to the HDD tier and retrieve the block. It's internal algorithms will cache reads as well for data that is hot so this dip into the HDD tier doesn't have to occur as often.

From a performance perspective, having SSDs act as HDDs would improve performance if your workloads do a lot of work with "cold" data that lives in the HDD tier. Typically, though, workloads have 10% hot data and 90% cold data for most volumes so the benefit would be limited. Writing sequentially to SSDs isn't really going to give you much improvement either because in the end your SAS/SATA bus will be limiting your bandwidth. The exception here would be loading up a server with ALL PCIe SSDs, tagging all but one as HDD, and then using the tagged devices for storage. In that case you will likely have greater bandwidth than SSD (PCIe 3.0 can do nearly 1 GB per second per lane - that's a big B byte). However at some point this would become cost prohibitive and depending on your workload overkill.

Understanding workload is key. If you have extremely large bursts of IO, more cache will help. If you have long periods of sustained high IO, having more HDDs and disk groups per server will help as you will constantly fill your write buffer and need to flush it. The rule of thumb of 10% SSD for sizing is likely based on typical workloads and you would need to test your workload and scale accordingly.

The beauty of Virtual SAN, IMO, is the flexibility to right-size servers based on your workload and scale both vertically within a server by adding disks/disk groups and horizontally by adding nodes. In the end this is the most cost effective way to scale but it does require careful measurement and testing to create the most efficient node design.

P.S. in my testing with FTT=1 I watch VSAN Observer and I see writes hitting the local node where the VM lives and one remote. If you have multiple sources of IO on the datastore it is more difficult to see where each VM is writing which is why I suggest testing with only a single VM and using VSAN Observer. You should see IO hitting the SSD in the local node with FTT=0 unless that node is full or down.

0 Kudos
depping
Leadership
Leadership

It all depends on where your components of your objects are placed right? But it is more likely that components do not align with the "vm" from a compute perspective in environments with more than 4 hosts in the VSAN cluster. Which means that IO will hit the wire in most cases.

0 Kudos
leguminous1
Contributor
Contributor

cmiller78 wrote:

First of all, Virtual SAN is going to write objects to your local disk and N remote nodes where N = your FTT level. Reads will be localized and not hit the network unless you plan on moving machines around frequently which I would hope is not the case. If you are moving for maintenance, you can perform a full data migration to keep data local.

According to the article by Rawlinson Rivera -  https://blogs.vmware.com/vsphere/2014/07/understanding-data-locality-vmware-virtual-san.html - vSAN does not make any effort to keep either SSD cache or the vmdk for a VM on the host where a VM is running because doing a full migration and re-warming SSD cache would have a much more notable impact on performance than doing a remote operation over 10GbE.

Ethan

0 Kudos
cmiller78
Enthusiast
Enthusiast

Thanks, I had not read this yet. This is interesting and a bit different from what we learned initially during our partner enablement.

All that being said it wouldn't change my math much since I was referring to 100% random writes. If the workload was mixed, you'd still drive roughly the same bandwidth assuming IO sizes were similar.

Appreciate the link and comment from Duncan!

0 Kudos
dedwardsmicron
Enthusiast
Enthusiast

My testing is showing some truth in the math at 4K IO's, which is nice to see in some regards; I would have expected better results from our HW capabilities.  That aside, I think where I was stuck is that while our writes are limited to a bandwidth of about 600MB/s per PCIe SSD, our reads are capable of 3.3GB/s per PCIeSSD.  So when we step away from the 4K numbers and start talking about block size the situation is a bit different.  Performance testing on physical servers show that this configuration should be capable of 6.6GB/s read and 1200MB/s write per host 6 nodes should be able to push about 39GB/s read, 7GB/s write.  I've been able to hit 5.4GB/s read, and 4.6GB/s sequential 128K over 10GbE x 2.  Near as I can tell I am not getting the benefit of the 2 ports; pretty sure I set it up correctly - and network utilization is showing reasonable sharing of both ports during heavy IO.  It turns out we were not able to get any better performance with a single 40G IB link so I'm wondering if I'm hitting some other limitation?

0 Kudos
leguminous1
Contributor
Contributor

So I promised some results from my testing with an IB setup and we've now been a few rounds with IB vs 10GbE. To keep it concise, we observed roughly 40% increase in max throughput just by switching from 10GbE to our IB setup. Didn't really note much in the way of improvement in small transnational IO, but as we crank the IO size up, you can see the 10GbE plateau at around 11GB/s, switch vsan to the IPoIB link and the plateau occurs closer to 15GB/s. Not entirely sure if that is the SAS bus getting saturated - we're running 6Gb SAS per node on 6 nodes - or something else in the chain.

However, the Mellanox IB driver for ESXi is old and has known glitches so we're moving forward with the 10GbE solution - we looked into the 56GbE offerings from Mellanox since our VPI cards support this, but the license to run 56GbE on our 6036 switch is more expensive than buying a new purpose-built 56GbE switch from them and we haven't run into a real-world situation where we've saturated the 10GbE bus... yet... We only have ~24 "real" systems on the vSAN at this point.

One word of warning, VMware support has been very particular about escalating tickets for our vSAN because they're not vSAN Ready Nodes and not all components are explicitly listed on the vSAN HCL AND the ESXi 5.5U2 HCL. So one component is supported on ESXi 5, but not 5.5U2  - basically a paperwork exercise between the two vendors - and that has come back to bite us... I'm assuming this will get more relaxed as the product matures and more straight forward as the HCL grows.

Ethan

0 Kudos