VMware Cloud Community
joergriether
Hot Shot
Hot Shot

Networking questions

Hi,

the VSAN documents state that you could use nic teaming in vsphere for vsan enabled vmkernels with focus on high availability and redundancy but not for aggregation and bandwith increase. Now my question would be, what about if you configure two vmkernels per host (with one 10 gig nic each connected) and activate vsan on both vmkernels. Will vsan distribute the traffic on both or is just one used? And if the last is correct, who decides which vmkernel post is used for the main traffic?

The second question is about high utilization in general. Do you implement any inside multi channel algorithm per nic? Do you implement or plan to implement support for direct memory access technologies (for example RoCE)? And if not, let´s say you connect via 40 G/s mellanox adapters. Is VSAN able to utilize the full bandwith of it, given that the hard disks, striping, sas controllers can deliver this speed?

Last question: What about DCB/PFC and flow control in general? Is there a recommendation covering all these technologies?

Best regards,

Joerg

29 Replies
depping
Leadership
Leadership

As no one else has responded let me take a stab at it.

the VSAN documents state that you could use nic teaming in vsphere for vsan enabled vmkernels with focus on high availability and redundancy but not for aggregation and bandwith increase. Now my question would be, what about if you configure two vmkernels per host (with one 10 gig nic each connected) and activate vsan on both vmkernels. Will vsan distribute the traffic on both or is just one used? And if the last is correct, who decides which vmkernel post is used for the main traffic?

Yes multiple VSAN VMkernel interfaces is supported as long as both interfaces are part of a different subnet! VSAN will distribute traffic, however as far as I am aware VSAN is not optimized for this use case and as such results may vary.

The second question is about high utilization in general. Do you implement any inside multi channel algorithm per nic? Do you implement or plan to implement support for direct memory access technologies (for example RoCE)? And if not, let´s say you connect via 40 G/s mellanox adapters. Is VSAN able to utilize the full bandwith of it, given that the hard disks, striping, sas controllers can deliver this speed?

Mellanox / RoCE is supported for vSphere (VMware KB: Configuring Mellanox RDMA I/O Drivers for ESXi 5.x (Partner Verified and Support) ). I have seen no explicit support statements around VSAN when it comes to this. Maybe kmadnani can comment around support. Generally VMware does not comment on roadmap, so if it is not supported today only time will tell

Last question: What about DCB/PFC and flow control in general? Is there a recommendation covering all these technologies?

I have not seen any best practices regarding flow control to be honest. Maybe Rawlinson or someone else from the Tech Marketing team knows if a document is being developed.

Duncan

-----------

Book out soon: Essential Virtual SAN: Administrator's Guide to VMware VSAN (VMware Press Technology)

joergriether
Hot Shot
Hot Shot

Hi Duncan,

thanks.

So would it be correct to say

a) VSAN is not routable

b) If you need multiple netwoprk segments you have to use multiple VMkernels located in this particular network segement

Best regards,
Joerg

0 Kudos
SimonTodd
Enthusiast
Enthusiast

Hi Joerg

You are correct. Currently VSAN is not routable, all the hosts participating in VSAN have to be configured to use the same L2 Subnet

Duncan is correct, the use of multiple VMKernel interfaces is supported, however we tested this in the past and it didn't yield much in the way of a performance gain and the requirement for this is that the VMKernel interfaces reside on different L2 Subnets.

PFC can be an issue, because of the way VSAN will write to two or more locations (Depending on Component Failures to Tolerate setting) we confirm the write once the hosts involved in the write have Acknowledged the write to the SSD, if PFC kicked in and delayed the write somewhere in the network, this would have an affect on the performance.

VSAN uses multicast, so this is also an important consideration when implementing solutions such as Priority Flow Control as I have seen some issues (Outside of VSAN) with PFC affecting multicast traffic....just something else to bear in mind

Simon

0 Kudos
joergriether
Hot Shot
Hot Shot

Hi Simon,

thanks!

You wrote about paying attention to PFC when it comes to multicast because of the speed. Are you sure about that? I am asking because Cormac wrote that Multicast is only for very very few things in a VSAN environment, for example metadata and/or object creation.

Using tcpdump i discovered the real raw file data, the place where the big data goes, is *always* transported plain and raw via standard TCP on port 2233. Can you comment on this?

Best regards,
Joerg

0 Kudos
SimonTodd
Enthusiast
Enthusiast

Hi Joerg

Multicast by nature is very "Bursty" and PFC does not like this type of I/O, and if your network is congested to the fact that you have to invoke PFC then it would be wise to segregate the traffic onto another network, with VSAN it gives you access to vDS where you can utilise functionality such as NetworkIO Control so if you wish to share say for example a 10Gbit NIC for multiple purposes including VSAN, you can prioritize VSAN traffic in times of congestion.

I'll look into the port 2233 part for you, I have not had this port mentioned to me before, so leave that one with me

Regards

Simon

0 Kudos
joergriether
Hot Shot
Hot Shot

SimonTodd schrieb:

I'll look into the port 2233 part for you, I have not had this port mentioned to me before, so leave that one with me

That would be great, Simon! Thanks in advance.

Best regards,
Joerg

0 Kudos
SimonTodd
Enthusiast
Enthusiast

Joerg

Engineering just confirmed that port 2233 is used for RDT traffic as a Peer to Peer communication (Unicast) between the VSAN nodes, so even I have learned something new today 🙂

Simon

0 Kudos
joergriether
Hot Shot
Hot Shot

Great, thanks Simon 😉

Now one question: What is "RDT"? Is this a special name for VSAN inter-node-raw-data-traffic? 😉

Or is it a specially crafted p2p protocol specially designed for VSAN?

Best regards,
Joerg

0 Kudos
depping
Leadership
Leadership

joergriether wrote:

Great, thanks Simon 😉

Now one question: What is "RDT"? Is this a special name for VSAN inter-node-raw-data-traffic? 😉

Best regards,
Joerg

RDT is a VMware VSAN proprietary protocol of which no details have been shared or will be shared as far as I have been told.

0 Kudos
joergriether
Hot Shot
Hot Shot

At least we finally know the the name of it...."RDT"  😉 One day at a time 😉

Maybe tomorrow you will tell us it is indeed a classic p2p design???? 😉

0 Kudos
joergriether
Hot Shot
Hot Shot

But all joking aside - people want to know and people need to know.

Me - i am a storage and HA guy - i like to understand how my systems on which my company´s data depend, work in detail. And i got to know you all as very open minded enthusiasts. Take the posts of Cormac or Duncan or Rawlinson - great and so very useful. Because it´s digging very deep into the technical details.

At some point you have to bring that details regarding VSAN transport protocol to light because people like me want to know how it works to have confidence/trust in a storage system. This goes particularly for storage systems! Because they are very important for the whole ecosystem.

Best regards,

Joerg

0 Kudos
leguminous1
Contributor
Contributor

I'm interested in the RDMA / IB support aspect of this string. My team has applied for some funding to setup a POC for vSAN and we're seriously considering using Mellanox and IPoIB as the backbone for vSAN communications since it's relatively inexpensive, low-latency and high bandwidth. I have a couple concerns about this though, IB uses four 14Gb channels to make a link and really benefits from multi-threaded processes which it sounds like vSAN is not.

Infiniband IPoIB performance problems? | Page 2 | ServeTheHome and ServeThe.Biz Forums

There was a lot of chatter from the VMware CTO about IB / RDMA reducing latency, improving vMotion performance and reducing CPU overhead in 2012 but I cannot find much since then. Are there still projects in the works to support RDMA/IB or RoCE for core VMware functions?

We've come up with using a FTT setting of 2 since having only one copy of a critical system for over an hour would, you know, be bad. In this scenario every block would be written three times on three different hosts and being able to use multiple threads would speed this up and allow for the use of aggregated links. I'm hoping that is on the list for a coming release!

I agree with Joerg that a bit better transparency on the internals of vSAN would be helpful! Much of the useful information I've found has been from blogs and forums and it would be helpful if VMware would organize this a bit better.

Ethan

0 Kudos
joergriether
Hot Shot
Hot Shot

Very interesting questions. Maybe Cormac, Duncan or Simon can comment on these.

I think VSAN is not yet capable of using multi channel technologies to bundle some high performance links like 10gbe / 56gb (IB14*4) / 40gbe (for example mellanox) etc together to simultaneous accelerate a single stream like e.g. some switch independent multichannel implementations came to light in the last two years by some other vendors. So here is a possibility for improvement from my point of view.

Regarding RDMA - Last thing i read was in September 2012, especially focused on in-guest via SR-IOV VF DirectPath I/O - check these slides here: http://cto.vmware.com/wp-content/uploads/2012/09/RDMAonvSphere.pdf

Does anybody know of a new statement or a new development?

Best regards,

Joerg

0 Kudos
cmiller78
Enthusiast
Enthusiast

A couple of additional comments:

- PFC is meant for layer 2 storage technologies like FCoE in order to create a congestion control mechanism similar to buffer credits in a fiber channel network. Being that Virtual SAN leverages TCP between nodes, TCP congestion avoidance (AIMD & slow start) on top of PFC would potentially create issues and add complexity you don't need. Instead you would provide bandwidth guarantees for traffic classes so you don't starve storage or give the storage traffic a priority queue with a bandwidth cap so it doesn't starve other traffic.

- As noted above routed traffic is not supported. This is generally true of all vmkernel traffic other than management. In addition you'd create more complexity by having to enable L3 multicast if you aren't already using it

- I'm curious what your use case is here for >10GbE. SAS would red line at 6gbps so maximum throughput would ultimately be back-end limited with 40gbps, and 10GbE configured with LACP (if you can), assuming you have spaced your IP addressing properly, would result in a stream of writes utilizing 2 x 10GbE ports which is more than you could handle on the back-end assuming you are not packing multiple controllers and disk groups. If you are planning multiple controllers and disk groups I could see where you might attempt to push more than 10gbps, but being that Virtual SAN is designed to consume no more than 10% CPU on a host, I am not sure how far you could ultimately push it without hitting a CPU wall (not something I've personally tested)

What kind of workloads are you planning to run on this? Generally speaking VM workloads are small, random IO which tends to consume little bandwidth and starve storage IOPS rather than bandwidth. I'm really curious about your use case here.

joergriether
Hot Shot
Hot Shot

I can only speak for myself: My "personal" main use case with the need for extreme high bandwith focuses mainly on datacenters of some folks where special policies are in place like for example daily full backups of extremely frequented and extremely large database server-vms.

Best regards,
Joerg

0 Kudos
leguminous1
Contributor
Contributor

I would run out of steam on SAS at 6gbps but as I understand it, but if I have one guest with a FTT of 2 sitting on a host that doesn't contain it's vmdk any write would go across the wire to three remote SSDs (1x3x6gbps=18gbps). If I make that 3 guests with a FTT of 2 on a single host with their vmdk spread across 9 other hosts (3x3x6gbps=54gbps) I could use all the bandwidth I can get! That covers running database backups and big linear IO, for random small I/O from developer workstations or transactional systems IB has latency 100-200ns vs 800-1000ns for high performance Ethernet. I'm fairly certian that running IPoIB will affect this but I don't know how muchand with the talk of RDMAsupport from the vmware CTO - which seems perfect for vsan in a lot of ways - in not too distant past I'm thinking this will come out as a win for us.

If vsan consumes 10% of 20x3GHz cores I'd be impressed, it shouldn't have much to do besides shoveling data around, which with a TOE should be pretty easy. With RDMA (yes, I'm in a RDMA/IB championing phase) there is almost nothing for the CPU to do, just tell the RDMA driver where the data is in memory and it mirrors the memory to all the other hosts without the CPU.

Ethan

0 Kudos
joergriether
Hot Shot
Hot Shot

It depends (regarding CPU) - if using iWARP or RoCE. But i´d love to see any RDMA technology implemented.

Best regards,

Joerg

0 Kudos
dedwardsmicron
Enthusiast
Enthusiast

My comment is regarding "- I'm curious what your use case is here for >10GbE."

We are preparing a POC for a high performance storage backend.  The end result is a 2 disk group chassis with 1 PCIe SSD to 5 SATA SSD's per disk group.  Our PCIe SSD's are capable of much much higher bandwidth than 10GbE; 400K IOPS @ 4K random read, 100K IOPS @ 4K random write.  My SATA SSD's are capable of about 50K/15K IOPS @ 4K random read/write. 

Based on my very initial / very early testing, my fear is that there is no point in going beyond 10G (based on my data, and what I am reading).  Does anyone have any information on how to configure VSAN to perform better on a 40G?  Obviously, I'm very interested in the RDMA discussion, etc.

Thanks in advance!

D.

0 Kudos
leguminous1
Contributor
Contributor

I'll post any info I can about the setup I'm working on when I get anything meaningful out of it. I initially was looking at PCIe devices but they were cost prohibitive until I saw the Intel NVMe devices, an open standard for PCIe SSD could break down the cost walls around it. Just waiting on the VMware driver...

My understanding is that SATA devices are treated as remote storage by vSAN and there are some issues using them. I don't have a article to back that up at the moment though...

Ethan

0 Kudos