Solved: How to force Distributed Switch traffic through a ...

Wombat99 · ‎05-19-2019

We have a vSphere cluster consisting of two identical servers running ESXi 6.7U2.

Each server has three 1Gbps uplinks. The uplinks have public IP addresses and are connected to the ISP's top-of-the rack switch.

Two of the uplinks are for traffic in and out of ESXi itself (largely for redundancy) one uplink per machine is to serve as the external IP address of a VM-based router to which a different /24 per server is routed. The VM-based routers on each server then make the two /24s available to both cluster members via two Distributed Switches. One Distributed Switch for each /24.

That way, both servers can draw out of either /24 pool as desired.

The servers have a second set of NICs that offer 2x 40Gbps connectivity and are directly connected via DAC cables.

How do we need to configure those interconnect NICs and ESXi to ensure that traffic between the servers used by VMs attached to the he Distributed Switches will only run through the 2x 40bps NIC and never take a "detour" through the slower 1Gbps side (and its associated vSwitch)?

I read quite a few docs, watched a lot of how-to videos, and searched on google, but I can't figure out how to actually enforce such a routing.

Finally, what IP addresses should be used for the interconnect NICs? I presume that if we we use IP addresses out of the two /24s, traffic from one /24 to the other /24 would be routed out via one router VM, back in via the other router VM and therefore again be bottlenecked.

Do we use private IP addresses on each of the 40Gbps NICs and them somehow force a route between the /24s via the interconnect NICs from the ESXi command line somehow?

My apologies, I am neither an ESXi nor a networking expert.

Big Thanks!

--Marc

daphnissov · ‎05-19-2019

Alright, where do I begin. What you're trying to do just doesn't make a whole lot of sense to me, so maybe I'm still not understanding the objectives.

You say you have 3 1 GbE vmnics (physical NICs) per host. Your diagram shows all three in some team as uplinks for vSS 0. Also connected to vSS 0 you appear to have a vmkernel port (I'm presuming the only kernel service this is offering is management, is that correct?) and a virtual machine port group to which a pfSense VM will be connected. You also strangely have IPs assigned (somehow) to each individual vmnic in that team. That just isn't possible to do, nor would you want to try it in this configuration anyhow. A vmkernel port is capable of having an IP assigned, so that part makes sense. What is that IP? On what network? And your pfSense VM has what IP on the interfaces that connects to its VM port group?
Second is the matter of the DHCP server to begin with. This just confuses me. You show two VMs per host (not counting pfSense which is also a VM). Why do you need DHCP services at all anywhere here? Is DHCP the only reason you're using a pfSense box here?
Cross-connecting these 40 GbE interfaces in this fashion isn't supported, do you know that? I get that your hosting provider doesn't have 40 GbE connectivity, but this would be one case where if you absolutely require it, you should provide your own switch which can be private to this environment.
Alternatively to #3, why not cluster within an ESXi host so it doesn't have to egress? Each host can have both VMs that act in a given cluster. If you're concerned about failover, there are ways to replicate VMs or sets of VMs to another host in the case of a single host failure.

All of this aside, let me just try to answer your direct question about steering traffic out of a distributed switch:

If you've got your workload VMs (whatever you want to call them) that have a single interface connected to a distributed port group with a given network assigned, you need to not assign that same network to another port group and connect it to both places, otherwise you've just created a L2 loop. As long as the uplink used for the vDS is in the same broadcast domain as the VMs on the other side, the ARP table will have those MACs populated and traffic will get routed out that uplink. And you don't assign IPs to uplinks. You assign IPs to either vmkernel ports or to vNICs (virtual machine NICs) connected to port groups. Those are your only two options for interface assignment.

What I would recommend here (and which you may not be interested in) is to look at a redesign of this networking idea. Start at the applications. What are these? How are they used? What are the requirements based around them? Is it solely availability? Something else? How many VMs are involved? How many users are involved? Etc.

------------------
How to Ask for Help on Tech Forums
https://neonmirrors.net

View solution in original post

daphnissov · ‎05-19-2019

This sounds like a really convoluted design and I'm not fully understanding how you have these things connected. Before digging in, can you explain why you have settled on this design pattern? What is running on these two ESXi hosts and what are you trying to accomplish? Because almost certainly there exists better ways to go about this that make more sense and makes traffic flow distribution easier.

------------------
How to Ask for Help on Tech Forums
https://neonmirrors.net

Wombat99 · ‎05-19-2019

daphnissov: the are a few reasons:

1) Redundancy: if one of the two servers (and its associated VM-based router for one of the /24s) goes down, we can trivially temporarily run the VMs of both cluster members on the other cluster member server, using its /24 uplink space. I probably should mention that both VM-based routers will serve IP addresses to the VMs using a particular Distributed Switch /24 via DHCP/DHCPv6, so if one server goes down we simply run all its VM's on the other server, switch the VMs uplink to the remaining functioning Distributed Switch, they will automatically get renumbered out of the other /24, and all we need to do is to update DNS to be back up and running with the full set of VMs until the dead cluster member gets repaired.

2) We need a fast interconnect between VMs running on two difference cluster members to test some clustering performance of certain Open Source software. There exist users of that Open Source software that do connect their systems via QFSP+ NICs and switches, but we've never had the hardware to test this ourselves. With a two servers that each have a two dual-port QFSP+ NIC and two DAC cables between them, we will finally have a setup that should at least in theory allow us to do testing approaching real-life usage of the software.

3) The ISP at which we are hosting the two cluster members doesn't offer top-of-rack switches with QSFP+ ports and unlikely ever will. High-speed clustering via DAC is our only technically feasible way to achieve high-speed interconnects between the two ESXi hosts in the cluster.

Does that help explain things?

Thanks,

--Marc

daphnissov · ‎05-19-2019

I'm having a difficult time visualizing what you're describing with regard to this VM router and what VMs you have running that need those services plus how these interconnected 40 GbE interfaces come into play. It'd be very helpful if you can paste in some screenshots of your vSphere networking for us to examine.

------------------
How to Ask for Help on Tech Forums
https://neonmirrors.net

Wombat99 · ‎05-19-2019

I'l try to draw something up in PowerPoint. I just hope I can make it all fit on one page.

At the moment, the machines are neither racked nor configured, so unfortunately I can't just post a screenshot of the current configuration. Which, not coincidentally is why I am asking that question, because I am not sure how to configure it. I will draw up what I have in mind.

While I've been running a hobby ESXi server for years, I wouldn't say vSphere is my area of expertise. I've only recently got into clustering and Distributed Switches, and have a simple cluster with Distributed Switches running at my home. The configuration at the ISP will be more complex, so it may take me a while to get this all drawn up. Will do my best.

Thanks,

--Marc

Wombat99 · ‎05-19-2019

As requested, here is a diagram for what we have in mind. Please do feel free to ask me any questions about it.

Please ignore the "Phase 2" question on the same diagram as part of my post here. That Phase 2 question is included on the diagram to be considered in the future. The question to which I am seeking an answer in this post here is the question labeled as "How to force Dswitch traffic through this NIC?"

Big Thanks!

--Marc

daphnissov · ‎05-19-2019

Alright, where do I begin. What you're trying to do just doesn't make a whole lot of sense to me, so maybe I'm still not understanding the objectives.

You say you have 3 1 GbE vmnics (physical NICs) per host. Your diagram shows all three in some team as uplinks for vSS 0. Also connected to vSS 0 you appear to have a vmkernel port (I'm presuming the only kernel service this is offering is management, is that correct?) and a virtual machine port group to which a pfSense VM will be connected. You also strangely have IPs assigned (somehow) to each individual vmnic in that team. That just isn't possible to do, nor would you want to try it in this configuration anyhow. A vmkernel port is capable of having an IP assigned, so that part makes sense. What is that IP? On what network? And your pfSense VM has what IP on the interfaces that connects to its VM port group?
Second is the matter of the DHCP server to begin with. This just confuses me. You show two VMs per host (not counting pfSense which is also a VM). Why do you need DHCP services at all anywhere here? Is DHCP the only reason you're using a pfSense box here?
Cross-connecting these 40 GbE interfaces in this fashion isn't supported, do you know that? I get that your hosting provider doesn't have 40 GbE connectivity, but this would be one case where if you absolutely require it, you should provide your own switch which can be private to this environment.
Alternatively to #3, why not cluster within an ESXi host so it doesn't have to egress? Each host can have both VMs that act in a given cluster. If you're concerned about failover, there are ways to replicate VMs or sets of VMs to another host in the case of a single host failure.

All of this aside, let me just try to answer your direct question about steering traffic out of a distributed switch:

If you've got your workload VMs (whatever you want to call them) that have a single interface connected to a distributed port group with a given network assigned, you need to not assign that same network to another port group and connect it to both places, otherwise you've just created a L2 loop. As long as the uplink used for the vDS is in the same broadcast domain as the VMs on the other side, the ARP table will have those MACs populated and traffic will get routed out that uplink. And you don't assign IPs to uplinks. You assign IPs to either vmkernel ports or to vNICs (virtual machine NICs) connected to port groups. Those are your only two options for interface assignment.

What I would recommend here (and which you may not be interested in) is to look at a redesign of this networking idea. Start at the applications. What are these? How are they used? What are the requirements based around them? Is it solely availability? Something else? How many VMs are involved? How many users are involved? Etc.

------------------
How to Ask for Help on Tech Forums
https://neonmirrors.net

Wombat99 · ‎05-19-2019

Thanks again for the time and effort you are investing in this response. I realize that in a likelihood I am missing something obvious.

My apologies in advance for any bugs in my drawing of the right side (green side) of the diagram. That part is working in my test setup, as is the pfSense router/DHCP VM. It has been a while since I looked at the green parts in detail, having focused on the other colors instead, which I can't model in my test setup.

I have pfSense connected to a regular vSwitch on the LAN side of pfSense on a single machines and I can connect VMs to that LAN vSwitch via DHCP just fine.

In practice, there will be more VMs connected to the Distributed Switches on both ESXi hosts than just four total. I drew four total to show what the use cases are and because I don't have space on that page to draw 20-30 VMs.

You are correct that the vmKernel port on the green side currently only supports management. I suspect that there will need to be vmKernel ports connected the Dswitch A and Dswitch B that support vmMotion, etc. What I don't know, and have not been able to find on the Internet, is which of the 5 or 6 properties that one can assign to a vmKernel port is used to move data between Distributed Switches. Is it "management"? Is it "vMotion"? Is it something else? /Something/ has to move the Distributed Switch data between two machines in a cluster. What is that something? I suspect that Distributed Switches between two cluster members are somehow connected via vmKernel ports on each cluster member using one of the set of properties one can enable for a vmKernel port. I would love to know which of the properties this is, because I think it would make things much clearer in my mind.

One reason why I went to the egress via the 2x 40Gbps NICs is because that allows us to run, for certain load testing scenarios, one workload VM on each ESXi host using the full set of computing resources that each host provides and to closest model how two systems running the software might be connected in real life. Also, we managed to pick up the 2x 40Gbps NICs for a very reasonable price. Alas, a QFSP+ switch exceeds the budget for this project. Hence the DACs.

Reading your response carefully, I believe one area where I may have gone wrong in that I was under the impression that I had to assign IP addresses to the interconnect NICs.

Finally, to answer your question about why we went the DHCP route for the workload VMs in the first place, it was to allow easy migration of the VMs both between the two /24 flowing into the two cluster members and to also allow for easy running of some of the VMs at a DR site that has yet another IP address space, all without having to manually renumber each VM. If VMs are moved between hosts, sites, and address spaces, all we have do is to load the corresponding IP address for that VM and that site into DNS and at most an hour later that VM will be up, running, and accessible. Be it on the other cluster member or on some server half a world away.

(We plan to use Veeam for the failover, which was recommended to me in a previous inquiry to this forum and which based on my limited testing so far seems to do what we are looking for regarding enabling DR. As long as the VMs don't require manual renumbering when being run from another site).

I will mark your previous answer as best answer and again thank you for your time. If you can think of any other advice or happen to know the answer to the question how exactly Distributed Switches move data between cluster members, I would love to know the answer.

I am deeply appreciative of your effort and your detailed responses!!!

--Marc

daphnissov · ‎05-19-2019

To answer your question about distributed switch and traffic flows, there is nothing special about how a distributed switch moves packets compared to a standard switch. The vDS exists to simplify management as it's a control-plane object; not a data plane object. Just like a vSS, you assign one or more uplinks to a vDS. Those uplinks are in a profile and this profile will be applied to any host belonging to the vDS. And just like a vSS, the virtual machine port groups use those uplinks to ingress/egress their traffic. How that happens depends on the teaming policy applied to the vDS. With a vDS, there are many more teaming policies available compared to a vSS, but those teaming policies don't determine how traffic reaches other hosts on the vDS, it determines how traffic leaves each host. Where that traffic gets switched or routed is irrelevant to either the ESXi host or the vDS object; that's a concern for external networking infrastructure.

Lastly, regarding DHCP, I would just add that if you're using Veeam for failover, it has the ability to re-IP VMs as well as assign the replicas to a different destination network. These two things in tandem give you just about whatever you need to have a given VM get the necessary network services on the destination side.

------------------
How to Ask for Help on Tech Forums
https://neonmirrors.net

All

How to force Distributed Switch traffic through a particular NIC between two cluster member?