VMware Cloud Community
Samsonite801
Enthusiast
Enthusiast

Physical interface bridging possible in ESXi ???

Not sure if I should post this here or in the general Vsphere forum, but my desired usage would be for VSAN...

In normal Linux, admin can build a br0 logical container (bridge interface), and then associate 2 pNICs (eth0 and eth1 say for example) to it, so that they reside in the same L2 broadcast domain (act as switch ports), then you assign your local interface IP to the br0 construct and the physical eth0 and eth1 ports on the box will act like switch ports and can actually pass traffic across them inside the br0..

Can this same thing be done on ESXi?

For example, if you have a dual-port physical NIC card, is it possible to build a logical bridge construct in ESXi and assign both of those pNICs of your dual-port card to it so they can pass broadcast traffic across? Or could one just assign 2 uplinks to a vDS, and make them promiscuous and pass traffic across each other? Or could a bridge somehow be built below the hypervisor layer so only one vmnic is presented to the hypervisor? Is there a standard way to accomplish this?

I want to configure a 3-node ESXi cluster for VSAN testing in a switchless daisy-chain configuration like this: ESXi-01 <--> ESXi-02 <--> ESXi-03.   Requirements: No physical switch can exist. This network will be isolated (no gateways), and only used for VSAN traffic in my example.

In my particular example above, only the ESXi-02 host would actually need the dual port NIC installed to start with, but ideally, this setup could be expanded to include additional nodes as well. Think of it like an isolated backplane for VSAN. Once POC was established, second sets of PCIe dual-port network adapters can be installed to eliminate single-point-of-failure.

From what I understand  though, VSAN requires all nodes be in the same broadcast domain (?), so I do not believe you could simply make this work using L3 networking. Seems to require all hosts have their respective VSAN vmkernel interface to be on the same L2 broadcast domain. Perhaps it is possible to accomplish this using L3 networking? Perhaps it could even be possible to daisy-chain all hosts into a complete circle using vDS and LBT?

The last thought that came to my mind is if there are such thing as dual-port PCIe network adapters which might have hardware-based bridging built right into the controller. This would be the most desirable. Part of this post here is to get ideas on whether this bridging can be done easily in the software level, but also to provide food for thought on the development side of things in case this is not currently possible. Making this possible to do in the hardware-level directly on the card would most likely only require a firmware flash on the card itself to enable this functionality.

The motivating factor for all of this is because 10 GbE switches are a very expensive investment for an edge location (especially if you need 2 for redundancy), yet 10 GbE dual-port PCIe cards are very cheap. Some way to eliminate the expensive switching would save a lot of money at remote sites where they cannot be justified..

Any thoughts on this?

Reply
0 Kudos
15 Replies
zdickinson
Expert
Expert

Not sure about all that, but are you looking to do nested ESXi for testing purposes?  http://www.virtuallyghetto.com/2013/09/how-to-quickly-setup-and-test-vmware.html

Reply
0 Kudos
Samsonite801
Enthusiast
Enthusiast

I am familiar with that process you mention as I used it before to nest ESXi cluster / vCenter appliance on Workstation in prep when I got my VCP5.

But for this project, I actually do have 3 physical 1U blades for ESXi and a vCenter Windows box with licenses to test on... If I can get it to work then I might consider to use 2 x 10 GbE interfaces on the VSAN instead of 2 x 1 GbE interfaces.

I have been seeing some congestion on the networking for the VSAN under decent loads, which I why I am considering to move to the recommended 10 GbE, but the expense of 10 GbE switches is not justifiable in this situation.

Reply
0 Kudos
zdickinson
Expert
Expert

Gotcha!  Interesting idea.  I would be worried about redundancy and split-brain situations.  Would love to see you pull it off.  You mention that 10Gb switches are not viable because of cost.  Everyone's budget is different, but we just got 2 Dell N4032F switches with 5 years support for under $16k.  Going to use them with vSAN.

Reply
0 Kudos
fattireaddict
Contributor
Contributor

In the past I have tried similar 3 host (full meshed) scenario but bumped into a few issues due to a lack of feature support for this type of deployment.  This scenario would depend on utilizing protocols like STP you were connecting in a full mesh between the ESXi hosts.  vDS and vSwitch do not support STP.  Even in a non looped topology vDS and vSwitch  instances do not forward ingress packets received on a physical uplink to a physical uplink.  In your case I believe vm connectivity would be limited too :


ESXi-01 <--> ESXi-02 <--> ESXi-03  


ESXi-01 <--> ESXi-02  yes

ESXi-02 <--> ESXi-03  yes

ESXi-01 <--> ESXi-03  nope

I have not tested this recently but I have not seen any changes to virtual switching features in this regard.  

Reply
0 Kudos
Samsonite801
Enthusiast
Enthusiast

Well as an alternative to port bridging and daisy chaining, I have also been entertaining the idea of trying dual-port 10 GbE adapters on all 3 hosts and cabling them so each host nic port connects to the opposite 2 hosts, set them, all up as vDS uplinks, but this will restrict the layer 2 traffic from interacting across each of the 3 links:

mesh.jpg

So the only problem there is, as per VSAN network requirements:

https://pubs.vmware.com/vsphere-55/index.jsp?topic=%2Fcom.vmware.vsphere.storage.doc%2FGUID-8408319D...    (the last 2 bullets from that URL I pasted here):

-Virtual SAN does not support multiple VMkernel adapters on the same subnet for load balancing. Multiple VMkernel adapters on different networks, such as VLAN or separate physical fabric, are supported.

-You should connect all hosts participating in Virtual SAN to a single L2 network, which has multicast (IGMP snooping) enabled. If the hosts participating in Virtual SAN span across multiple switches or even across L3 boundaries, you must ensure that your network is configured correctly to enable multicast connectivity. You can change multicast addresses from the defaults if your network environment requires, or if you are running multiple Virtual SAN clusters on the same L2 network.

More on that here: http://cormachogan.com/2014/01/21/vsan-part-15-multicast-requirement-for-networking-misconfiguration...

So that is the part I am not really sure about. How important is the IGMP snooping traffic? Can't it still get across to each of the 3 hosts even though there are 3 separate links? The hosts will all be able to talk to each other, but the different networks they use will be isolated.

STP is not an issue because vSwitches just don't cause loops (vSwitches don't forward broadcast packets). I currently use LBT (no etherchannel, no LACP, no Spanning Tree, nothing) with both uplinks in a vDS (which are really just 3 hidden vSS underneath them on each host), and with the switch running straight open with no configuration and ESXi never causes loops, and LBT just connects the traffic to the uplink with lowest utilization.

But my theory is that is per the first bullet point above, If I create SEPARATE VMkernel adapters per host, so they each connect to their respective uplink, and it should work that way. But again, if VSAN is expecting to pass IGMP across all links I am not sure how this will behave, because I do not have a firm grasp on that protocol or what they are using it for.

EDIT:

Again, if the Proof of Concept shows it works, then you can add a second 10 GbE card in other PCIe slot and set up a second set of paths to each host for network redundancy, using LBT if that would work..

Reply
0 Kudos
zdickinson
Expert
Expert

If the POC works, are you going to put a production/tier 1 workload on it?

Reply
0 Kudos
Samsonite801
Enthusiast
Enthusiast

Well... That is a big TBD.

At this stage in the game, this idea is nothing more than a what-if project. It does have the potential to save a lot of money for a specific scenario of smb or corp edge (obviously does not offer scaleability for other types of deployments).

There is some indication in the bullet-points I pasted above that VMware might actually support spanning across L3 boundaries as long as the multipath communication could get through from and to each host in the cluster.

My theory is that if you set up multiple vmkernel interfaces per host, per uplink group, per physical network, that this could work. Then use second uplinks with LBT enabled for the redundant networks. It would also have to pass vCenter VSAN validation checking:

FROM DOCUMENTATION-

"After you make any changes to Virtual SAN configuration, vCenter Server performs validation checks for Virtual SAN configuration. Validation checks are also performed as a part of a host synchronization process. If vCenter Server detects any configuration problems, it displays error messages."

The only way I would consider putting production workloads on it is if in the end, I would be in a quote-unquote: "supported" configuration from VMware's perspective so you have their blessing when you require support in general (so they don't simply say in a support call that it isn't compliant, fix it and then call us back).

It would also have to pass stress-testing and failure scenario testing to confirm the behavior when redundant paths go down, hosts or disks fail and such.

It is way too early to think of this project as being anything other than 'bench racing' at this point. It is still going to be awhile before I can even test this out since the system I am planning to try this on is still running some production workloads. Soon it will be decomm'd from prod and downgraded for general testing use, in which I would like to try this project out for fun and curiousity.. Unless somebody else wants to beat me to it try to set this up first?

Reply
0 Kudos
Samsonite801
Enthusiast
Enthusiast

Well I managed to test this concept (using 1 GbE adapters) on a system I was able to use for testing. I am going to get some 10 GbE adapter soon and will then move on to testing it with that.

So far in preliminary examination it seems to work fine.

The final configuration that wound up working was as follows:

Physical cabling:

mesh.jpg

This cabling equals 3 networks required, so I created 3 separate Virtual Distributed Switches (I couldn't get it to work on single vDS because I needed a way to only add 2 hosts at a time to each switch container).

Then I was initially having some trouble getting the physical vmnic ports to auto-negotiate (no link-lights) so I had to set them static to 1000 Mb 'Configured Speed' setting for all 6 vmnic ports which made that problem go away.

My blueprint was as follows:

Hosts:    .51       .61       .71

vmnics:  4  5     4  5     4  5

            -----------------------------

vds #:    1   2     1   3     2  3    ( 1-to-1,  2-to-2, and  3-to-3 )..

vDS 1:    1.1.10 .51  &  .61

vDS 2:   1.1.20 .51  &  .71

vDS 3:   1.1.30 .61  &  .71

Here is the basic switch config on the 3 vDS:

vDS01.jpg

vDS02.jpg

vDS03.jpg

I only labeled them 10 GbE for later on, but the physical ports attached to the Uplinks are only 1 GbE for this POC.

Here are the health statuses reporting all fine:

VSAN01.jpg

VSAN02.jpg

VSAN03.jpg

I'm going to monitor it for a few days and make sure everything continues to report fine, and as soon as we get in the 10 GbE adapters, I am going to install them and see how that goes. If it all works fine, I'll go ahead and set up the second set of adapters (by adding second sets of Uplinks to each vDS, and using LBT) so there is no single point of failure, and continue testing, and do some performance benchmarks, and failure scenario testing to see what comes of that..

.

.

Reply
0 Kudos
Samsonite801
Enthusiast
Enthusiast

I had a little extra time to play today, so I went the next level in my POC using only the 1 GbE adapters, and decided to set up some redundant NICs.

This process went just fine too..

This is the final mock up which is working good so far (Uplink 1 is one NIC adapter in each host, and Uplink 2 is the other respective NIC adapter in each host) :

vDS04.jpg

vDS05.jpg

vDS06.jpg

As you can see, in the pic below, I set each vDS DPortGroup to LBT (Route based on physical NIC load), and it seems to be working great so far. I have not placed any VMs on this VSAN datastore yet but I have copied files like ISOs and such to and from it and behavior is as expected.

LBT.jpg

So here's my final amended blueprint:

UPLINK#1:

Hosts:    .51       .61       .71

vmnics:  4  5     4  5     4  5

            -----------------------------

vds #:    1   2     1   3     2  3    ( 1-to-1,  2-to-2, and  3-to-3 )..

UPLINK#2:

Hosts:    .51       .61       .71

vmnics:  2  3     2  3     2  3

            -----------------------------

vds #:    1   2     1   3     2  3    ( 1-to-1,  2-to-2, and  3-to-3 )..

vDS 1:    1.1.10 .51  &  .61

vDS 2:   1.1.20 .51  &  .71

vDS 3:   1.1.30 .61  &  .71

So I'm sure this whole mock up will work the same with 10 GbE, but it will still be nice when I get those cards in to confirm there are no link issues or anything.

Reply
0 Kudos
Samsonite801
Enthusiast
Enthusiast

Just as a side note, in this case here in my testing, I am using 1U ESXi hosts which only have one PCIe slot per host. In order to have the 10 GbE on both uplinks, you will most likely need 2U with 2 PCIe slots, or if 1U you would need onboard 10 GbE adapter(s), or combining that with PCIe. I am just using this setup for testing the POC using 1 GbE for now.

You could also try 10 GbE on one uplink and 1 GbE on the other uplink for better protection (probably not desirable but may provide adequate protection in a bind). I know if we ever decide to use this setup in production we will use strictly 10 GbE on both uplinks.

Here is a picture of the cabling in my POC just to get a visual of how ridiculously simple this is:

mesh01.jpg

Reply
0 Kudos
Samsonite801
Enthusiast
Enthusiast

My above ideas about having 3 networks did not work.. Once I started trying to move VMs onto the VSAN datastore it just didn't work. Not sure if VSAN just can't figure out which path to use or what, but vmotion was failing and other problems, like creating VMs on that datastore failing and such. And after enabling HA it started throwing other networking related errors as well.

Back to the drawing board. I have to set this project down for awhile as I have some other things that are more important for the time being...

If anyone else wants to mess with this I wish you the best of luck. When I get more time I might work on it later or something..

Reply
0 Kudos
5498375982eddhf
Contributor
Contributor

Samonite801,

Thank you for documenting your progress.

I was planning to do a similar setup to avoid having to purchase two 10Gbe switches.  What did you end up doing to connect your severs?

Thanks.

Reply
0 Kudos
BenzSL600
Contributor
Contributor

I was wondering the same thing and quickly found out that ESXi has some limitations with regards to bridging and STP as mentioned earlier. I do have 2 datacenter-grade switches with multiple 10GbE ports, but they turn out to be very noisy and power-hungry for a home lab situation! Therefore, I am also looking at other ways to accomplish 10GbE links without switches.

The new idea is to outfit 1 esxi host with 3x dual 10GbE NICs and have 2x of them in a pass-through mode to a guest. That guest would run Linux/BSD and bridge the interfaces. The 3 esxi hosts will have a single 10GbE to one of the pass-through NICs. Yes, one esxi host will have a 'loop' that links the physical world of the esxi host to the virtualized world of the guest running the bridge 😉 If this proofs to work, then vsan can get its own 10GbE lane while other traffic types remain on the 1GbE lanes and the mission will be accomplished.

Reply
0 Kudos
Samsonite801
Enthusiast
Enthusiast

So on my setup , here I just wound up buying a 10G Brocade switch..

But more recently, I have been trying to help a friend set up a study lab where he has 5 HP 1u, where 4 will be hosts on one cluster for cloud automation components, and the 5th one would have Core Pod components for the same, and we also had this idea to use PCI passthru, so he got a couple of 10G dual port NICs for the 5th host and we installed CentOS 7, and using passthru, but for the life of me I cannot get linux to load the mlx4_en / mlx4_core drivers.

You can see the PCI devices in lspci -v but driver module doesn't load. I fussed with helping him on that for this past couple weeks and nothing works.. I wonder if Intel cards would have better compatibility?

Another idea I had been toying with is to attempt setup using SR-IOV and map the VF's over to the CentOS 7 host. But I need to flash the fw on the Mellanox card in order to enable SR-IOV and then I don't even know if the host's system BIOS supports SR-IOV so this is not yet known..

So the most recent experiment I tried the other day on his, was to install the Cisco Nexus 1000v VSM/VEM setup and it lets me create port-profiles fine, and I can bind them to the physical interfaces, and put them into the same VLAN, but it will not allow me to get any packets across from one physical nic uplink to another. So far, I can only can pass traffic from a vEthernet to an Ethernet port-profile (just as you only can do with a stock vSwitch)..

The ultimate last resort I recommended for my colleague, is to simply run CentOS on the bare metal, set up the br0 int, be done with that, and run KVM virtual machines on it, since VMware lacks the versatility to do bridging of any kind..

I've actually built a 10 port Linux switch in this way before and even ran a couple KVM VM's on it.. Doing it this way is a bit trickier if you need to deploy OVA/OVF templates, because you must manually adapt them to deploy on KVM.. But it does work..

Reply
0 Kudos
spulver88
Contributor
Contributor

Hi

I have build my lab similar to you.

VSAN works great. But if i enable vSphere HA on the cluster there appear a warning on each host like "vsphere ha agent cannot reach some management network addresses".

these error is correct. for example: esx01 cannot reach the network between esx02 and esx03 and so on.

is there any possibility to suppress this error?

i configured on the 10G VSAN vmkernal ports only the vsan-option and no other management function...

i googled a lot but i found nothing that would help me.

i installed newest esxi and vcenter releases.

thanks a lot!

sebastian

Reply
0 Kudos