VMware Cloud Community
oscarleopard
Contributor
Contributor

Issues with Active/Standby uplinks in VDS. Communication to one node is not working correctly in Standby configuration

Although my problem is centred around vSAN, I am guessing it's a fundamental ESXi problem and posting here as well, hopefully that is okay.

I am in the process of trialling vSan to see if it will work to replace/consolidate our existing infrastructure and have come across a strange problem that I am struggling to pinpoint.

Our network configuration for vSAN is as follows;

3 x Nodes. Each node has 2 x 1Gbe and 2 x 10Gbe ports.

2 x 1GBe Switches - Each Node has a 1GBe connection to each switch

1 x 10GBe switch - Each node has both 10Gbe connections to this switch.

I have attempted to configure the VDS as follows and for the most part it seems to work;

4 Uplinks 1 - 4

Uplink 1 is vmnic0 on all Nodes (1Gbe)

Uplink 2 is vmnic1 on all Nodes (1Gbe)

Uplink 3 is vmnic2 on all Nodes (10Gbe)

Uplink 4 is vmnic3 on all Nodes (10Gbe)

For the time being, I have set the Load Balancing failover to "Use explicit failover order" and the Uplink config is;

Uplink 3 & 4 - Active

Uplink 1 & 2 - Standby

When testing in this configuration, the results are as expected, using the Proactive tests I get 9gbits/sec and full connectivity. When I do some local testing on ESXI using iperf3, I get the same results. HCIBench also gives me a decent starting point.

The problem occurs when I look to test the failover. Essentially, what I am looking to do is ensure availability to the cluster should the single 10Gbe switch go down, I don't really want to put another switch in if I can help it. In order to test, I turn of the ports that are connected to Uplinks 3 & 4 and at this point I assume it should failover to the 1Gbe and it does, of sorts.

When I do this and re-run the tests, I get an issue with one of the nodes not being able to talk to the others, nodes 2 & 3 are fine, but node 1 seems in some way unavailable.

I can ping each of the nodes from every other node, but when I try to run the same tests as the 10Gbe using iperf3, I get the following response ;

Node 2 -> Node 1

[  4] local 10.0.61.1 port 55276 connected to 10.0.60.1 port 5201

Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting 0 seconds, 10 second test

iperf3: getsockopt - Function not implemented

[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd

[  4]   0.00-1.00   sec  43.0 KBytes   351 Kbits/sec  8634728   0.00 Bytes     

iperf3: getsockopt - Function not implemented

[  4]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec    0   0.00 Bytes     

iperf3: getsockopt - Function not implemented

[  4]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec    0   0.00 Bytes     

iperf3: getsockopt - Function not implemented

[  4]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec    0   0.00 Bytes     

iperf3: getsockopt - Function not implemented

[  4]   4.00-5.00   sec  0.00 Bytes  0.00 bits/sec    0   0.00 Bytes     

iperf3: getsockopt - Function not implemented

[  4]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec    0   0.00 Bytes     

iperf3: getsockopt - Function not implemented

[  4]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec    0   0.00 Bytes     

iperf3: getsockopt - Function not implemented

[  4]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec    0   0.00 Bytes     

iperf3: getsockopt - Function not implemented

[  4]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec    0   0.00 Bytes     

iperf3: getsockopt - Function not implemented

[  4]   9.00-10.00  sec  0.00 Bytes  0.00 bits/sec  4286332568   0.00 Bytes     

On Node 1

Accepted connection from 10.0.61.1, port 20564

[  5] local 10.0.60.1 port 5201 connected to 10.0.61.1 port 25154

iperf3: getsockopt - Function not implemented

[ ID] Interval           Transfer     Bandwidth

[  5]   0.00-1.00   sec  0.00 Bytes  0.00 bits/sec                

iperf3: getsockopt - Function not implemented

[  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec                

iperf3: getsockopt - Function not implemented

[  5]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec                

iperf3: getsockopt - Function not implemented

[  5]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec                

iperf3: getsockopt - Function not implemented

[  5]   4.00-5.00   sec  0.00 Bytes  0.00 bits/sec                

iperf3: getsockopt - Function not implemented

[  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec                

iperf3: getsockopt - Function not implemented

[  5]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec                

iperf3: getsockopt - Function not implemented

[  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec                

iperf3: getsockopt - Function not implemented

[  5]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec                

iperf3: getsockopt - Function not implemented

[  5]   9.00-10.00  sec  0.00 Bytes  0.00 bits/sec                

iperf3: getsockopt - Function not implemented

[  5]  10.00-10.11  sec  0.00 Bytes  0.00 bits/sec    

Node 2 -> Node 1 ping

vmkping  -I vmk0 10.0.60.1

PING 10.0.60.1 (10.0.60.1): 56 data bytes

64 bytes from 10.0.60.1: icmp_seq=0 ttl=64 time=0.270 ms

64 bytes from 10.0.60.1: icmp_seq=1 ttl=64 time=0.300 ms

64 bytes from 10.0.60.1: icmp_seq=2 ttl=64 time=0.277 ms

For reference, jumbo frames on 1gb is not yet enabled, so no response;

vmkping  -I vmk0 10.0.60.1  -s 9000

PING 10.0.60.1 (10.0.60.1): 9000 data bytes

Node 2 -> Node 3

local 10.0.61.1 port 39862 connected to 10.0.62.1 port 5201

Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting 0 seconds, 10 second test

iperf3: getsockopt - Function not implemented

[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd

[  4]   0.00-1.00   sec   114 MBytes   953 Mbits/sec  8634728   0.00 Bytes     

iperf3: getsockopt - Function not implemented

[  4]   1.00-2.00   sec  84.5 MBytes   709 Mbits/sec    0   0.00 Bytes     

iperf3: getsockopt - Function not implemented

[  4]   2.00-3.00   sec  84.1 MBytes   706 Mbits/sec    0   0.00 Bytes     

iperf3: getsockopt - Function not implemented

[  4]   3.00-4.00   sec   112 MBytes   940 Mbits/sec    0   0.00 Bytes     

iperf3: getsockopt - Function not implemented

[  4]   4.00-5.00   sec   112 MBytes   940 Mbits/sec    0   0.00 Bytes     

iperf3: getsockopt - Function not implemented

[  4]   5.00-6.00   sec   112 MBytes   940 Mbits/sec    0   0.00 Bytes     

iperf3: getsockopt - Function not implemented

[  4]   6.00-7.00   sec   112 MBytes   940 Mbits/sec    0   0.00 Bytes     

iperf3: getsockopt - Function not implemented

[  4]   7.00-8.00   sec   112 MBytes   940 Mbits/sec    0   0.00 Bytes     

iperf3: getsockopt - Function not implemented

[  4]   8.00-9.00   sec   112 MBytes   940 Mbits/sec    0   0.00 Bytes     

iperf3: getsockopt - Function not implemented

[  4]   9.00-10.00  sec   112 MBytes   940 Mbits/sec  4286332568   0.00 Bytes     

Node 3 -> Node 1

local 10.0.62.1 port 19339 connected to 10.0.60.1 port 5201

Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting 0 seconds, 10 second test

iperf3: getsockopt - Function not implemented

[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd

[  4]   0.00-1.00   sec   114 MBytes   953 Mbits/sec  8634728   0.00 Bytes     

iperf3: getsockopt - Function not implemented

[  4]   1.00-2.00   sec   112 MBytes   942 Mbits/sec    0   0.00 Bytes     

iperf3: getsockopt - Function not implemented

[  4]   2.00-3.00   sec   112 MBytes   940 Mbits/sec    0   0.00 Bytes     

iperf3: getsockopt - Function not implemented

[  4]   3.00-4.00   sec   112 MBytes   940 Mbits/sec    0   0.00 Bytes     

iperf3: getsockopt - Function not implemented

[  4]   4.00-5.00   sec   112 MBytes   940 Mbits/sec    0   0.00 Bytes     

iperf3: getsockopt - Function not implemented

[  4]   5.00-6.00   sec   112 MBytes   940 Mbits/sec    0   0.00 Bytes     

iperf3: getsockopt - Function not implemented

[  4]   6.00-7.00   sec   112 MBytes   940 Mbits/sec    0   0.00 Bytes     

iperf3: getsockopt - Function not implemented

[  4]   7.00-8.00   sec   112 MBytes   940 Mbits/sec    0   0.00 Bytes     

iperf3: getsockopt - Function not implemented

[  4]   8.00-9.00   sec   112 MBytes   940 Mbits/sec    0   0.00 Bytes     

iperf3: getsockopt - Function not implemented

[  4]   9.00-10.00  sec   112 MBytes   940 Mbits/sec  4286332568   0.00 Bytes   

When I run these at 10Gbe, they all work fine.

Node 2 -> Node 1

[  4] local 10.0.61.1 port 30126 connected to 10.0.60.1 port 5201

Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting 0 seconds, 10 second test

iperf3: getsockopt - Function not implemented

[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd

[  4]   0.00-1.00   sec  1.11 GBytes  9.57 Gbits/sec  8634728   0.00 Bytes     

iperf3: getsockopt - Function not implemented

[  4]   1.00-2.00   sec  1.15 GBytes  9.88 Gbits/sec    0   0.00 Bytes     

iperf3: getsockopt - Function not implemented

[  4]   2.00-3.00   sec  1.15 GBytes  9.88 Gbits/sec    0   0.00 Bytes     

iperf3: getsockopt - Function not implemented

[  4]   3.00-4.00   sec   752 MBytes  6.30 Gbits/sec    0   0.00 Bytes     

iperf3: getsockopt - Function not implemented

[  4]   4.00-5.00   sec  1.15 GBytes  9.88 Gbits/sec    0   0.00 Bytes     

iperf3: getsockopt - Function not implemented

[  4]   5.00-6.00   sec  1.15 GBytes  9.88 Gbits/sec    0   0.00 Bytes     

iperf3: getsockopt - Function not implemented

[  4]   6.00-7.00   sec  1.15 GBytes  9.88 Gbits/sec    0   0.00 Bytes     

iperf3: getsockopt - Function not implemented

[  4]   7.00-8.00   sec  1.15 GBytes  9.88 Gbits/sec    0   0.00 Bytes     

iperf3: getsockopt - Function not implemented

[  4]   8.00-9.00   sec  1.15 GBytes  9.89 Gbits/sec    0   0.00 Bytes     

iperf3: getsockopt - Function not implemented

[  4]   9.00-10.00  sec  1.15 GBytes  9.88 Gbits/sec  4286332568   0.00 Bytes     

At present, there is no specific VLAN configuration and all switches have a connection to each other and the vmkping works on 1Gbe and 10Gbe, but I have no idea where to start looking for the issue with Node 2 to Node 1 when using 1Gbe

Thoughts are welcome.

Reply
0 Kudos
2 Replies
scott28tt
VMware Employee
VMware Employee

Moderator: Please do not post duplicate threads on the same topic.

Issue with using active/standby VDS and vSAN

Your other thread will be locked and archived.


-------------------------------------------------------------------------------------------------------------------------------------------------------------

Although I am a VMware employee I contribute to VMware Communities voluntarily (ie. not in any official capacity)
VMware Training & Certification blog
Reply
0 Kudos
ZibiM
Enthusiast
Enthusiast

Bit unusual address space.

Is this a single VLAN ? Looks like at least /22.

Is this a simple L2, or do you have GW in this network ?

What are the MTU settings on the VDS and the switches ?

What is the VDS healthcheck status - like are you sure you have all the VLANs really available on all the uplinks ?

Is this all flash VSAN ?

It won't be happy with the 1Gb interface link

Reply
0 Kudos