- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Issues with Active/Standby uplinks in VDS. Communication to one node is not working correctly in Standby configuration
Although my problem is centred around vSAN, I am guessing it's a fundamental ESXi problem and posting here as well, hopefully that is okay.
I am in the process of trialling vSan to see if it will work to replace/consolidate our existing infrastructure and have come across a strange problem that I am struggling to pinpoint.
Our network configuration for vSAN is as follows;
3 x Nodes. Each node has 2 x 1Gbe and 2 x 10Gbe ports.
2 x 1GBe Switches - Each Node has a 1GBe connection to each switch
1 x 10GBe switch - Each node has both 10Gbe connections to this switch.
I have attempted to configure the VDS as follows and for the most part it seems to work;
4 Uplinks 1 - 4
Uplink 1 is vmnic0 on all Nodes (1Gbe)
Uplink 2 is vmnic1 on all Nodes (1Gbe)
Uplink 3 is vmnic2 on all Nodes (10Gbe)
Uplink 4 is vmnic3 on all Nodes (10Gbe)
For the time being, I have set the Load Balancing failover to "Use explicit failover order" and the Uplink config is;
Uplink 3 & 4 - Active
Uplink 1 & 2 - Standby
When testing in this configuration, the results are as expected, using the Proactive tests I get 9gbits/sec and full connectivity. When I do some local testing on ESXI using iperf3, I get the same results. HCIBench also gives me a decent starting point.
The problem occurs when I look to test the failover. Essentially, what I am looking to do is ensure availability to the cluster should the single 10Gbe switch go down, I don't really want to put another switch in if I can help it. In order to test, I turn of the ports that are connected to Uplinks 3 & 4 and at this point I assume it should failover to the 1Gbe and it does, of sorts.
When I do this and re-run the tests, I get an issue with one of the nodes not being able to talk to the others, nodes 2 & 3 are fine, but node 1 seems in some way unavailable.
I can ping each of the nodes from every other node, but when I try to run the same tests as the 10Gbe using iperf3, I get the following response ;
Node 2 -> Node 1
[ 4] local 10.0.61.1 port 55276 connected to 10.0.60.1 port 5201
Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting 0 seconds, 10 second test
iperf3: getsockopt - Function not implemented
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-1.00 sec 43.0 KBytes 351 Kbits/sec 8634728 0.00 Bytes
iperf3: getsockopt - Function not implemented
[ 4] 1.00-2.00 sec 0.00 Bytes 0.00 bits/sec 0 0.00 Bytes
iperf3: getsockopt - Function not implemented
[ 4] 2.00-3.00 sec 0.00 Bytes 0.00 bits/sec 0 0.00 Bytes
iperf3: getsockopt - Function not implemented
[ 4] 3.00-4.00 sec 0.00 Bytes 0.00 bits/sec 0 0.00 Bytes
iperf3: getsockopt - Function not implemented
[ 4] 4.00-5.00 sec 0.00 Bytes 0.00 bits/sec 0 0.00 Bytes
iperf3: getsockopt - Function not implemented
[ 4] 5.00-6.00 sec 0.00 Bytes 0.00 bits/sec 0 0.00 Bytes
iperf3: getsockopt - Function not implemented
[ 4] 6.00-7.00 sec 0.00 Bytes 0.00 bits/sec 0 0.00 Bytes
iperf3: getsockopt - Function not implemented
[ 4] 7.00-8.00 sec 0.00 Bytes 0.00 bits/sec 0 0.00 Bytes
iperf3: getsockopt - Function not implemented
[ 4] 8.00-9.00 sec 0.00 Bytes 0.00 bits/sec 0 0.00 Bytes
iperf3: getsockopt - Function not implemented
[ 4] 9.00-10.00 sec 0.00 Bytes 0.00 bits/sec 4286332568 0.00 Bytes
On Node 1
Accepted connection from 10.0.61.1, port 20564
[ 5] local 10.0.60.1 port 5201 connected to 10.0.61.1 port 25154
iperf3: getsockopt - Function not implemented
[ ID] Interval Transfer Bandwidth
[ 5] 0.00-1.00 sec 0.00 Bytes 0.00 bits/sec
iperf3: getsockopt - Function not implemented
[ 5] 1.00-2.00 sec 0.00 Bytes 0.00 bits/sec
iperf3: getsockopt - Function not implemented
[ 5] 2.00-3.00 sec 0.00 Bytes 0.00 bits/sec
iperf3: getsockopt - Function not implemented
[ 5] 3.00-4.00 sec 0.00 Bytes 0.00 bits/sec
iperf3: getsockopt - Function not implemented
[ 5] 4.00-5.00 sec 0.00 Bytes 0.00 bits/sec
iperf3: getsockopt - Function not implemented
[ 5] 5.00-6.00 sec 0.00 Bytes 0.00 bits/sec
iperf3: getsockopt - Function not implemented
[ 5] 6.00-7.00 sec 0.00 Bytes 0.00 bits/sec
iperf3: getsockopt - Function not implemented
[ 5] 7.00-8.00 sec 0.00 Bytes 0.00 bits/sec
iperf3: getsockopt - Function not implemented
[ 5] 8.00-9.00 sec 0.00 Bytes 0.00 bits/sec
iperf3: getsockopt - Function not implemented
[ 5] 9.00-10.00 sec 0.00 Bytes 0.00 bits/sec
iperf3: getsockopt - Function not implemented
[ 5] 10.00-10.11 sec 0.00 Bytes 0.00 bits/sec
Node 2 -> Node 1 ping
vmkping -I vmk0 10.0.60.1
PING 10.0.60.1 (10.0.60.1): 56 data bytes
64 bytes from 10.0.60.1: icmp_seq=0 ttl=64 time=0.270 ms
64 bytes from 10.0.60.1: icmp_seq=1 ttl=64 time=0.300 ms
64 bytes from 10.0.60.1: icmp_seq=2 ttl=64 time=0.277 ms
For reference, jumbo frames on 1gb is not yet enabled, so no response;
vmkping -I vmk0 10.0.60.1 -s 9000
PING 10.0.60.1 (10.0.60.1): 9000 data bytes
Node 2 -> Node 3
local 10.0.61.1 port 39862 connected to 10.0.62.1 port 5201
Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting 0 seconds, 10 second test
iperf3: getsockopt - Function not implemented
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-1.00 sec 114 MBytes 953 Mbits/sec 8634728 0.00 Bytes
iperf3: getsockopt - Function not implemented
[ 4] 1.00-2.00 sec 84.5 MBytes 709 Mbits/sec 0 0.00 Bytes
iperf3: getsockopt - Function not implemented
[ 4] 2.00-3.00 sec 84.1 MBytes 706 Mbits/sec 0 0.00 Bytes
iperf3: getsockopt - Function not implemented
[ 4] 3.00-4.00 sec 112 MBytes 940 Mbits/sec 0 0.00 Bytes
iperf3: getsockopt - Function not implemented
[ 4] 4.00-5.00 sec 112 MBytes 940 Mbits/sec 0 0.00 Bytes
iperf3: getsockopt - Function not implemented
[ 4] 5.00-6.00 sec 112 MBytes 940 Mbits/sec 0 0.00 Bytes
iperf3: getsockopt - Function not implemented
[ 4] 6.00-7.00 sec 112 MBytes 940 Mbits/sec 0 0.00 Bytes
iperf3: getsockopt - Function not implemented
[ 4] 7.00-8.00 sec 112 MBytes 940 Mbits/sec 0 0.00 Bytes
iperf3: getsockopt - Function not implemented
[ 4] 8.00-9.00 sec 112 MBytes 940 Mbits/sec 0 0.00 Bytes
iperf3: getsockopt - Function not implemented
[ 4] 9.00-10.00 sec 112 MBytes 940 Mbits/sec 4286332568 0.00 Bytes
Node 3 -> Node 1
local 10.0.62.1 port 19339 connected to 10.0.60.1 port 5201
Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting 0 seconds, 10 second test
iperf3: getsockopt - Function not implemented
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-1.00 sec 114 MBytes 953 Mbits/sec 8634728 0.00 Bytes
iperf3: getsockopt - Function not implemented
[ 4] 1.00-2.00 sec 112 MBytes 942 Mbits/sec 0 0.00 Bytes
iperf3: getsockopt - Function not implemented
[ 4] 2.00-3.00 sec 112 MBytes 940 Mbits/sec 0 0.00 Bytes
iperf3: getsockopt - Function not implemented
[ 4] 3.00-4.00 sec 112 MBytes 940 Mbits/sec 0 0.00 Bytes
iperf3: getsockopt - Function not implemented
[ 4] 4.00-5.00 sec 112 MBytes 940 Mbits/sec 0 0.00 Bytes
iperf3: getsockopt - Function not implemented
[ 4] 5.00-6.00 sec 112 MBytes 940 Mbits/sec 0 0.00 Bytes
iperf3: getsockopt - Function not implemented
[ 4] 6.00-7.00 sec 112 MBytes 940 Mbits/sec 0 0.00 Bytes
iperf3: getsockopt - Function not implemented
[ 4] 7.00-8.00 sec 112 MBytes 940 Mbits/sec 0 0.00 Bytes
iperf3: getsockopt - Function not implemented
[ 4] 8.00-9.00 sec 112 MBytes 940 Mbits/sec 0 0.00 Bytes
iperf3: getsockopt - Function not implemented
[ 4] 9.00-10.00 sec 112 MBytes 940 Mbits/sec 4286332568 0.00 Bytes
When I run these at 10Gbe, they all work fine.
Node 2 -> Node 1
[ 4] local 10.0.61.1 port 30126 connected to 10.0.60.1 port 5201
Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting 0 seconds, 10 second test
iperf3: getsockopt - Function not implemented
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-1.00 sec 1.11 GBytes 9.57 Gbits/sec 8634728 0.00 Bytes
iperf3: getsockopt - Function not implemented
[ 4] 1.00-2.00 sec 1.15 GBytes 9.88 Gbits/sec 0 0.00 Bytes
iperf3: getsockopt - Function not implemented
[ 4] 2.00-3.00 sec 1.15 GBytes 9.88 Gbits/sec 0 0.00 Bytes
iperf3: getsockopt - Function not implemented
[ 4] 3.00-4.00 sec 752 MBytes 6.30 Gbits/sec 0 0.00 Bytes
iperf3: getsockopt - Function not implemented
[ 4] 4.00-5.00 sec 1.15 GBytes 9.88 Gbits/sec 0 0.00 Bytes
iperf3: getsockopt - Function not implemented
[ 4] 5.00-6.00 sec 1.15 GBytes 9.88 Gbits/sec 0 0.00 Bytes
iperf3: getsockopt - Function not implemented
[ 4] 6.00-7.00 sec 1.15 GBytes 9.88 Gbits/sec 0 0.00 Bytes
iperf3: getsockopt - Function not implemented
[ 4] 7.00-8.00 sec 1.15 GBytes 9.88 Gbits/sec 0 0.00 Bytes
iperf3: getsockopt - Function not implemented
[ 4] 8.00-9.00 sec 1.15 GBytes 9.89 Gbits/sec 0 0.00 Bytes
iperf3: getsockopt - Function not implemented
[ 4] 9.00-10.00 sec 1.15 GBytes 9.88 Gbits/sec 4286332568 0.00 Bytes
At present, there is no specific VLAN configuration and all switches have a connection to each other and the vmkping works on 1Gbe and 10Gbe, but I have no idea where to start looking for the issue with Node 2 to Node 1 when using 1Gbe
Thoughts are welcome.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Moderator: Please do not post duplicate threads on the same topic.
Issue with using active/standby VDS and vSAN
Your other thread will be locked and archived.
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Although I am a VMware employee I contribute to VMware Communities voluntarily (ie. not in any official capacity)
VMware Training & Certification blog
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Bit unusual address space.
Is this a single VLAN ? Looks like at least /22.
Is this a simple L2, or do you have GW in this network ?
What are the MTU settings on the VDS and the switches ?
What is the VDS healthcheck status - like are you sure you have all the VLANs really available on all the uplinks ?
Is this all flash VSAN ?
It won't be happy with the 1Gb interface link