Hi,
I have a 3-fully-collapsed cluster (3 hosts for transport, management and edge). I noticed that Geneve tunnels between the transport nodes and the edge node don't come up from two of the transport nodes but they do from one of them!!!
I use the same layer-3 segment and same VLAN ID for VTEPs for the host transport and edge transport nodes. I read a lot of documentation, communities and blogs where all say that you cannot share the same VLAN ID for the host and edge transport nodes IF you use the SAME virtual switch in the host for the VTEPs host itself and for the edge VM's VTEPs. But that is not my case. My three hosts have 4 physical vmnics (0 to 3). I use a N-VDS that uses two physical vmnics (2 and 3) for the host transport node's VTEPs and another vSphere DVS that has two vmnics (0 and 1) where the Edge VM is connected. So, I thought I could use the same VLAN ID for all the VTEPs. I'm not sure if this is actually the problem why the Geneve tunnels don't come up from two host transport nodes but they do from one of them. It is supposed that if the same VLAN ID for the VTEPs was the problem the tunnels wouldn't come up from one host and don't from the other two. Am I correct?
Just a pair of images where you can see that from one host they come up and from other one don't.
Does anyone know why this could be happening?
Thank you advance,
Guido.
I finally solved the problem. I had to open a support ticket with VMware.
The problem was that tunnels between the Host Transport Nodes and Edge Nodes were not up and running. However, in the NSX Manager GUI they appeared as up and in green color!!! There is a bug there that the GUI shows you tunnels UP and running where they are actually down and without connection between the VTEPs. (you can see in the image the the tunnel against the IP 192.168.12.17 is up where it is actually down).
Those tunnels were down because there was no layer-3 connection between the Host Transport Node VTEPs and the Edge Node VTEP. This infrastructure were running on a HP chassis C7000 with two Virtual Connect switches. There was a problem in the configuration of the core switch LAG ports. In the core, the ports that were connected to the Virtual Connect HP switches shouldn't have been configured as a LAG (port-channel) but as single ports.
We unconfigured the LAG and everything began to work.
Thanks for your help.
Guido.
You can use the same VLAN if you are spreading the host nodes and the edge nodes across switches. More than likely you have an MTU or trunking problem so I'd check end to end that MTU is a minimum 1600 and the proper IDs are trunked.
I have made a lot of tests in order to discard MTU and connection problems and I got the following results:
From each ESXi CLI I executed the following ping tests:
vmkping -s 1572 -S vxlan -I vmk10 -d 192.168.12.x
"x" is each vxlan vmkernel (VTEP) of the other two ESXi and lastly the VTEP IP of the Edge Node.
From the two hosts which have the Geneve tunnels down against the Edge node (host 5 and 6) I can reach all the VTEPs of the other host transport nodes (ESXi hosts) but not the edge one. It is to say, I cannot ping the VTEP of the edge node which is what I see from the GUI, the Geneve tunnel against it is down from those two hosts. But they are up against the other two transport nodes.
I make another test that was taking out the "size" argument of the ping against the VTEP IP of the Edge Node in order to see if it was a MTU problem and I neither could reach it. It is as if the Edge Node's VTEP didn't exist for those hosts. No ping response.
But from the only ESXi host of the three ones that has the Geneve tunnel up and running against the Edge Node I can ping all the VTEPs (the ones that belong to the host transport nodes and the one that belongs to the edge node). It is also consistent with what I see in the GUI, that that ESXi host (host 14) has the tunnel with the Edge Node up.
Strange thing:
Just to see if there was something wrong in the configuration of the VDS where the Edge VM is connected to, I created in the same Distributed Port Group a vmkernel port for hosts 5 and 6 (as default, without associating it to any IP stack). I configured it with an IP address of the same subnet of the VTEPs. Remember those two hosts have the tunnels with the Edge Node down.
I executed the same ping using that vmkernel interface as the source and I couldn't reach any of the other hosts VTEPs (the ones that could be reached before using as source interface the VTEPs) but I could reach the Edge Node one!!! It is to say, for some reason from a vmkernel port in the same VTEP VLAN and subnet I cannot reach the other hosts VTEPs but I can reach the VTEP that I cannot reach from the other VTEPs that are created in the N-VDS of each host.
I don't know if it has anything to do with the used IP stack..
But from the point of view of the Edge Node VTEP, what is the difference between the ping request it receives from the VTEPs of hosts 5 or 6, and the same ping request it receives from the vmkernel port of host 5?? Just the source IP address. In the ping packet there is no information about the source ESXi interface that originates the request nor the IP stack. It is an ICMP standard packet!!!
One thing I don't know how to test:
I don't kwnow how to generate the same ping requests that I do from the ESXi CLI but do it from the Edge Node. Because from the Edne Node CLI the only interface I see (if iI issue the "get interfaces" command) is the managent one, not the VTEP is has configured. I see the VTEP issuing the command "get logical switches", but is there any ways to generate a ping request from that IP address as the source?
Thank you,
Guido
You can do the pinging from the edge :
get tunnel-port
i.e
Tunnel | : 49905f4e-a4ed-52bf-a596-70958395d223 |
IFUID | : 266 |
LOCAL | : 10.0.9.106 |
REMOTE | : 10.0.9.104 |
ENCAP | : GENEVE |
ping 10.0.9.104 source 10.0.9.106 vrfid 0 size 1572 dfbit enable
PING 10.0.9.104 (10.0.9.104) from 10.0.9.106: 1572 data bytes
1580 bytes from 10.0.9.104: icmp_seq=0 ttl=64 time=0.329 ms
1580 bytes from 10.0.9.104: icmp_seq=1 ttl=64 time=0.323 ms
1580 bytes from 10.0.9.104: icmp_seq=2 ttl=64 time=0.278 ms
If mtu is incorrect, you see something like:
ping 10.0.9.104 source 10.0.9.106 vrfid 0 size 1800 dfbit enable
PING 10.0.9.104 (10.0.9.104) from 10.0.9.106: 1800 data bytes
36 bytes from 10.0.9.107: frag needed and DF set (MTU 1600)
Vr HL TOS Len ID Flg off TTL Pro cks Src Dst
4 5 00 0724 0000 0 0000 40 01 0d08 10.0.9.106 10.0.9.104
Or
vrf 0
get neighbor
Interface : f75fb918-a629-5f62-83d0-ff98e832d553
IP : 10.0.9.102
MAC : 00:50:56:6a:bf:cf
State : reach
Timeout : 615
And then ping
ping 10.0.9.102 size 1572 dfbit enable
PING 10.0.9.102 (10.0.9.102): 1572 data bytes
1580 bytes from 10.0.9.102: icmp_seq=0 ttl=64 time=0.454 ms
I get the following output from the commands you told me Chris:
nsxtesg0(vrf)> get neighbor
Logical Router
UUID : 736a80e3-23f6-5a2d-81d6-bbefb2786666
VRF : 0
LR-ID : 0
Name :
Type : TUNNEL
Neighbor
Interface : fdb716f0-4a7a-50a4-9cff-6872f19c73de
IP : 192.168.12.12
MAC : 00:50:56:62:f7:92
State : reach
Timeout : 341
Interface : fdb716f0-4a7a-50a4-9cff-6872f19c73de
IP : 192.168.12.13
MAC : 00:50:56:68:8d:54
State : reach
Timeout : 921
Interface : fdb716f0-4a7a-50a4-9cff-6872f19c73de
IP : 192.168.12.15
MAC : 00:50:56:68:35:3c
State : reach
Timeout : 656
Interface : fdb716f0-4a7a-50a4-9cff-6872f19c73de
IP : 192.168.12.16
MAC : 00:50:56:60:d5:db
State : reach
Timeout : 1083
Interface : fdb716f0-4a7a-50a4-9cff-6872f19c73de
IP : 192.168.12.19
MAC : 00:50:56:94:1f:42
State : reach
Timeout : 563
Interface : fdb716f0-4a7a-50a4-9cff-6872f19c73de
IP : 192.168.12.14
MAC : 00:50:56:6e:9e:ae
State : reach
Timeout : 296
Interface : fdb716f0-4a7a-50a4-9cff-6872f19c73de
IP : 192.168.12.11
MAC : 00:50:56:66:fc:79
State : reach
Timeout : 321
As you can see, all the neighbors says "reach" (I don't know whether that means the that the edge node can reach that neighbor), but the only tunnel I have up is to 192.168.12.11 and 192.168.12.12 that are the only ones that can be reached by a ping:
nsxtesg0(vrf)> ping 192.168.12.13
PING 192.168.12.13 (192.168.12.13): 56 data bytes
--- 192.168.12.13 ping statistics ---
3 packets transmitted, 0 packets received, 100.0% packet loss
nsxtesg0(vrf)> ping 192.168.12.11
PING 192.168.12.11 (192.168.12.11): 56 data bytes
64 bytes from 192.168.12.11: icmp_seq=0 ttl=64 time=1.156 ms
64 bytes from 192.168.12.11: icmp_seq=1 ttl=64 time=1.707 ms
64 bytes from 192.168.12.11: icmp_seq=2 ttl=64 time=2.186 ms
--- 192.168.12.11 ping statistics ---
4 packets transmitted, 3 packets received, 25.0% packet loss
round-trip min/avg/max/stddev = 1.156/1.683/2.186/0.421 ms
nsxtesg0(vrf)> ping 192.168.12.11 size 1572 dfbit enable //Forcing the packet size and don't fragment bit
PING 192.168.12.11 (192.168.12.11): 1572 data bytes
1580 bytes from 192.168.12.11: icmp_seq=0 ttl=64 time=1.525 ms
1580 bytes from 192.168.12.11: icmp_seq=1 ttl=64 time=1.693 ms
--- 192.168.12.11 ping statistics ---
3 packets transmitted, 2 packets received, 33.3% packet loss
round-trip min/avg/max/stddev = 1.525/1.609/1.693/0.084 ms
Very strange. I don't know what else I can test. Tomorrow perhaps I'll change the IP Subnet and VLAN ID for the VTEPs of the Edge Node, although I think this is not the problem, because if it was, I couldn't have the tunnel against one host up. There should be no tunnels up if the problem was the VTEP VLAN ID.
Thanks.
Guido.
Not helping, but just saying I am really looking forward to idea on this issue. I am having the same problem
Its lab, It's for learning. To try to fix that issue I reinstalled everything. Fresh esxi, vcenter, nsxt, nsx-edge. All latest releases
It's configured both on same vlan-0, tried to have my vdswitch in trunk or single vlan without impact (it's all on the same host here anyway)
nsx-edge-1a> get bfd-sessions
BFD Session
Dest_port : 3784
Diag : No Diagnostic
Encap : geneve
Forwarding : last false (current false)
Interface : 3a89989f-22a8-5673-8d52-12a1e0a91925
Keep-down : false
Last_cp_diag : No Diagnostic
Last_cp_rmt_diag : No Diagnostic
Last_cp_rmt_state : down
Last_cp_state : down
Last_fwd_state : NONE
Last_local_down_diag : No Diagnostic
Last_remote_down_diag : No Diagnostic
Local_address : 10.129.255.11
Local_discr : 2377231423
Min_rx_ttl : 255
Multiplier : 3
Received_remote_diag : No Diagnostic
Received_remote_state : down
Remote_address : 10.129.255.10
Remote_admin_down : false
Remote_diag : No Diagnostic
Remote_discr : 0
Remote_min_rx_interval : 0
Remote_min_tx_interval : 0
Remote_multiplier : 0
Remote_state : down
Router_down : false
Rx_cfg_min : 1000
Rx_interval : 1000
Session_type : TUNNEL
State : down
Tx_cfg_min : 100
Tx_interval : 1000
nsxedge1(vrf)> get neighbor
Logical Router
UUID : 736a80e3-23f6-5a2d-81d6-bbefb2786666
VRF : 0
LR-ID : 0
Name :
Type : TUNNEL
Neighbor
Interface : d843afab-ea93-540b-a8a4-766dc9c89e9f
IP : 10.129.255.10
MAC : 00:50:56:63:fb:f4
State : reach
Timeout : 208
nsxedge1(vrf)> ping 10.129.255.10 source 10.129.255.11 size 1572
PING 10.129.255.10 (10.129.255.10) from 10.129.255.11: 1572 data bytes
1580 bytes from 10.129.255.10: icmp_seq=0 ttl=64 time=0.594 ms
1580 bytes from 10.129.255.10: icmp_seq=1 ttl=64 time=0.440 ms
1580 bytes from 10.129.255.10: icmp_seq=2 ttl=64 time=0.596 ms
Ok I think someone documented our problem and solution here: https://www.spillthensxt.com/nsx-t-tep-ip-addressing/
Look like it is mandatory to have the traffic going out of the dswitch. I was hoping a 2-nodes cluster could have a edge builtin without involving external L3 routing ... I don't see how now.
Will try moving edge outside the cluster and report here soon
FYI - Moving the edge outside the cluster worked. I did a vmotion of my edge to a nearby cluster, connected it's second nic (the one for overlay) to a dedicated standard-switch using a NIC directly connected to that remote distributed switch.
I find this to be a big limitation. We want a full software stack, but for this we need to have traffic going out on physical NIC and then back in on that same NIC.
Hopefully someone can provide a better solution, but the previously shared url is very good at explaining the 3 workarounds
Hello,
First of all it looks like a challenging issue,
As I understand and please correct me if I am wrong, you have 4*pNics on each server. you used them as follows:
2*pNics for vDS (For vSphere management, vMotion, Edge TEP, Edge Uplinks, .....etc)
2*pNics for NVDS (For host TEP)
If that is the scenario, it is mandatory to use two different VLANs for TEP (One for Hosts and one for Edges). I know and I see that one server can communicate with edges within the same subnet.
But think about it from networking and tagging side, you have the same VLAN you need to distribute the traffic sometime from first two uplinks to communicate between hosts through the overlay tunnel and sometimes using the other two uplinks to communicate between hosts and edges.
Therefore, I recommend using two different VLANs for TEP and be careful of the Tagging especially when using Single NVDS deployment for edges (not three NVDS: TEP, Uplink1, and Uplink2).
For more information I am glad to be on service,
I finally solved the problem. I had to open a support ticket with VMware.
The problem was that tunnels between the Host Transport Nodes and Edge Nodes were not up and running. However, in the NSX Manager GUI they appeared as up and in green color!!! There is a bug there that the GUI shows you tunnels UP and running where they are actually down and without connection between the VTEPs. (you can see in the image the the tunnel against the IP 192.168.12.17 is up where it is actually down).
Those tunnels were down because there was no layer-3 connection between the Host Transport Node VTEPs and the Edge Node VTEP. This infrastructure were running on a HP chassis C7000 with two Virtual Connect switches. There was a problem in the configuration of the core switch LAG ports. In the core, the ports that were connected to the Virtual Connect HP switches shouldn't have been configured as a LAG (port-channel) but as single ports.
We unconfigured the LAG and everything began to work.
Thanks for your help.
Guido.