In this days more and more customer start using hardware VTEP instead of VMK on esxi. Stories about why customer use HWVTEP varies between performance , capacity or already existing hardware in place. To hones in my experience I didn;t see much of benefit choosing HWVTEP against esxi VTEP ones. But what is definitely happens is introducing complexity in the network and troubleshooting issues. In this post can be found useful steps to troubleshoot this kind of environments.
First before we start with any kind of troubleshooting examples lest follow a normal process of identifying how packect a travel between Physical PC and VM in vSphere/ NSX environment
As we can see from the example we have a flow
1 PC sends and ARP request which is broadcast to find VM 1
2 HW VTEP sends PC MAC learn on port to NSX Controller
3 HW VTEP learns VM mac from Controller (Controller should know mac address of the VM via a <JOIN> message send to controller when VM is attached to VNI)
4 vDS prepared for NSX shares VM MAC with Controller
5 NSX vDS request remote MAC of PC from controllers which should be shared between HW VTEP and Controller.
6 VM sends ARP back to PC via tunnel between ESXI and HW VTEP
7 HW VTEP forwards ARP response to PC
with this steps should be communication happens.
As can be seen there is a lot communication handshakes between HW VTEP and NSX components, and here most of the issues happens and we need to focus on this.
Main troubleshooting area for HW VTEP is the HW VTEp it self.
From NSX side we just need to ensure that environment is ready like host preparation.
Most common configuration are
- Bind to non-existing physical ports
- ToR certificate is not configured properly
- Not connecting ToR to anyone of controllers nodes. HWVTEP need to be connected to one of the controllers and the controller push OVSDB transaction to make ToR connected to all controllers.
- Forget to prepare host which is necessary, meaning install vibs.
If we are sure that all this look OK and no alerts or strange messages in NSX , then we can go to ToR and check <show bfd neighbors>
When we add replication hosts and bindings to Logical Switch need to check BFD is up or not BFD will not be up until we not specify any replication host and binding to Logical Switch
Please refer to documentation for specific vendor as an example is Juniper
Summary : most of trouble shooting in HWVTEP need to be done on ToR not in NSX environment. To confirm what where and where is the packet (as my Colleague Jose say's) we need to capture