Solved: Re: Question about vMAC on the DLR LIFs

nneogi · ‎02-01-2015

Hello,

Based on the NSX Design Guide: “Each LIF has assigned an IP address representing the default IP gateway for the logical L2 segment it connects to and a vMAC address. The IP address obviously is unique per LIF, whereas the same vMAC is assigned to all the defined LIFs”

Testing this in the HOL labs, I started a ping from 172.16.10.11 -> 172.16.20.11. I configured a mirror port with source as app-sr-01a (172.16.20.11) (no tcpdump available on HOL web/app servers!). Looking at the MAC addresses on the packet "inside" the GRE encapsulated packet (refer to see snapshot below) I have few questions:

1. Where is the source MAC on the ICMP request packet 00:50:56:f1:f7:a2 coming from? What cli command would show this MAC address?

2. If each LIF will have the same vMAC, the “arp -an” output on the corresponding servers shows different MACs for 10.1 and 20.1 LIFs. Is this expected or a Nested lab thing or am I missing something?

admin · ‎02-15-2015

1. The SRC MAC in the Request is almost certainly a pMAC of the ESXi server where web-sv-01a resides. You should be able to see it by running "net-vdr -C -l" on that host.

2. "arp -an" output for your web-sv-01a does not look right, you should see vMAC (02:50:56:56:44:52) for all DLR LIFs. What version of NSX are you using?

View solution in original post

steveplz · ‎02-04-2015

I just completed my VCP-NV so I'll have a crack at explaining it and any more experienced onlookers may care to check my logic and weed out any typos 🙂

My first point is that the encapsulation would be VXLAN and not GRE if you're on NSX-v. NSX-mh can use GRE, STT and also VXLAN. Just saying in case your answering any test questions in future. I'm not breaking any non-disclosure here as I honest don't recall if that question came up and in any case it's a pretty obvious fact you should learn if you follow the exam blueprint and study the differences between NSX-v and NSX-mh.

The first error in your thinking is that you are applying the theory as if the VMs were on the same subnet and L2 domain but they are not.

In the example you quoted The two hosts are on different subnets;

web-sv-01a 172.16.10.11 is on the 172.16.10.0/24 network

whereas

app-sv-01a 172.16.20.11 is on the 172.16.20.0/24 network

NOTE: This will mean that there are two logical switches since a DLR may not connect more than 1 LIF to the same logical switch. Since each Logical switch is uniquely identified by a VNI there will therefore be 2 VNIs. Let's allocated 5001 and 5002 in case we need them further on down.

Whether the network is physical or virtual It still holds true that these VMs must communicate with one another via a router, albeit in this case a virtual router or DLR. So when encapsulating packets at L2 each VM will use it's default gateway (an interface on the DLR instance) as the destination MAC address and any frames it receives from the other subnet will be sourced from that same DLR interface MAC address. It's just Routing 101 as you already know it. It's virtually the same (oops couldn't help the pun, sorry)

You're second misunderstanding is a knock on effect from the first. Since the VMs are on different subnets they will be communicating with a different DLR interface and so a different LIF and vMAC.

The MAC address on a DLR Interface connecting to a Logical Switch is called a vMAC.

This vMAC is the same that DLR and IF pair across any hypervisors in the transport zone.

So lets say we have one hypervisor with these two VMs on it and let's call the router instance DLR1 and the logical interfaces LIF1 and LIF2 and .99 is the host IP address of the DLR LIFs on their respective subnets.

Using fictitious MAC addresses for easier illustration -

web-sv-01a has DLR1/LIF1 IP address 172.16.10.99/24 MAC 00:00:00:00:10:99 as it's default gateway

app-sv-01a has DLR1/LIF2 IP address 172.16.20.99/24 MAC 00:00:00:00:20:99 as it's default gateway

When web-sv-01a attempts for the first time to send an IP packet to app-sv-01a

web-sv-01a will ARP for the MAC address of it's DFGW which is DLR1/LIF1

so on this network segment 172.16.10.0/24 for all traffic between the VMs

IP packets will have the web-sv-01a's and app-sv-01a's IP addresses in the source and destination fields

but at L2 web-sv-01a's and DLR1/LIF1 vMAC address in the source and destination fields

NOTE: which way around the source and destination addresses are is depends on the direction of the traffic, i.e. to or from the host via the router but you can work that out.

Once the ARP process is done -

web-sv-01a's ARP table

IP Address MAC Address IF

172.16.10.99 00:00:00:00:10:99 1

DLR1's ARP table

IP Address MAC Address IF

172.16.10.11 00:00:00:00:10:11 1

So as per usual web-sv-01a has no clue of the actual MAC address of app-sv-01a.

Similarly when DLR1 attempts for the first time to route a packet originating from web-sv-01a to app-sv-01a it will first ARP for the MAC address of app-sv-01a via DLR1/LIF2

so on this network 172.16.20.254/24

IP packets will have the web-sv-01a's and app-sv-01a's IP addresses (NO CHANGE AT THE IP LAYER)

but at L2 app-sv-01a's and DLR1/LIF2 vMAC address

Once the ARP process is done -

app-sv-01a's ARP table

IP Address MAC Address IF

172.16.20.99 00:00:00:00:20:99 1

DLR1's ARP table

IP Address MAC Address IF

172.16.10.11 00:00:00:00:10:11 1

172.16.20.11 00:00:00:00:20:11 2 THE DIFFERENCE NOW IS THAT THE DLR HAS THE APP SERVER MAC TOO

NOTE: No VXLAN encapsulation occurred since the VMs were on the same host and the Controllers were not involved in the ARP traffic (although the security module in each hypervisor would have updated the VTEP and ARP tables on both the hosts and the Controller cluster)

Now lets consider multiple ESXi Hosts (Hypervisors)

In the second example I'll give you let's say we still do the same and send traffic from the web server to the app server as in the example above but the VMs are on different ESXi hosts.

e.g. let's say web-sv-01a was on Host1 and app-sv-01a was on Host2.

In this case the rule applies that routing is done by the DLR kernel instance on the host closest to the source of the traffic so traffic originating from web-sv-01a would be routed by the DLR kernel module on Host1 and traffic originating from app-sv-01a would be routed by the DLR kernel module on Host2. May be a bit hard to follow but there's some good blogs and explanations out there so take a look around if you want pictures. e.g. http://networkinferno.net/nsx-compendium#Logical_Distributed_Routing

NOTE: This principle of routing closest to the source is an important one to remember for your packet walk/trace theory and is also part of the exam blueprint.

The next step is for the Logical switch kernel instance on Host1 to encapsulate the traffic in VXLAN with VNI 5001 in the VXLAN header (remember VNIs for the 2 logical switches were allocated for these examples earlier) and in the outer IP header Host1s VTEP IP address as the source and Host2s VTEP as the destination of the UDP packet and send it to the physical network to Host2s VTEP IP. At Host 2 it is decapsulated by the Logical switch VNI 5001 kernel instance and forwarded to the app-sv-01a VM (notice no routing at this end as it was already done on Host1).

I'm assuming that return traffic in the reverse direction of app-sv-01a to web-sv-01a behaves in much the same way, this time routed by Host2's DLR kernel instance onto the router LIF associated with the logical switch having VNI 5000, encapsulated by Host2's VXLAN kernel module with the VNI 5000 destination in the VXLAN header and Host1's VTEP IP address as the destination in the outer IP header then sent to the physical network to be forwarded to Host1s VTEP interface, decapsulated by the VXLAN kernel module on Host1 and switched to the web-sv-01a by the Logical switch instance to which Host1s interface is connected.

Phew, that was a long time writing but I think I have it. I hope you can follow it.

As for your questions on command line and troubleshooting have a look at Rich Dowlings blog VCP-NV | YAVB - Rich Dowling and the last section 9 of the NCP-NV blueprint where there's a goodly bunch of troubleshooting CLI commands, etc.

Have fun and hope your question was answered. This was my first reply to a VMware technical post ever so please be kind and I welcome your constructive feedback both positive and negative. It's all a learning process.

nneogi · ‎02-05-2015

Hello,

Thank you for your response. With regards to your first point, the reason the attached capture is GRE encapsulated is because I setup a mirror port (ERSPAN) to see exactly what the packets looked like from the app-srv-01's (172.16.20.11) perspective. You'd see in the snapshot I attached that the packet is GRE encapsulated but take a closer look at the inside packet. The question really was that the source MAC on that inside packet does not match what I see in the arp table on the app-srv-01 for the corresponding DLR LIF on that segment. I need to run through the lab again and do some more captures.

steveplz · ‎02-05-2015

Aaaargh. My bad. In my defence I was sleep deprived but still no excuse. I fell into the same trap I've harangued Cisco TAC engineers for in the past when answering my cases. Like a deer in the virtual headlights I was dazzled by the shiny GRE and didn't pay attention to the rest of the text.

You're welcome for the response despite the glaring FAIL that I didn't understand your question properly but I'm not giving up. It was a good exercise to go through writing my spiel. I find explaining something to a peer is a good test of ones understanding and uncovers any uncertainties you might have. I am intrigued by your question and since I'm going to sit the VCIX-NV in the next few weeks it can't hurt me to look into it. All the command line stuff is where I'm currently weakest so looking forward to playing and learning.

I've been looking at the question again since you answered so when I have something to say I'll post again.

Edit - forgot to add this question for you regarding capturing during the lab.

Is there a reason you didn't SSH to the hypervisor and use pktcap-uw on ESXi? If you do that there's no need for your port mirroring, GRE etc. pktcap-uw is greatly expanded as far as I can tell in features over the VMware version of tcpdump on the hypervisor - tcpdump-uw. You can save to a file with the right options. I would like to know a way to be able to download files such as these captures from the HOL but I take it you were running wireshark on one of the machines within the lab somewhere?

steveplz · ‎02-05-2015

I ran through everything and highlighted the MAC address that is unknown there that you are talking about.

My suspicion is as per your earlier suggestion that it might be "a Nested lab thing" is the highest likelihood of being the cause. There's a couple of good articles from William Lam you might like to look at if you're interested in knowing more about nested environments and why promiscuous mode and forged transmits are required. It's all related to the fact that vSwitches don't do MAC learning like a normal switch.

http://www.virtuallyghetto.com/2013/11/why-is-promiscuous-mode-forged.html

http://www.virtuallyghetto.com/2014/08/new-vmware-fling-to-improve-networkcpu-performance-when-using...

Without knowing the topology physical and virtual it's hard to say conclusively but 99% sure it's the case. The only other thing you should check is that there's no load balancing configured that might throw some confusion in the mix. I doubt that's the case here but worth asking.

I documented how I looked at all your addressing, etc. and all I can say is that you were right to be confused but again just an unfortunate side effect of nesting hypervisors and the unique way VMware virtual switches differ in their operations to classic MAC learning behaviour.

A couple of final points. If you had gone further in your packet captures you might have noticed frames coming from multiple MAC source addresses including the odd one you saw and the correct one. Have you ever done a ping in the lab and saw DUP in the output. That's because promiscuous mode accepts all traffic on the network and some that isn't meant for the host gets processed when it should have been ignored. End result two hosts might reply to the same echo request and bang you have this DUP output. That's a very crude explanation but you get the basic picture.

If you want to poke around at it more I suggest you look at using the pktcap-uw to do captures on the LIFs at the hypervisor and poke around to see if you can find the owner of that MAC address 00:50:56:f1:f7:a2.

I hope finally that is of some help. If you're still curious try to send an email to the HOL guys as they're very helpful at hands-on-lab-beta@vmware.com. Post an update to this thread if you dig up anything interesting.

Addresses resolved from host command outputs

web-sv-01a: 172.16.10.11 00:50:56:a6:7a:a2 eth0

dfgw 172.16.10.1 00:50:56:8e:4d:21 LIF1 vMAC

app-sv-01a 172.16.20.11 00:50:56:a6:84:52 eth0

dfgw 172.16.20.1 02:50:56:56:44:52 LIF2 vMAC

Scenario

from web-sv-01a ping started to 172.16.20.11

ICMP packets are exchanged between web-sv-01a 172.16.10.11 and app-sv-01a 172.16.20.11

Capture is done at the eth0 nic of the destination server app-sv-01a 172.16.20.11

1. Theoretically from the perspective of eth0 on the target of pings 172.16.20.11

Since IP addresses don’t change

Echo packets source always 172.16.10.11 and destination always 172.16.20.11

Echo reply packets source always 172.16.20.11 and destination always 172.16.10.11

MAC address of the remote host (web-sv-01a 172.16.10.11) replaced by that of the DFGW LIF2

Echo packets source always 02:50:56:56:44:52 and destination 00:50:56:a6:84:52

Echo reply packets source always 00:50:56:a6:84:52 and destination 02:50:56:56:44:52

LEGEND by colour

GOOD

NOPE

ATTENTION

2. Actually Observed from Packet Capture

IP LAYER ALL GOOD AS PER RFC THEORY

Echo packets source always 172.16.10.11 and destination always 172.16.20.11

Echo reply packets source always 172.16.20.11 and destination always 172.16.10.11

MAC layer has unaccounted for source address for echo packets where that of LIF2 should appear

Echo packets source should be 02:50:56:56:44:52 and destination 00:50:56:a6:84:52

Echo SMAC in the capture is 00:50:56:f1:f7:a2 but should be 02:50:56:56:44:52 LIF2 vMAC

OUTGOING ICMP ECHO REPLY ALL GOOD AS PER RFC THEORY

SA 172.16.20.11 SMAC 00:50:56:a6:84:52 (app-sv-01aeth0 MAC address)

DA 172.16.10.11 DMAC 02:50:56:56:44:52 (default GW LIF2 vMAC)

admin · ‎02-15-2015

1. The SRC MAC in the Request is almost certainly a pMAC of the ESXi server where web-sv-01a resides. You should be able to see it by running "net-vdr -C -l" on that host.

2. "arp -an" output for your web-sv-01a does not look right, you should see vMAC (02:50:56:56:44:52) for all DLR LIFs. What version of NSX are you using?

nneogi · ‎02-18-2015

DmitriK,

You are absolutely right about both your answers. Thank you. With regards to the output of arp -an output, I realized that I had the web tier interface still connected to the perimeter gateway instead of the DLR. After I moved the web tier interface to the DLR, I see the right output for "arp -an". Also, "net-vdr -C -l" command gave me the source MAC on the request packet as well.

To summarize, in case anyone else runs into this. Here are are final screen catpures.