Solved: vm lost network when pass through nsx-t distribute...

aaronkiki · ‎04-29-2021

I have a environment with 4 hosts with name host1, host2, host3, host4.

vcenter 7.0.2

ESXi 7.0.1, 17551050

NSX-T 3.1.2

I just enabled the distributed firewall based vlan. I create some segements. All vms works well expcept one vm named test-213 which located on host2. the vm test-213 will lost work and can not reach the gateway. I captured the packet with the nsxcli.

nsxcli -c start capture dvfilter nic-8898398-eth0-vmware-sfw.2 stage pre expression ipproto 0x01

the packet is ok before the dirtributed firewall. but I can't capture any packet after the distributed firewall with nsxcli.

nsxcli -c start capture dvfilter nic-8898398-eth0-vmware-sfw.2 stage post expression ipproto 0x01

If I migrate the vm test-213 to other hosts or I reboot the host2. the vm works well. But after a moment, the vm test-213 will lost network again.

The difference vm test-213 with other vms is that it is transfering large files. the distributed firewall policy for the vm test-213 is permitted any.

is there any method to know why the vm lost network when it pas through the distributed firewall. and if the nsx-t distributed firewall and distributed IDS/IPS can not support vms with large throughput well.

aaronkiki · ‎05-07-2021

These days, I've tried a lot of things. the DFW droped or didn't tranfer the specific vms' packets.

At last, I update the vCenter and ESXi to the newst release 7.0U2a.

Now, all the vms works fine. So, I guess these specific vms may trigger some bugs.

View solution in original post

shank89 · ‎04-29-2021

Try using the traceflow tool in the UI, it will inject packets into the dataplane and if it is getting caught in any dfw rule it will tell you.

Shashank Mohan

VCIX-NV 2022 | VCP-DCV2019 | CCNP Specialist

https://lab2prod.com.au
LinkedIn https://www.linkedin.com/in/shankmohan/
Twitter @ShankMohan
Author of NSX-T Logical Routing: https://link.springer.com/book/10.1007/978-1-4842-7458-3

aaronkiki · ‎04-30-2021

can not use the traceflow tool in the ui. there is a error:

Traceflow request failed. The request might be cancelled because it took more time than normal. Please retry.

Error Message: Traceflow intent /infra/traceflows/772dfaf0-a997-11eb-b1fe-effdc2a85c0f realized on enforcement point /infra/sites/default/enforcement-points/default with error Traceflow does not support vlan switch for port: LogicalPort/f4923903-da25-4af0-8db5-5b2bd82f9f8e /infra/segments/OfficeVMs/ports/default:f4923903-da25-4af0-8db5-5b2bd82f9f8e.

the segment used a vsphere 7 vds, not n-vds.

Sreec · ‎04-30-2021

Look like you are using VLAN backed network not overlay networks. Keeping that aside can you confirm below points

1. Where is VM gateway configured?

2. Have you tried excluding the VM from DFW ?

3. When VM is not able to reach the L3 address, do we have L2 learning working fine? Is the issue specific to a VM ?

Cheers,
Sree | VCIX-5X| VCAP-5X| VExpert 7x|Cisco Certified Specialist
Please KUDO helpful posts and mark the thread as solved if answered

aaronkiki · ‎05-05-2021

Thanks Sreec.

yes, I'm using vlan backed network not overlay networks. the gateway is configured in a physical switch.

the vm can learning mac address of the gateway or other vm in the same subnetwork. but it can not access other vm or gateway.

I have tried to excluding the vm from DFW, the vm is ok.

when I enabled DFW for the vm and there is not any limit, the DFW policy is any to any and permit any, I've captured it's packets in and off the DFW. the DFW logs is ok, it shows mach pass.

The problem is clear, the DFW drop or don't transmit the vm's packet.

Most of the vms works fine except some ones. these problem vms in different networks. So far, there are 3 vms have the problem. if I change the vm's network to DVS's normal portgroup they works fine. If I change back to the sgement portgroup, the problem will appear again.

Sreec · ‎05-05-2021

Thanks for sharing it. Can you also perform the below check?

1. Connect VM to logical Switch and issue below commands from the host where VM is residing

summarize-dvfilter | grep -A4 VM NAME - To get the slot name ( Eg: name: nic-4790914-eth0-vmware-sfw.2)

vsipioctl getfwconfig -f slot name

Does it show any drop rule?

2. Connect VM to DV port group and execute the same step

3. Remove VM from VC inventory and add it back and connect to logical switch and test the connectivity once again.

Cheers,
Sree | VCIX-5X| VCAP-5X| VExpert 7x|Cisco Certified Specialist
Please KUDO helpful posts and mark the thread as solved if answered

aaronkiki · ‎05-05-2021

when the vm connect to the logical switch of nsx-t. all the rules's action is accept. Even the ids's action is just to detect.

when the vm connect to the DV port group it shows no rules and works fine.

I removed the vm from vc and add it back again, the problem is same when the vm connected to the nsx-t segement portgroup.

aaronkiki · ‎05-06-2021

I noticed a alarm in the NSX-T UI.

The disk usage for the Manager node disk partition /image has reached.

all the 3 manager nodes' /image partition usage reached 100%.

there are many named java_pidxxx files which cost all the space of the /image.

Can I delete these java_pid files? and does the 100% usage of the /image caused those weird questions?

aaronkiki · ‎05-07-2021

These days, I've tried a lot of things. the DFW droped or didn't tranfer the specific vms' packets.

At last, I update the vCenter and ESXi to the newst release 7.0U2a.

Now, all the vms works fine. So, I guess these specific vms may trigger some bugs.

All

vm lost network when pass through nsx-t distributed firewall