Solved: Re: dvSwitch - IPv6 address limits on hosts attach...

reub · ‎04-07-2017

I've been experiencing a rather odd problem for the last few days which has me baffled and I'm reaching out for advice/suggestions. I am running ESXi-6.5 with the latest patch bundle as of the end of March.

I have 12 VMs of which 9 or so of them are all on a common "server" vlan, VLAN 10. There is a 10G uplink from this virtual switch into a Cisco Catalyst switch with multiple VLANs mapped through to other hosts. Management is on another NIC and there is a spare copper port on the server.

In my environment I had a Nexus 1000v switch installed and running successfully for some time. I have been planning on migrating off this so I took the plunge and did this migration a week or two ago and I replicated the topology and created a dvSwitch within vCenter for this VLAN, and used the same trunked uplink as before. I then migrated all of the servers off the N1kv to the new dVswitch and onto port groups for each VLAN (same uplink).

Things were a bit unstable initially but everything came right after a full host restart. I started experiencing problems with one VM an hour or so later after that. The VM in question has one virtual NIC with 4x static IPv4 addresses on it, and 4x static IPv6 addresses (plus whatever the host gets from SLAAC I suppose). This is the only VM with multiple IP addresses on it - which may be related to the problem.

What happens is that some of the IPv6 addresses stop working from outside the port-group on the dvSwitch. Everything within the port-group on the same VLAN works and the addresses remain functional within the group, but from outside the dvSwitch and through the physical port, we lose access to and from some of the additional IPv6 addresses.

I've restarted the switch, and changed the 10G NIC (for a different Intel one with a different driver even). Neither of these things have helped.

However what I did find helps is if I move the VM off onto a second dvSwitch, which has one of the otherwise unused copper port uplinks. Same configuration again - but different uplink. The connectivity starts working immediately once the problem occurs, and is stable for long periods of time.

Given this did all work with the N1kv I'm inclined to believe this is not a hardware problem, but a software problem with the dvSwitch or else some limitation that exists with the dvSwitch that does not exist with the N1kv.

The only change I hade to make when creating the port-groups was to enable promiscuous mode on my Palo Alto firewall interfaces but aside from that the port-groups are pretty standard.

Has anyone got any suggestions for what else I can try? Are there any known issues/caveats with multiple IPv6 addresses bound to a host on a dvSwitch?

TechMassey · ‎04-15-2017

That is too bad, I did some additional checking but I can't find any precedent for this behavior. I agree that it is likely software, related to a difference between the Cisco Nexus and the dvSwitch. Best to open a support ticket with VMware at this point if you haven't already.

Please help out! If you find this post helpful and/or the correct answer. Mark it! It helps recgonize contributions to the VMTN community and well me too 🙂

View solution in original post

TechMassey · ‎04-09-2017

This one is interesting, challenging but interesting, and I'm not certain of the answer but lets give it a shot

Thanks for all the details, it really helps. I too am often in the situation of actually trying to make the environment better versus keeping things as they always are. Only issue is you run into fun problems like this.

Based on the details, one item stands out which is the switch to the copper interface and a new vDS. Just to verify, is the common server VLAN the one with promiscuous turned on and as far as you can tell the temp vDS on copper using the same portgroup settings? The second question would be is there ANY different in physical switch port config between the 10G and the copper? Like access VLAN or anything on the physical switch fabric that stands out?

Please help out! If you find this post helpful and/or the correct answer. Mark it! It helps recgonize contributions to the VMTN community and well me too 🙂

reub · ‎04-10-2017

Thanks for the response and for the interest Warren. Yes this is one of those tricky but possibly interesting problems.

Here's the switch port configs. Both ports are on the same switch:

interface GigabitEthernet1/0/21

description Cisco UCS C220 M3 Server Test Port

switchport trunk allowed vlan 10,50

switchport mode trunk

switchport nonegotiate

power efficient-ethernet auto

storm-control broadcast level pps 2k 250

storm-control action shutdown

spanning-tree portfast trunk

end

!

interface TenGigabitEthernet1/1/3

description Cisco UCS C220 M3 Server Inside (Fibre)

switchport trunk allowed vlan 2,5,8,10-13,16,17,50,60,80

switchport mode trunk

switchport nonegotiate

energywise keywords UCS,SERVER

storm-control broadcast level pps 2k 250

storm-control action shutdown

bfd interval 600 min_rx 600 multiplier 3

spanning-tree portfast trunk

end

So nothing obviously different there.

VLAN 10 is the VLAN where the server loses connectivity.

Here's the esxcfg-nic output also. I had a -lot- of connectivity problems with the i40en driver but the latest intel i40e driver seems to work well:

[root@vmware-1:~] esxcfg-nics -l

Name PCI Driver Link Speed Duplex MAC Address MTU Description

vmnic0 0000:01:00.0 igb Up 1000Mbps Full 24:e9:b3:16:3c:c6 1500 Intel Corporation I350 Gigabit Network Connection

vmnic1 0000:01:00.1 igb Up 1000Mbps Full 24:e9:b3:16:3c:c7 1500 Intel Corporation I350 Gigabit Network Connection

vmnic2 0000:03:00.0 i40e Up 10000Mbps Full 3c:fd:fe:a4:51:b4 9000 Intel Corporation Ethernet Controller X710 for 10GbE SFP+

vmnic3 0000:03:00.1 i40e Down 0Mbps Half 3c:fd:fe:a4:51:b5 9000 Intel Corporation Ethernet Controller X710 for 10GbE SFP+

[root@vmware-1:~]

vmnic0 = management, vmnic1 = testing port (where connectivity is stable), vmnic2 = main data uplink (where the host is unstable).

Over the weekend I flattened and re-installed the host with 6.5a and then the latest update, so I've now moved the VM back onto the dvSwitch that has the uplink of the 10G port. All security options such as promiscuous mode are set to their default disabled. Lets see how this goes.............

reub · ‎04-11-2017

A rebuild of the host did -not- resolve the problem. I am still experiencing the same issue even after the rebuild.

FWIW the VM Hardware version is 13 and the VM is running Linux Kernel 4.10.

TechMassey · ‎04-15-2017

That is too bad, I did some additional checking but I can't find any precedent for this behavior. I agree that it is likely software, related to a difference between the Cisco Nexus and the dvSwitch. Best to open a support ticket with VMware at this point if you haven't already.

Please help out! If you find this post helpful and/or the correct answer. Mark it! It helps recgonize contributions to the VMTN community and well me too 🙂

reub · ‎04-18-2017

Thanks anyway, Warren. I can't log a support ticket as I only have the entitlement to use the software as a VMUG member. So for now I guess I'll have to just suck it up and hope the problem fixes itself in due course.

reub · ‎06-16-2017

Bumping this up again - I've done further testing and determined that this problem -only- occurs with dvSwitches. The issue does not occur with standard VMware switches nor the Cisco N1kv. So there's something amiss with the dvSwitch code.

The problem only seems to occur with IPv6 traffic across the uplink ports. IPv6 access within the dvSwitch continues to work.

The problem does not depend on the number of hosts or IPv6 addresses on the switch.

Hopefully someone else has seen this problem and/or come up with a solution by now...

All

dvSwitch - IPv6 address limits on hosts attached?