Spontaneous Locale-ID change on the host

yak9 · ‎05-27-2016

Hello!

I have a test lab with NSX 6.2 and recently, while playing around with local egress feature, I've stumbled upon very strange and frustrating behavior.

I run a simulation of 2-host (esx-12, esx-22) stretched cluster, running in two datacenters: DC-A (esx-12), and DC-B (esx-22).

DC-A is managed by vCenter vca1 and NSX Manager nsxmgra1, DC-B is managed by vCenter vcb1 and nsxmgrb1. Esx-22 is only host in DC-B managed by vca1.

nsxmgra1 UUID is 42243c55-196b-520f-d733-e99899274998

nsxmgrb1 UUID is 420299c3-a98d-5ef5-6704-8c77475a4c37

NSX Managers utilize Cross-vCenter topology (nsxmgra1 is Primary), with one UDLR with local egress enabled and in HA configuration (so 2 Control VMs per site, 4 VMs total).

I set esx-22 locale-ID via GUI to nsxmgrb1 UUID and for a day or two local egress is working fine.

Then something happens, and following line appears in vmkernel.log on esx-22 host:

2016-05-27T05:54:39.914Z cpu15:566227)vdrb: VdrCpProcessLocaleIdMsg:2329: CP:INFO: VDR Control Plane : Changing localeId from 420299c3-a98d-5ef5-6704-8c77475a4c37 to 42243c55-196b-520f-d733-e99899274998

For some mysterious reason, locale-ID on this host is changed back to nsxmgra1 UUID.

Not surprisingly, 566227 is PID of netcpa-worker process, but this particular netcpa-worker has not any open connections to controllers:

[root@esx-22:~] esxcli network ip connection list | grep 566227

[root@esx-22:~]

I've checked vmkernel.log around offending line, and found something suspicious about LIFs that I can't decipher, LIFs looks fine both from GUI and via CLI commands

[root@esx-22:~] grep 566227 /var/log/vmkernel.log | head -210 | tail -20

2016-05-26T16:06:54.416Z cpu22:566227)vdrb: VdrProcessRouteUpdateMessageAddRoutes:281: CP:Route Add: Entry 0 Unsupported Route Attribute type 1 skipping Attribute

2016-05-26T16:06:54.416Z cpu22:566227)vdrb: VdrProcessRouteUpdateMessageAddRoutes:281: CP:Route Add: Entry 0 Unsupported Route Attribute type 1 skipping Attribute

2016-05-26T16:06:54.416Z cpu22:566227)vdrb: VdrProcessRouteUpdateMessageAddRoutes:281: CP:Route Add: Entry 0 Unsupported Route Attribute type 1 skipping Attribute

2016-05-26T16:06:54.416Z cpu22:566227)vdrb: VdrProcessRouteUpdateMessageAddRoutes:281: CP:Route Add: Entry 0 Unsupported Route Attribute type 1 skipping Attribute

2016-05-26T16:06:54.416Z cpu22:566227)vdrb: VdrProcessRouteUpdateMessageAddRoutes:281: CP:Route Add: Entry 0 Unsupported Route Attribute type 1 skipping Attribute

2016-05-26T16:06:54.416Z cpu22:566227)vdrb: VdrProcessRouteUpdateMessageAddRoutes:281: CP:Route Add: Entry 0 Unsupported Route Attribute type 1 skipping Attribute

2016-05-26T16:06:54.416Z cpu22:566227)vdrb: VdrProcessRouteUpdateMessageAddRoutes:281: CP:Route Add: Entry 0 Unsupported Route Attribute type 1 skipping Attribute

2016-05-26T16:06:54.416Z cpu22:566227)vdrb: VdrProcessRouteUpdateMessageAddRoutes:281: CP:Route Add: Entry 0 Unsupported Route Attribute type 1 skipping Attribute

2016-05-26T16:06:54.416Z cpu22:566227)vdrb: VdrProcessRouteUpdateMessageAddRoutes:281: CP:Route Add: Entry 0 Unsupported Route Attribute type 1 skipping Attribute

2016-05-26T16:06:54.416Z cpu22:566227)vdrb: VdrProcessRouteUpdateMessageAddRoutes:281: CP:Route Add: Entry 0 Unsupported Route Attribute type 1 skipping Attribute

2016-05-26T16:06:54.416Z cpu22:566227)vdrb: VdrProcessRouteUpdateMessageAddRoutes:281: CP:Route Add: Entry 0 Unsupported Route Attribute type 1 skipping Attribute

2016-05-26T16:07:50.867Z cpu22:566227)WARNING: vdrb: VdrProcessLifUpdateMessageAddLifs:1000: CP:Lif Update: Unable to Add Lif entry, skipping to next entry

2016-05-26T16:07:50.867Z cpu22:566227)WARNING: vdrb: VdrProcessLifUpdateMessageAddLifs:1000: CP:Lif Update: Unable to Add Lif entry, skipping to next entry

2016-05-26T16:07:50.867Z cpu22:566227)WARNING: vdrb: VdrProcessLifUpdateMessageAddLifs:1000: CP:Lif Update: Unable to Add Lif entry, skipping to next entry

2016-05-26T16:07:50.867Z cpu22:566227)WARNING: vdrb: VdrProcessLifUpdateMessageAddLifs:1000: CP:Lif Update: Unable to Add Lif entry, skipping to next entry

2016-05-27T05:54:39.914Z cpu15:566227)vdrb: VdrCpProcessLocaleIdMsg:2329: CP:INFO: VDR Control Plane : Changing localeId from 420299c3-a98d-5ef5-6704-8c77475a4c37 to 42243c55-196b-520f-d733-e99899274998

2016-05-27T05:54:39.914Z cpu15:566227)vdrb: VdrSendInstanceReport:3713: CP:Simulating Link UP for instance [I:0x1770], controller IP = 0x7a80050a

2016-05-27T05:54:39.914Z cpu15:566227)vdrb: VdrSendInstanceReport:3713: CP:Simulating Link UP for instance [I:0x1388], controller IP = 0x7a80050a

2016-05-27T05:54:39.946Z cpu15:566227)vdrb: VdrCpProcessConnectionUpdateMessage:838: CP:Received Connection Update: numAdd=1, numDel= 0

2016-05-27T05:54:39.946Z cpu15:566227)vdrb: VdrInitConnectionTeamInfo:1927: CNXN:[C:vDS02-DC-1-Stretched,P:50331675] Default teaming

I've also download tech-support bundles from both nsxmgra1 and controoler in charge of UDLR, but those bundles are huge and I'm yet to find something useful there.

Any ideas about this behavior?

yak9 · ‎06-03-2016

Upgrade to current latest versions (ESXi 6.0U2 and NSX 6.2.2) fixed the issue.

Nevertheless, there is another glitch: controller state sync changes Locale-ID back to default (UUID of NSX Manager in control) value. It can be very nasty issue in following scenario:

Stretched cluster with local egress
DC1 part of Stretched cluster has VC and NSX Manager + controller cluster running
ESGs maintaining DC1 connectivity to physical network aren't on stretched cluster
Hosts on DC2 part of stretched cluster has Locale-ID specifically set to something other than NSX Manager UUID
DC1 crashes and management services restarts on DC2
When controller cluster AND NSX Management Service comes up, controller state sync is performed and Locale-ID changes
At this exact moment VMs on DC2 part of stretched cluster lose network connectivity, as UDLR routing table changed. No adjacency with dead DC1's ESGs - no routes

"Best" part of it is that VC is still dead, so you can't correct locale-ID via GUI.

So, I see two paths to overcome this situation (apart from possibility I'm doing something terribly wrong and everyone else have it working fine. Anyone?):

REST-API automation. And while I find this approach rather workaround, than a solution, it actually doesn't work, here is screenshot of attempt to execute "get host locale-id" example from API guide

Dynamic routing. I'll try to work on this one.

Are you guys interested? I can keep you informed on my findings

RaymundoEC · ‎06-03-2016

I would like to understand more about this problem sir, if you don´t mind .

regards.

+vRay

RaymundoEC · ‎06-03-2016

I think the msgj bus is mess in that particular host, since I sume that Local Id was ok then after this failure on the bus the ID change in auto, let me see if I can go deep about automatic failure and in what cases Local ID changes, but I assuming and just assuming that Loca ID changes in auto then something is wrong reganding communication with the Bus.

try this if you haven't but reboot netcpa on that host*

please keep me post in your findings

regards

+vRay

yak9 · ‎06-05-2016

Hello, Raymundo.

I did the test again and problem persist.

Host in question, the one playing remote side of stretched cluster is esx-22.

First, I enforced locale-id on it one more time to ensure it is correct.

vmkernel.log

12:2016-06-06T05:10:06.664Z cpu0:35166)vdrb: VdrCpProcessLocaleIdMsg:2329: CP:INFO: VDR Control Plane : Changing localeId from 42243c55-196b-520f-d733-e99899274998 to 420299c3-a98d-5ef5-6704-8c77475a4c37

netcpa.log

2016-06-06T05:10:06.664Z info netcpa[FF9BBB70] [Originator@6876 sub=Default] Got Message in Function Callback Type 3

2016-06-06T05:10:06.664Z info netcpa[FF9BBB70] [Originator@6876 sub=Default] Got Message in App Id host-43 Type VNET

After that, I rebooted esx-22 to reset all services running on it, and during boot-up locale-id changed again!

I changed it back and continued with the test.

Stretched cluster in question:

esx-12:

vm1 - VNI6000
vm2 - VNI6003
vm3 - VNI6005
vm4 - VNI6006

esx-22:

vm5 - VNI6000
vm6 - VNI6003
vm7 - VNI6005
vm8 - VNI6006

All those VNIs are behind UDLR.

I also ran ping from my laptop to all this VMs to monitor connectivity realtime

Management cluster is separate, two-node as well, and all management and control VMs are running on "DC1" part of management cluster.

So, I powered off "DC1" part of my lab and, obviously, vm1-vm4 went down, but connectivity for vm8 was lost too, it resumed 10 seconds afterwards, but still this was unexpected behavior.

Even weirder thing happened later: as soon as vm1-vm4 booted up on esx-22, ping to all VMs on VNI6000 vanished. I connected to the consoles of VMs in question and found out that, they were able to ping each other and VMs on other VNIs, connected to UDLR. They could also ping default gateway, but VNI6000 traffic refused to leave UDLR.

This was extremely frustrating. All firewalls were either disabled (EGS) to configured to allow all traffic (DFW), so it was not the case. Also, I have never experienced anything like this with "simple" DLR.

Anyway, by that time NSX Controllers have booted, esx-22 regained connectivity with them, and VNI6000 VMs became reachable again.

For a short period of time everything was working fine, then NSX Management Service came up and BOOM, locale-id changed again:

vmkernel.log

499:2016-06-06T05:19:55.444Z cpu0:35166)vdrb: VdrCpProcessLocaleIdMsg:2329: CP:INFO: VDR Control Plane : Changing localeId from 420299c3-a98d-5ef5-6704-8c77475a4c37 to 42243c55-196b-520f-d733-e99899274998

netcpa.log

2016-06-06T05:19:55.508Z info netcpa[FF9BBB70] [Originator@6876 sub=Default] Got Message in Function Callback Type 3

2016-06-06T05:19:55.508Z info netcpa[FF9BBB70] [Originator@6876 sub=Default] Got Message in App Id host-43 Type VNET

2016-06-06T05:19:55.508Z info netcpa[FF97AB70] [Originator@6876 sub=Default] Received vdr instance message numVdrId 2 ...

2016-06-06T05:19:55.508Z info netcpa[FF97AB70] [Originator@6876 sub=Default] Updated vdr instance vdr name = default+edge-5, vdr id = 5000, auth token = becbcd28-fd1a-432c-b4dc-de12ea067510, universal = false, localEgress = false

2016-06-06T05:19:55.509Z info netcpa[FF97AB70] [Originator@6876 sub=Default] Updated vdr instance vdr name = default+edge-47748c53-6f4b-47cc-a55e-858691cbad47, vdr id = 6000, auth token = 3e209177-983c-4ea8-8bc2-af2fe9b707f2, universal = true, localEgress = true

2016-06-06T05:19:55.509Z info netcpa[FF97AB70] [Originator@6876 sub=Default] No flap edge CP link for vdr id 5000

2016-06-06T05:19:55.509Z info netcpa[FF97AB70] [Originator@6876 sub=Default] No flap edge CP link for vdr id 6000

So either local egress is broken somehow, or my environment is totally broken. But my environment is clean installed vSphere 6.0 + NSX 6.2, then updated to vSphere 6.0u2 + NSX 6.2.2. It is a lab, so it barely see any use and I'm only one to mess with it.

Anyway, I was extremelt disturbed by VNI6000 behavior. I guess I'll run another test without local egress and watch UDLR is that situation.

Am I mistaken, or you work for VMware? If you do, I can provide you with all logs you want for you to go through and/or show to someone inside the shop.

RaymundoEC · ‎06-07-2016

I will be very interested on check this problem, so yeah sure share the logs and I can see what can I do, just be patient on this since I have some work to do, in the other hand please provide like the logical sketch of your deployment please.

regards!

+vRay