Hello!
I have a test lab with NSX 6.2 and recently, while playing around with local egress feature, I've stumbled upon very strange and frustrating behavior.
I run a simulation of 2-host (esx-12, esx-22) stretched cluster, running in two datacenters: DC-A (esx-12), and DC-B (esx-22).
DC-A is managed by vCenter vca1 and NSX Manager nsxmgra1, DC-B is managed by vCenter vcb1 and nsxmgrb1. Esx-22 is only host in DC-B managed by vca1.
nsxmgra1 UUID is 42243c55-196b-520f-d733-e99899274998
nsxmgrb1 UUID is 420299c3-a98d-5ef5-6704-8c77475a4c37
NSX Managers utilize Cross-vCenter topology (nsxmgra1 is Primary), with one UDLR with local egress enabled and in HA configuration (so 2 Control VMs per site, 4 VMs total).
I set esx-22 locale-ID via GUI to nsxmgrb1 UUID and for a day or two local egress is working fine.
Then something happens, and following line appears in vmkernel.log on esx-22 host:
2016-05-27T05:54:39.914Z cpu15:566227)vdrb: VdrCpProcessLocaleIdMsg:2329: CP:INFO: VDR Control Plane : Changing localeId from 420299c3-a98d-5ef5-6704-8c77475a4c37 to 42243c55-196b-520f-d733-e99899274998
For some mysterious reason, locale-ID on this host is changed back to nsxmgra1 UUID.
Not surprisingly, 566227 is PID of netcpa-worker process, but this particular netcpa-worker has not any open connections to controllers:
[root@esx-22:~] esxcli network ip connection list | grep 566227
[root@esx-22:~]
I've checked vmkernel.log around offending line, and found something suspicious about LIFs that I can't decipher, LIFs looks fine both from GUI and via CLI commands
[root@esx-22:~] grep 566227 /var/log/vmkernel.log | head -210 | tail -20
2016-05-26T16:06:54.416Z cpu22:566227)vdrb: VdrProcessRouteUpdateMessageAddRoutes:281: CP:Route Add: Entry 0 Unsupported Route Attribute type 1 skipping Attribute
2016-05-26T16:06:54.416Z cpu22:566227)vdrb: VdrProcessRouteUpdateMessageAddRoutes:281: CP:Route Add: Entry 0 Unsupported Route Attribute type 1 skipping Attribute
2016-05-26T16:06:54.416Z cpu22:566227)vdrb: VdrProcessRouteUpdateMessageAddRoutes:281: CP:Route Add: Entry 0 Unsupported Route Attribute type 1 skipping Attribute
2016-05-26T16:06:54.416Z cpu22:566227)vdrb: VdrProcessRouteUpdateMessageAddRoutes:281: CP:Route Add: Entry 0 Unsupported Route Attribute type 1 skipping Attribute
2016-05-26T16:06:54.416Z cpu22:566227)vdrb: VdrProcessRouteUpdateMessageAddRoutes:281: CP:Route Add: Entry 0 Unsupported Route Attribute type 1 skipping Attribute
2016-05-26T16:06:54.416Z cpu22:566227)vdrb: VdrProcessRouteUpdateMessageAddRoutes:281: CP:Route Add: Entry 0 Unsupported Route Attribute type 1 skipping Attribute
2016-05-26T16:06:54.416Z cpu22:566227)vdrb: VdrProcessRouteUpdateMessageAddRoutes:281: CP:Route Add: Entry 0 Unsupported Route Attribute type 1 skipping Attribute
2016-05-26T16:06:54.416Z cpu22:566227)vdrb: VdrProcessRouteUpdateMessageAddRoutes:281: CP:Route Add: Entry 0 Unsupported Route Attribute type 1 skipping Attribute
2016-05-26T16:06:54.416Z cpu22:566227)vdrb: VdrProcessRouteUpdateMessageAddRoutes:281: CP:Route Add: Entry 0 Unsupported Route Attribute type 1 skipping Attribute
2016-05-26T16:06:54.416Z cpu22:566227)vdrb: VdrProcessRouteUpdateMessageAddRoutes:281: CP:Route Add: Entry 0 Unsupported Route Attribute type 1 skipping Attribute
2016-05-26T16:06:54.416Z cpu22:566227)vdrb: VdrProcessRouteUpdateMessageAddRoutes:281: CP:Route Add: Entry 0 Unsupported Route Attribute type 1 skipping Attribute
2016-05-26T16:07:50.867Z cpu22:566227)WARNING: vdrb: VdrProcessLifUpdateMessageAddLifs:1000: CP:Lif Update: Unable to Add Lif entry, skipping to next entry
2016-05-26T16:07:50.867Z cpu22:566227)WARNING: vdrb: VdrProcessLifUpdateMessageAddLifs:1000: CP:Lif Update: Unable to Add Lif entry, skipping to next entry
2016-05-26T16:07:50.867Z cpu22:566227)WARNING: vdrb: VdrProcessLifUpdateMessageAddLifs:1000: CP:Lif Update: Unable to Add Lif entry, skipping to next entry
2016-05-26T16:07:50.867Z cpu22:566227)WARNING: vdrb: VdrProcessLifUpdateMessageAddLifs:1000: CP:Lif Update: Unable to Add Lif entry, skipping to next entry
2016-05-27T05:54:39.914Z cpu15:566227)vdrb: VdrCpProcessLocaleIdMsg:2329: CP:INFO: VDR Control Plane : Changing localeId from 420299c3-a98d-5ef5-6704-8c77475a4c37 to 42243c55-196b-520f-d733-e99899274998
2016-05-27T05:54:39.914Z cpu15:566227)vdrb: VdrSendInstanceReport:3713: CP:Simulating Link UP for instance [I:0x1770], controller IP = 0x7a80050a
2016-05-27T05:54:39.914Z cpu15:566227)vdrb: VdrSendInstanceReport:3713: CP:Simulating Link UP for instance [I:0x1388], controller IP = 0x7a80050a
2016-05-27T05:54:39.946Z cpu15:566227)vdrb: VdrCpProcessConnectionUpdateMessage:838: CP:Received Connection Update: numAdd=1, numDel= 0
2016-05-27T05:54:39.946Z cpu15:566227)vdrb: VdrInitConnectionTeamInfo:1927: CNXN:[C:vDS02-DC-1-Stretched,P:50331675] Default teaming
I've also download tech-support bundles from both nsxmgra1 and controoler in charge of UDLR, but those bundles are huge and I'm yet to find something useful there.
Any ideas about this behavior?
Upgrade to current latest versions (ESXi 6.0U2 and NSX 6.2.2) fixed the issue.
Nevertheless, there is another glitch: controller state sync changes Locale-ID back to default (UUID of NSX Manager in control) value. It can be very nasty issue in following scenario:
"Best" part of it is that VC is still dead, so you can't correct locale-ID via GUI.
So, I see two paths to overcome this situation (apart from possibility I'm doing something terribly wrong and everyone else have it working fine. Anyone?):
Are you guys interested? I can keep you informed on my findings
I would like to understand more about this problem sir, if you don´t mind .
regards.
I think the msgj bus is mess in that particular host, since I sume that Local Id was ok then after this failure on the bus the ID change in auto, let me see if I can go deep about automatic failure and in what cases Local ID changes, but I assuming and just assuming that Loca ID changes in auto then something is wrong reganding communication with the Bus.
try this if you haven't but reboot netcpa on that host*
please keep me post in your findings
regards
Hello, Raymundo.
I did the test again and problem persist.
Host in question, the one playing remote side of stretched cluster is esx-22.
First, I enforced locale-id on it one more time to ensure it is correct.
vmkernel.log
12:2016-06-06T05:10:06.664Z cpu0:35166)vdrb: VdrCpProcessLocaleIdMsg:2329: CP:INFO: VDR Control Plane : Changing localeId from 42243c55-196b-520f-d733-e99899274998 to 420299c3-a98d-5ef5-6704-8c77475a4c37
netcpa.log
2016-06-06T05:10:06.664Z info netcpa[FF9BBB70] [Originator@6876 sub=Default] Got Message in Function Callback Type 3
2016-06-06T05:10:06.664Z info netcpa[FF9BBB70] [Originator@6876 sub=Default] Got Message in App Id host-43 Type VNET
After that, I rebooted esx-22 to reset all services running on it, and during boot-up locale-id changed again!
I changed it back and continued with the test.
Stretched cluster in question:
esx-12:
esx-22:
All those VNIs are behind UDLR.
I also ran ping from my laptop to all this VMs to monitor connectivity realtime
Management cluster is separate, two-node as well, and all management and control VMs are running on "DC1" part of management cluster.
So, I powered off "DC1" part of my lab and, obviously, vm1-vm4 went down, but connectivity for vm8 was lost too, it resumed 10 seconds afterwards, but still this was unexpected behavior.
Even weirder thing happened later: as soon as vm1-vm4 booted up on esx-22, ping to all VMs on VNI6000 vanished. I connected to the consoles of VMs in question and found out that, they were able to ping each other and VMs on other VNIs, connected to UDLR. They could also ping default gateway, but VNI6000 traffic refused to leave UDLR.
This was extremely frustrating. All firewalls were either disabled (EGS) to configured to allow all traffic (DFW), so it was not the case. Also, I have never experienced anything like this with "simple" DLR.
Anyway, by that time NSX Controllers have booted, esx-22 regained connectivity with them, and VNI6000 VMs became reachable again.
For a short period of time everything was working fine, then NSX Management Service came up and BOOM, locale-id changed again:
vmkernel.log
499:2016-06-06T05:19:55.444Z cpu0:35166)vdrb: VdrCpProcessLocaleIdMsg:2329: CP:INFO: VDR Control Plane : Changing localeId from 420299c3-a98d-5ef5-6704-8c77475a4c37 to 42243c55-196b-520f-d733-e99899274998
netcpa.log
2016-06-06T05:19:55.508Z info netcpa[FF9BBB70] [Originator@6876 sub=Default] Got Message in Function Callback Type 3
2016-06-06T05:19:55.508Z info netcpa[FF9BBB70] [Originator@6876 sub=Default] Got Message in App Id host-43 Type VNET
2016-06-06T05:19:55.508Z info netcpa[FF97AB70] [Originator@6876 sub=Default] Received vdr instance message numVdrId 2 ...
2016-06-06T05:19:55.508Z info netcpa[FF97AB70] [Originator@6876 sub=Default] Updated vdr instance vdr name = default+edge-5, vdr id = 5000, auth token = becbcd28-fd1a-432c-b4dc-de12ea067510, universal = false, localEgress = false
2016-06-06T05:19:55.509Z info netcpa[FF97AB70] [Originator@6876 sub=Default] Updated vdr instance vdr name = default+edge-47748c53-6f4b-47cc-a55e-858691cbad47, vdr id = 6000, auth token = 3e209177-983c-4ea8-8bc2-af2fe9b707f2, universal = true, localEgress = true
2016-06-06T05:19:55.509Z info netcpa[FF97AB70] [Originator@6876 sub=Default] No flap edge CP link for vdr id 5000
2016-06-06T05:19:55.509Z info netcpa[FF97AB70] [Originator@6876 sub=Default] No flap edge CP link for vdr id 6000
So either local egress is broken somehow, or my environment is totally broken. But my environment is clean installed vSphere 6.0 + NSX 6.2, then updated to vSphere 6.0u2 + NSX 6.2.2. It is a lab, so it barely see any use and I'm only one to mess with it.
Anyway, I was extremelt disturbed by VNI6000 behavior. I guess I'll run another test without local egress and watch UDLR is that situation.
Am I mistaken, or you work for VMware? If you do, I can provide you with all logs you want for you to go through and/or show to someone inside the shop.
I will be very interested on check this problem, so yeah sure share the logs and I can see what can I do, just be patient on this since I have some work to do, in the other hand please provide like the logical sketch of your deployment please.
regards!