VMware Networking Community
jedijeff
Enthusiast
Enthusiast
Jump to solution

Bad controller maybe. Rebuild controller cluster?

Hi. We 1 controller VM that has disconnected twice from the controller cluster in the last two months. We see vCenter alarms pointing to high memory usage maybe a week before it happens.

I opened a ticket and was told to rebuild the entire controller cluster, ie delete and install all 3 nsx controllers again. The ticket closed, and I forgot to ask if I can just delete this one node, because I remember several months ago during a fail over test, we are cross-site setup, when we failed back I inadvertantly deleted a primary controller instead of one at our secondary site when we were failing back. It got in a weird state so I think I deleted the VM actually. So I am not sure I screwed the database up on NSX or not.

Anyway I may open the ticket back up to give that information.

I am just hesitant to rebuild the entire cluster, and yes I know turning off DRS it should be no issue. I just wanted to try to delete this 1 controller first clenaing and redeploy it. Looking for any comments. Thanks,,,

Reply
0 Kudos
1 Solution

Accepted Solutions
vvermani2
Contributor
Contributor
Jump to solution

Hi,

First, to answer your question - yes you can delete and re-deploy a single controller when it encounters catastrophic, unrecoverable errors or when one of the controller VMs become inaccessible and cannot be fixed - you must first delete the broken controller before deploying a new one.

Second, The controllers lies in the control plane and even if all three controller instances are down, it does not impact your data plane (i.e. your ESXi Hosts running your VMs) network communication. The Impact is when "new" VMs are provisioned, new logical switches are created or any amendments you make to existing objects that require updates to be reported to the controllers and for the controllers to propagate the newly learnt information, down to the VTEPS e.g. routes, MAC addresses, new VTEPS (in case of adding new ESXi Hosts), etc.

If you would like to learn more about the controller nodes and its architecture, please see the link below:

Understanding the Controller Cluster Architecture

The control plane is completely segregated from the data plane to stand its failures, with the caveat I mentioned above, so don't be reluctant to delete and re-deploy all three controllers.

I don't know exactly what happened in your environment and why a single controller node got disconnected, but troubleshooting and resolving an issue with the controllers can sometimes be very quick, but at times can be very cumbersome. The issue may or may not be related to the vCenter memory alarms you mentioned, but I am sure VMware support may have spent time on the issue after which they reached a conclusion to suggest a re-deploy of all three instances. It is also the easiest and quickest way to get a clean and healthy state as the re-deployment and its cluster formation, takes minutes compare to the hours of time spent on troubleshooting. Having said that, it would be ideal to troubleshoot and find the root cause if the issue is re-occurring.

Hope this helps.

Regards,

Varun

http://shuttleTITAN.com

Knowledge increases by sharing it...

View solution in original post

Reply
0 Kudos
2 Replies
vvermani2
Contributor
Contributor
Jump to solution

Hi,

First, to answer your question - yes you can delete and re-deploy a single controller when it encounters catastrophic, unrecoverable errors or when one of the controller VMs become inaccessible and cannot be fixed - you must first delete the broken controller before deploying a new one.

Second, The controllers lies in the control plane and even if all three controller instances are down, it does not impact your data plane (i.e. your ESXi Hosts running your VMs) network communication. The Impact is when "new" VMs are provisioned, new logical switches are created or any amendments you make to existing objects that require updates to be reported to the controllers and for the controllers to propagate the newly learnt information, down to the VTEPS e.g. routes, MAC addresses, new VTEPS (in case of adding new ESXi Hosts), etc.

If you would like to learn more about the controller nodes and its architecture, please see the link below:

Understanding the Controller Cluster Architecture

The control plane is completely segregated from the data plane to stand its failures, with the caveat I mentioned above, so don't be reluctant to delete and re-deploy all three controllers.

I don't know exactly what happened in your environment and why a single controller node got disconnected, but troubleshooting and resolving an issue with the controllers can sometimes be very quick, but at times can be very cumbersome. The issue may or may not be related to the vCenter memory alarms you mentioned, but I am sure VMware support may have spent time on the issue after which they reached a conclusion to suggest a re-deploy of all three instances. It is also the easiest and quickest way to get a clean and healthy state as the re-deployment and its cluster formation, takes minutes compare to the hours of time spent on troubleshooting. Having said that, it would be ideal to troubleshoot and find the root cause if the issue is re-occurring.

Hope this helps.

Regards,

Varun

http://shuttleTITAN.com

Knowledge increases by sharing it...

Reply
0 Kudos
jedijeff
Enthusiast
Enthusiast
Jump to solution

Hi. Thank you very much. I will rebuild. I am not sure what happened to this controller, it gets high memory utilization and iops hours after I power up. Used to be days. Latency on the datastore it is on is about 3ms. I will keep it off for several days until I can get the change approval to rebuild. We have 2 more controllers online and are in CDO mode as well. Luckily this is a DEV/QA system only. Thanks,,,

Reply
0 Kudos