Re: Permissions problems with linked VCs after VCS...

JulietDeltaGolf · ‎09-13-2020

Hi,

We have just migrated/upgraded 2, linked mode, Windows 6.0 vCenters to 6.7 VCSAs. Everything seemed to go OK during the migrations but now they are complete there seems to be some problem with the trust or sharing of credentials between the 2 VCSAs. They still see each other as linked and VCSA 'A' is visible at the top level from VCSA 'B' and vice-versa. However attempting to expand the tree of the opposing site from either A or B gives the 'You have no priviliges to view this object or it does not exist' error message. This occours if we are logged in to either site and attempting to view the other and we get the same problem if we are logged in as 'administrator@vsphere.local' or if we are logged in as any AD account that should have the permissions to do this.

I.e. AD account 'X' is able to log in to both VCSA 'A' and VCSA 'B' and can explore the 'local' VCSA in full but can't view 'A' from 'B' and can't view 'B' from 'A'.

We can however see recent tasks intitated at either site, e.g. if we delete a VM snapshot at site 'A' (using VCSA 'A') and site 'B' (using VCSA 'B') then the recent tasks pane is populate with both tasks and in both VCSAs.

scott28tt · ‎09-13-2020

Moderator: Thread moved to the vCenter Server area.

-------------------------------------------------------------------------------------------------------------------------------------------------------------

Although I am a VMware employee I contribute to VMware Communities voluntarily (ie. not in any official capacity)
VMware Training & Certification blog

Lalegre · ‎09-13-2020

Hey JulietDeltaGolf,

Check if the time between the appliances is the same on all PSCs and vCenters (Depends if you have Embedded or External deployment).

Also check if the replication between PSCs is working:

vdcrepadmin -f showservers -h PSC_FQDN -u administrator -w Administrator_Password
vdcrepadmin -f showpartners -h PSC_FQDN -u administrator -w Administrator_Password
vdcrepadmin -f showpartnerstatus -h localhost -u administrator -w Administrator_Password

You should see the remote PSC as connected partner and also the sync should be seen. I will past also the KB where all the checks are and the expected outputs in case you are not familiar with the commands: VMware Knowledge Base

JulietDeltaGolf · ‎09-14-2020

Thanks but things have changed (seemingly for the worse but perhaps it actually helps point us in the right direction now). Site A seems to still be fine but attempting to log in to site B VC, using either a domain user or the administrator@vsphere.local account now produces the same error "Empty SSO respone string.":

We have just restarted the site B VCSA, in case this allows us to re-connect.

JulietDeltaGolf · ‎09-14-2020

The VCSA reboot has resolved the 500 problem, we're back to the same behaviour as per the original post.

Both VCSAs are set to host sync'd time and appear to be the same.

Lalegre · ‎09-14-2020

However you should check what i told you because it checks the communication and syncing of the SSO Domain between the PSCs and this could be your issue of why you are not able to connect.

Also you should see the logs of the SSO service.

JulietDeltaGolf · ‎09-14-2020

For site A we see this:

./vdcrepadmin -f showpartnerstatus -h localhost -u administrator

Partner: site_A

Host available: Yes

Status available: Yes

My last change number: 105292

Partner has seen my change number: 88906

Partner is 16386 changes behind.

But for site B we see this:

./vdcrepadmin -f showpartnerstatus -h localhost -u administrator

Partner: site_A

Host available: Yes

Status available: No

Other than that all the other 'showservers' and 'showpartners' commands appear to give correct results. It looks like site B is the problem, it is not seeing the changes from site A or sending its changes to site A.

Lalegre · ‎09-14-2020

So basically is that, the agreement between the two PSCs is broken. So i would recommend you to go over the "createagreement" section from the KB i sent you first to recreate it. Do it from the PSC on Site B.

JulietDeltaGolf · ‎09-14-2020

Not sure if it changes anything or just confirms but in the logs we see:

Site A:

2020-09-14T10:37:08.530672+00:00 err vmdird t@140406020888320: SASLSessionStep: sasl error (-13)(SASL(-13): authentication failure: client evidence does not match what we calculated. Probably a password error)

2020-09-14T10:37:08.531184+00:00 err vmdird t@140406020888320: VmDirSendLdapResult: Request (Bind), Error (49), Message ((49)(SASL step failed.)), (0) socket (128.33.45.35)

2020-09-14T10:37:08.531493+00:00 err vmdird t@140406020888320: Bind Request Failed (128.33.45.35) error 49: Protocol version: 3, Bind DN: "cn=SITE_B_FQDN,ou=Domain Controllers,dc=vsphere,dc=local", Method: SASL

2020-09-14T10:37:33.966230+00:00 err vmdird t@140406750693120: VmDirSafeLDAPBind to (ldap://SITE_B.FQDN:389) failed. SRP(9234)

and on Site B:

2020-09-14T10:38:33.973211+00:00 err vmdird t@140126311143168: SASLSessionStep: sasl error (-13)(SASL(-13): authentication failure: client evidence does not match what we calculated. Probably a password error)

2020-09-14T10:38:33.973810+00:00 err vmdird t@140126311143168: VmDirSendLdapResult: Request (Bind), Error (49), Message ((49)(SASL step failed.)), (0) socket (10.172.252.16)

2020-09-14T10:38:33.974271+00:00 err vmdird t@140126311143168: Bind Request Failed (10.172.252.16) error 49: Protocol version: 3, Bind DN: "cn=SITE_B_FQDN,ou=Domain Controllers,dc=vsphere,dc=local", Method: SASL

2020-09-14T10:38:38.593819+00:00 err vmdird t@140127267452672: VmDirSafeLDAPBind to (ldap://SITE_A.FQDN:389) failed. SRP(9234)

That sounded a bit like this: VMware Knowledge Base (not quite but similar), so we followed those steps but it doesn't seem to have made any difference.

Do you think we should just try and 'createagreement' or should we 'removeagreement' and then' createagreement'?

Lalegre · ‎09-14-2020

i would go with that to give it a try cause clearly your replication is not working and maybe the agreement got lost during the ugprade. However take an Snapshot of all the PSCs and vCenters before going on with the procedure.

JulietDeltaGolf · ‎09-14-2020

We tried this, all the steps seemed to work (removing and then creating) but the end result is the same:

Site A still says:

2020-09-14T13:11:36.780423+00:00 err vmdird t@140484043319040: VmDirSafeLDAPBind to (ldap://site_b_fqdn:389) failed. SRP(9234)

2020-09-14T13:12:04.211459+00:00 err vmdird t@140483967784704: SASLSessionStep: sasl error (-13)(SASL(-13): authentication failure: client evidence does not match what we calculated. Probably a password error)

2020-09-14T13:12:04.327158+00:00 err vmdird t@140483967784704: VmDirSendLdapResult: Request (Bind), Error (49), Message ((49)(SASL step failed.)), (0) socket (128.33.45.35)

2020-09-14T13:12:04.327454+00:00 err vmdird t@140483967784704: Bind Request Failed (128.33.45.35) error 49: Protocol version: 3, Bind DN: "cn=site_b_fqdn,ou=Domain Controllers,dc=vsphere,dc=local", Method: SASL

Site B still says:

2020-09-14T13:07:03.814559+00:00 err vmdird t@139719966971648: VmDirSafeLDAPBind to (ldap://site_a_fqdn:389) failed. SRP(9234)

2020-09-14T13:07:05.284057+00:00 err vmdird t@139719111341824: SASLSessionStep: sasl error (-13)(SASL(-13): authentication failure: client evidence does not match what we calculated. Probably a password error)

2020-09-14T13:07:05.284427+00:00 err vmdird t@139719111341824: VmDirSendLdapResult: Request (Bind), Error (49), Message ((49)(SASL step failed.)), (0) socket (10.172.252.16)

2020-09-14T13:07:05.284618+00:00 err vmdird t@139719111341824: Bind Request Failed (10.172.252.16) error 49: Protocol version: 3, Bind DN: "cn=site_a_fqdn,ou=Domain Controllers,dc=vsphere,dc=local", Method: SASL

Lalegre · ‎09-14-2020

Well it was good to give it a try. Now try to follow the next procedure: https://vstack.it/2020/03/10/400-an-error-occurred-while-processing-the-authentication-response-from...

It faces exactly the same errors as you so i think it can help.

JulietDeltaGolf · ‎09-15-2020

We have moved things forward (using various machine account reset procedures, including this one: VMware Knowledge Base ) and now Site B is able to browse Site A. However it seems the root problem was with Site B as site A still can not browse this.

Site A now reports:

2020-09-14T16:03:32.473567+00:00 err vmdird t@139750677653248: _VmDirFetchReplicationPage: error: 53 filter: 'uSNChanged>=90807' requested: 1000 received: 0 usn: 90806 utd: '765c1341-c05f-11e5-ae51-000c29865313:95793,'

Which isn't very interesting, however site B now says this:

2020-09-14T15:59:31.962756+00:00 err vmdird t@140568298514176: VmDirSendLdapResult: Request (Search), Error (53), Message (Server in not in normal mode, not allowing outward replication.), (0) socket (ip.ip.ip.ip)

A bit more googling around led us to this command and output:

# /usr/lib/vmware-vmafd/bin/dir-cli state get

Enter password for administrator@vsphere.local:

Directory Server State: Failure (5)

There are a couple of articles around this but just attempting to change the state but this CLI output implies this is no longer available with the 6.7 VCSA:

# /usr/lib/vmware-vmafd/bin/dir-cli state set --state NORMAL
Enter password for administrator@vsphere.local:
dir-cli failed. Error 9001: Possible errors:
LDAP error: Operations error
Win Error: Operation failed with error ERROR_INVALID_FUNCTION (1)

and trying to follow other options to reset the password doesn't seem to work either:

# vcenter-restore -u administrator
Please enter SSO Admin Password:
Restore of embedded node is not supported via this script. Exiting.

This seemed to work as a script but didn't resolve the problem:

/usr/lib/vmware-vmafd/bin/dir-cli computer password-reset --login administrator --live-dc-hostname fbsshefvc.fletchers.corp --password XXXXXXX

It feels like we are at the root cause now though, there is a problem with the PSC or 'Directory Server' on Site B.

Lalegre · ‎09-15-2020

If you check inside the vmdird do you see any issues? Have you also tried to restart the PSCs? The issues related with the replication can be really tough so at this stage i recommend you to open a ticket with VMware GSS.

Also if you cannot find the fix for the directory service maybe something you can do is deploy a new PSC, join it to the same SSO Domain pointing to Site A and then do the repoint of vCenter Server.

JulietDeltaGolf · ‎09-16-2020

This is the embeded PSC in the VCSA.

We raised it with support, they tried some magic scripts that were supposed to:

1: Change domain state of broken PSC to 0

2: Decomission broken PSC from the healthy PSC

3: Re-join the domain using the data.mdb of a healthy PSC (and therefore fix replication)

But they failed pretty hard and we had to revert everything to snapshots again, worried the next suggestion is just going to be 'deploy a new VCSA'.

Vijay2027 · ‎09-17-2020

Could be data inconsistency between VMDIR instances.

Run the below command on both vc instances and make sure version is 6.7:

/usr/lib/vmidentity/tools/scripts/lstool.py list --url https://localhost/lookupservice/sdk --type vcenterserver --no-check-cert | grep -i -A 10 "Service Type:"

Sample Output:

Service Type: vcenterserver

Service ID: 7444d753-3128-41b4-8f47-211728c725d7

Site ID: default-site

Node ID: 115400a2-2ea6-47cf-9e3d-1834d55e7e9d

Owner ID: vpxd-a8508516-7192-49be-b4df-ee60a25af700@vsphere.local

Version: 6.7

Endpoints:

Type: com.vmware.vim.extension

Protocol: vmomi

URL: https://vcsa.test.org:443/sdkTunnel

SSL trust:

In your case you will see 2 "Service Type: vcenterserver" on each node.

JulietDeltaGolf · ‎09-29-2020

Thanks but they were both 6.7.

We didn't have any more time to troubleshoot so we ended up forcefully decommissioning the failed PSC/VCSA and deploying a new one.

All

Permissions problems with linked VCs after VCSA migration