srodenburg
Expert
Expert

Errror "network configuration is out of sync" after upgrading 2-node to 7 U3c

Jump to solution

Hello,

I upgraded a perfectly healthy, 2-node ROBO, from 7.0 U2b to 7.0 U3c incl. the witness. vCenter is also U3c.

Directly afterwards, Skyline Health had exactly 1 error: "'vSAN Cluster Configuration Consistency' -> network configuration is out of sync" between the two vSAN nodes (the witness is not on the list).

Using the "remediate inconsistent configuration" button does nothing to fix the issue. I rebooted both nodes + the witness but no dice. Error is still there.
Running "localcli vsan cluster unicastagent list" on both nodes and the witness reveals a perfectly healthy cluster with all the Uuid's and stuff in all the right places. So that's not it.

Any ideas on how to discover why vCenter has the opinion that the network config is out of sync between the two data-nodes, because I cannot find any discrepancies between them. We don't use DVS by the way. With 2 vLAN's it's overkill. This cluster was fine before the upgrade and none of this makes any sense.

0 Kudos
1 Solution

Accepted Solutions
srodenburg
Expert
Expert

Update: out of sheer curiosity, I changed the second vSAN vmk ip-adresses from an APIPA to a normal IP, re-enabled the vSAN Service on both second vSAN vmk's and viola, error is gone.

So it really seems that U3c does not like APIPA adresses (169.254.x.x) anymore. At least not for vSAN.

Problem solved. Their network team must give them proper IP's and then it will be fine.

View solution in original post

0 Kudos
4 Replies
TheBobkin
VMware Employee
VMware Employee

@srodenburg Are the nodes perchance set to ignore vC membership updates?

Checkable via (should be '0'):

# esxcfg-advcfg -g /VSAN/IgnoreClusterMemberListUpdates

 

If the health check itself doesn't give further information about the detected reason for inconsistency then checking the returned data that triggered the health check in /var/log/vmware/vsan-health/vmware-vsan-health-service.log may give some further insight.

0 Kudos
srodenburg
Expert
Expert

Tnx. VC membership updates value is "0" on all 3 hosts.

But I found this in the log. I anonymized the server names:

server1.somecompany.org is vSAN datanode #1
server2 the other
server3 is the witness.

2022-04-08T10:32:09.424+02:00 ERROR vsan-mgmt[14611] [VsanVcClusterHealthSystemImpl::_checkUnicastConfigIssues opID=074ffc34] Unicast info 169.254.253.183 was not found on host server1.somecompany.org
2022-04-08T10:32:09.424+02:00 ERROR vsan-mgmt[14611] [VsanVcClusterHealthSystemImpl::_checkUnicastConfigIssues opID=074ffc34] Unicast info 169.254.253.181 was not found on host server2.somecompany.org

Nothing about server 3 (witness) but then again, it only complains that the two datanodes have an out-of-sync network config.

These two cluster-nodes use dual vSAN vmk's, the second one in network 169.254.253.0/24 so that is what the error is about.

server1's second vSAN vmk has IP-address 169.254.253.181
server2's second vSAN vmk has IP-address 169.254.253.183
the error seems to be that both servers cannot find unicast info on the other node's second vSAN vmk.

The first and orginal vSAN vmk is in a 10.223 network and in a different VLAN. So somehow, since the upgrade, it's not happy about the second vSAN vmk anymore (granted, it is on an unusual network).
I simply disabled the vSAN service on both server's second vSAN vmk and viola, the error is gone. When I re-enable vSAN on them, the error comes back.

So it seems 7.0 U3c has an issue with that second vSAN vmk now. Or maybe it started to dislike people using 169.254.x.x networks?(which, again, is an unusual thing to do in this context)
(customer told me they chose that ip-range because the network team refused to give them a new VLAN and normal IP-adresses for the second vSAN vmk's, Dual vSAN vmk's worked fine, all the way to 7.0 U2b which was the version this cluster was upgraded from).

Note: not seeking a discussion about the pro's and con's of having dual vSAN vmks. I know of many clusters that have this, incl. U3c versions but they all use normal networks for them, not in 169.254 APIPA ranges.

0 Kudos
srodenburg
Expert
Expert

Update: out of sheer curiosity, I changed the second vSAN vmk ip-adresses from an APIPA to a normal IP, re-enabled the vSAN Service on both second vSAN vmk's and viola, error is gone.

So it really seems that U3c does not like APIPA adresses (169.254.x.x) anymore. At least not for vSAN.

Problem solved. Their network team must give them proper IP's and then it will be fine.

0 Kudos
TheBobkin
VMware Employee
VMware Employee

Yes, vSAN can't use and hence won't use APIPA IP addresses for vSAN traffic since long long ago (properly blocked since some time around 6.0 U2 or so, worked for some but not all addresses in that range prior, but wasn't ever supposed to be used/usable).  Good to see it is relatively easy to find reason from that log and the reason you won't see other things in health (e.g. unicast ping test fails) is likely because the pass criteria of these wouldn't have been missed e.g. vmkping probably worked fine between the IPs but they still ain't usable, guess that is a gap in the tests, it is pretty corner-case but maybe there should be a health test or sub-definition that states not to use APIPA.

 

But anyway, better off adding redundancy at the vmnic level on single vmk than dual network anyway (IMO).

0 Kudos