Solved: vSAN Stretched Cluster Networking

MJMSRI · ‎08-27-2020

Hi All,

we have 3 sites with the below setup

DC1 = 3 x ESXi vSAN Hosts. All hosts vSAN VMKernel on 172.0.16.x / 255.255.255.0 / Gateway 172.0.16.254 VLAN 220
- Layer 2 network
DC2 = 3 x ESXi vSAN Hosts 172.0.17.x / 255.255.255.0 / Gateway 172.0.17.254 VLAN 220
- Layer 2 network
Office = 1 x vSAN Virtual Appliance Witness. VMK1 WitnessPg setup as 192.168.200.20 VLAN 200
- Layer 3 network with static route to both DC1 and DC2 VLAN220

Questions are:

Can you see any reasons why the two DCs have different ip range and different gateways considering they are both on /16 so same subnet, same VLAN and have L2 connection between the DC's?
Can you see any problem with changing the vSAN VMK in DC2 to be on same IP Range and same Gateway as DC1?

Thanks,

TheBobkin · ‎08-27-2020

Hello MJMSRI

"regarding the different IP Range at each DC, i am trying to see why that is in place and the incumbent is not around anymore to discuss."

Is there any documentation or other resources (e.g. email threads) that might elaborate on why this design was chosen?

Can you check what the das.isolationAddressX are set to for HA in this cluster? This may be able to (an extent) confirm if each gateway was configured for this on each site and thus why they configured it like this (as opposed to just configuring this for an virtual IP in this subnet on this site instead of the default gateway)

"However this document does allude to this detailing 'IP address on vSAN network on site 1' and 'IP address on vSAN network on site 2'"

My understanding of this has always been that it doesn't need to be (and maybe shouldn't be?) the DG IP and should be an addressable IP in the same subnet as the vSAN hosts on that site, depping or GreatWhiteTec might be able to elaborate on whether this is the case (and/or whether the DG vs an IP in range is beneficial/detrimental) as they tend to eat such queries for breakfast.

But going back to your original question, if they are all in the same /16, 172.0.x.x network then there should be no problem putting them all in the same 172.0.[N].x - however that being said, please please please (on behalf of GSS and anyone else that tends to fix things when they go sideways!), please validate with a single node in Maintenance Mode that switching the network causes no partition before carefully proceeding one node at a time to do the rest (e.g. I wouldn't advise scripting it to do all at once).

You can even have a plan B when doing this by just adding a new vmk in the desired IP range, enabling vSAN traffic on it then disabling the original - then validate that it stays clustered (can also check the vmnic + vmk traffic via esxtop 'n' to se it switch over) - if it doesn't work as expected (and you partitioned the host from the cluster) then you can simply re-enable vSAN traffic on the original vmk. (Advisable to do this with node in MM, but you may not see any/much traffic in esxtop in this state, but should still see cluster membership change from 'esxcli vsan cluster get' or monitoring clomd.log for changes to add/remove CdbObjectNode).

Bob

View solution in original post

TheBobkin · ‎08-27-2020

Hello MJMSRI,

"1. Can you see any reasons why the two DCs have different ip range and different gateways considering they are both on /16 so same subnet, same VLAN and have L2 connection between the DC's?"

- Per your notes these have subnet mask 255.255.255.0 - these are different /24 subnets not a /16 which would be 255.255.0.0.

"2. Can you see any problem with changing the vSAN VMK in DC2 to be on same IP Range and same Gateway as DC1?"

- I think you need to validate whether L2 connectivity between the sites is possible here before considering this.

Bob

MJMSRI · ‎08-27-2020

Hi TheBobkin thanks for the reply.

The /16 was a typo and meant to read /24.

As detailed in the notes, the current networking between the DC's is L2 and the connection from Witness site to DC's is L3.

TheBobkin · ‎08-27-2020

Hello MJMSRI,

"The /16 was a typo and meant to read /24"

Sorry to reiterate but they are separate /24 (subnet mask 255.255.255.0) networks - if you are 100% positive that they can all be moved to the 172.0.16.x or 172.0.17.x network then you can of course do this. However, before doing this I would advise engaging whomever initially configured it like this as potentially they had a valid reason for doing so (and as an aside, I see similar configurations from time to time in normal functional environments).

Bob

MJMSRI · ‎08-27-2020

Hi TheBobkin apologies, i have mixed up the info on this one. So the vSAN networks in each dc are both /16 so therefore the 172.0.16.x and 172.0.17.x are both on the same subnet.

regarding the different IP Range at each DC, i am trying to see why that is in place and the incumbent is not around anymore to discuss. Looking at this link there is no mention to do this: https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.vsan-planning.doc/GUID-39E2C6A6-0D7...

Network Design for Stretched Clusters

However this document does allude to this detailing 'IP address on vSAN network on site 1' and 'IP address on vSAN network on site 2'

Cluster Settings – vSphere HA | vSAN Stretched Cluster Guide | VMware

TheBobkin · ‎08-27-2020

Hello MJMSRI

"regarding the different IP Range at each DC, i am trying to see why that is in place and the incumbent is not around anymore to discuss."

Is there any documentation or other resources (e.g. email threads) that might elaborate on why this design was chosen?

Can you check what the das.isolationAddressX are set to for HA in this cluster? This may be able to (an extent) confirm if each gateway was configured for this on each site and thus why they configured it like this (as opposed to just configuring this for an virtual IP in this subnet on this site instead of the default gateway)

"However this document does allude to this detailing 'IP address on vSAN network on site 1' and 'IP address on vSAN network on site 2'"

My understanding of this has always been that it doesn't need to be (and maybe shouldn't be?) the DG IP and should be an addressable IP in the same subnet as the vSAN hosts on that site, depping or GreatWhiteTec might be able to elaborate on whether this is the case (and/or whether the DG vs an IP in range is beneficial/detrimental) as they tend to eat such queries for breakfast.

But going back to your original question, if they are all in the same /16, 172.0.x.x network then there should be no problem putting them all in the same 172.0.[N].x - however that being said, please please please (on behalf of GSS and anyone else that tends to fix things when they go sideways!), please validate with a single node in Maintenance Mode that switching the network causes no partition before carefully proceeding one node at a time to do the rest (e.g. I wouldn't advise scripting it to do all at once).

You can even have a plan B when doing this by just adding a new vmk in the desired IP range, enabling vSAN traffic on it then disabling the original - then validate that it stays clustered (can also check the vmnic + vmk traffic via esxtop 'n' to se it switch over) - if it doesn't work as expected (and you partitioned the host from the cluster) then you can simply re-enable vSAN traffic on the original vmk. (Advisable to do this with node in MM, but you may not see any/much traffic in esxtop in this state, but should still see cluster membership change from 'esxcli vsan cluster get' or monitoring clomd.log for changes to add/remove CdbObjectNode).

Bob

depping · ‎08-28-2020

Yeah good point, they could have indeed created 2 different networks to ensure they will have an "isolation address" per location. These isolation addresses should be specified in the advanced settings, you should see two mentioned in there. And there are 2 because you have 2 locations, and both locations will need a "site local address" which is reliable and pingable by the hosts in the located even when the link between locations is down. This could be why they designed it the way it is today.

Also, why would you change the design if it is working as designed and there are no problems? Are those IP Ranges needed for something? Or do you just prefer to see them in the same network?

MJMSRI · ‎09-04-2020

Hi depping, i did check the cluster settings and the advanced setting for isolation is not set so this is not why the cluster was setup with this networking, although as you say it would be a good way to set this advanced parameter.

as TheBobkin suggested, i did contact the incumbent to ask why this was set like this and it seems they set the 2 DCs with different vSAN Network IP Ranges so the Witness could distinguish between the sites. However of course with a stretched cluster that is why Fault Domains are configured. The request to change this is coming from the networking team so at this stage i am looking to see if this is possible before changing anything however seems it may be best to leave as it is and then set the isolation addresses.

depping · ‎09-08-2020

I would highly recommend then to set two isolation addresses using the advanced settings as discussed in our documentation.