VSAN 2 node Directly connected cluster - shutting ...

stinkyptoz · ‎11-24-2022

Hi all,

I've recently adopted a 2 node VSAN cluster that has an issue whereby when a vmotion is initiated, from one host to another, the host seems to send a burst of traffic via the managment NIC, which is forcing the switch to close the port down....and I lose the host in vCenter obviously.

I've created a standard switch for the management traffic and i've created a distributed switch for the vmotion/vsan traffic. I've also configured the traffic types/services for each vmkernal adapter correcty.

My DVUplinks are configured correctly for my Distributed Switch(as per the physical ports on my hosts) and my VMNics for my Standard Switch are again configured correctly as per the physical setup/NICS on the host/s.

I'm thinking that the Gateway address for my VSAN/vMotion kernal adapters is possibly the issue... and i've tried to set it to not have a gateway( as these are directly connected hosts and only need to have connectivity to themselves for vsan/vmotion traffic etc), but it wont allow me to do that.

Does anyone have any ideas or can at least point me in a particular likely direction if its not what i'm thinking above? I'm only using the default TCP/IP Stack as it stands...I'm guessing best practise is to separate out the vMotion and VSAN traffic using separate TCP/IP Stacks - I guess this is one way to specify the gateway addresses for the vMotion and VSAN traffic, but would I just leave it blank?

I was wondering if this article is perhaps my issue? Any thoughts ( https://www.yellow-bricks.com/2017/11/22/isolation-address-2-node-direct-connect-vsan-environment/ )

Cheers in advance,

Pete

depping · ‎11-25-2022

You shouldn't need a separate TCP stack, this should just work, without or without an isolation address specified. (That article has to do with HA, not vMotion.) Gateway shouldn't even be used, this is just L2 traffic, as you would configure one host with let's say 10.0.0.1 and the other host with 10.0.0.2 and a standard subnet of 255.255.255.0 as the traffic won't go outside anyway.

I am assuming that when you say vMotion, this is a live migration, aka powered-on VM?

I am assuming you double checked that vMotion isn't accidentally flipped on on the Management VMkernel port?

I am also assuming you double checked that the correct adaptors have been assigned to the vMotion and vSAN VMkernel ports?

I personally have never seen this to be honest...

TheBobkin · ‎11-25-2022

@stinkyptoz Are you by any chance multihoming the Management and vMotion network (e.g. both in same subnet) and unintentionally sending the vMotion traffic out the Management vmk?

If not that then I would advise confirming that Management and vMotion vmks are using different uplinks (can be confirmed which is actively in use from esxtop 'n' option.

If neither of these then are you sure there isn't something else occurring during vMotion causing issues with hosts on the node causing it to become non-responsive?

Are you by any chance just noting this while vMotioning the vCenter VM itself?

stinkyptoz · ‎11-26-2022

Hi depping,

Thanks for reply first off! 🙂

So this is/was a live migration... vMotion appears to be configured correctly - as in the vMotion service is specified to only use the dedicated vMotion VMKernal port(with an IP on a different address space/subnet to the management VMKernal adapter)

Yes, I have ensured that the correct Physical NICs in the server are associated with the correct vmNICs on the host... I have vmNICs 0 and 1 used for my Management VMKernal adapter(10.111.60.10) on my standard switch(which are obviosuly patched into the physical switch in the datacentre)...then vmNICS 4 and 5 are used for DVUplinks for my Distributed Switch, which has 2 VMKernal Adapters....one for VSAN and one for vMotion(both having separate IP addresses on different network spaces eg vSAN is 10.10.60.20 and vMotion is 10.101.60.30)

I'm now wondering if it could be something to do with the way the vSAN has been setup or possibly some sort of issue/misconfiguration with the directly connected SFPs/cables/dualport 10GB Ethernet Card that the two host are connected by(vmNics 4 and 5 in vSphere etc).

Bit of a head scratcher this one!! 😞

stinkyptoz · ‎11-26-2022

Hi Bobkin,

Thanks for the reply 🙂

Defnitely not multihoming the management and vMotion network, they are using separate physical connections into the physical server, they are using separate vSphere switches with separate vmNICs and obviosuly separate vmKernals adapters with different IPs (Management = 10.111.60.10 and vMotion = 10.101.60.30) I've set the services, eg management/vMotion/VSAN/witness traffic to the correct vmkernal adapters.

I've never tried to vMotion the vCenter VM...in fact that resides on a separate cluster in the datacentre...and this VSAN cluster in question, with this issue, is just a 2 node directly connected setup in a remote office etc....using a witness appliance located in the main datacentre.

stinkyptoz · ‎02-01-2023

Hi,

Update for anyone experiencing a similar issue :

The problem turned out to be the physical switch port configuration. The number of MAC adressess allowed per Switch Port was set to default (1) and therefore when a MAC address(for a VM) was vMotioned from one ESX Host to another, this disabled the switch port and obviously disconnected the Host from vCenter. 😞

https://www.geeksforgeeks.org/configuring-port-security-on-cisco-ios-switch/

Hope this helps someone in the future....

Best Wishes,

Pete

All

VSAN 2 node Directly connected cluster - shutting management switch ports down.