VMware Cloud Community
dunxd
Contributor
Contributor

Network Misconfiguration when adding first host to new cluster

I am building a new vSphere cluster from scratch.  I have installed ESXi on the first host, and built a vCenter server on a VM residing on that host (storage is on the local hard drive, although we have iSCSI targets which I can reach from the host).  The cluster is configured for HA.  When I try and add the host to the cluster, I get an error at the point where HA is configured - Cannot complete the .

I have stripped the network configuration of the host down to the most basic - a single NIC attached to a single vSwitch - this is running the VMKernel Port on VLAN 8 - that is our Management VLAN. The vCenter server will have a network address on this VLAN, so I also set the initial Virtual Machine Port Group to this VLAN, and connected the vCenter server NIC to this port group.  I understand I can't connect the vCenter server to the VMkernel port group, but shouldn't I be able to connect the vCenter server to a Port Group in the same VLAN?  If not, do I need to create a VLAN specifically for VMKernel Port Group?  I plan to set up another port group for vMotion with a dedicated and isolated VLAN (i.e. VLAN isn't routed) so this wouldn't allow vCenter to communicate.

Does anyone have any suggestions, or other ideas for what might be causing the problem.  I've read through the documentation, but it isn't giving me any pointers, and the error message isn't helping me beyond telling me something is wrong with my network config.

0 Kudos
1 Reply
dunxd
Contributor
Contributor

Ok - tracked it down.  It wasn't a misconfiguration of my network.  It was a misconfiguration of the HA settings for the cluster.

The VMware hosts are connected to two Dell 6248 core switches with default routing to a VRRP address.  This address isn't pingable, and VMware uses the gateway as the isolation address. I discovered this through the error messages for the host - the specific error (can't contact isolation address: ip address) isn't displayed at the cluster level.

I fixed this by adding the following values to the advanced options for HA:
  • das.isolationaddress1 = home ip address of core switch 1
  • das.isolationaddress2 = home ip address of core switch 2
  • das.usedefaultisolationaddress = false
  • das.failuredetectiontime = 20000
The last one I added after reading somewhere that this should be increased from  15 to 20 seconds when using more than 1 isolation address.  Perhaps this isn't necessary.
After adding these settings I disabled and renabled HA, and the error is no longer displayed for my host.
0 Kudos