VMware Cloud Community
VTG38
Contributor
Contributor

Issue with vDS (distributed switch) and VM network failover

Dear all,

We are having an issue in configuring properly a vDS in vSphere / vCenter Server 6.5 with regards to network failover in case of network loss for the VM.

Our environment:

  • Two ESXi 6.5 hosts and a shared NAS.
  • Hosts configured in a cluster with HA enabled.
  • There is a vDS that uses two physical NICs (uplinks), one from each of the hosts. Both physical NICs are connected to the same switch. Teaming is configured on this vDS and groupport.
  • The VM is using this groupport.
  • The VM has a static IPv4 address
  • Management network, vMotion network etc. are using other vSwitch and NICs.

The HA feature works properly:

  • If one host is lost, the VM restarts on the other host (and is fully reachable through the network);
  • If network path to NAS is down on one host, the VM restarts on the other host (and is fully reachable through the network);
  • If we manually vMotion the VM from one host to another host, same, it is all good.

What is not working is when the VM is running on one host, and we disconnect the network cable on that host for the particular NIC used in the vDS. What we expect is that the inbound/outbound network traffic will continue to flow transparently (flowing through the NIC of the other host which is part of the vDS). But here, when this happens, we cannot ping anymore the VM, and the VM cannot ping anymore the outside world.

Do you have any idea on how to fix this?

Is vDS the appropriate feature to handle this type of error (we assume that purpose of teaming on a vDS is to have traffic routed somehow from the other NICs still up in the vDS)?

Thank you very much.

Regards.

Tags (1)
0 Kudos
3 Replies
daphnissov
Immortal
Immortal

What is not working is when the VM is running on one host, and we disconnect the network cable on that host for the particular NIC used in the vDS. What we expect is that the inbound/outbound network traffic will continue to flow transparently (flowing through the NIC of the other host which is part of the vDS). But here, when this happens, we cannot ping anymore the VM, and the VM cannot ping anymore the outside world.

Yes, and that's correct behavior. If you have only a single physical NIC from each ESXi functioning as the uplink to the vDS, you don't have two uplinks, you have just one. When you disconnect that vmnic on host A, there is no way for traffic to be routed from host B over to host A to compensate. This isn't how a vDS is supposed to work. If you wish to have this type of protection against single vmnic failures, you must add a second vmnic per host to the vDS and team them.

0 Kudos
VTG38
Contributor
Contributor

Thank you for this quick reply.

We were assuming that may be the VM would be moved to the second host to recover the network eventually (using vMotion network or Mgmt network).

So apparently we need to add more uplinks to the vDS as you mentioned. This issue is, if all uplinks on the host are down, we will end up with same results (but the probability is much lower...): the VM will remain active, will not be moved to the other host, and the service provided by that VM will be unreachable.

My question then is: how can we automatically move a VM to another host if we lose all VM network links on the former host (apart vMotion and Mgmt networks)?

I was thinking about a component/script from within the VM, monitoring some external IP, and using VMware automation toolkit to report an issue and triggering Proactive HA. Is that one possible solution to this problem (but it requires few development I guess...)?

Do we have other more elegant solutions?

It is nice to have an HA feature restarting the VM when the server is down, but if nothing happens when all VM NICs are down, it answers only partially to the problem...

Best regards

0 Kudos
daphnissov
Immortal
Immortal

My question then is: how can we automatically move a VM to another host if we lose all VM network links on the former host (apart vMotion and Mgmt networks)?

You don't, there isn't a scenario for this as an HA response because the VM itself is accessible. If you're really trying to guard against this, you must provide resiliency for your networking where the VM communicates. Plus, what's the likelihood that all links would be down for all VM traffic-related switches but they would be up for all kernel services? That's a pretty unlikely scenario.

0 Kudos