VMware Cloud Community
chojineke
Contributor
Contributor

ESX HA restarts servers after Vmotion network disconnection

We have 2 ESX hosts, both with 6 NICs. Every host is connected with 3 NIC's to a switch for the normal network and with 3 NIC's to another switch for the Vmotion and backup network.

Due to maintenance reasons I had to disconnect that Vmotion/Backup network-switch for about 15 minutes on one host. But since there where still 3 NIC's connected to the normal network (the same network VCMS is running on) I didn't think it would give me any problems.

Shutting down the switch was indeed no problem: after a small hickup VCMS still saw both ESX hosts and everything was working correctly

But when I turned on the switch again, for some reason VCMS suddenly lost connection with one ESX host for a small time and HA kicks in to restart all VM's that where running on that host and put them on the other ESX hosts. Meanwhile the lost ESX host became visible again for VCMS, but the deed was done and all VM's originaly running on that one host where killed and restarted on the other host, hence users lost their connection with those servers..

Does anybody know why VCMS lost that one ESX host for some time, while 3 of the 6 NIC's never lost network connection

0 Kudos
11 Replies
AndreTheGiant
Immortal
Immortal

Do you have some service console on the VMotion vSwitches?

Each network is unique? Maybe a wrong netmask?

Andre

Andrew | http://about.me/amauro | http://vinfrastructure.it/ | @Andrea_Mauro
0 Kudos
chojineke
Contributor
Contributor

on both vSwitches a Service Console is defined.

Normal network on vSwitch0 is in 172.20.0.0/255.255.0.0 range and backup network on vSwitch1 is in 172.21.0.0/255.255.0.0 range.

0 Kudos
nanair01
Enthusiast
Enthusiast

I guess it can be coz of some misconfigured fail over policy!!!! The "Normal Network switch" became the active one when the "Vmotion switch went to standby mode".

But when it(Vmotion switch) became ready to serve, due to the falover policy it became the active one. And hence users lost the connection because it is in a diff network and the VM's started going for the Vmotion since VC presumed the host as in maintennace mode since it didnt receive heart beats for 15secs........

I hope this helps!!!!

If you find this post helpful/rectify your problem do not forget to award points
0 Kudos
chojineke
Contributor
Contributor

I'm affraid that I don't really understand your answer... I don't know where to configure the failover policy (except network failure detection on link level or beacon probing)..So I don't know what can be wrong with it.

the vmotion switch indeed went down, and with it the whole "backup network" (with service console in the backup-network IP-range) went down for one host, with no failover possible..

But why would the vmotion switch become the 'active' one when it came back up? And for what would it have become the 'active' one? as the "normal network" (with service console in the normal-network IP-range) was still up, and never went down.

0 Kudos
nanair01
Enthusiast
Enthusiast

Hey,

Click on VWITCH properties and click on NIC Teaming tab.

What is the status of FailBack? Yes or No?

Also check out active adapters......... Please post the answer here. I am sure it is a NIC teaming problem.....

If you find this post helpful/rectify your problem do not forget to award points
0 Kudos
chojineke
Contributor
Contributor

Failback is set to "Yes" and all NIC's are set 'active' on both vSwitches.

There are no standby NIC's defined. and none of the defined networks (normal, backup, vmotion, service console) overide the vSwitch failover settings.

For some reason all NIC's in the NIC Teaming tab of both vSwitches display "172.20.88.1-172.20.95.254" as their networks while the "backup" vSwitch only contains IP's within 172.21.0.0/16 range. I don't know if that has anything to do with it, but I can't seem to change that?

0 Kudos
nanair01
Enthusiast
Enthusiast

Fail back set to "Yes" means when the VMotion switch comes up it becomes the active one and all the VM's try to use that n/w.

More over from your descripton what I could understand is all the VSWITCHES are misconfigured....... The n/ws are misconfigured. Don't you think so?

If you find this post helpful/rectify your problem do not forget to award points
0 Kudos
chojineke
Contributor
Contributor

But fail back is set to "Yes" on all vSwitches, and they should all be the active one for their own networks (normal for vSwitch0, backup and vmotion for vSwitch1).

When link on all NIC's on vSwitch1 fail, as was the case, there is actually no alternative route for vmotion, the backup-network, or the service console on the 172.21.0.0/16 range .. But communication between the ESX hosts should go over the 'normal' network vSwitch on the 172.20.0.0/16 range.

The displayed network ranges on the NIC's are indeed not correct, but I don't know how I can alter that? And what consequences does it have?

0 Kudos
nanair01
Enthusiast
Enthusiast

Here in this case I hope the Vmotion happens only when the failed Vswitch returns back to its active state. Right?

If you find this post helpful/rectify your problem do not forget to award points
0 Kudos
chojineke
Contributor
Contributor

When the physical switch (and thus vSwitch1) became functional again, there didn't happen any actual vmotion. But the VM's where just killed on one host and restarted on the other host (because it seems that one host was not reachable for a short time)..

0 Kudos
nanair01
Enthusiast
Enthusiast

I am not sure about the problem man!!!! I thought it could be some Failover problem.........

If you find this post helpful/rectify your problem do not forget to award points
0 Kudos