JasonVmware
Enthusiast
Enthusiast

Help with failover

Jump to solution

Hello all

I have a lab setup with the network settings showing in the pics provided below. When I pull out one of my network connections to my psyhical nics to simulate a NIC failure not all my VM's stay online. Only half of the VM's will stay online. Is this a setup issue or is this by design with Oringinating port ID ?

Anyhelp would be greatly appricated.

0 Kudos
1 Solution

Accepted Solutions
SkyC
Enthusiast
Enthusiast

Try changing the Network Failover Detection to beacon probing, often the blade chassis will present a link to the blade server even if the uplink is unplugged. ESX doesn't detect the failover.

View solution in original post

0 Kudos
12 Replies
Mark_Bradley
Enthusiast
Enthusiast

Hi,

When you say that half of the VM's stay online when you pull the cable to simulate NIC failure, how long did you wait to see if the VM's came online?

Sounds like comms to some of the VM's are affected, if you open a console session to one of those VM's can it communicate with the outside world?

Are all the VM's the same OS/configuration?

__________________________________________________________________________________________ Check out my blog at http://www.ridethevirt.blogspot.com
0 Kudos
JasonVmware
Enthusiast
Enthusiast

Ya all the VM's are the same OS (Windows Server 2008) except the VC server which is 2003 R2 SP2, I waited about 15min I figured that was more then enough time for it to fail over. I did not try to ping out from the console. I will have to give this a try and let you know. The ESX was still reporting in VC and could be controlled through VC so I wouldn't imagine it was having communication issues however it is possible. The Console port group is also on a different set of nics though. I pulled the cable out on vmnic1 to test a nic failure on the VM Network and half the vms' went offline. When I plugged the nic back in the VM's came back online. When I pulled the cable plugged into vmnic3 the other half of the VM's went offline.

0 Kudos
M__Y_
Enthusiast
Enthusiast

Hi,

If you have a vSwitch with 2 NICs, the first guest will use the first NIC, the second guest will use the second NIC, the third VM will use the first NIC, the fourth VM will use the second NIC, etc.

When you disconnet one NIC, the VM using this NIC will connect on the other NIC after 5 seconds (theory - If I remember, advanced parameter Net.PortDisableTimeout = 5000). Due to your configuration, they will not go back to their original NIC after they were connected to the other NIC (failback parameter) - except with fail of the second NIC -.

Regards.

0 Kudos
depping
Leadership
Leadership

They should however failover though. My guess is failback will occur as well because it has been set at a vSwitch level instead of the portgroup level.

Duncan

VMware Communities User Moderator

-


Blogging:

Twitter:

If you find this information useful, please award points for "correct" or "helpful".

0 Kudos
JasonVmware
Enthusiast
Enthusiast

Thank you for your response. This is how I figured it would work however it would not trigger to switch over to the other nic. 5 seconds is more then good enough as far as a failover time is concerned however I left it for a good 10-15min without any failover occuring. Any thoughts on to why it would not decide to failover or trigger that command ? I will also do some further testing in the next few days so I will keep you all posted.

0 Kudos
SkyC
Enthusiast
Enthusiast

Are your hosts Blade servers?

0 Kudos
depping
Leadership
Leadership

That's weird. Are you sure your network is setup correctly? Can you show the vswitch details instead of the portgroup details?

Duncan

VMware Communities User Moderator

-


Blogging:

Twitter:

If you find this information useful, please award points for "correct" or "helpful".

0 Kudos
M__Y_
Enthusiast
Enthusiast

Have you tried to enable "Notify switch"?

Have you changed the configuration of the vswitch? Can you publish its configuration as the portgroup, please?

Have you tried to configure the vswitch with the configuration of the portgroup and then to disable specific configuration of the portgroup (this may help to identify a bug)?

Version and build of your ESX?

-


If you find this information useful, please award points for "correct" or "helpful".

0 Kudos
JasonVmware
Enthusiast
Enthusiast

Yes these are blades and i'm starting to think the problem may be with the blade setup however, it is currently setup with a very simple switching setup as there is no trunking or LACP setup on the EXT ports as of yet. I will get a screenshot of the vswitch settings on Monday as I don't have access to the lab at the moment. I did turn on the notify swtich setting for one of the tests but I will retry the notify switch setting agian as well and report my findings on Monday.

0 Kudos
SkyC
Enthusiast
Enthusiast

Try changing the Network Failover Detection to beacon probing, often the blade chassis will present a link to the blade server even if the uplink is unplugged. ESX doesn't detect the failover.

0 Kudos
depping
Leadership
Leadership

If they are blades they will indeed not detect a link down because it's internally still connected. Beacon probing might me a solution but it would be advisable to use at least 3 nics. (With 2 nics and beacon probing it's kinda hard to detect which nic is the bad nic... )

Duncan

VMware Communities User Moderator

-


Blogging:

Twitter:

If you find this information useful, please award points for "correct" or "helpful".

JasonVmware
Enthusiast
Enthusiast

Ahhh that makes sence with the becon probing. I will have to make this change and give it a test. Thanks agian ! I have additional points to reward to those who helped as I had an old post that went unanwsered.

If anyone who helped would like the additional points please post the same anwser here so I can close this thread:

http://communities.vmware.com/thread/204368

Thanks agian I'll post back my findings on Monday

0 Kudos