VMware Cloud Community
kingcap3
Contributor
Contributor

vSphere ESXi 4.1 HA Network Configuration

Hello,

I am posting a question due to a failure we had on our production ESXi 4.1 HA environment when a host got isolated due to a Network port being shutdown accidently, but we thought we had configured resilience so that this host shouldnt have become isolated as it has a second nic in the management network vmkernel port group.

This is our configuration, 2 ESXi Hosts with two Physical Nics each vmnic0 and vmnic1. We use a Cisco Nexus 1000v for our Guest Application data traffic and our vmotion traffic and we have seperate nics for this traffic, so we use the two onboard nics on our Dell R810 as our Managemnet Network and HA Heartbeat.

We have created a vswitch0 on each ESXi host, which is labelled Management Network for the vmkernel port and has the two uplinks vmnic0 and vmnic1 assigned to it. It uses a VLAN 670. In the Management Network properties, Nic Teaming Tab, we have assigned vmnic0 and vmnic1 in the Active Adapters Section  and this is the same configuration for both our ESXi hosts. We do not have any adapters in the standby adapters section.

What happened was our Network Engineer accidently blocked the Port on the Cisco switch that vmnic0 was connected to on one of our ESXi hosts. The ESXi host then went into isolation and stopped all the VM's that were running on it and moved them over to our first ESXI host. My question is should this have happened as we had a second nic connected,  vmnic1 which I would have thought should have taken the heartbeat traffic and not caused a HA solation issue. I have though been reading that we should have had the second adapter vmnic1 in the standby section and not added as a Active Adapter.

I have added some files as I have taken some screenshots to assist.

If anybody could provide some clarity on what we should configure that would be great, I have read a number of HA whitepapers for ESXi 4.1, but I seem maybe to be missing something and cant understand why we had this issue.

0 Kudos
5 Replies
vmroyale
Immortal
Immortal

Hello.

Note: This discussion was moved from the VMware ESXi 4 community to the Availability: HA & FT community.

Good Luck!

Brian Atkinson | vExpert | VMTN Moderator | Author of "VCP5-DCV VMware Certified Professional-Data Center Virtualization on vSphere 5.5 Study Guide: VCP-550" | @vmroyale | http://vmroyale.com
0 Kudos
kingcap3
Contributor
Contributor

Thanks Brian

0 Kudos
mittim12
Immortal
Immortal

Have you made any changes to your das.fauluredetectiontime?   I guess it's possible that you went into isolation mode before the NIC had failed over properly.      

I would also check out the http://www.yellow-bricks.com/ blog for any HA best practice questions.  Duncan has done an outstanding job of detailing the HA process and best configurations.  

0 Kudos
kingcap3
Contributor
Contributor

Thanks For this, We haven’t made any changes so this is at the default, so this maybe a possible answer

0 Kudos
peetz
Leadership
Leadership

Hi,

by default ESX uses "Link failure detection" to determine if a link has gone down.

The link will go down if the network cable breaks or is pulled out at one side, or if the physical switch fails that the vmnic connects to.

You wrote that someone accidently "blocked the port" on the switch. I'm not a network guy, but I guess this means that the link was still there but all traffic was blocked. So, ESX did not mark this uplink as broken, but continued to use it.

Making the management network resilient against all possible failures can be a real challenge. A good blog post describing this is this one.

Andreas

- Check out my VMware Front Experience Blog

Twitter: @VFrontDe, @ESXiPatches | https://esxi-patches.v-front.de | https://vibsdepot.v-front.de
0 Kudos