Solved: Re: Problems with NIC failover in virtual switches

jwnchoate · ‎02-12-2009

We have been setting up some vlan trunking and during the setup I discovered some, but not all of my nics are not detecting a 'down' state in vmware. I tried restarting all the equipement and it hasnt helped. I double checked the configurations and they are identical as far as I can see. Just wondering if anyone has had issues with this.

In the image below I should normally see a red X in the NIC's link state after a few moments and the other NIC will pass traffic. On the 5 hosts configured, this appeared to be the only NIC that did not go down in virtual center when I shut down the switchport. Then when I rebooted the swtich, I checked all 5 hosts for correct path failover. Strangly, 2 of my other hosts did actually detect the NIC went down but path failover did not occur and the vm's went unresponsive to the outside world, even though their link state did show and X! Failover detection is set to link state.

All but one host is 3.5.64607, the other is 82663. My next step is to update and bring it to latest greatest versions. Hopefully this works.

tfskelly · ‎02-13-2009

Hello

I had this same problem with my HP Blades and their broadcom mezzanine cards. After many hours with tech support we figured out that this patch fixed it: ESX350-200802401-BG.

Hope that helps!

Kelly Burton

IT Systems Engineer

Banner Health

View solution in original post

Texiwill · ‎02-13-2009

Hello,

ESX is looking for connection issues, ie. there is no link. Are you sure shutdown is removing the link? Check the switch for lights, etc. In some cases shutdown may not be removing the actual link just disabling traffic to the port.

Best regards,
Edward L. Haletky
VMware Communities User Moderator
====
Author of the book 'VMWare ESX Server in the Enterprise: Planning and Securing Virtualization Servers', Copyright 2008 Pearson Education.
Blue Gears and SearchVMware Pro Blogs -- Top Virtualization Security Links -- Virtualization Security Round Table Podcast

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill

jwnchoate · ‎02-13-2009

Between the 5 esx hosts, I have 20 NICS. I found this issue because I was testing failover on them and documenting which vmnics mapped to which switchport.

On 19 of the NICs, when I shutdown the switchport, VMWare will detect the shutdown state within a few seconds, and the other vmnic will take over (if the one I shutdown was the active one). I started troubleshooting what might be just a bad nic, but then...

When I turned one of the switches off and physically removed it from the blade chassis to see if it would clear the connectivity issue, I watched vmware. I saw 9 vmnics show link down state. The one nic that is in question still showed up. The 10 vmnics connected to the other swtich stayed up and each vswtich did show at least one good connection. So far so good, still looks like I have one bad NIC issue. However somehting occured that I did not expect. Some of the failover still did not occur on 2 hosts as expected even though it should have. 2 hosts full of vm's lost connectivity even though they detected the link down state on one of the nics. I could still get to the console on all of hosts and their sswitches are all configured like the vswtiches for the vm's.

I will continue to isolate and test.

jwnchoate · ‎02-13-2009

FUEL TO FIRE!

I went into vswitch config and removed one vnic, it worked.

I removed that one, and added back the other, it worked.

BOTH vmnics worked by themselves.

I added both back, noticed that the vmnic in question now showed in a 'down' state!

After adding both, I popped in and shutdown each switchport in turn and failover works withing a few seconds and now shows the red 'x'.

Seems to work now. I continue to test. Doesnt leave me feeling confident long term if the system shows to be configured correctly but its not. hm.

Plan for now, remove and re-add all the vmnics in turn. Test by shutting down each port and then final test by removing switches one at a time. I need this to work as in a predictable fashion.

jwnchoate · ‎02-13-2009

now issue had gone back to way it was. Im going to update to latest greatest software, if that not work, its got to be a nic issue. 😐

nic will pass traffic if its the only active nic.

jwnchoate · ‎02-13-2009

found this document, im going to give it a try. Concerns ESX and my Cisco switch for link state setup!

Hope this works.

I read through this one, its going to be helpful in figuring out how to determine if the uplink to the core switches is down, but isnt related to this. I can 'force' the nic to see a down link if I change something on it like from 1000 Full to Auto and save it. This causes it to recognize a link down and teaming works.

jwnchoate · ‎02-13-2009

Narrowing a bit.

It appears that on more than one host the Broadcom BCM5715S GigE NIC is not detecting a down link until I make a setting change for it in vmware. For example, if I edit the vmnic and change from AUTO<-->1000Full (either way) the nic will detect the down condition and fail over. My internal BC NetXtreme II BCM5708 nic will properly detect the link state. The BCM5708 uses the bnx2 driver and the BCM5715 uses the tg3 driver.

This has to be some kind of driver issue...now im going back and looking at the hardware lists. Any way to try the bnx2 driver with this? They are both bcm57xx series nics.

kcollo · ‎02-13-2009

Not sure if you really need to go through that process if you are using ciscos. VMware is CDP aware, and information can be found via cmd line, our using the GUI. The blog link below goes through cmd line. In gui, click on esx host -> Configuration tab -> networking and click on the bubble dialog box. That should show you port and vlan information obtained via CDP from the switch.

http://blog.colovirt.com/2008/10/21/vmware-esx-network-troubleshooting-with-cisco/

Kevin Goodman

Linux / SAN / Virtualization

kevin@colovirt.com

http://blog.colovirt.com

tfskelly · ‎02-13-2009

Hello

I had this same problem with my HP Blades and their broadcom mezzanine cards. After many hours with tech support we figured out that this patch fixed it: ESX350-200802401-BG.

Hope that helps!

Kelly Burton

IT Systems Engineer

Banner Health

jwnchoate · ‎02-13-2009

tfskelley wins the cookie!

I had not ever ran into a situation that forced me to update our enviroment all year, so I had to force my self to sit down and get vum working and running updates resovled the problem. The update he mentioned looks surpassed by other patches now, but that one was the broadcom driver update patch. Works perfect now. Thanks.

Fyi, the doc hyperlink above is a useful doc on getting vmware to detect upstream switch uplink failure too, that will come in handy if the uplink to the core switch goes down.

admin · ‎02-14-2009

I had this once when the switch/portgroup was set to link status only, so for that customer I changed to probing