ESX is looking for connection issues, ie. there is no link. Are you sure shutdown is removing the link? Check the switch for lights, etc. In some cases shutdown may not be removing the actual link just disabling traffic to the port.
Edward L. Haletky
VMware Communities User Moderator
Author of the book 'VMWare ESX Server in the Enterprise: Planning and Securing Virtualization Servers', Copyright 2008 Pearson Education.
Blue Gears and SearchVMware Pro Blogs -- Top Virtualization Security Links -- Virtualization Security Round Table Podcast
Between the 5 esx hosts, I have 20 NICS. I found this issue because I was testing failover on them and documenting which vmnics mapped to which switchport.
On 19 of the NICs, when I shutdown the switchport, VMWare will detect the shutdown state within a few seconds, and the other vmnic will take over (if the one I shutdown was the active one). I started troubleshooting what might be just a bad nic, but then...
When I turned one of the switches off and physically removed it from the blade chassis to see if it would clear the connectivity issue, I watched vmware. I saw 9 vmnics show link down state. The one nic that is in question still showed up. The 10 vmnics connected to the other swtich stayed up and each vswtich did show at least one good connection. So far so good, still looks like I have one bad NIC issue. However somehting occured that I did not expect. Some of the failover still did not occur on 2 hosts as expected even though it should have. 2 hosts full of vm's lost connectivity even though they detected the link down state on one of the nics. I could still get to the console on all of hosts and their sswitches are all configured like the vswtiches for the vm's.
I will continue to isolate and test.
FUEL TO FIRE!
I went into vswitch config and removed one vnic, it worked.
I removed that one, and added back the other, it worked.
BOTH vmnics worked by themselves.
I added both back, noticed that the vmnic in question now showed in a 'down' state!
After adding both, I popped in and shutdown each switchport in turn and failover works withing a few seconds and now shows the red 'x'.
Seems to work now. I continue to test. Doesnt leave me feeling confident long term if the system shows to be configured correctly but its not. hm.
Plan for now, remove and re-add all the vmnics in turn. Test by shutting down each port and then final test by removing switches one at a time. I need this to work as in a predictable fashion.
now issue had gone back to way it was. Im going to update to latest greatest software, if that not work, its got to be a nic issue. :|
nic will pass traffic if its the only active nic.
found this document, im going to give it a try. Concerns ESX and my Cisco switch for link state setup!
Hope this works.
I read through this one, its going to be helpful in figuring out how to determine if the uplink to the core switches is down, but isnt related to this. I can 'force' the nic to see a down link if I change something on it like from 1000 Full to Auto and save it. This causes it to recognize a link down and teaming works.
Narrowing a bit.
It appears that on more than one host the Broadcom BCM5715S GigE NIC is not detecting a down link until I make a setting change for it in vmware. For example, if I edit the vmnic and change from AUTO<-->1000Full (either way) the nic will detect the down condition and fail over. My internal BC NetXtreme II BCM5708 nic will properly detect the link state. The BCM5708 uses the bnx2 driver and the BCM5715 uses the tg3 driver.
This has to be some kind of driver issue...now im going back and looking at the hardware lists. Any way to try the bnx2 driver with this? They are both bcm57xx series nics.
Not sure if you really need to go through that process if you are using ciscos. VMware is CDP aware, and information can be found via cmd line, our using the GUI. The blog link below goes through cmd line. In gui, click on esx host -> Configuration tab -> networking and click on the bubble dialog box. That should show you port and vlan information obtained via CDP from the switch.
Linux / SAN / Virtualization
I had this same problem with my HP Blades and their broadcom mezzanine cards. After many hours with tech support we figured out that this patch fixed it: ESX350-200802401-BG.
Hope that helps!
IT Systems Engineer
tfskelley wins the cookie!
I had not ever ran into a situation that forced me to update our enviroment all year, so I had to force my self to sit down and get vum working and running updates resovled the problem. The update he mentioned looks surpassed by other patches now, but that one was the broadcom driver update patch. Works perfect now. Thanks.
Fyi, the doc hyperlink above is a useful doc on getting vmware to detect upstream switch uplink failure too, that will come in handy if the uplink to the core switch goes down.
I had this once when the switch/portgroup was set to link status only, so for that customer I changed to probing