I'm in the process of upgrading our hosts from 3.5 to 4.0U1, doing clean installs but running into a networking problem. I have had no issue with my configuration from the origional setup three years ago with 3.0.2 to 3.5. We have a stand alone nic with 2 ports that go directly to a dedicated 10/100 Cisco switch which is in the DMZ. Those two pNICs are assigned to a DMZ portgroup, and also configured 100MB/Full since the physical switch has the ports configured at 100MB/Full. When I have a web server on one of the newly built out 4.0 hosts, I start to get the following alarms / warning messages:
Lost uplink redundancy on virtual switch "vSwitch1". Physical NIC vmnic5 is down. Affected portgroups:"DMZ".
warning - 4/15/2010 3:21:12 PM
Alarm 'Network uplink redundancy lost' on esx03 changed from Gray to Red
info - 4/15/2010 3:20:17 PM
Lost uplink redundancy on virtual switch "vSwitch1". Physical NIC vmnic4 is down. Affected portgroups:"DMZ".
warning - 4/15/2010 3:20:09 PM
After talking with VMware support they weren't too helpful siting that it could be a configuration problem, "not being best practices" when its worked for 3 years until upgrading to 4.0. I hoped it might have been just a bad NIC on one of the hosts and moved the webserver to another host I just updated but the same thing happened there too a couple minutes later. Currently there are no servers sitting on the DMZ portgroup on either of the 4.0 hosts and haven't had an alarm or warning message yet so I'm assuming its something in 4.0 and either hard set 100MB/Full or just 100MB connections period. Our production network which is on gig switches and auto neg have no issues period.
Anyone have any ideas or insight into what might be going on?
There are no errors on the physical switch which these NIC's connect. I can't really narrow it down besides the difference between 3.5u3 and 4.0u1 as the 3.5 hosts ran fine with that setup but once I upgraded to 4.0u1 I started getting these errors. The IOS is 12.0(5) on a c3500xl.
even though there are no errors on the switch, here are the infos I forgot to attach yesterday.
Link-Flap error: Errdisable Port State Recovery on the Cisco IOS Platforms
The recommendation I found is to configure:
errdisable flap-setting cause link-flap max-flaps 10 time 20
Btw, is it new host hardware you are using? If so I could only think of some kind of portsecurity settings...
Sorry, I can't help you much more with this issue
Thanks for the help also just got off the phone with VMware support and might have actually done something useful and provided this KB - http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=100954... which looks like it might have solved the issue. Kind of points to a VIC / GUI issue with configuring if the "fix" is to assign the vmnics via Console shrug We're waiting to see if it triggers another alarm in the next couple hours.
After going back and forth with VMware support and them basically trying to blame hardware issue for 2+ weeks, they think its something to do with ESX4 and how it interacts with the 100/full switch so instead of trying to muck around with support and getting the alarms to go away, we just bought a cheap-ish gig switch to handle that vSwitch connections. Should fix it, not what we had in mind but it'll work.