nic link down but nic shows status lights (esxi)

Jacob0B · ‎11-29-2011

Hello all, I have a bit of a weird problem and I'm not having much luck with the search option.

Let me describe how I got to the point I am at with this:

We have about 29 VM's running on our vmware esxi server, all of them are more or less for internal testing purposes. All of these vm's use 2 network ports on our vmware device, but are seperated using virtual switches and vlans. One of the network port connects to a dumb switch and another connects to a cisco switch with trunking on the interface.

I created a brand new vm and a new vswitch to test bridging mode in a product we are using. I modified an older vm to connect to this new vm switch, and then the new device to connect to this new switch, our public vlan, and our internal network. The idea was that the bridge would allow the older vm to continiue to talk to the public vlan as normal, but the bridged device would work transparently in the middle.

Once I finished this configuration and started the bridge device, the network of our entire building ceased to function, including my connection to the vmware server. Once I realized that this had happened, I walked over to the physical piece of hardware and unplugged it entirely from the network. The rest of the building then went back to its happy network-purring self.

Unforunately, I have usb passthrough turned on, and was unable to connect to the terminal on the esxi server. Instead, I simply pushed the power button and let the server shut down.

Once it came back up, the nic that was previously attached to the trunking port of the cisco (vmnic0) had/has stopped working. Once I managed to get into it on the other interface, I was able to issue a "esxconfig-nic -l" which shows the device as down.

On the back of the server, I can see the vmnic0 has a yellow light and a green light. The functioning vmnic1 has two green lights. I switched network cables on the two interfaces once to see if anything would change, it did not.

Does anyone have any ideas?

Thanks,

-Jacob

rickardnobel · ‎11-29-2011

When you had the network down after creating a bridge inside a VM is most likely because a Layer 2 loop was formed in some way and broadcasts quickly consumed all bandwidth for you. Bridging inside VMs could be dangerous for this very reason. If possible it is much better to do IP routing inside a VM which needs to be attached to several networks/VLANs.

The issue with the NIC with link, but does not work could be related to this. Do you know if Spanning Tree was changed or configured at the physical switches after this event? The link to your host might be "up", but logically disabled through Spanning Tree from the physical switch.

My VMware blog: www.rickardnobel.se

Rubeck · ‎11-30-2011

Totally agree with Rickard.. Can you access this switch?

If so try to see if the port is in an err-disabled state..

/Rubeck

Jacob0B · ‎11-30-2011

Thank you guys for the responses.

I do have access to the switch, but i'm not very familiar with the concept of the spanning tree or how to modify it within cisco's operating system.

However, I don't think the problem is in the switch, I've tried plugging that ethernet port, vmnic0, into a dumb switch just to test if the port would come up - it did not. Also, I have that same port on the cisco now plugged into the other port on our vmware server (vmnic1), and its working fine. This to me points to either a hardware problem or something within esxi itself. Given that the circumstances around the event were software related, I think the most likely solution is that something in esxi has disabled the nic - and its certainly not obvious how or why. Dmesg does not show any errors, and neither does syslog. Not sure where else to look.

I realize that bridging inside of the vmware server is not a best case scenario - in our company, we actually consider bridging to not be a best case scenario at all. Unfortunately, we are short of hardware to do this test on a physical device, or I would not have tried it within the vm in the first place. I'm also not sure how I could have created a broadcast loop, but that theory does seem plausible since the entire building's network died. There must be something connected between the networks I was bridging which I did not realize.

Any other ideas on the source of the dead nic? Is it physically possible the nic itself was overloaded in some way and was actually "fried" by this experiment? I've never heard of it happening that way, but I don't have the physical hardware knowledge to know if its even possible, though intuition tells me its not.

Thanks again,

Jacob

rickardnobel · ‎11-30-2011

I would say that it is very unlikely that the network card was damaged by the high traffic load. It is interesting that you have tested the card on another physical switch and it does not work there too.

When connected to another physical switch, does the switch port seems to get link? (Light up)

Could you do a print of the output from esxcfg-nics -l and also esxcfg-vswitch -l ?

Do you know if anything has been set on the VMNIC, like special speed or duplex settings?

My VMware blog: www.rickardnobel.se

Jacob0B · ‎11-30-2011

The switch port lights up the same no matter where it is plugged in. I have tested three different places - the cisco switch, a dumb switch, and directly to my laptop. In all three cases there is one green light and one orange light. Also, the green light (I assume the activity light) blinks very quickly and constantly. For reference, the working nic (vmnic1) has two green lights.

There are not any special speed or duplex settings that have been set, save one or two commands, that did not appear to make any difference, which I tried after the failure. I don't personally know of any other parameters that could have been set.

~ # esxcfg-nics -l
Name    PCI           Driver      Link Speed     Duplex MAC Address       MTU    Description
vmnic0 0000:02:00.00 e1000e      Down 0Mbps     Half   00:24:8c:57:aa:56 1500   Intel Corporation 82574L Gigabit Network Connection
vmnic1 0000:03:00.00 e1000e      Up   1000Mbps Full   00:24:8c:57:aa:87 1500   Intel Corporation 82574L Gigabit Network Connection

Please note that I have changed the config displayed by esxcfg-vswitch since the problem occured. I had to change things around to get the server to work again so that we could continue using the vm's.

~ # esxcfg-vswitch -l
Switch Name Num Ports Used Ports Configured Ports MTU Uplinks
vSwitch0 64 22 64 1500 vmnic1

PortGroup Name        VLAN ID Used Ports Uplinks
192.168.0.0/16        7        1           vmnic1
Public Vlan           3        3           vmnic1
Nacs Testing          6        2           vmnic1
172.17.0.0/16 Network 2        12          vmnic1
Management Vlan2      2        1           vmnic1
Management Network    2        1           vmnic1

Switch Name Num Ports Used Ports Configured Ports MTU Uplinks
vSwitch1 64 1 64 1500

PortGroup Name VLAN ID Used Ports Uplinks
OLD 192.168.0.0/22 Network OLD 0 0

Switch Name Num Ports Used Ports Configured Ports MTU Uplinks
vSwitch2 128 2 128 1500

PortGroup Name VLAN ID Used Ports Uplinks
Bridge Testing Network 0 1

changes - the 19.168.0.0/26 network was previously NOT a vlan, and was connected to a dumb switch via vSwitch2 on vmnic1. vSwitch0 was connected to vmnic0

Thank you once again for taking the time to examine my/our issue. It really does have me stumped.

-Jacob

rickardnobel · ‎12-03-2011

Jacob0B wrote:
There are not any special speed or duplex settings that have been set, save one or two commands, that did not appear to make any difference, which I tried after the failure.

Just to be sure that there is no misconfiguration in the speed/duplex settings, could you make sure that the vmnic0 is set to AUTO in the vSphere Client.

Did you have access to the Cisco switch? When you have the vmnic0 cable attached, could you login to the switch and run:

show interface (name of the interface, like GigabitEthernet0/1)

It would be interesting to see what the physical switch reports from its point of view.

My VMware blog: www.rickardnobel.se

Jacob0B · ‎12-05-2011

Opening vSphere client, I couldn't see any way to assign speed settings. So, I got on the box and executed:

# esxcfg-nics -a vmnic0

No errors were reported, but also nothing changed.

From the cisco switch (I borrowed a connection from an old testbox of mine, so you can ignore the description):

DevSwitch#show interfaces gi0/15
GigabitEthernet0/15 is up, line protocol is up (connected)
Hardware is Gigabit Ethernet, address is 64ae.0c6c.080f (bia 64ae.0c6c.080f)
Description: Jacob's testbox (PatchPort 6)
MTU 1500 bytes, BW 100000 Kbit, DLY 100 usec,
     reliability 255/255, txload 1/255, rxload 1/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
Full-duplex, 100Mb/s, media type is 10/100/1000BaseTX
input flow-control is off, output flow-control is unsupported
ARP type: ARPA, ARP Timeout 04:00:00
Last input never, output 00:00:00, output hang never
Last clearing of "show interface" counters never
Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0
Queueing strategy: fifo
Output queue: 0/0 (size/max)
5 minute input rate 0 bits/sec, 0 packets/sec
5 minute output rate 10000 bits/sec, 12 packets/sec
     0 packets input, 0 bytes, 0 no buffer
     Received 0 broadcasts (0 multicasts)
     0 runts, 0 giants, 0 throttles
     0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
     0 watchdog, 0 multicast, 0 pause input
     0 input packets with dribble condition detected
     1706 packets output, 146545 bytes, 0 underruns
     0 output errors, 0 collisions, 1 interface resets
     0 babbles, 0 late collision, 0 deferred
     0 lost carrier, 0 no carrier, 0 PAUSE output
     0 output buffer failures, 0 output buffers swapped out

the line:

0 input packets with dribble condition detected

Is interesting, but I'm not sure what it means or how to correct it.

-Jacob

EDIT: After doing my own research on the dribble condition, that is a counter, not a status. Therefore, having it say zero is completely normal.

Rubeck · ‎12-05-2011

"Full-duplex, 100Mb/s, media type is 10/100/1000BaseTX"

Could you try to hardcode at Gbit speed on the pSwitch ports as a test..? The speed negotiation looks screwed, IMO..

/Rubeck

Jacob0B · ‎12-05-2011

Kim Rubeck wrote:
Could you try to hardcode at Gbit speed on the pSwitch ports as a test..? The speed negotiation looks screwed, IMO..
How would I go about doing this and then undoing this?
I unfortunately don't have as much cisco experience as I would like, and what your asking is currently outside of my skill level on these devices.
-Jacob

Rubeck · ‎12-05-2011

conf t

interface gi0/15

speed 1000

duplex full

This sets the switch port to 1 Gbit using full duplex....

You might have to do the same thing on the ESX side for the vnic connected to this switch port..

/Rubeck

Jacob0B · ‎12-06-2011

Forcing the speed/duplex in that way caused the switch to stop detecting a link at all. I also checked the hardware status lights on the esxi server and they had shut off. (Where previously there was one green, one yellow/orange light)

You inspired me to try some things though, so I tried forcing speed on the switch to 10mb, which also did not work. Setting the switch back to auto with

no speed 10

no duplex full

brought the status back to up, though the vmnic0 still displays a link down. I also tried forcing vmnic0 to match the 100mb by issuing

esxcfg-nics -s 100 -d full vmnic0

No errors returned, but neither did it work.

I put all settings back to automatic once I was done.

Any more Ideas? I'm certainly willing to try.

-Jacob

rickardnobel · ‎12-06-2011

Here are some comments on the information from your physical Cisco switch (see bold comments inside your text) :

Jacob0B wrote:
GigabitEthernet0/15 is up, line protocol is up (connected)

This means that from the switch side the link is totally up and no problems have been detected enabling the port.

Full-duplex, 100Mb/s, media type is 10/100/1000BaseTX
As Rubeck pointed out, this is interesting that the switch has selected 100/full. The full would mean that the other side is set to auto, if not then it would be half. This does probably mean that your adapter when negotiating reports that it is capable of only 100 Mbit/Full duplex.
0 packets input, 0 bytes, 0 no buffer
This means that not a single frame has been received from the ESXi host to the switch port.
1706 packets output, 146545 bytes, 0 underruns
But from the switch side the port is up and 1706 packets have been sent into the ESXi host, most likely broadcast frames.

This seems to mean that the network card is logically "down" from the ESXi host, even if it is physically "up". Still a mystery why this state has been set and how to revert it.

My VMware blog: www.rickardnobel.se

Jacob0B · ‎12-06-2011

Rickard Nobel wrote:
This seems to mean that the network card is logically "down" from the ESXi host, even if it is physically "up". Still a mystery why this state has been set and how to revert it.

That was my hypothesis, yes. I'm glad we've managed to show more evidence than just my gut feeling, however.

I feel like there must be something in vmware which is disabling the port. Some sort of override safety I may have triggered in the driver for the port itself maybe? The thing is I would expect any sort of override or auto shutoff to show in the logs somewhere, or to display a big red flashing warning in vsphere or something similair.

If you have any ideas as to commands I could use to try to enable the card, i'm all for it. I've already identified and gave ethtool a shot, with no success.

Interestingly, ethtool vmnic0 shows:

~ # ethtool vmnic0
Settings for vmnic0:
        Supported ports: [ TP ]
        Supported link modes:   10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Supports auto-negotiation: Yes
        Advertised link modes: Not reported
        Advertised auto-negotiation: No
        Speed: Unknown! (65535)
        Duplex: Unknown! (255)
        Port: Twisted Pair
        PHYAD: 1
        Transceiver: internal
        Auto-negotiation: off
        Supports Wake-on: pumbag
        Wake-on: g
        Current message level: 0x00000001 (1)
        Link detected: no

This would make me think it is a driver issue of some sort?

-Jacob

rickardnobel · ‎12-07-2011

If using the vSphere Client and check the VMNIC0 settings for speed and duplex, what do they look like?

And also, since you have made different tests and re-configuration, there is some vSwitch attached to the VMNIC0 right? And some VM or something attached to that vSwitch? Just so there is something internal that could actually send frames out to the physical switch.

My VMware blog: www.rickardnobel.se

Jacob0B · ‎12-07-2011

Using the vsphere client the nic reports that it is down, but configured for "1000 Full". I don't see any way to change this from vsphere? I've tried changing this from the ssh session but It does not update on vsphere

I have reattached a vswitch while I was testing things. I didn't have anything running all the time on this vswitch, everything that used to use it has been moved to the other nic. You make a good point, however, so i've created a vm with the express purpose of sitting on that interface in the hopes that it forces it to activate.

One of my coworkers was wondering if there is a way that this interface could be somehow stuck in a bridging-type mode, and refusing to send packets to avoid the broadcast loop discussed earlier. Is there an interface bridging build into vmware?

-Jacob

rickardnobel · ‎12-07-2011

After you connected a vSwitch to the vmnic you could go through the vSwitch setting and edit the vmnic duplex and speed. Make sure to set it to Auto. Start any VM connected to the vSwitch and see if anything happens.

One of my coworkers was wondering if there is a way that this interface could be somehow stuck in a bridging-type mode, and refusing to send packets to avoid the broadcast loop discussed earlier. Is there an interface bridging build into vmware?

I would say that it is unlikely. The vSwitches has no Spanning Tree or any other loop detection / prevention, so it should not really understand that a loop is taking place.

My VMware blog: www.rickardnobel.se

Jacob0B · ‎12-08-2011

I now have a vm set up on a vswitch connected to vmnic0, and vmnic0 is set to auto negotiate. Unforunately, vsphere still shows that the interface is down.

-Jacob

rickardnobel · ‎12-08-2011

It certainly seems stuck at a shutdown state for no obvious reason. Have you had the opportunity to reboot the host since the incident?

My VMware blog: www.rickardnobel.se

Jacob0B · ‎12-08-2011

Not since I switched all the vm's to the other nic.

However, within the course of figuring out that the nic would no longer work the server was rebooted several times. I'm skeptical that rebooting it one more time would work, but at this point I guess i'm willing to try just about anything.

I'll power down the vm's and the server after hours tomorrow to give rebooting one last hurrah.

-Jacob