I just upgraded the hardware in my home/lab box to use a new motherboard and Intel I5-9600k. No issues during boot, and all vms came up fine. The only hardware difference is that now I'm using 2 PCIE NICs instead of 1 PCI and 1 PCIE (no PCI on the new motherboard). Previously, I could disconnect a wire from either without any interruption. Now it seems to be stuck to the original PCIE card. If I pull that cable, everything freezes (though I can still get a ping to the vmkernel NIC). The other cable can be pulled without interruption. Both NICs show as active and they are both members of vswitch0 (the only vswitch). I don't know that it would make a difference but I run ESXi of a USB thumb drive. All my storage is iSCSI from a NAS. ESXi 6.5.0 Update 2 (Build 8294253). Also the USB port is v3.1. I seem to remember ESXi having issues running off a 3.0 USB port previously, but that was back in 2014 when I first built this host.
Hi there,
Could you provide an output of the vSwitch configuration including any override at the portgroup level? Also what configuration do you have on the physical switch (if any and I’m not implying you require anything custom here)?
Kind regards.
Is there a command to provide an output, or do you just want a description?
I think I found what you are looking for.
[root@hawk:~] esxcfg-vswitch -l
Switch Name Num Ports Used Ports Configured Ports MTU Uplinks
vSwitch0 2212 16 128 1500 vmnic1,vmnic0,vmnic2
PortGroup Name VLAN ID Used Ports Uplinks
DMZ 4 3 vmnic1,vmnic0,vmnic2
Management Network 0 1 vmnic1,vmnic0,vmnic2
VM Network 3 4 vmnic1,vmnic0,vmnic2
ESXi Backend Network 3 1 vmnic1,vmnic0,vmnic2
[root@hawk:~] esxcli network vswitch standard list
vSwitch0
Name: vSwitch0
Class: etherswitch
Num Ports: 2212
Used Ports: 16
Configured Ports: 128
MTU: 1500
CDP Status: listen
Beacon Enabled: false
Beacon Interval: 1
Beacon Threshold: 3
Beacon Required By:
Uplinks: vmnic2, vmnic0, vmnic1
Portgroups: DMZ, Management Network, VM Network, ESXi Backend Network
vmnic0: onboad ethernet (oddly enough it detected it and when connected it shows as active)
vmnic1: Old PCIE NIC
vmnic2: New PCIE NNIC
[root@hawk:~] esxcli network vswitch standard policy security get -v vSwitch0
Allow Promiscuous: false
Allow MAC Address Change: true
Allow Forged Transmits: true
[root@hawk:~] esxcli network vswitch standard policy failover get -v vSwitch0
Load Balancing: srcport
Network Failure Detection: link
Notify Switches: true
Failback: true
Active Adapters: vmnic1, vmnic0, vmnic2
Standby Adapters:
Unused Adapters:
[root@hawk:~] esxcli network vswitch standard policy shaping get -v vSwitch0
Enabled: false
Average Bandwidth: -1 Kbps
Peak Bandwidth: -1 Kbps
Burst Size: -1 Kib
More data:
[root@hawk:~] esxcli network nic list
Name PCI Device Driver Admin Status Link Status Speed Duplex MAC Address MTU Description
------ ------------ ------ ------------ ----------- ----- ------ ----------------- ---- ------------------------------------------------
vmnic0 0000:00:1f.6 ne1000 Up Down 0 Half 0c:9d:92:c1:36:a9 1500 Intel Corporation Ethernet Connection (7) I219-V
vmnic1 0000:03:00.0 ne1000 Up Up 1000 Full 68:05:ca:27:7d:2a 1500 Intel Corporation Gigabit CT Desktop Adapter
vmnic2 0000:04:00.0 ne1000 Up Up 1000 Full 68:05:ca:90:20:3a 1500 Intel Corporation Gigabit CT Desktop Adapter
Advertised Auto Negotiation: true
Advertised Link Modes: Auto, 10BaseT/Half, 100BaseT/Half, 10BaseT/Full, 100BaseT/Full, 1000BaseT/Full
Auto Negotiation: true
Cable Type: Twisted Pair
Current Message Level: -1
Driver Info:
Bus Info: 0000:00:1f:6
Driver: ne1000
Firmware Version: 0.5-4
Version: 0.8.3
Link Detected: false
Link Status: Down
Name: vmnic0
PHYAddress: 0
Pause Autonegotiate: false
Pause RX: false
Pause TX: false
Supported Ports: TP
Supports Auto Negotiation: true
Supports Pause: false
Supports Wakeon: true
Transceiver:
Virtual Address: 00:50:56:5f:69:c9
Wakeon: MagicPacket(tm)
[root@hawk:~] esxcli network nic get -n vmnic1
Advertised Auto Negotiation: true
Advertised Link Modes: Auto, 10BaseT/Half, 100BaseT/Half, 10BaseT/Full, 100BaseT/Full, 1000BaseT/Full
Auto Negotiation: true
Cable Type: Twisted Pair
Current Message Level: -1
Driver Info:
Bus Info: 0000:03:00:0
Driver: ne1000
Firmware Version: 1.8-0
Version: 0.8.3
Link Detected: true
Link Status: Up
Name: vmnic1
PHYAddress: 0
Pause Autonegotiate: false
Pause RX: false
Pause TX: false
Supported Ports: TP
Supports Auto Negotiation: true
Supports Pause: false
Supports Wakeon: true
Transceiver:
Virtual Address: 00:50:56:57:7d:2a
Wakeon: MagicPacket(tm)
[root@hawk:~] esxcli network nic get -n vmnic2
Advertised Auto Negotiation: true
Advertised Link Modes: Auto, 10BaseT/Half, 100BaseT/Half, 10BaseT/Full, 100BaseT/Full, 1000BaseT/Full
Auto Negotiation: false
Cable Type: Twisted Pair
Current Message Level: -1
Driver Info:
Bus Info: 0000:04:00:0
Driver: ne1000
Firmware Version: 1.8-0
Version: 0.8.3
Link Detected: true
Link Status: Up
Name: vmnic2
PHYAddress: 0
Pause Autonegotiate: false
Pause RX: false
Pause TX: false
Supported Ports: TP
Supports Auto Negotiation: true
Supports Pause: false
Supports Wakeon: true
Transceiver:
Virtual Address: 00:50:56:5b:f4:92
Wakeon: MagicPacket(tm)
Hi athompson88,
Thanks for that. When reading your post, you mention that after disconnecting vmnic1 (Old PCIE card) that the VMKernel port (ESXi management) still responds to pings - is that correct? That means ESXi management is still available but VMs running on the portgroup "VM Network" become unavailable?
If so then could you run the following and post the output:
esxcli network vswitch standard portgroup policy failover get -p "VM Network"
This is just to make sure the load balancing algorithm hasn't been overridden at the portgroup level.
Also on the "VM Network" portgroup, try changing the list of network adapters. Make the vmnic2 active and the other as unused. Do you still lose communications to the VMs?
Can you also confirm that on the pSwitch port, you are trunking the VLAN IDs 3 and 4 through? You must also have a native VLAN ID as the management portgroup is untagged so make sure the configuration between the physical ports is the same on the pSwitch.
Kind regards.
If I unplug the cable, the web console becomes unresponsive. I can still ping the ESXi host, and even SSH into it. Doing so allowed me to get the following output for you, which was taken while vmnic1 was unplugged:
[root@hawk:~] esxcli network vswitch standard portgroup policy failover get -p "VM Network"
Load Balancing: srcport
Network Failure Detection: link
Notify Switches: true
Failback: true
Active Adapters: vmnic1, vmnic2, vmnic0
Standby Adapters:
Unused Adapters:
Override Vswitch Load Balancing: false
Override Vswitch Network Failure Detection: false
Override Vswitch Notify Switches: false
Override Vswitch Failback: false
Override Vswitch Uplinks: false
To your second point, I did try disabling vmnic1 on the "VM Network" and was still able to communicate with the VMs. If I then disabled vmnic1 it cut all communication (as I'd expect). Also, let me clarify what my networks are.
Management Network: VLAN 0 (which is really 192.168.1.x) - This is used for network hardware management only, not the ESXi host (switches, router, AP).
DMZ: VLAN 4 - What it sounds like
VM Network: VLAN 3 - pivate server subnet
ESXi Backend Network: VLAN 3 - VMKernel port (vmk0)
The ports on the physical switch (if that's what you meant by pSwitch) are configured as untagged for VLAN 1, and tagged for VLANs 3 and 4. To rule out the physical switch I swapped the cables at the ends attached to vmnic1 and vmnic2. The same outcomes occurred.
Thanks.
Hi there and thanks for the info!
So to confirm:
I'll continue to review your vSwitch config as well - cannot currently see anything wrong though am a little surprised that you have two portgroups which have the same VLAN ID connected to the same network adapters. No big deal, just surprised.
Also as vmnic0 is unused and disconnected, I would remove from the vSwitch or at least set to unused. Doubt this will resolve anything but it will be tidier that way
Kind regards.
Disconnecting vmnic2 doesn't affect connectivity to the VM Network or the Management Network. Disconnecting vmnic1 affects connectivity to the Management Network as described (ping, SSH available, web console unresponsive) and seems to kill all access to everything on the VM Network.
I did a few more tests just now. Strangely enough, I can ping hosts on the VM Network as well, but can't SSH into them. If I SSH into the ESXi host, and then try to ping the hosts on the VM Network from there, that terminal session freezes and I can't even ctrl-c. I can open another session to the ESXi host though. Terminal sessions will also freezes if I run any command specifically attempting to view datastores (for example "df", or "ls /vmfs/volumes").
press alt and F12 on the esxi console while this issue is happening and you will see real time logging, it could help you to see if there are storage errors
I'll look into the logging.
More strange behavior. It seems that right after I disconnect vmnic1, all communications will work for about 5 seconds, and then begin having issues.
On a side note, I'm not particularly concerned with NIC failure. I mainly want the 2 cards in place to allow greater IO with the NAS which also has 2 gigabit NICs in it. Since we've already determined that both cards are being actively used, my primary needs are met. I am still interested in determining the issue from a purely academic standpoint.
Firstly, good thought on the storage errors because that seems to be exactly what's going on.
A quick update. I've decided to return the new PCIE card and just go with the onboard NIC in addition to the original PCIE. This will save me $40. Since the onboard behaves the same as the new PCIE in successful testing (maintaining connectivity when vmnic1 is removed from port groups leaving only vmnic0 to service them), and is also Intel, I feel confident it will do what I need. I used an ASUS TUF Z390-PRO Gaming motherboard for this, just in case reference is needed.
With that out of the way, I have attached 5 photos taken of the output while testing with physical connections.
01 - vmnic0 unplugged and plugged back in
02-04 - vmnic1 unplugged
05 - vmnic1 plugged back in
Thanks again for the help.
I went through the web console event log and noticed the following were generated.
"Lost path redundancy to storage device naa.6e843b67af803dad0e48d466fda02ad0. Path vmhba64:C0:T0:L0 is down. Affected datastores: iscsi-data1."
"Lost path redundancy to storage device naa.6e843b6fb6aea05de4a7d4157d8ae2d9. Path vmhba64:C1:T1:L0 is down. Affected datastores: iscsi-data3."
Seems like the path is going down with just the loss of 1 NIC. Does the software iSCSI adapter bind to a specific NIC? If so, then I wouldn't be using both NICs for iSCSI traffic after all.
Attached is a screenshot of the full event log.
Hi athompson88,
Snap! I was about to suggest it might be a storage issue. Can you confirm that you have configured port binding as per this article: https://kb.vmware.com/s/article/2045040
This will give you multiple paths to storage, potentially increasing performance plus giving you redundancy
Kind regards.
Well, it's better but still not working quite right. I created 2 new vmks specifically for iSCSI use and then set them up as per the document. So, here is what is happening now
1) Both NICs connected --> No issues
2) vmnic0 disconnected --> Connections lost
3) vmnic0 connected --> No issues
4) vmnic1 disconnected --> Connections lost
5) vmnic1 connected --> No issues
6) vmnic1 disconnected --> No issues <--- here's where it gets weird
7) vmnic1 connected --> No issues
😎 vmnic0 disconnected --> Connections lost
9) vmnic0 connected --> No issues
10) vmnic0 disconnected --> No issues <--- and again
11) vmnic0 connected --> No issues
12) vmnic1 disconnected --> Connections lost
13) vmnic1 connected --> No issues
14) vmnic1 disconnected --> No issues
15) vmnic1 connected --> No issues
To summarize, it seems things fail on the first disconnect of a given link, but then remain in tact if further disconnects of the same link occur. If the disconnect occurs on the other link connections fail the first time, but then similarly succeed during additional disconnects. And if you alternate back and forth with disconnects, failures occur every time. It's like it only knows how to recover from a failure of one given link at a time, and if a different link fails it has to relearn how to recover that one, but then forgets how to recover from the original.