VMware Cloud Community
AVinovarov
Contributor
Contributor
Jump to solution

VMkernel interface connectivity on 10G pNICs

Hi all,

I've got:

Hardware: HP ProLiant DL380 G7 + HP NC523SFP 10GbE (by QLogic), all firmware is up-to-date.

Software: ESXi 6.0 U2 from latest available HP custom ISO.

The server how has six physical NICs:

4 * 1G Ethernet uplinks connected to vSwitch0 (management VMkernel vmk0 here, multiple VLANs in trunk & port-channel, everything's fine)

2 * 10G Ethernet uplinks connected to vSwitch1 (these NICs are for NFS and vMotion), let them be vmnic4 and vmnic5

The issue I'm running into happens on vSwitch1 and 10G NICs (and yes, I have tried to remove-everything-and-recreate-it-all-from-scratch).

The vSwitch was supposed to be used for connecting NFS storage via 10G to ESXi using an isolated non-routed VLAN.

10G uplinks vmnic4 and vmnic5 are connected to a VSS stack of Cisco 4500-X (ports are in trunk mode), standard MTU of 1500 everywhere, no jumbo frames yet.

So, when I add a VM port group and set a VLAN ID (say 3, let subnet be 10.10.3.0/24) for VM traffic - VM is reachable via vSwitch1 and physical 10G NICs, pinging the virtual Cisco switch interface fine.

But when I add a VMkernel port group with the same VLAN ID and add a vmk1 interface - it reaches the VM on the same virtual switch just fine (which means that connectivity is OK inside a vSwitch), but could not ping/ARP/anything via the physical 10G NICs.

Tried using a default TCP/IP stack, as well as creating a new one - no effect too.


For now I tried every means to diagnose and/or fix it, except maybe for black magic - removed second physical NIC on vSwitch, deleted everything and re-created vSwitch/port groups/vmkernel from scratch, to no effect.

When I try to ping the vmk1 from the physical switch and vice-versa ARPs are incomplete like there's no L1 connectivity, but the fact that VM pings anything on /24 subnet fine makes me believe that the cabling is OK.

I tried setting ports on Cisco switch to switchport mode access and VLAN3 and removing VLAN tagging on VM and VMkernel port groups, to the same effect - VM pings without an issue and VMkernel does not.

After a week of reading, checking everything I can think of and testing, all to no avail, I'd be very grateful for any ideas on this.

Thanks a lot!

Reply
0 Kudos
1 Solution

Accepted Solutions
AVinovarov
Contributor
Contributor
Jump to solution

To whom it may concern: re-installing the ESXi from scratch solved the issue automagically, phew.

Or maybe it was a HP firmware update somewhere along the way. Or using only a new ESXi 6.0 web client and completely avoiding the old installable...

View solution in original post

Reply
0 Kudos
2 Replies
AVinovarov
Contributor
Contributor
Jump to solution

Some diagnostics I've made so far:

Routing table on ESXi looks fine to me, subnet 10.10.3.0/24 is directly connected to vmk1 interface:

[root@ESXi:~] esxcfg-route -l

VMkernel Routes: Network          Netmask          Gateway          Interface

10.10.x.0        255.255.255.0    Local Subnet     vmk0

10.10.3.0        255.255.255.0    Local Subnet     vmk1

default          0.0.0.0          10.10.x.1        vmk0

ARP broadcasts seem to work fine when i ping the VM on the same vSwitch, IP is 10.10.3.222:

[root@ESXi:~] esxcli network ip neighbor list

Neighbor     Mac Address        Vmknic    Expiry  State  Type

-----------  -----------------  ------  --------  -----  -------

10.10.x.1    xx:xx:xx:xx:xx:xx  vmk0    1164 sec         Unknown

10.10.3.222  xx:xx:xx:xx:xx:xx  vmk1    1196 sec         Unknown

Also tried packet captures in both directions, both on vmk1 logical interface and vSwitch port this vmk is connected to, when I try to ping something outside the ESXi box (like Cisco switch logical interface) the packets get segmented for some reason, not sure if it works the way it's supposed to:

07:55:09.207828[5] Captured at PortInput point, TSO not enabled, Checksum not offloaded and not verified, length 60.

         Segment[0] ---- 42 bytes:

         0x0000:  ffff ffff ffff 0050 566f c40e 0806 0001

         0x0010:  0800 0604 0001 0050 566f c40e 0a0a 0304

         0x0020:  0000 0000 0000 0a0a 036f

         Segment[1] ---- 18 bytes:

         0x0020:  0000 0000 0000

         0x0030:  0000 0000 0000 0000 0000 0000

07:55:10.209978[6] Captured at PortInput point, TSO not enabled, Checksum not offloaded and not verified, length 60.

         Segment[0] ---- 42 bytes:

         0x0000:  ffff ffff ffff 0050 566f c40e 0806 0001

         0x0010:  0800 0604 0001 0050 566f c40e 0a0a 0304

         0x0020:  0000 0000 0000 0a0a 036f

         Segment[1] ---- 18 bytes:

         0x0020:  0000 0000 0000

         0x0030:  0000 0000 0000 0000 0000 0000

Networking is definitely not the brightest of my talents, but from what I see i'd suggest that L1 connectivity is fine (and thus replacing cables and/or physical NIC does not make sense) and it seems like a L2 issue with VMkernel interface only.

As i mentioned above, removing vSwitch1 with port groups and logical interfaces does not change anything, I just re-create the issue. Reinstalling the ESXi does not look like a rational idea from the start, but I don't mind doing it if it sounds reasonable.

Reply
0 Kudos
AVinovarov
Contributor
Contributor
Jump to solution

To whom it may concern: re-installing the ESXi from scratch solved the issue automagically, phew.

Or maybe it was a HP firmware update somewhere along the way. Or using only a new ESXi 6.0 web client and completely avoiding the old installable...

Reply
0 Kudos