VMware Cloud Community
ArrowSIVAC
Enthusiast
Enthusiast
Jump to solution

Windows Server 2016 Windows 10 Workstation - Packet Loss issues

Been seeing this issue for a while in our environment and trying to find root cause.  Hoping to get some direction / ideas / see if others have same issue. I have a large cluster but tried to isolate it down as much for creation of baseline.  I have moved VMs, tried different vSphere hosts, switches, NICs, server models, dVSwitch vs Standard switch etc... Pattern still not repeatable but does not disapear with any of the mix, so now I have my test set of VMs pinned to single server, with assumption it is the VM, NIC driver, OS issue, switch issue in combination with the others.

Environment:

4 socket Intel server, 1 x 10Gb Emulex NIC.   Single NIC in standard switch 10Gb MTU 9k. three VMs one windows 2008svr as baseline, one windows 2016 server one windows 10 workstation.

Switch: Brocade VDX 6740

Switch Settings and mac listing during ping baseline of failures.

sw12# show mac-address-table interface ten 12/0/14

VlanId   Mac-address       Type     State        Ports

12       0050.568b.a321    Dynamic  Active       Te 12/0/14

12       0050.569e.006e    Dynamic  Active       Te 12/0/14

12       0050.56af.272c    Dynamic  Active       Te 12/0/14

19       0050.5667.a790    Dynamic  Active       Te 12/0/14

Total MAC addresses    :  4

sw12# sh run int ten 12/0/14

interface TenGigabitEthernet 12/0/14

mtu 9216

description x385001-C2_1

switchport

switchport mode trunk

switchport trunk allowed vlan all

no switchport trunk tag native-vlan

switchport trunk native-vlan 11

no spanning-tree shutdown

fabric isl enable

fabric trunk enable

fcoeport 027ATLIBMVSAN

no shutdown

!

sw12#

Symptom:

Packets drop for 3-6 seconds, sometimes the VM drops offline until the VM is edited, and nic disabled / renabled  or OS is disabled and re-enabled, or reboot.

I have tried and have as part of my baseline E1000 and VMXNET 3 adapters. 

I have tried setting MTU, and enabling receive side scaling.  These changes, done when nic stops communcation does allow it to start communication again, but does not seem to effect long term issue (aka it comes back .. minutes,  hours,   sometimes days later).

What I can say as a baseline is that: it always comes back, sometimes just session loss of a few seconds, sometimes 10-15 seconds,  sometimes NIC just stops communication.  Packet trace not showing much but I can supply that if that will help.  The nic is always "online" and OS can ping itself so I believe IP stack is still bound.  So my gut is that this is some kind of OS driver / VMWare driver issue, but don't know more about method to debug.   Nothing in system event logs.

Ideas?

Reply
0 Kudos
1 Solution

Accepted Solutions
ArrowSIVAC
Enthusiast
Enthusiast
Jump to solution

Just to post on this.

Long story short. This was not limited to just the VMWare / VM environment.  As it was intermittent it was hard to baseline but once I found that the packet loss was also seen on physical hosts.

The switches showed NO events... which is not helpful.   What is odd is that the correlation was Windows 2016 primarily  that were seeing this.   And VMs 90% more often then phyiscal.. though I did not document it under HyperV ...  but again.... it was intermittent.

The fix <insert gag reflex here> was that I had a patch pending to the router that has NO noted update related to this issue, but before I opened a support ticket, I was going to upgrade to latest... after reboot... ... which long story short, I reverted back to old code.... the environment is not seeing the error.

I appreciate response...  just an odd thing.  And when systems do not output details and wireshark traces just show packets lost, and it only happens intermittently without defined pattern... I can't say much to help community.

View solution in original post

Reply
0 Kudos
2 Replies
daphnissov
Immortal
Immortal
Jump to solution

What version of vSphere (both vCenter and ESXi), what server hardware, what version of VMware tools, what ESXi driver is in use for the vmnics which serve as uplinks for VM traffic, and what is the firmware of those vmnics? You can use the following KB article to get both of the last pieces.

Reply
0 Kudos
ArrowSIVAC
Enthusiast
Enthusiast
Jump to solution

Just to post on this.

Long story short. This was not limited to just the VMWare / VM environment.  As it was intermittent it was hard to baseline but once I found that the packet loss was also seen on physical hosts.

The switches showed NO events... which is not helpful.   What is odd is that the correlation was Windows 2016 primarily  that were seeing this.   And VMs 90% more often then phyiscal.. though I did not document it under HyperV ...  but again.... it was intermittent.

The fix <insert gag reflex here> was that I had a patch pending to the router that has NO noted update related to this issue, but before I opened a support ticket, I was going to upgrade to latest... after reboot... ... which long story short, I reverted back to old code.... the environment is not seeing the error.

I appreciate response...  just an odd thing.  And when systems do not output details and wireshark traces just show packets lost, and it only happens intermittently without defined pattern... I can't say much to help community.

Reply
0 Kudos