Still having the issue..
Sometimes it could be useful to really verify that all VMNIC uplinks for all VLAN does work. One way to try this is to create a new portgroup on the vSwitch used by your VMs on the first host, put one test VM on this portgroup, then go into the NIC teaming policy of the new portgroup and select "Override switch failover order".
Then move down all VMNICs except one to unused, so only one VMNIC is active. Then set the portgroup VLAN settings to one of the production VLANs and try to see if you could ping some different expected addresses. If this works, then move the working VMNIC down to unused and move up another to Active. Try again, and do this for all VMNICs. If this works then you have verifed that the VLAN configuration and other settings are correct on the physical switch ports facing this host.
If having multiple VLANs, repeat the process for all other production VLANs. Then repeat the process on the other hosts.
While might take a while, does will verify if everything is correctly configured on the physical switches. When a vMotion takes place the VM gets a new "Port ID" and is assigned a new outgoing VMNIC. If there is a configuration error on one or several physical switch port this could seem random, but perhaps always happens on VLAN x on VMNIC y. Since the Port ID policy you are using in effect will randomly spread the VMs over the VMNICs these problems could be hard to diagnose. (Doing a disconnect of the Virtual Machine vNIC gives the VM a new port-ID which will move it to a new outgoing VMNIC, which might seem to solve the problem.)
That is an obvious suggestion and I can't believe we didn't try it months ago. We were able to identify one bad port doing what you said on one of our 3 hosts and have fixed that port config. We will continue testing. We also have the issue in a separate DMZ cluster but we haven't been able to troubleshoot there yet. Once we are done with testing I will report back.
Thanks for the help.
Glad to hear to you found one configuration error already! Report back when you have completed the testing.
I found a similar misconfiguration in our DMZ and made the correction there. We are going to simulate failures in our cluster tonight to see if this is truly resolved. Typically we will see this problem if we have a host go down.
Rickard's suggestion was very helpful. We were able to isolate more than one port in each of our trouble environments and correct them using his suggestion. I believe our "dropped network" issues have been resolved after testing. If not, I'll be back.
Very nice to hear that you got your old problem finally solved. Good luck!