VMWare server in NLB cluster intermittently drops ...

CirrusJustin · ‎06-24-2021

I have two Server 2019 vmWare virtual machines configured as NLB cluster hosts with IGMP Multicast. This has worked well for months. Then updates were turned on for both servers, and for the past few months the servers have been rebooted on a monthly basis.

Now, usually a few days after reboot, I'll get complaints from users that the web app is going slow. Sure enough, I'll start a ping on one of the two servers and one will be dropping tons of ping requests. If I stop the host on the cluster, disable/enable the virtual dedicated NLB NIC, then enable the server again on the cluster, the pings no longer drop and the app returns to normal speed. This will last until reboots happen again the next month.

This is not contained to either of the two servers and is entirely random. It can be one or the other and even both.

Logs from right around the reboot time/date start to alternate every 1-15 minutes between the below messages. These are normal messages, but the frequency with which they appear is the problem. They stop once I disable/enable the NIC:

NLB is initiating convergence on host 0x1 because host 0x2 is leaving the cluster. Event ID: 69
Host 0x1 converged with host(s): 1,2. It is now an active member of the NLB cluster and will start load balancing traffic as the default host. The default host is the host with the lowest host priority. It handles all traffic that isn't covered by any of the defined port rules. Event ID: 29

I'm no good with Wireshark, so don't know if other packets are being dropped. Any idea what's going on here or how I can troubleshoot?

All

VMWare server in NLB cluster intermittently drops ping packets until NIC reset