VMware Cloud Community
miszcz
Contributor
Contributor

Weird network issue after vCenter upgrade - output drops on switchport (leading to network issues for VMs)

Disclaimer: I have not opened a case with VMware support regarding this issue yet - mostly because I don't think that they will be able to help with this. But on the very slim chance that other people have also experiences a similar issue, I'm trying the community. Maybe I get lucky ... thanks for listening.

After an update of our VCSA to 6.7 U1b we experienced some network issues that we initially did not connect to the vCenter upgrade. There are a number of other VMs on the ESXi hosting the VCSA (mostly Linux VMs). After the upgrade we noticed (for lack of a better description) "an unstable network", resulting in recurring sudden cluster switches of Linux Pacemaker / DRBD clusters due to network reachability issues. We finally could confirm that the VCSA was at least part of the problem, if not the cause.

After moving the vCenter to its own ESXi, we could at least isolate the network issues in such a way that no other of our important VMs are impacted anymore.

The only observable issue regarding this problem is an increasing number of output drops on the switchport that connects to the ESXi with the VCSA (which by now is the only VM on the host). The underlying switch is a Catalyst 6500. The ESXi host is connected with 1Gbps to the switch. Since we never experienced similar network issues in the past, I don't know if these output drops were there before or if they occured only after the upgrade.

So far, we have been able to rule out defective cables, a defective switchport and a defective switch module. The average throughput on that interface is < 2 Mbps combined TX/RX (which is 0.2% of the line speed). Still, we are seeing those output drops which - based on Cisco documentation - are caused by a congested interface: at times the output queue must be full and the switch discards some packets. To the best of my understanding, this must be because of short lived bursty traffic (which we haven't been able to capture yet). What might be the cause of this bursty traffic is still completely unclear to us.

The load on the vCenter is rather small - no discernible CPU or memory issues, a rather low network load (see switch throughput) and no dropped packets on the interface level of the VCSA.

The actual percentage of output drops on the interface is ~ 0.25%. It doesn't seem that high to me, but it does seem to have quite an impact on the stability of Linux DRBD clusters. The "funny" thing is that there is no observable impact on the functionality or performance of the vCenter itself. I would have expected that if other VMs are behaving badly due to those dropped packets that the vCenter might complain as well - but that doesn't seem to be the case.

So here is the question: has anyone else seen this issue with the latest version of vCenter? Is anyone aware of changes made to the vCenter in the latest update that lead to more bursty traffic? And does anyone have an idea on how to further analyze the root cause of this issue?

Thanks a bunch,

-michael

P.S. I can provide more technical information if anyone's interested.

Reply
0 Kudos
3 Replies
MikeStoica
Expert
Expert

You can also do some ESXi network troubleshooting ESXi Network Troubleshooting Tools - VMware vSphere Blog .

Reply
0 Kudos
miszcz
Contributor
Contributor

Thanks for the info. I already did some analysis on the ESXi with esxtop, multiple esxcli commands and checking the logs. But no luck so far.

I might give iperf a try to see if the output drops are affected.

But since the problem already manifests on the physical switch level which discards the packets, I doubt that I will be able to find out more details about this on the ESXi which never sees the packets anyway. But I'll have a look.

Reply
0 Kudos
MikeStoica
Expert
Expert

IF the switch drops the packets then you have to check there

Reply
0 Kudos