Tinker with vMotion TCP settings and/or force vMotion rate limit?
First off, wow, this new forum is horrible when it comes to searching. You have to drill so far down before you even find a search box, and if someone posted relevant content in the wrong very specific category, good luck finding it. The old forums were so much better.
In any case, here's my predicament. I've got a host with a 10gig NIC that the vMotion vmkernel port is on. The entire network is 10/25, so no option to connect gigE or artificially turn the rate down that I can find. We have two data centers connected by a 10gig circuit which uses a leaky bucket rate limit algorithm of 2 Gbps. I need to have the ability to live vmotion VM's across this link.
What ends up happening is vSphere begins sending data, things are good, tcp window size opens up, throughput increases, window aggressively increases, rate limit bucket fills, packets start being dropped and vSphere dramatically clamps down the window size, causing the overall throughput to become horrible. This cycle repeats over and over until the migration ultimately times out. The cause of the issue is vSphere's TCP stack is probably configured with decades-old linux defaults for buffers, congestion control, and queuing, which behaves incredibly poorly on lossy and high latency circuits. We can see the same testing across this link with ancient linux distributions. More modern ones are fine and will average out to the 2gbps rate limit after just a minute or two of bouncing around to find the appropriate window size.
Problem I have is I can find no way to force vSphere to cap its rate at 2gps, and I can also find no way to do any kind of TCP tuning on vSphere. Support told me to contact sales to see if vmware pro services could offer some paid solution.
Has anyone found a way to force a rate limit on vMotion, force a TCP window size cap, tune any other TCP parameters of the vmkernel interfaces, etc.? It would be really dumb to need to install a parallel gigabit network just for vmotion to not fall apart because of its ancient default TCP config, while also limiting us to half the available bandwidth.