VMware Cloud Community
PonF
Contributor
Contributor

vMotion slow on 10Gbit network

We have two vSAN clusters. When we migrate a VM (Powered on) between the clusters we only get 50+50MByte/s. It feels like we have a 1Gbit limit. In reality we have a 10+10Gbit/s network between the clusters. What can be the problem?

I have done a iPerf test between two VM on each cluster and there we got 5Gbit/s with E1000 network card on a single port. The VM's was on the same network as vMotion.

We have two vmkernel ports and two physical ports for vMotion only. Configured Active and Standby. The physical switches also handle vSAN traffic.

I have read similiar problems that often point to that the migration is done on the management network but that is not the case here. There is no extra traffic on management port, only on vMotion (and vSAN). We do not use jumo frames. But I dont think that is the whole problem here?

See attached picture how the network traffic goes. This is only observed by looking at the Performance tabs during migration. The disks for vSAN is 10K speed and specs says 12Gbit/s so the limit should not be there? But I have not done any performance test on vSAN network (in production so don't know how).

Edit:
In the example in the picture the migration is done between Host3 and Host6. vSAN first stores the mirrored disk on Host1 and Host2 and then on Host4 and Host5 after migration complete.
The migration takes 10 minutes and the disk is setup to 80GB (thin provision). The summary page for the VM says "Storage usage 94.17GB".

0 Kudos
10 Replies
PonF
Contributor
Contributor

Is there no one that have any idea why the migration is slow? Any settings or anything? I did not find any vMotion sub-forum but I hope I posted in the correct place.

Tolga_y1980
Contributor
Contributor

hello , did you solve this problem ?

0 Kudos
PonF
Contributor
Contributor

No I never solved it. As you can see I did not get any more advice here either...

We don't use vMotion very often so have not prioritized this. Next year we will rebuild/upgrade the environment so I guess we have to live with this until then and verify that the new environment speed is working before taken to production.

Tolga_y1980
Contributor
Contributor

okay btw i solved it , i can explain shortly ..   When you are doing a vmotion between clusters  its not using the vmotion network because its a storage vmotion . Its using the management network for this transfer.  So you may check if your management network is 1gbit or 10gbit first of all .  So if its 1gbit network than you need to migrate yout management vmkernels to the 10gibt portgroups on other switch , it can be done online.

After than if you power off and send the vm to other cluster it will use more than 1gbit .  But dont expect to have 10gbit traffic while its sending vmdk files are encapsulated and the traffic will be about 2gbits or somethng like that. 

 

 

 

0 Kudos
Tolga_y1980
Contributor
Contributor

And when you try to do this vmotion while vm powered on , its trying to do this with vmotion vmkernels and limiting it to 1gbit ,  so move your management vmkernels to 10gbit nw if not , than power of the vm and start the vmotion , you will see it will go on management vmkernel and with an higher speed than 1gbit.

0 Kudos
Arvind_Kumar11
Enthusiast
Enthusiast

Assuming you are already using the 10G network interface.

If yes, then you can ask your network guys to configure jumbo frames on physical adaptors at the switch end. Then you can set the MTU size from 1500 to 9000 on the vmotion kernel adaptor. 

PonF
Contributor
Contributor

@Tolga_y1980 

Okay but I dont think this is correct. The storage vMotion should not go over management network. From what I have read this happens if you have wrong configuration, snapshots or other faults. Sure if you speed up the management network it will probably go faster but your are using wrong network?

As I said in my first post there is no extra traffic on our management port, only on vMotion (but slow). So atleast that part looks correct for me.

0 Kudos
PonF
Contributor
Contributor

@Arvind_Kumar11 

From my earlier investigation I would say we are using the 10Gbit interface. Will not the jumbo frame only speed up the traffic a bit? In my case it feels alomst like a 1Gbit limit somewhere (but not on pysical level since iPerf file transfer gave higher speeds).

0 Kudos
Tibmeister
Expert
Expert

So a few things with all this.  First, Jumbo Frames does more than simply speeding up the traffic.  It allows packets that are of 9000 bytes in size to go through the interface per transaction, instead of the standard 1500 byte size packets.  That's a 6 fold REDUCTION in packet processing.  The backplane of the switch and of the NIC has two critical measurements in the realm of performance, bandwidth and packets per second throughput.  You may not be coming close to 10Gb in traffic, but if you are hitting the packets per second limit on what the NIC or switch can handle then you have a bottleneck.  It can also get more complicated if you have a single ASIC per interface or multiple interfaces sharing an ASIC, because the ASIC is usually where the bottlenecks in either packets per second or bandwidth are present.

On to the storage, yes, your controller is capable of 12GB/s of bandwidth, but that's SHARED across all your disks, and 10K disks are not that fast at all.  This is also sharing all IO, both read and write.  So, while you have 12GB/s of bandwidth (assuming SAS here), the amount of disks on the bus that are in a shared config will impact this, along with the sheer 10K speed.

Also on the storage side, you need to make sure that the storage itself supports VAAI primitives.  What this does is offload the data traffic during a vMotion (on the same storage) and not require that to go over the wire.  If you are going between storage arrays, well than VAAI doesn't help all that much.  If you can't take advantage of VAAI for any reason, then the Storage vMotion should use the vMotion interface.  This is where you can also get into trouble.

If your vMotion interface is on the same subnet as your management interface, the lower number interface will always get the vMotion traffic.  vmk0 is your management interface, so guess what get's the transfer.  Also is true if the vMotion is on the same subnet as the Storage vmkernel adapter, you don't get predictable results.

The other factor that hasn't been brought up is the NIC configuration.  If you have a single 10Gb NIC handling all the traffic, then yeah, it's going to be a bottleneck.  

I hope this helps to provide some direction for your investigation, but from the information provided not a ton of help van be given outside of the general advice I provided above.

0 Kudos
Tolga_y1980
Contributor
Contributor

Just shutdown the vm and try again and watch which network is used .  When vm is powered on its trying to use vmotion network , but if you shut down the vm  it will use management vm kernel..   you can watch from esxtop   n  command.

0 Kudos