Hello,
Greetings. I've a standalone ESX host running on ESX5.5 version. And I've a two Intel's 10Gig Network card in the host. When I try
1. Created a Virtual switch with one of the 10G connected in it. A VM with VSwitch NIC - setup A.
2. Configured the other 10G NIC as a passthrough NIC. And created a VM with a PCI Device ( passthrough NIC ) - Setup B
In the setup(2) I can see NIC driver ixgbe is loaded in the VM. Am sending the data to the application running in both the setup. And I expect high performance in the passthrough mode than VSwitch setup. But unfortunately I see less number of connections/sec in the passthrough mode than VSwitch setup. When the application is loaded I tried to monitor the esxtop data on the ESXi host. And i could see one of the PCPU is 100% ( PCPU varies, but only one at a time ). And the corresponding VM is in the top which high %of used.
Could anyone shed some lights why my PCI Passthrough is not scaling well? Is there any tuning I should do?
What kind of OS, Application and data?
Are you sure that the NIC is the bottleneck and not CPU for example?
What is the overall goal you are trying to achieve?
// Linjo
Sorry about not giving the clear information. The guest os is : Ubuntu 14.04 3.13 kernel 64 bit.
Basically what I want to achieve is : better performance with the pci passthrough NIC ( > CPS than what VSWITCH NIC setup gives ).
Im doubting on the NIC because when I use the same application and data on VSwitch NIC setup I get 40K Connections/per sec but with the Passthrough NIC I only get 25K. ( same number of CPUs and memory in both the setup ) . How would it be CPU/other problems? ( TBH It was a big surprise for me ).
First off, can you give some more details on the type of traffic you're testing with? Is it TCP or UDP connections? What application layer protocol? Have you tested other network benchmarking tools such as iperf as well?
Does the VM only have a single vCPU or did you try to increase it? When you passthrouh a NIC to a guest VM, all the work related to handling network packets that is not directly offloaded to the NIC hardware needs to be done inside the OS as well with its limited computing resources.
If you use a vmxnet3 vNIC however, most of this is being offloaded from the VM to host independently of the computing resources assigned to the VM.
Therefore, make sure the VM has enough CPU resources and assign additional vCPUs. Also make sure the VM's virtual hardware version is up-to-date.
I assume you may also need to tune the ixgbe driver settings to make sure it's using hardware offloading capabilities and multiple CPU interrupt queues/receive side scaling is enabled etc.
For example, on this 4 vCPU VM with a vmxnet3 vNIC you can see that all CPUs handle interrupts for this single network interface with multiple receive and transmit queues:
# cat /proc/interrupts | egrep -i 'eth|cpu'
CPU0 CPU1 CPU2 CPU3
57: 76566551 70446837 68928406 61311955 PCI-MSI-edge eth0-rxtx-0
58: 80703330 66993836 64197758 56587761 PCI-MSI-edge eth0-rxtx-1
59: 52852203 67779552 74430390 81477134 PCI-MSI-edge eth0-rxtx-2
60: 85093782 65899469 55469451 61844401 PCI-MSI-edge eth0-rxtx-3
61: 0 0 0 0 PCI-MSI-edge eth0-event-4
Also check your top CPU stats for a high number of hard (%hi) or soft (%si) interrupts when you run the test.
Next you should examine the NIC offloading settings and make sure at least checksumming and LRO are enabled, or enable others if needed:
# ethtool --show-offload eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: off
udp-fragmentation-offload: off
generic-segmentation-offload: off
generic-receive-offload: on
large-receive-offload: on
# ethtool --show-coalesce eth0
# ethtool --show-pause eth
From my experience with tuning heavy traffic Firewalls, increasing the NIC ring buffer sizes also helps with reducing CPU interrupts and gaining higher throughput, increase the values if needed (personally I found 1024 to be a good value but your mileage may vary):
# ethtool --show-ring eth0
Ring parameters for eth0:
Pre-set maximums:
RX: 4096
RX Mini: 0
RX Jumbo: 0
TX: 4096
Current hardware settings:
RX: 256
RX Mini: 0
RX Jumbo: 0
TX: 512
Thanks for the reply.. Its really useful.
VM is not with single CPUs. I tried increasing the CPUs. Yes I agree with what you are saying
"
When you passthrouh a NIC to a guest VM, all the work related to handling network packets that is not directly offloaded to the NIC hardware needs to be done inside the OS as well with its limited computing resources.
If you use a vmxnet3 vNIC however, most of this is being offloaded from the VM to host independently of the computing resources assigned to the VM.
"
So I tried with 4 cpus ( not hyperthreaded ) 16 G memory and 1x10G Ethernet card. And checksumming offload is on, LRO is on. And ring buffer size is set to 1024. rx and tx pause is on.
The weird thing is when I send requests to the application it gives me 95K connection per sec when all the software interrupts are handled by 1CPU ( CPU 0 ), when I pin the interrupts to spread it across all the CPUs
I can see nicely all the CPUs are handling softwareinterrupts but I get only 83K Connection per seconds. Definitely something strange is happening,,
If you get better performance with everything pinned to a single CPU, then I suppose in your case CPU cache locality could be more crucial than raw processing power.
This is also mentioned in the first article here, among other tuning points you can try:
http://timetobleed.com/useful-kernel-and-driver-performance-tweaks-for-your-linux-server/
http://dak1n1.com/blog/7-performance-tuning-intel-10gbe/
There are other general performance recommendations like disabling physical host power saving in the BIOS, enabling the latency-sensitive VM option and a lot more you can find in this guide:
https://www.vmware.com/files/pdf/techpaper/VMware-PerfBest-Practices-vSphere6-0.pdf