patpat
Contributor
Contributor

UEFI PXE boot; Extremely low TFTP performance

Test scenario:

Host Hardware:HP Elitebook 8470p
Intel(R) Core(TM) i7-3630QM CPU @ 2.40 GHz, 16GB RAM
Host OS:Windows 8.1 Pro
VMware:Workstation 9, 10, 11, 12
PXE server:Serva
Boot Manager:Syslinux 6.03
Client Preboot Environment:EFI 64 (VMware FW)

When PXE booting under Workstation EFI guests OSs like i.e. ubuntu-15.04-desktop-amd64.iso

syslinux.efi initially TFTP transfers vmlinuz (6.5 MB) and initrd.lz (22.5 MB).
These transfers (even when they do not present any error) are extremely slow:

v9v10v11v12HP 2570p (EFI hardware)
vmlinuz 6.5 MB8s16s9s9s2s
initrd.lz 22.5 MB62s121s115s116s27s

The last column represents a hardware client booting exactly the same OS

files from the same PXE server.

I can see that under VMware EFI FW all the TFTP transfers (when handled by syslinux.efi) are very slow.

Of course syslinux.efi uses EFI FW resources for these transfers, like instances of

EFI_UDP4_SERVICE_BINDING_PROTOCOL, EFI_UDP4_PROTOCOL, etc

paradoxically the VM Workstation with best UEFI PXE performance is the old v9 but it is

still far slower than i.e. an EFI notebook like the HP Elitebook 2570p booting the same test OS

files against the same PXE server.

I have run Wireshark traffic captures (i.e. Workstation 11)

43128    235.451388000    192.168.64.1    192.168.64.128    TFTP    1454    Data Packet, Block: 16388             Delta 0.015367

43129    235.466755000    192.168.64.128    192.168.64.1    TFTP    46    Acknowledgement, Block: 16388        Delta 0.000128

43130    235.466883000    192.168.64.1    192.168.64.128    TFTP    1454    Data Packet, Block: 16389             Delta 0.019316

43131    235.486199000    192.168.64.128    192.168.64.1    TFTP    46    Acknowledgement, Block: 16389

were the pattern shows a high delay between the arrived data packet and the sent ACK.

the reasons for this behavior from an EFI FW point of view could be i.e. not handling correctly

priority of events; syslinux.efi relies on these events for knowing when a data packet has arrived and

then send the corresponding ACK

Please let me know if you guys need more info

Best,

Patrick

0 Kudos
6 Replies
dariusd
Leadership
Leadership

Hi again patpat!

Which virtual NIC were you using for the tests?  Also, due to a quirk in the way we need to share resources with the host, some vNICs perform better in UEFI when the virtual machine has only one virtual CPU, so it might be informative to try with one vCPU and with more than one vCPU and compare the results.

Cheers,

--

Darius

0 Kudos
patpat
Contributor
Contributor

Hi Darius !

  1. It looks the behavior is independent of the virtual NIC.
    The test was performed using a default ethernet0.virtualDev ("vlance") but I have also tested clients with ethernet0.virtualDev = "e1000e" with similar results.
  2. The test was always performed on single CPU VMware clients. Now I've tested v12 with 2 CPUs and gets better figures on the big transfer when it should've been the other way around right?

v12 (1 vCPU)
v12 (2 vCPU)
vmlinuz 6.5 MB9s10s
initrd.lz 22.5 MB116s71s

The results are repeatable.

Let me know.

Best,

Patrick

0 Kudos
dariusd
Leadership
Leadership

Hmmm... I did get that the wrong way around: 1 vCPU VMs can perform worse than 2 vCPU VMs during UEFI PXE boot.

If you add to the VM's configuration file:

    vnet.recvClusterSize = "1"


does the 1 vCPU VM's performance improve?


It probably still won't approach the performance of a physical machine, though.  The architecture of UEFI does not permit us to use interrupts for NIC events -- we must use polling -- and polling is very bad when we need to steal CPU time from the host OS in a shared environment.  We've made a few compromises to achieve a balance between TFTP throughput and not causing too much host CPU usage while the firmware is running (i.e. limiting the polling rate), and those compromises do make it somewhat unlikely that we will be able to match the performance of a native firmware implementation (which realistically can use just as much CPU as it wants... no one will care).


There's probably more that we can do here to improve our virtual UEFI firmware's TFTP performance if we have the time and opportunity to dig deeper...


Cheers,

--

Darius

0 Kudos
patpat
Contributor
Contributor

Hi there

                                       v12 (1 vCPU)
v12 (2 vCPU)
vmlinuz 6.5 MB9s10s
initrd.lz 22.5 MB116s71s

vnet.recvClusterSize = "1"v12 (1 vCPU)
v12 (2 vCPU)
vmlinuz 6.5 MB11s13s
initrd.lz 22.5 MB129s74s

Unfortunately vnet.recvClusterSize = "1" does not help.

I have also seen other vnet variables like

vnet.bwInterval =

vnet.dontClusterSize =

vnet.maxRecvClusterSize =

vnet.maxSendClusterSize =

vnet.noPollCallback =

vnet.pollInterval =

vnet.recalcInterval =

vnet.recvClusterSize =

vnet.recvThreshold =

vnet.sendClusterSize =

vnet.sendThreshold =

vnet.useSelect =

Some of them (i.e. vnet.pollInterval) sound interesting but I wonder if a description and their default value could be available.

I understand what you say about the need of compromising when considering EFI/NIC/Polling/CPU time/ etc

but probably the approach can still be improved; today EFI PXE is just too slow.

i.e. I think probably those FW compromises should not be the same before and after calling ExitBootServices()
I do not know if you guys today already consider that difference.

Best,

Patrick

0 Kudos
dariusd
Leadership
Leadership

I don't have a ready reference with a description of all the options and their defaults, sorry.

After ExitBootServices, the OS "owns" the platform and can do as it chooses, and it is not constrained by the UEFI restrictions -- it is free to reinitialize the PIC/APIC/LAPIC and MSI/MSI-X and use hardware interrupts as it chooses.  The compromises are made solely in the firmware (in the DXE environment, where the firmware's own NIC drivers are in use) and are only relevant prior to ExitBootServices.

I've been contemplating setting the firmware up to use interrupts in a way that is "concealed" from the rest of the DXE environment, so that we can achieve better performance while not visibly breaking the UEFI specification...  :smileydevil:  Haven't actually implemented anything along those lines yet.

Cheers,

--

Darius

0 Kudos
patpat
Contributor
Contributor

well w/o more variables to tweak I can only hope you guys can some how do something about this issue.

Something I have noticed, when bootmgfw.efi has to TFTP transfer (windowsize=8) Boot.wim (270 MB)

v12HP 2570p (EFI hardware)
Boot.wim (270 MB)47s36s

OK windowsize=8 means 8 times fewer ACKs but in this case the difference in performance is not so big.

I wonder if it could be something else besides driver performance like i.e. the use of

EFI_UDP4_SERVICE_BINDING_PROTOCOL, EFI_UDP4_PROTOCOL

(bootmgfw.efi does not use these protocols)

Best,

Patrick

0 Kudos