XDP performance

Dear SolidRun community,

Does anybody do any experients with XDP work on LX2160a board?
As far as I can see, XDP Zero-copy support was added few months ago in Linux Vanilla kernel, and since August into 5.15 here qoriq-components/linux - Linux Tree for QorIQ support .

But, unfortunately, performance is mediocre.
We’ve expected numbers, comparable with VPP-DPDK backed up (We still could not get it working stable), but in reality forwarding rate comparable with Linux Kernel:
~ 500kpps for xdp_router from tutorial
~ 400kpps for VPP with AF_XDP driver
~370kpps for Linux Kernel
~>4.5Mpps for VPP/DPDK.

May be, we forgot something or do something wrong? If someone got satisfying performance, please, advice.
Below are our setup/test details.

We use binary built by SolidRun lx2160a_build, with simple changes to use 5.15 kernel from above.
$ uname -a
Linux localhost 5.15.71-rt51-07203-g00e98e11cb01 #1 SMP PREEMPT_RT Fri Jan 27 22:25:43 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux
$
BPF/XDP related options are:

# grep -E "(BPF)|(XDP)" ./build/linux/.config
CONFIG_BPF=y
CONFIG_HAVE_EBPF_JIT=y
CONFIG_ARCH_WANT_DEFAULT_BPF_JIT=y
# BPF subsystem
CONFIG_BPF_SYSCALL=y
CONFIG_BPF_JIT=y
CONFIG_BPF_JIT_ALWAYS_ON=y
CONFIG_BPF_JIT_DEFAULT_ON=y
# CONFIG_BPF_UNPRIV_DEFAULT_OFF is not set
# CONFIG_BPF_PRELOAD is not set
# CONFIG_BPF_LSM is not set
# end of BPF subsystem
CONFIG_CGROUP_BPF=y
CONFIG_XDP_SOCKETS=y
# CONFIG_XDP_SOCKETS_DIAG is not set
# CONFIG_NETFILTER_XT_MATCH_BPF is not set
# CONFIG_BPFILTER is not set
# CONFIG_NET_CLS_BPF is not set
# CONFIG_NET_ACT_BPF is not set
# CONFIG_BPF_STREAM_PARSER is not set
# CONFIG_SENSORS_XDPE122 is not set
CONFIG_BPF_EVENTS=y
# CONFIG_TEST_BPF is not set

eth1 and eth2 interfaces are created by ls-addni.

xdp-tutorial forwarding:
GitHub - xdp-project/xdp-tutorial: XDP tutorial built natively on board .
xdp_router microprogram attached.

../xdp-tutorial/packet-solutions# ./xdp_loader -d eth1 -N -F --progsec xdp_router
Success: Loaded BPF-object(xdp_prog_kern.o) and used section(xdp_router)
 - XDP prog attached on device:eth1(ifindex:5)
 - Pinning maps in /sys/fs/bpf/eth1/
../xdp-tutorial/packet-solutions# ./xdp_loader -d eth2 -N -F --progsec xdp_router
Success: Loaded BPF-object(xdp_prog_kern.o) and used section(xdp_router)
 - XDP prog attached on device:eth2(ifindex:6)
 - Pinning maps in /sys/fs/bpf/eth2/

As one can see below, program already attached in native mode. At least, ip link show indicated that.

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 06:1b:6f:31:a9:6c brd ff:ff:ff:ff:ff:ff
3: sit0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/sit 0.0.0.0 brd 0.0.0.0
4: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether d0:63:b4:03:1c:ce brd ff:ff:ff:ff:ff:ff
5: eth1: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 xdp qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether d0:63:b4:03:1c:cb brd ff:ff:ff:ff:ff:ff
    prog/xdp id 36 tag 551558afe8187df7 jited
6: eth2: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 xdp qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether d0:63:b4:03:1c:cd brd ff:ff:ff:ff:ff:ff
    prog/xdp id 44 tag 551558afe8187df7 jited

Then, IP addresses are configured, ip forwarding sysctled to on (it has impact on bpf forwarding as well as on kernel forwarding), test performed by Cisco T-Rex.
To be sure that forwarding goes through XDP, we can check that tcpdump does not see forwarded packets , we can also enable maps and see XDP_REDIRECT counters increasing.
Still, just 30% faster than kernel (while in examples over internet on PC boxes, it should be closer to DPDK).

VPP:
same basic setup, then by vpp manual:

vpp# create interface af_xdp host-if eth1 num-rx-queues all
vpp# create interface af_xdp host-if eth2 num-rx-queues all

Then mac addresses are set, IP addreses are set and simple test performet by T-Rex with about 400kpps performance.
Note, we were not able “tell” VPP to use zero-copy mode:

vpp# create interface af_xdp host-if eth2 num-rx-queues all zero-copy
af_xdp             [error ]: af_xdp_create_queue: xsk_socket__create() failed (is linux netdev eth2 up?): Operation not supported
create interface af_xdp: xsk_socket__create() failed (is linux netdev eth2 up?): Operation not supported
vpp#

During first “normal” interface creation, we also see some alarms in VPP logs:

libbpf: elf: skipping unrecognized data section(7) .xdp_run_config
libbpf: elf: skipping unrecognized data section(8) xdp_metadata
libbpf: elf: skipping unrecognized data section(7) xdp_metadata
libbpf: prog 'xdp_dispatcher': BPF program load failed: Invalid argument
libbpf: prog 'xdp_dispatcher': -- BEGIN PROG LOAD LOG --
Func#11 is safe for any args that match its prototype
btf_vmlinux is malformed
R1 type=ctx expected=fp
; int xdp_dispatcher(struct xdp_md *ctx)
0: (bf) r6 = r1
1: (b7) r0 = 2
; __u8 num_progs_enabled = conf.num_progs_enabled;
2: (18) r8 = 0xffff800008448000
4: (71) r7 = *(u8 *)(r8 +0)
 R0_w=invP2 R1=ctx(id=0,off=0,imm=0) R6_w=ctx(id=0,off=0,imm=0) R8_w=map_value(id=0,off=0,ks=4,vs=84,imm=0) R10=fp0
; if (num_progs_enabled < 1)
5: (15) if r7 == 0x0 goto pc+141
; ret = prog0(ctx);
6: (bf) r1 = r6
7: (85) call pc+140
btf_vmlinux is malformed
R1 type=ctx expected=fp
Caller passes invalid args into func#1
processed 84 insns (limit 1000000) max_states_per_insn 0 total_states 9 peak_states 9 mark_read 1
-- END PROG LOAD LOG --
libbpf: failed to load program 'xdp_dispatcher'
libbpf: failed to load object 'xdp-dispatcher.o'
libxdp: Failed to load dispatcher: Invalid argument
libxdp: Falling back to loading single prog without dispatcher

So, any ideas, what could be wrong?