I’ve been running some throughput tests on the SFP+ interfaces using iperf3 and I’m not seeing the throughput that I would expect.
Using the standard build from the SolidRun github repository, I find that the recommended 10GTek SFP+ optical transceiver module (10Gtek AXS85-192-M3) achieves about 4.8 Gbits/sec. I see similar performance from an FS SFP-10GSR-85 HP.
On dpmac.9 the 10GTek is seriously throttled to 357 Mbits/sec, the FS transceiver still achieves 4.51 Gbits/sec. Retry count for the FS test is high, for the 10GTek extremely high.
As a control, I ran a similar test from an Ubuntu workstation with an Intel PCIe NIC and an Intel transceiver running through the same network switch to the same iperf3 server. I get 9.23 Gb/sec. So, the network and server running the test will handle almost the full 10Gbit/sec bandwidth.
Is there anything that I can do to get faster speeds?
On another note, the recommended Finistar SFP+ (FTLX8571D3BCL) never works in any of the cages. The link stays up momentarily and then the switch (Aruba) suspends the port due to “link flapping”. I’m working with my IT department to disable the link flapping protection to see if it fares better. Maybe its an issue with my switch?
We have been having issues with packet loss when using the SFP interfaces too. We’ve tried both FS 10GSR-85 and the supported 10GTek devices. I’ve found the 10GTek SFPs to be better than the FS ones but see packet loss with both (we’ve been using DPDK + pktgen to generate test traffic flows).
By default (unless you pass in UDP arguments) iperf will use TCP which backs off aggressively when packet loss is occurs so even a low rate of loss can affect the achievable throughput a lot. I expect packet loss might be the cause of the apparent low performance of the SFP interfaces you are seeing.
I think the cause might be in the patches applied to the lx2160_build project. This checks out the current LSDK release and applies a few cherry-picked patches from NXP, including some relating to SERDES configuration for the 10G interfaces. Since those patches, NXP have done a lot more work on SERDES driver support and SFP integration and most of these changes are not included in lx2160a_build:
That patch adds support for TX equalization configuration to the SERDES (this is not included in lx2160a_build but builds upon an earlier patch “0009-phy-add-support-for-the-Layerscape-SerDes-28G.patch” which is applied to the lx2160a_build project.
We’re still working though this but at the moment but appears the interfaces work better when patches 9,10 and 11 are not applied to the kernel. It seems there are a few options available:
Wait until the next layerscape SDK release which will presumably include NXP’s changes
Attempt to add more of NXP’s patches to the current release, bringing it more up-to-date. (Maybe SolidRun could look into this?)
Move backwards, closer to the last official LSDK release by not applying 9,10 and 11.
Thanks for the reply. I was starting to look into the kernel patches yesterday to understand how the SERDES is configured and what device drivers work. I haven’t gotten into the NXP repository yet but did clone it already. I’m hoping to look into your suggested option 2 but it might be tricky to get it to work just right. You’ve obviously been down this road already. The hints should help to speed up my research. I’ll keep posting here, if I have successes or abject failures.
I noticed your earlier post about dpmac.8 and dpmac.10 having better performance. My experience is that dpmac.9 has poor performance and dpmac.7, 8 & 10 have much better throughput. That is, with the right SFP+ transceivers. The Finistar transceivers always have troubles. I turned off “port flapping” protection at the switch and the link stays up but the throughput is on the order of 100Mb/sec with lots of retries, on the order of 10,000 over the course of a one minute test.
In general, the number of retries when running iperf is better on the LX2160 than from my control workstation. The throughput is about half, however. I’ll try the DPDK + packetgen tools to see how they compare. I’ll have to learn how to use them and see if I can get them up and running on my workstation to compare.
I also tried the eq tuning parameters mentioned in your earlier post. I modified the in memory device tree but I did not notice any change in test results. I need to inspect the processor registers to verify the changes. I haven’t hooked up test equipment to look at the signals yet. We don’t have pods/probes for inspecting the for the optical link but do have SFP+ pods to look at the signals at the cage. I’m working on getting that equipment and a HW engineer to assist.
Not sure if this will help you or not. But I am running the cex7 build with these modifications
HEAD detached at ddab3ad
Changes not staged for commit:
(use "git add/rm <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
I am using MC v10.33, DPDK v22.03 and PKTGEN v22.04.1. I am running PKTGEN with 32,768 descriptors per direction per port which required a tweak (2 → 4) to the number of huge pages.
Using DAC cables between dpmac7 <-> dpmac8 and dpmac9 <-> dpmac10 (I have also tested all the other connection combinations with the same results) I can get zero packet loss down to 128 byte packet sizes. This is with all 4 ports doing full rate (100%) tx & rx and PKTGEN reports 30Gbps combined throughput.
Older versions of PKTGEN result in packet loss that are not reflected as rx frame or discard errors when using DAC cables.
If I swap to SFPs (I’ve tried 10Gtek and FS) then I see rx frame and discard errors.
I am finally able to see close to 10Gbit throughput on dpmac.10. I haven’t tested any of the other ports. In the end, the device under test was cpu bound. The SFP+ port seems to be working well when it is supplied with enough data.
I was initially testing with iperf3 but iperf3 is single threaded and maxes out one core on the LX2160. The solution was to use iperf2 with the -P option. The -P option starts multiple client threads and this helps to fully load the Ethernet port. Since iperf2 is multi-threaded, it is able to use multiple cores effectively. On my control workstation, one core is enough, since it’s a fairly high performance Xeon machine. On the other hand, the Layerscape ARM cores are a bit more humble and one core can’t shove enough bytes into the pipe fast enough.
Running iperf -P 4, which opens 4 network ports, has high throughput and -P 2 is almost as good.
I came across the same problem on Honeycomb board with 10gb DAC cables. Initially for the RX (iperf3) I was able to get 2-3 Gbps (single stream); TX was around 6-7Gbps. I identified bottlneck being ksoftirqd hitting 100% on single core. This was test with MTU 1500. Basically, CPU core isn’t fast enough to drain ring buffers quckly enough. I changed MTU to 9000 on solidrun machine and external system. This time solidrun is able to achieve 9Gbps for single stream (both TX and RX). Faster Intel/AMD CPUs would be able to achieve full wirespeed (~10Gbps) with standard MTU 1500. That’s a bit disappointing.
These are down to inefficiencies within the kernel memory mapping subsystem and the network stack. There were optimizations added in recent kernels for this as well as a partial speedup for the DPAA2 driver, however due to the nature of the hardware it can not take advantage of the full speedup. The reuse of buffers relies on a circular ring buffer topology and the dpaa2 can only use SG DMA.
If you run DPDK you will find that the hardware can easily sustain 2x25Gbps at wire speed throughput.