Data transfer occasionally gets stuck for dozens of seconds at a time when sending high quantities of data. Kernel issue?

Hi,
I’m running into an issue that occurs when I’m sending high quantities of data through a ClearFog’s SFP+ ports. The transfer appears to be stuck for dozens of seconds, and occasionally minutes, at a time. The ClearFog’s Send-Q fields for the connections as seen through ss/netstat is stuck at a large number during this time.

I don’t believe it’s an issue with my user-space programs, as Send-Q is stuck at a high number on the sender end (ClearFog), and Recv-Q is stuck at 0 on the receiver end (Misc Linux devices).
If the user-space application somehow didn’t call sendmsg (Or in my case, the higher level Boost ASIO abstraction) on the sender, then Send-Q would be 0.
And if the receiver user-space application didn’t call recvmsg or equivalent, then Recv-Q would be a large number rather than 0 as it is currently.

I also don’t believe it’s a hardware issue, as the sender ClearFogs are directly connected to the receiver devices (No switches in the way, for example). Some receivers are different ARM boards, some using built-in ethernet, and others using Mellanox PCIe NICs. We’ve seen the issue across at least 3 or 4 different ClearFogs. Different cables are also used.

Also, I was performing data transfers on 2 of the ClearFog’s SFP+ ports, and both encountered the issue at the same time. An issue is probably related to the ClearFog, rather than the receivers, cables, user-space programs, etc. During the issue I’m SSH’d in via the QSFP+ port, and do not encounter any issues there, so it’s not like all networking breaks.

At this stage, I’m beginning to think it’s some kind of kernel issue. This issue Uncovering a 24-year-old bug in the Linux Kernel – Skroutz Engineering appears similar, although the timer field of ss is not in the “persist” state.

Here’s a diagram of how the network is set up:
image

I’ve collected some output from ss -mito, as well as tcpdump:

Since the raw output of ss -mito was messy, I reorganized it into the CSV files. Here’s the range of line numbers where the issue occurs in each file:
sender-a.csv: L10808-11041 (send_q is stuck at 4132375)
sender-b.csv: L43113-44059 (send_q is stuck at 4133976)
receiver-a.csv: L33890-34292 (recv_q is stuck at 0)
receiver-b.csv: L32413-32902 (recv_q is stuck at 0)

One silly thing about the contents CSV files is that I replaced commas with @, to avoid breaking the CSV format, which is why the contents of the timer and wscale columns look the way that they are.

The .log files are the raw output of ss -mito, for different interfaces. They take a bit more work to understand, as each file contains two data transfers down the one interface. The CSV files should hopefully contain all the useful information. Anyway, for the data transfers, ports 21n00 and 21n10 are used, where n is either 3 or 4 whether the receiver is Linux Device A or Linux Device B respectively. The 21n10 transfer is intentionally more intermittent. 21n10’s data transfer didn’t appear to be underway when the issue occurred, and hence it can probably be ignored (I left it out of the CSV file). Here are the line numbers where the issue started:
ss-sender-a.log: L54034
ss-sender-b.log: L215557
ss-receiver-a.log: L169655
ss-receiver-b.log: L162279

The .pcap files all have the issue occur at approximately frame 1000 (Note how the time column jumps ahead by ~44 seconds). The packets are truncated to prevent the file from being too large. Unfortunately, I didn’t not manage to capture receiver-b.pcap, so it is missing. I’m not sure whether anything useful is in these .pcaps.

Is there anything suspicious in any of these logs, particular the ss CSV files? Are some of my assumptions about Send-Q and Recv-Q incorrect, meaning that user space application or hardware issues can’t be ruled out? Are there any additional logs I can produce to figure out the issue?

Thank you

Interesting problem, and this may very well be a kernel issue. I am wondering if your issue is not in the SYN Q, but in the Accept queue overflowing and because of the ordering it is dropping a packet that is causing the SYN Q to stall. A bit of googling with that thought in mind brought me to this article which I think may help you debug your issue a bit further. https://www.alibabacloud.com/blog/599203

I’m not sure whether the issue would be related to SYN or ACK queues, as both the sender and receiver are in the ESTABLISHED state, where the values of Send-Q and Recv-Q indicate “sent but not acknowledged”, and “received but not read” instead.

Just had a random though. Could you run your test with a different tcp congestion control algorithm to see if that fixes the issue.

Thanks for the idea. cubic and reno were available, with cubic being the default. I changed the congestion control algorithm (on the receiver devices too, for good measure), but the issue still remained.