Hi,
I’m running into an issue that occurs when I’m sending high quantities of data through a ClearFog’s SFP+ ports. The transfer appears to be stuck for dozens of seconds, and occasionally minutes, at a time. The ClearFog’s Send-Q fields for the connections as seen through ss
/netstat
is stuck at a large number during this time.
I don’t believe it’s an issue with my user-space programs, as Send-Q is stuck at a high number on the sender end (ClearFog), and Recv-Q is stuck at 0 on the receiver end (Misc Linux devices).
If the user-space application somehow didn’t call sendmsg
(Or in my case, the higher level Boost ASIO abstraction) on the sender, then Send-Q would be 0.
And if the receiver user-space application didn’t call recvmsg
or equivalent, then Recv-Q would be a large number rather than 0 as it is currently.
I also don’t believe it’s a hardware issue, as the sender ClearFogs are directly connected to the receiver devices (No switches in the way, for example). Some receivers are different ARM boards, some using built-in ethernet, and others using Mellanox PCIe NICs. We’ve seen the issue across at least 3 or 4 different ClearFogs. Different cables are also used.
Also, I was performing data transfers on 2 of the ClearFog’s SFP+ ports, and both encountered the issue at the same time. An issue is probably related to the ClearFog, rather than the receivers, cables, user-space programs, etc. During the issue I’m SSH’d in via the QSFP+ port, and do not encounter any issues there, so it’s not like all networking breaks.
At this stage, I’m beginning to think it’s some kind of kernel issue. This issue Uncovering a 24-year-old bug in the Linux Kernel – Skroutz Engineering appears similar, although the timer
field of ss
is not in the “persist” state.
Here’s a diagram of how the network is set up:
I’ve collected some output from ss -mito
, as well as tcpdump
:
Since the raw output of ss -mito
was messy, I reorganized it into the CSV files. Here’s the range of line numbers where the issue occurs in each file:
sender-a.csv: L10808-11041 (send_q is stuck at 4132375)
sender-b.csv: L43113-44059 (send_q is stuck at 4133976)
receiver-a.csv: L33890-34292 (recv_q is stuck at 0)
receiver-b.csv: L32413-32902 (recv_q is stuck at 0)
One silly thing about the contents CSV files is that I replaced commas with @
, to avoid breaking the CSV format, which is why the contents of the timer
and wscale
columns look the way that they are.
The .log files are the raw output of ss -mito
, for different interfaces. They take a bit more work to understand, as each file contains two data transfers down the one interface. The CSV files should hopefully contain all the useful information. Anyway, for the data transfers, ports 21n00 and 21n10 are used, where n is either 3 or 4 whether the receiver is Linux Device A or Linux Device B respectively. The 21n10 transfer is intentionally more intermittent. 21n10’s data transfer didn’t appear to be underway when the issue occurred, and hence it can probably be ignored (I left it out of the CSV file). Here are the line numbers where the issue started:
ss-sender-a.log: L54034
ss-sender-b.log: L215557
ss-receiver-a.log: L169655
ss-receiver-b.log: L162279
The .pcap files all have the issue occur at approximately frame 1000 (Note how the time column jumps ahead by ~44 seconds). The packets are truncated to prevent the file from being too large. Unfortunately, I didn’t not manage to capture receiver-b.pcap, so it is missing. I’m not sure whether anything useful is in these .pcaps.
Is there anything suspicious in any of these logs, particular the ss
CSV files? Are some of my assumptions about Send-Q and Recv-Q incorrect, meaning that user space application or hardware issues can’t be ruled out? Are there any additional logs I can produce to figure out the issue?
Thank you