Yes it looks like all controllers need the same fix. The 64-bit prefetchable address spaces all have the same copy paste error. Only pcie5 should be using the 0xa4 0x00000000 pcie address space. All others should be 1 to 1 mapped to the second address.
Applied the fixes, but regrettably, it doesnât seem to change anything wrt the issue - still getting nvme timeouts on warm boot + non-ECC memory + when loading Image+dtb from 2nd eMMC partition.
Does it work properly if you boot the system with just a single DIMM of the non-ECC memory? I am starting to think that the delay is enough loading off the 2nd eMMC partition is enough that NVMe is going into a sleep state and hanging (unfortunately this is a problem even on non-lx2160a systems), or there is some erratum we are missing with the older BSP.
That is a possibility. On warm reset if you probe the nvme before booting linux does that make a difference? So break into u-boot command line and then nvme scan then boot.
Sorry, please ignore those erroneous lines: I removed the SODIMM and inserted it again and they were gone (thatâs where you have the bootloader logs starting again after the âĂżâ character). I meant to remove them from the log but forgot. We donât have those errors.
We havenât done any changes to rcw. We didnât even change u-boot or kernel - we are basically using components from tag lsdk-21.08-sr-1.1 of the lx2160a_build git as-is.
14 for 100GE.1 and PCIe.2 x4, for the SSD.
2 for PCIe.3 x8, used by FPGA via xdma driver.
2 for PCIe.5 x8, same as SolidRun. Also used by FPGA.
We have been running this way for years with no issues (but we never tried no-ECC memory before. even this time was accidental, we do plan to keep using ECC memory only).
Specifically, I used lx2160acex7_2000_700_3200_8_5_2-bae3e6e.img.xz.
Exactly same issue: cold boot works, but warm boot with no-ECC has nvme timeouts. Interestingly, in this case, the filesystem is in eMMC, not SSD, and still we get nvme timeouts.
Using clearfrog board - so everything sw and hw is SolidRun.
Attaching logs, and again, also the trimmed version for easy âmeldâ-ing.
It seems the issue doesnât reproduce using the old lx2160acex7_2000_700_3200_8_5_2-6a1498d.img.xz image (from Dec 2020), so this might be a regression introduced sometimes recently.