Nvme timeout on warm boot when switching sodimm

Yes it looks like all controllers need the same fix. The 64-bit prefetchable address spaces all have the same copy paste error. Only pcie5 should be using the 0xa4 0x00000000 pcie address space. All others should be 1 to 1 mapped to the second address.

0x84 0x00000000
0x8c 0x00000000
0x9c 0x00000000

Applied the fixes, but regrettably, it doesn’t seem to change anything wrt the issue - still getting nvme timeouts on warm boot + non-ECC memory + when loading Image+dtb from 2nd eMMC partition.

Attaching logs

emmc-2nd-emmc-nvme-fail-pcie-fixes-trimmed-upload.txt (39.4 KB)

Does it work properly if you boot the system with just a single DIMM of the non-ECC memory? I am starting to think that the delay is enough loading off the 2nd eMMC partition is enough that NVMe is going into a sleep state and hanging (unfortunately this is a problem even on non-lx2160a systems), or there is some erratum we are missing with the older BSP.

I’m using just a single DIMM of the non-ECC memory - this is how I am reproducing everything.

Maybe u-boot leaves the SSD in some automatic-suspend mode, and unless the kernel boots fast enough, the SSD drops into low power mode?

That is a possibility. On warm reset if you probe the nvme before booting linux does that make a difference? So break into u-boot command line and then nvme scan then boot.

still failing (logs attached)
emmc-emmc-nvme-fail-after-uboot-scan.txt (48.9 KB)

I also can’t possibly think of any logical explanation which might explain how non-ECC memory can be a trigger for this thing.

Could you put the ECC memory back in and provide me with the cold and warm boot logs. Just so I can do a full comparison?

Please find attached.

Please note, that with ECC memory, both cold and warm boot work (and kernel is loaded from 2nd eMMC partition).

I’ve also attached trimmed version of the logs, without the kernel timestamps, to make it easier to compare with tools like meld.

ecc-cold-boot.txt (75.7 KB)
ecc-cold-boot-trimmed.txt (62.1 KB)
ecc-warm-boot.txt (75.0 KB)
ecc-warm-boot-trimmed.txt (61.5 KB)

This is very suspect. Even with ECC you are getting a training error on cold boot.

NOTICE: BL2: v2.4(release):opx1000-v1.0-rc2-3-g7422fe804-dirty
NOTICE: BL2: Built : 15:51:00, Nov 12 2023
NOTICE: UDIMM J722GU44J2320N7
NOTICE: DDR PMU Hardware version-0x1210
NOTICE: DDR PMU Firmware vision-0x1001 (vA-2019.04)
ERROR: Execution FW failed (error code -5)
ERROR: Calculating DDR PHY registers failed.
PHY handshake timeout, ddr_dsr2 = 0
ERROR: Found training error(s): 0x100
ERROR: Error: Waiting for D_INIT timeout.
ERROR: Writing DDR register(s) failed
ERROR: Programing DDRC error
ERROR: DDR init failed.
NOTICE: Incorrect DRAM0 size is defined in platform_def.h
ĂżNOTICE: BL2: v2.4(release):opx1000-v1.0-rc2-3-g7422fe804-dirty
NOTICE: BL2: Built : 15:51:00, Nov 12 2023
NOTICE: UDIMM J722GU44J2320N7
NOTICE: DDR PMU Hardware version-0x1210
NOTICE: DDR PMU Firmware vision-0x1001 (vA-2019.04)
NOTICE: DDR4 UDIMM with 1-rank 64-bit bus (x8)

What RCW changes have you made?

Sorry, please ignore those erroneous lines: I removed the SODIMM and inserted it again and they were gone (that’s where you have the bootloader logs starting again after the “ÿ” character). I meant to remove them from the log but forgot. We don’t have those errors.

We haven’t done any changes to rcw. We didn’t even change u-boot or kernel - we are basically using components from tag lsdk-21.08-sr-1.1 of the lx2160a_build git as-is.

but you are choosing a different serdes protocol or at least that is what the logs are reporting.

Using SERDES1 Protocol: 14 (0xe)
Using SERDES2 Protocol: 2 (0x2)
Using SERDES3 Protocol: 2 (0x2)

Sorry, you are right - we are doing that change.

Our serdes configuration is:

14 for 100GE.1 and PCIe.2 x4, for the SSD.
2 for PCIe.3 x8, used by FPGA via xdma driver.
2 for PCIe.5 x8, same as SolidRun. Also used by FPGA.

We have been running this way for years with no issues (but we never tried no-ECC memory before. even this time was accidental, we do plan to keep using ECC memory only).

diff --git a/rcw/lx2160acex7/RCW/template.rcw b/rcw/lx2160acex7/RCW/template.rcw
index d0c86f370..2112f3bba 100644
--- a/rcw/lx2160acex7/RCW/template.rcw
+++ b/rcw/lx2160acex7/RCW/template.rcw
@@ -1,5 +1,5 @@
 #include <configs/lx2160a_defaults.rcwi>
 #include <configs/lx2160a_2000_700_3200.rcwi>
-#include <configs/lx2160a_SD1_8.rcwi>
-#include <configs/lx2160a_SD2_5.rcwi>
+#include <configs/lx2160a_SD1_14.rcwi>
+#include <configs/lx2160a_SD2_2.rcwi>
 #include <configs/lx2160a_SD3_2.rcwi>

Hi, I reproduced the same issue with vanilla SolidRun images.

I took the latest binaries from:
https://images.solid-run.com/LX2k/lx2160a_build/20240104-bae3e6e

Specifically, I used lx2160acex7_2000_700_3200_8_5_2-bae3e6e.img.xz.

Exactly same issue: cold boot works, but warm boot with no-ECC has nvme timeouts. Interestingly, in this case, the filesystem is in eMMC, not SSD, and still we get nvme timeouts.

Using clearfrog board - so everything sw and hw is SolidRun.

Attaching logs, and again, also the trimmed version for easy “meld”-ing.

vanilla-cold-boot-trimmed.txt (45.8 KB)
vanilla-warm-reset-no-ecc-trimmed.txt (37.2 KB)
vanilla-warm-reset-no-ecc.txt (46.5 KB)
vanilla-cold-boot.txt (55.8 KB)

It seems the issue doesn’t reproduce using the old lx2160acex7_2000_700_3200_8_5_2-6a1498d.img.xz image (from Dec 2020), so this might be a regression introduced sometimes recently.

Attaching logs.
old-vanilla-warm-after-warm-works.txt (174.6 KB)