Nvme timeout on warm boot when switching sodimm

jnettlet · January 31, 2024, 6:22pm

Yes it looks like all controllers need the same fix. The 64-bit prefetchable address spaces all have the same copy paste error. Only pcie5 should be using the 0xa4 0x00000000 pcie address space. All others should be 1 to 1 mapped to the second address.

0x84 0x00000000
0x8c 0x00000000
0x9c 0x00000000

joema · January 31, 2024, 8:50pm

Applied the fixes, but regrettably, it doesn’t seem to change anything wrt the issue - still getting nvme timeouts on warm boot + non-ECC memory + when loading Image+dtb from 2nd eMMC partition.

Attaching logs

emmc-2nd-emmc-nvme-fail-pcie-fixes-trimmed-upload.txt (39.4 KB)

jnettlet · February 1, 2024, 4:34am

Does it work properly if you boot the system with just a single DIMM of the non-ECC memory? I am starting to think that the delay is enough loading off the 2nd eMMC partition is enough that NVMe is going into a sleep state and hanging (unfortunately this is a problem even on non-lx2160a systems), or there is some erratum we are missing with the older BSP.

joema · February 1, 2024, 8:08am

I’m using just a single DIMM of the non-ECC memory - this is how I am reproducing everything.

Maybe u-boot leaves the SSD in some automatic-suspend mode, and unless the kernel boots fast enough, the SSD drops into low power mode?

jnettlet · February 1, 2024, 8:12am

That is a possibility. On warm reset if you probe the nvme before booting linux does that make a difference? So break into u-boot command line and then nvme scan then boot.

joema · February 1, 2024, 8:25am

still failing (logs attached)
emmc-emmc-nvme-fail-after-uboot-scan.txt (48.9 KB)

joema · February 1, 2024, 8:30am

I also can’t possibly think of any logical explanation which might explain how non-ECC memory can be a trigger for this thing.

jnettlet · February 1, 2024, 8:44am

Could you put the ECC memory back in and provide me with the cold and warm boot logs. Just so I can do a full comparison?

joema · February 1, 2024, 11:25am

Please find attached.

Please note, that with ECC memory, both cold and warm boot work (and kernel is loaded from 2nd eMMC partition).

I’ve also attached trimmed version of the logs, without the kernel timestamps, to make it easier to compare with tools like meld.

ecc-cold-boot.txt (75.7 KB)
ecc-cold-boot-trimmed.txt (62.1 KB)
ecc-warm-boot.txt (75.0 KB)
ecc-warm-boot-trimmed.txt (61.5 KB)

jnettlet · February 1, 2024, 11:36am

This is very suspect. Even with ECC you are getting a training error on cold boot.

NOTICE: BL2: v2.4(release):opx1000-v1.0-rc2-3-g7422fe804-dirty
NOTICE: BL2: Built : 15:51:00, Nov 12 2023
NOTICE: UDIMM J722GU44J2320N7
NOTICE: DDR PMU Hardware version-0x1210
NOTICE: DDR PMU Firmware vision-0x1001 (vA-2019.04)
ERROR: Execution FW failed (error code -5)
ERROR: Calculating DDR PHY registers failed.
PHY handshake timeout, ddr_dsr2 = 0
ERROR: Found training error(s): 0x100
ERROR: Error: Waiting for D_INIT timeout.
ERROR: Writing DDR register(s) failed
ERROR: Programing DDRC error
ERROR: DDR init failed.
NOTICE: Incorrect DRAM0 size is defined in platform_def.h
ÿNOTICE: BL2: v2.4(release):opx1000-v1.0-rc2-3-g7422fe804-dirty
NOTICE: BL2: Built : 15:51:00, Nov 12 2023
NOTICE: UDIMM J722GU44J2320N7
NOTICE: DDR PMU Hardware version-0x1210
NOTICE: DDR PMU Firmware vision-0x1001 (vA-2019.04)
NOTICE: DDR4 UDIMM with 1-rank 64-bit bus (x8)

What RCW changes have you made?

joema · February 1, 2024, 12:52pm

Sorry, please ignore those erroneous lines: I removed the SODIMM and inserted it again and they were gone (that’s where you have the bootloader logs starting again after the “ÿ” character). I meant to remove them from the log but forgot. We don’t have those errors.

We haven’t done any changes to rcw. We didn’t even change u-boot or kernel - we are basically using components from tag lsdk-21.08-sr-1.1 of the lx2160a_build git as-is.

jnettlet · February 1, 2024, 12:55pm

but you are choosing a different serdes protocol or at least that is what the logs are reporting.

Using SERDES1 Protocol: 14 (0xe)
Using SERDES2 Protocol: 2 (0x2)
Using SERDES3 Protocol: 2 (0x2)

joema · February 1, 2024, 1:50pm

Sorry, you are right - we are doing that change.

Our serdes configuration is:

14 for 100GE.1 and PCIe.2 x4, for the SSD.
2 for PCIe.3 x8, used by FPGA via xdma driver.
2 for PCIe.5 x8, same as SolidRun. Also used by FPGA.

We have been running this way for years with no issues (but we never tried no-ECC memory before. even this time was accidental, we do plan to keep using ECC memory only).

diff --git a/rcw/lx2160acex7/RCW/template.rcw b/rcw/lx2160acex7/RCW/template.rcw
index d0c86f370..2112f3bba 100644
--- a/rcw/lx2160acex7/RCW/template.rcw
+++ b/rcw/lx2160acex7/RCW/template.rcw
@@ -1,5 +1,5 @@
 #include <configs/lx2160a_defaults.rcwi>
 #include <configs/lx2160a_2000_700_3200.rcwi>
-#include <configs/lx2160a_SD1_8.rcwi>
-#include <configs/lx2160a_SD2_5.rcwi>
+#include <configs/lx2160a_SD1_14.rcwi>
+#include <configs/lx2160a_SD2_2.rcwi>
 #include <configs/lx2160a_SD3_2.rcwi>

joema · February 4, 2024, 11:29am

Hi, I reproduced the same issue with vanilla SolidRun images.

I took the latest binaries from:
https://images.solid-run.com/LX2k/lx2160a_build/20240104-bae3e6e

Specifically, I used lx2160acex7_2000_700_3200_8_5_2-bae3e6e.img.xz.

Exactly same issue: cold boot works, but warm boot with no-ECC has nvme timeouts. Interestingly, in this case, the filesystem is in eMMC, not SSD, and still we get nvme timeouts.

Using clearfrog board - so everything sw and hw is SolidRun.

Attaching logs, and again, also the trimmed version for easy “meld”-ing.

vanilla-cold-boot-trimmed.txt (45.8 KB)
vanilla-warm-reset-no-ecc-trimmed.txt (37.2 KB)
vanilla-warm-reset-no-ecc.txt (46.5 KB)
vanilla-cold-boot.txt (55.8 KB)

joema · February 5, 2024, 7:51am

It seems the issue doesn’t reproduce using the old lx2160acex7_2000_700_3200_8_5_2-6a1498d.img.xz image (from Dec 2020), so this might be a regression introduced sometimes recently.

Attaching logs.
old-vanilla-warm-after-warm-works.txt (174.6 KB)

Topic		Replies	Views
LX2 honeycomb linux reboot fails with LSDK-21.08 changes NXP LX2160	8	378	February 1, 2022
Honeycomb LX2160 - First bootup RAM training problems NXP LX2160 NXP-LX2160	4	112	November 21, 2024
Problems with their Solutions NXP LX2160	21	1235	August 21, 2021
Difficulties to boot on eMMC after update Renesas RZ	6	319	June 26, 2024
Bootup lx2160 NXP LX2160	39	1057	May 16, 2022

Nvme timeout on warm boot when switching sodimm

Related topics