Nvme timeout on warm boot when switching sodimm

we switched from:

NOTICE: UDIMM J722GU44J2320N7
NOTICE: DDR4 UDIMM with 1-rank 64-bit bus (x8)
NOTICE: 16 GB DDR4, 64-bit, CL=22, ECC on
DRAM: 15.9 GiB
DDR 15.9 GiB (DDR4, 64-bit, CL=22, ECC on)
[ 0.000000] Memory: 12080748K/14612480K available (23104K kernel code, 4448K rwdata, 11656K rodata, 8512K init, 1056K bss, 2498964K reserved, 32768K cma-reserved)

To:

NOTICE: UDIMM CT32G4SFD832A.C16F
NOTICE: DDR4 UDIMM with 2-rank 64-bit bus (x8)
NOTICE: 32 GB DDR4, 64-bit, CL=22, ECC off, CS0+CS1
DRAM: 31.9 GiB
DDR 31.9 GiB (DDR4, 64-bit, CL=22, ECC off)
DDR Chip-Select Interleaving Mode: CS0+CS1
[ 0.000000] Memory: 28538404K/31389696K available (23104K kernel code, 4448K rwdata, 11656K rodata, 8512K init, 1056K bss, 2818524K reserved, 32768K cma-reserved)

When we power up the board, it all seems to work.

But when we do a warm boot (just using the reboot command) we get nvme timeouts on the end of the linux boot:

[ 5.971617] ALSA device list:
[ 5.974582] No soundcards found.
[ 64.743091] nvme nvme0: I/O 20 QID 0 timeout, completion polled
[ 126.183097] nvme nvme0: I/O 24 QID 0 timeout, completion polled
[ 126.189135] nvme nvme0: Shutdown timeout set to 8 seconds
[ 167.527139] random: crng init done
[ 187.623101] nvme nvme0: I/O 21 QID 0 timeout, completion polled
[ 249.063092] nvme nvme0: I/O 25 QID 0 timeout, completion polled
[ 249.081843] nvme nvme0: allocated 64 MiB host memory buffer.
[ 310.503086] nvme nvme0: I/O 22 QID 0 timeout, completion polled
[ 371.943103] nvme nvme0: I/O 26 QID 0 timeout, completion polled

Any idea how to debug this? what/where to look?

We switch the SODIMM on a working setup, not changing anything else, and we get this error. We are using a standard ClearFrog eval board with the CEX7 som.

This is not an error I have experienced but I have not tested any NVMe devices that are using a host memory buffer. This basically means that the NVMe does not have any ddr on the device for caching and instead uses system memory.

Because the nvme is using system memory this leads me to the next point. I see that you have switched from ECC memory which was working to non-ECC memory. It is very possible that the NVMe transactions are causing memory corruption of some sort that the ECC memory is correcting but is failing with the non-ECC memory. This happening after a warm reboot could very much be caused by something as simple as the memory temperature causing slightly different memory calibrations on a reboot. NXP’s BSP also does have a fastboot option for storing and recovering training data from the previous boot…I will need to double check if that is enabled in our builds.

What version of the BSP are you using?

1 Like

We’re using LSDK-21.08 (we’re based on tag lsdk-21.08-sr-1.1 of the lx2160a_build git)

Small update:

We have tried the Pro version of the Samsung SSD980 we were using, and the problem remained.

Then we moved to Micron SO-DIMMs with ECC and the problem went away.

Thanks for you help!

I am glad we have narrowed down the issue. While I am glad you have a workaround for your issues I definitely feel there is an issue that needs to be looked at more in depth. Are you willing to test if I can provide a patched firmware to test with?

yes, sure. we’re also not sure why would ECC affect nvme here. Moreover, we have also tried using the PRO version of our Samsung SSD980, which has an internal cache, and we still had those nvme timeouts when using the no ECC SO-DIMMs.

Note: those nvme timeouts happen on warm boots only - either if we use the ClearFrog reset button, or by using the Linux ‘reboot’ command.

Thanks for the additional information. I will take that into consideration while reviewing the code.

Is there a way for us to debug those nvme timeout? or is it firmware?

Not easily unfortunately. This is not just an Aarch64 or HoneyComb issue either. Some of it is vendor firmware on the NVMe, some of it is NVMe vendors only validating their firmware on Windows and the Linux driver stack is causing issues. The amount of per NVMe quirks in the Linux driver stack is increasing with every release, some having to do with firmware issues and APST, or TRIM, others just random idiosyncrasies.

With your specific case, since it occurred after a memory change and only on warm reboots I just want to make sure NXPs DDR initialization isn’t exposing a bug that is being masked by ECC.

Thank you!

I’d like also to point out that the issue is deterministic and not stochastic: using the no-ECC memory, we always get nvme timeout on warm boots. so it’s probably not because of data integrity problems - this is probably because of some logic somewhere.

More on this:
warm boot nvme timeouts only happen when booting from eMMC (/boot/Image on a partition in /dev/mmcblk1, and DIP pin 4 set to off.

Same kernel Image booting from sdcard (/boot/Image on a partition in /dev/mmcblk0, and dip pin 4 set to on): no nvme timeouts.

Seems that u-boot can successfully access the nvme and it’s only the kernel that reaches those nvme timeouts.

We let u-boot boot from eMMC, and find /extlinux and /boot only in an nvme partition. it actually worked. It loaded the kernel and the dtb and booted them. the kernel then had those familiar nvme timeouts (the scenario still only happen with the non-ECC memory + after warm boot).

Here’s u-boot successfully loading the kernel+dtb from the SSD:

Device 0: Vendor: 0x144d Rev: 5B2QGXA7 Prod: S5GYNX0W811155N
Type: Hard Disk
Capacity: 476940.0 MB = 465.7 GB (976773168 x 512)
… is now current device
Scanning nvme 0:1…
Re-Distributor 0 LPI is already enabled
Releasing fan controller full speed gpio
BootOrder not defined
EFI boot manager: Cannot load any image
Scanning nvme 0:2…
Found /extlinux/extlinux.conf
Retrieving file: /extlinux/extlinux.conf
354 bytes read in 1 ms (345.7 KiB/s)
1: primary kernel
Retrieving file: /boot/Image
49054208 bytes read in 30 ms (1.5 GiB/s)
append: console=ttyAMA0,115200 earlycon=pl011,mmio32,0x21c0000 default_hugepagesz=1024m hugepagesz=1024m hugepages=2 pci=pcie_bus_perf iommu.passthrough=1 root=/dev/nvme0n1p2 rw rootwait
Retrieving file: /boot/fsl-lx2160a-clearfog-cx.dtb
45027 bytes read in 1 ms (42.9 MiB/s)
Moving Image from 0x81100000 to 0x81200000, end=841e0000
Flattened Device Tree blob at 81000000
Booting using the fdt blob at 0x81000000
Loading Device Tree to 000000009fff2000, end 000000009fffffe2 … OK
Re-Distributor 0 LPI is already enabled
Releasing fan controller full speed gpio
fsl-mc: Deploying data path layout … SUCCESS

Starting kernel …

And then the kernel seems to probe the nvme device correctly:

[ 3.181977] nvme 0001:01:00.0: Adding to iommu group 1
[ 3.187238] nvme nvme0: pci function 0001:01:00.0

And then it hangs, after nvme_identify_ctrl() is doing nvme_submit_sync_cmd(). Specifically, the blk_execute_rq() hangs for 60 seconds (which the configured timeout) and we get:

[ 64.746901] nvme nvme0: I/O 12 QID 0 timeout, completion polled

could you please post a full dmesg output on the failure? Comparing u-boot to Linux is not really valid as the u-boot subsystem for accessing pcie and nvme is a lot more simple than Linux. It is also very single threaded so there are no other competing subsystems.

I’m trying to upload the entire boot log, but I get the following error:

“Multiple errors occurred: 1) Body is limited to 32000 characters; you entered 50391. 2) Sorry, you can only mention 10 users in a post.”

Any way to bypass these restrictions?

You can save it as a .txt file and then upload it as an attachment.

Ok, thanks, please find attached two logs - one bad one good.

I have managed to narrow this to the following difference:

Both scenarios are warm boot and no-ECC (otherwise no issue at all).
U-boot is loaded from eMMC (DIP switch 4 is off).
kernel Image and DTB loaded from eMMC too.
root filesystem is on nvme.

Only difference:
Good case: /boot and /extlinux are placed on /dev/mmcblk1p1.
Bad case: /boot and /extlinux are placed on /dev/mmcblk1p2.

In the good case, u-boot jumps over empty mmc0, reads mmc1, immediately finds /boot and /extlinux in the first partition, and all seem good.

In the bad case, u-boot jumps over empty mmc0, reads mmc1, but only finds /boot and /extlinux in the 2nd partition, and then we have nvme timeout in the kernel. This doesn’t happen if we’re doing cold boot or if we’re using ECC memory.

Can this be some timing thing? that one more second it takes to find the kernel in the bad case somehow breaking things in the warm boot + no-ECC case.

Btw, I’ve added some prints to the kernel, so we understand exactly where it hangs. that pr_info(“__nvme_submit_sync_cmd2\n”) is placed right before blk_execute_rq(req->q, NULL, req, at_head) in the function __nvme_submit_sync_cmd. It blocks for 60 seconds in the bad case, and then there are additional IO attempts, which also block at the same place.

emmc-2nd-emmc-nvme-fail-trimmed-upload.txt (41.8 KB)
emmc-1st-emmc-nvme-good-trimmed-upload.txt (61.3 KB)

It appears that you are using a custom device-tree and enabling another PCIe controller. This controller is being configured incorrectly.


 layerscape-pcie 3500000.pcie: host bridge /soc/pcie@3500000 ranges:
 layerscape-pcie 3500000.pcie:      MEM 0x8c00000000..0x8fffffffff -> 0xa400000000
 layerscape-pcie 3500000.pcie:      MEM 0x8840000000..0x88ffffffff -> 0x0040000000
 layerscape-pcie 3500000.pcie:       IO 0x8810000000..0x881000ffff -> 0x0000000000
 layerscape-pcie 3500000.pcie: PCI host bridge to bus 0000:00

That should be a one to one mapping in the device-tree with both bar spaces starting at 0x8c00000000

We only changed fsl-lx2160a.dtsi once following our previous discussion here:

You helped us there do this single fix:

diff --git a/linux/arch/arm64/boot/dts/freescale/fsl-lx2160a.dtsi b/linux/arch/arm64/boot/dts/freescale/fsl-lx2160a.dtsi
index c83ff74ec..66d01f086 100644
--- a/linux/arch/arm64/boot/dts/freescale/fsl-lx2160a.dtsi
+++ b/linux/arch/arm64/boot/dts/freescale/fsl-lx2160a.dtsi
@@ -1176,7 +1176,7 @@ pcie3: pcie@3600000 {
                        apio-wins = <256>;
                        ppio-wins = <24>;
                        bus-range = <0x0 0xff>;
-                       ranges = <0x43000000 0xa4 0x00000000 0x94 0x00000000 0x04 0x00000000>, /* 64-Bit Window - prefetchable */
+                       ranges = <0x43000000 0x94 0x00000000 0x94 0x00000000 0x04 0x00000000>, /* 64-Bit Window - prefetchable */
                                         <0x02000000 0x00 0x40000000 0x90 0x40000000 0x00 0xc0000000>, /* 32-Bit Window - non-prefetchable */
                                         <0x01000000 0x00 0x00000000 0x90 0x10000000 0x00 0x00010000>; /* 16-Bit IO Window */
                        msi-parent = <&its>;

Do we need to do something similar in the other controllers?

attaching our .dtsi

Thanks!

fsl-lx2160a.dtsi.txt (47.5 KB)