I must be missing something, but I don’t see a LX2160A-based board in Solidrun’s u-boot git repository. Are the instructions somewhere on how to build u-boot for Honeycomb?
Background: We have noticed hangups in specific scenarios involving load-exclusive/store-exclusive instructions. These appear similar to issues observed on a different A72-based board, for which the solution is to set bit 31 in CPUACTLR_EL1 (see Documentation – Arm Developer).
Looks like I was mislead by the presence of a u-boot tree in SolidRun’s github repository. As far as I can tell the way to build u-boot for Honeycomb is to clone SolidRun/lx2160a_build and build a full image, which seems a bit heavy.
I finally figured out how to change u-boot and update it on the SD card. Unfortunately, enabling snoop-delayed exclusive handling prevents u-boot from starting. I added the code in the same place as the other errata workarounds that update this register. Not sure what is going on.
I ran a stress test for 24 hours on two Honeycomb boards (yes, we pretty much bought all of the Canadian stock…). The board that has bit 31 of CPUACTLR_EL1 cleared (the default) froze multiple times and had to be rebooted after each freeze. The board that has bit 31 set with the modified firmware is alive and kicking.
I think this is an issue with A72-based SoCs that should probably be reported to NXP and ARM. TI is already aware, as we have seen the same problem on one of their boards.
The test case won’t help you much unless you have access to QNX 7.1. The hang itself is in the kernel, and requires running in EL1.
What we suspect is happening is that a core fails to acquire a spin lock, getting stuck in WFE forever:
Core A reads the value of a spin lock with ldaxr and observes that it is busy
Core A issues wfe
Core B releases the spin lock by writing 0 using stlr
Step 3 should emit an event, per section B2.9.2 of the ARMv8 manual, as the global monitor transitions to an open state, but that doesn’t seem to happen in some cases. Replacing wfe with yield avoids the hang, as does adding an explicit sev after stlr for releasing the lock.
Note that 99.9% of the time the system behaves just fine and it takes a kernel-heavy stress test on all cores running for about half an hour before the hang happens.
When we saw a similar issue on another board it turned out that the hardware provider has their own cache coherency implementation and not the ARM CCI. I don’t know if that is the case here.
Thanks for the update. The LX2160a uses ARM’s CCN-508 CHI implementation for coherency. I doubt there are any modifications. There is an errata regarding entering and exiting wfi/wfe and atomic operations. I will review that again and look at NXP’s assembly to verify it meets their recommend workaround.