Bedrock R7000 throttling?

Hello,
I’m using my Bedrock R7000 mostly as a 24/7/365 server+firewall+gateway, and it rocks.
I’m also using it occasionally as a workstation with GUI (xfce4).
Recently I’ve suffered very huge delays in the GUI: it stalls for few to more than 60 seconds, unresponsive. The enclosure is very hot but still OK I guess (somewhere between ~50 and 60°C).
One NVMe reaches 62°C and CPU is between 53 and 64°C.

If I cool the unit down (with some kind of plastic ice pack for example) the GUI no longer stalls.
Summer is coming and my house is really not that hot yet (22-23°C).
Still, the GUI is unusable under very light GPU load (few static webpages in firefox, 1-2 terminal windows).
I’m a bit worried for the months to come…

Those temperatures should not be anywhere close to causing the CPU or GPU to throttle. The Ryzen APU’s are very happy running up to a junction temp of 90C before they severely throttle. Does FreeBSD have the ability to monitor the different core voltages and temperatures? On Linux I use a utility called RyzenAdj, but can also monitor some of the temps with the standard lmsensors package. I would also check to see if FreeBSD has PPD support to choose the different power profiles. Some of this can also be tweaked in the UEFI firmware configurations.

Also have you opened the case at all? Bedrock is very particular regarding how it is assembled due to the advanced cooling system inside needed for effective passive cooling.

I have opened the case to add the second NVMe drive. It requires that only one panel is removed, and I was very cautious removing it and putting it back. By the touch of the hand I can tell that temperature is pretty much the same on both panels, so I guess it’s not a passive cooling issue due to this manipulation.

I’m monitoring temps with munin / sysctl so it’s pretty basic:

(12:27)# sysctl -a | grep temp
…/…
dev.cpu.15.temperature: 38.6C
dev.cpu.14.temperature: 38.6C
dev.cpu.13.temperature: 38.6C
dev.cpu.12.temperature: 38.6C
dev.cpu.11.temperature: 38.6C
dev.cpu.10.temperature: 38.6C
dev.cpu.9.temperature: 38.6C
dev.cpu.8.temperature: 38.6C
dev.cpu.7.temperature: 38.6C
dev.cpu.6.temperature: 38.6C
dev.cpu.5.temperature: 38.6C
dev.cpu.4.temperature: 38.6C
dev.cpu.3.temperature: 38.6C
dev.cpu.2.temperature: 38.6C
dev.cpu.1.temperature: 38.6C
dev.cpu.0.temperature: 38.6C
dev.amdtemp.0.core0.sensor0: 38.6C
dev.amdtemp.0.sensor_offset: 0
dev.amdtemp.0.%iommu: 
dev.amdtemp.0.%parent: hostb0
dev.amdtemp.0.%pnpinfo: 
dev.amdtemp.0.%location: 
dev.amdtemp.0.%driver: amdtemp
dev.amdtemp.0.%desc: AMD CPU On-Die Thermal Sensors

But I noticed a very odd thermal profile: the temp readings have a plateau around 50°C that lasts a day, starting at the beginning of the day, after that readings fall around 40°C for 2 to 3 days, then go up again to ~50 for a full day.

The last rise is due to my GUI session of last evening. Other rises are absolutely unrelated to my activity.
I’ve setup a script to log temp and CPU freq every 5 seconds to try and map this with precision.

In the meantime I’ll investigate voltage.

A FreeBSD pkg exists for RyzenAdj so I’ve used it to retrieve metrics:

# ryzenadj -i
CPU Family: Phoenix Point
SMU BIOS Interface Version: 14
Version: v0.16.0 
PM Table Version: 4c0008
|        Name         |   Value   |     Parameter      |
|---------------------|-----------|--------------------|
| STAPM LIMIT         |    28.000 | stapm-limit        |
| STAPM VALUE         |     7.014 |                    |
| PPT LIMIT FAST      |    51.000 | fast-limit         |
| PPT VALUE FAST      |     6.012 |                    |
| PPT LIMIT SLOW      |    41.000 | slow-limit         |
| PPT VALUE SLOW      |     6.026 |                    |
| StapmTimeConst      |    99.694 | stapm-time         |
| SlowPPTTimeConst    |    96.929 | slow-time          |
| PPT LIMIT APU       |    41.000 | apu-slow-limit     |
| PPT VALUE APU       |       nan |                    |
| TDC LIMIT VDD       |    54.000 | vrm-current        |
| TDC VALUE VDD       |     2.425 |                    |
| TDC LIMIT SOC       |    16.000 | vrmsoc-current     |
| TDC VALUE SOC       |     1.046 |                    |
| EDC LIMIT VDD       |   105.000 | vrmmax-current     |
| EDC VALUE VDD       |    16.607 |                    |
| EDC LIMIT SOC       |    23.000 | vrmsocmax-current  |
| EDC VALUE SOC       |     1.900 |                    |
| THM LIMIT CORE      |   100.000 | tctl-temp          |
| THM VALUE CORE      |    37.056 |                    |
| STT LIMIT APU       |     0.000 | apu-skin-temp      |
| STT VALUE APU       |     0.000 |                    |
| STT LIMIT dGPU      |     0.000 | dgpu-skin-temp     |
| STT VALUE dGPU      |     0.000 |                    |
| CCLK Boost SETPOINT |       nan | power-saving /     |
| CCLK BUSY VALUE     |       nan | max-performance    |

Not sure what to expect as “normal” values. May be I’ll need to dump that in splunk too.

I would definitely log all these parameters as well. My guess is you are hitting the STAPM limit and that is causing your issue.

STAPM LIMIT | 28.000 | stapm-limit

I would also recommend seeing if amdgpu_top runs on FreeBSD, that will give you metrics like this.

unfortunately no amdgpu_top on FreeBSD, just radeontop (that shows nothing interesting).

So far, STAPM VALUE stays well below STAPM LIMIT, but Xorg/Xfce4 is not started, it only acts as an headless server. I’m logging the output from ryzenadj -i every 5 sec. and if possible I’ll put some load on the GUI tonight to see how metrics behave.

Some evening news:

  • I can’t reproduce the non-responsiveness I’ve got the other day, despite putting way more load on the GUI than earlier.
  • I’ve some nice graphs, now :slight_smile:

sample:

All of them: https://www.patpro.net/cafesale/all-RyzenAdj.png

before «Speedometer Firefox»: PC is idle, no GUI started
I start GUI, start Firefox, start Speedometer web benchmark
Speedometer stops, I start MotionMark benchmark in Firefox
MotionMark stops, I start MotionMark benchmark in Chromium (CPU temp reached 73°C)
MotionMark stops, I start using GUI normally (music player, browser, terminal)

I’ve had zero slowdown, zero freeze, during the full experience. Perfect GUI responsiveness.

I’ll keep it logging, and I’ll screenshot the graphs as soon as I got some freeze again.

Tonight I’ve played a game during 2 hours: temperature has skyrocketed and I’ve almost burnt myself on the PC case (epic!), no noticeable throttle. Back in window manager everything is smooth, super-responsive.
CPU at ~80°C, one NVMe above 80°C.

all graphs from ryzenadj:


I can conclude there is no link between temperature and freezes of the GUI. But I still have no idea of the origin of the problem :frowning: