Is HWiNFO causing the dreaded WHEA-Logger Event ID XX Cache Hierarchy Errors and sudden reboots on AMD Ryzen systems?

Martin · May 29, 2021

Try my previous suggestion and disable monitoring of NVIDIA GPU sensor - hit Del key over its heading.

craxton · May 31, 2021

Jackalito said:
Hello everyone:

A couple of users and myself have been suffering sudden reboots with our computers composed of Ryzen CPU systems (Ryzen 3000, but especially 5000) under different load conditions. The quickest way for us to trigger it, however, has been by using software designed to test RAM stability such as TM5 or Karhu RAM Test.

We have recently discovered that this problem only occurs if we have HWiNFO loaded in the background on Windows 10. Most of us also have AMD Radeon graphics cards, but we yet have to determine if that is a contributing factor. We don't know exactly where the conflict is, but the pattern is clear: we see the dreaded WHEA-Logger Event ID XX Cache Hierarchy error in the Event Viewer of Windows after those sudden reboots.

This has been tested by multiple users across different setups: motherboard manufacturers, AGESA/BIOS revisions, RAM brands and configurations, settings, and even after a fresh install of Windows (including different versions of the operating system). The only common denominator we have been able to find this far is the use of HWiNFO (we've only tested this using the latest versions - we still don't know if it can be solved by rolling back to a previous version specifically).

I'm sharing this information here with the hope that this problem can be reproduced and fixed accordingly. Perhaps, it will also require collaboration with AMD.

Thank you for your time.

seen you in OCN a few times consulting with Veii, and perhaps manna and mongol
been a while...however

HWiNFO runs daily on my pc (now for instance, (running 4x8 3200c14 (4000mhz c16 flat tuned) WHEA-FREE minus
id18 which was my curve being to aggressive. Ycruncher, prime, OCCT, TM5, and HCI all pass (while hwinfo is running or not
doesnt really matter) msi b550 gaming edge wifi.

are you sure you setup is stable? you should check in on OCN if you havent in sometime
been alot discovered lately.
to which more can run 4000mhz but most get 3800.... anyhow, CPU_VDDP was a major issue with most bios revisions being beyond 900mv
to which now (with most new bios revisions) its been fixed at 900mv (some boards allow to change MSI doesnt
as per email stated an APU is needed without it VDDP (CPU) not CLDO is not unlockable..even with a modded bios
it changes nothing. even using MSI dragon power didnt change it again, its already fixed.

if you see WHEA 18 its 100% core voltage. if its WHEA 19 then your ram isnt 100% stable.
use y-cruncher at least 4 passes to test it.... (limit your PBO some (would suggest stock settings which would be off to test)
then run the test again with PBO on etc (WHEA 20 btw happens as well when 18 happens)

me nor any other user ive seen have seen HWiNFO cause issue... now msi dragon center....different story

(EDIT) i dont have an AMD GPU but i do have issues or had with whea 18...use core offset and bump it
to .3xxmv (it should be stepping) hadnt read the other responses... before going thru hoops give a mv bump a shot
DF-Cstates are still broken in 90% of all Agesa to date, which can cause this sudden reboot.

(perhaps i should have looked at the post instead of the first response.)
didnt know this was old...but none the less, whea 18 is CPU voltage related.
that much is certain thru many people being able to confirm.

Jo3yization · May 31, 2021

Jo3yization said:
Hey everyone,

System Info:
Ryzen 5 3600 4.2ghz @ 1.125v Allcore
B450M Mortar Max Bios: 7B89v2D Agesa 1.2.0.2, previously 1.1.0.0 with no improvement.
Chipset Driver: 2.13.27.501 - 2/4/2021
Gigabyte RX 6700 XT - 21.5.2 Driver
Sfc /scannow - chkdsk - DISM, no problems.
PSU - Seasonic Core GM-650w(Gold)

30mins Asus Realbench, ~200% mem test pro stable, all kinds of real world load use stable prior to MAY 2021, 5 hour gaming sessions, 3 hour video rendering etc. General use Temps under 72C max load,, <60C gaming. Often left on 24/7.

Then randomly, over the past month at idle, things like youtube or launching a game started triggering a blackscreen/crash with Cache Heirachy Error, this has happened mostly over the past 30 days after setting HWinfo to start with system & simply leaving it on. (I was using it less prior & not running it 24/7),, now originally at the start of the month, I was getting this error on Agesa 1.1.0.0, & it continued after updating to 1.2.0.2 in an attempt to resolve the problem on May 10th, with no improvement,, over the month of May I had the Cache Hierachy Error ~11 times at seemingly random times on random Processor APIC ID's, sometimes with two different processor APIC IDs causing a cache error at exactly the same time.

After the crash and attempting to reproduce the issue, I could not force it to crash again by performing the same actions via applications, but sometimes it would crash when left idle again without me noticing, for only a few hours of idle. (HWinfo open), so I did not even recognize some of the entries until taking a closer look.

At first I tried increasing Vcore by +0.05v & SoC(1.050>1.100v) thinking it was my undervolt idle voltages, or some kindof degradation, with NO improvement before I stumbled across this thread a few days back mentioning HWinfo, which should have been fixed on the current version '7.04' that I was running,, nevertheless I decided I'll avoid using HWinfo to rule it out & see what happens, as well as disable GPU ULPS via Afterburner as it was the other prime suspect on other idle crashing threads, after the CPU voltage increases did not seem to help. But I was running ULPS default for months prior without issue. (Maybe something to do with Resizable BAR, CPU & GPU cache & something HWinfo is doing between core parking & idle just a theory.)

So anyway, onto the 3rd day now and the issue has not returned but still early to know for sure, anecdotal I know but just wanted to share, I even reverted the previously suspected CPU undervolt values to what I originally tested as load stable at the same time & its simply been rock solid(so far), & I've left the system to idle for a few hours multiple times, between 3 hour gaming sessions with ReLive recording etc, followed by long idles or minimal but normal desktop use.

Here's the log.
View attachment 6509
Processor APIC ID's
01/05 - 11
03/05 - 8, 11
04/05 - 5 & 10
16/05 - 8, 0 & 12.
24/05 - 5 & 12.
25/05 - 8
I tried to find some pattern of the same cores throwing cache errors,, nothing conclusive imo, core 2 is the fastest in CCX 0 & tends to sleep before Core 1 & 3, but only after CCX 1 is usually completely asleep,,, core 5 is fastest in CCX 1, core 4 is second fastest,, cores 2, 4, 5 & 6 sleep the most when watching Ryzen master at desktop but cores 4 & 8 usually sleep before 5 with core 2 sleeping last /w 1 & 3 the last ones left awake.....(feels like a riddle, which core is most likely to error?)

TLDR: Started using HWinfo daily in May & started getting crashes(not knowing it might have anything to do with HWinfo), I was on 6.42 & updated to 7.04 on the 18th, the version change didnt seem to help,, still had multiple cache heirachy crashes. Prior months before the crashes I would only run HWinfo when benchmarking or stresstesting instead of 24/7. Now ~3 days since keeping HWinfo closed & not a single error. Will keep posted as nothing conclusive yet, give it another 2 weeks before I'll know for certain.

Just an update, Day 6 & the WHEA Cache Heirachy Event 18 has not returned. In light of the most recent comment, upping core voltage on my undervolt was one of my first reactions after getting this error during mostly idle/light usage, it did not help & I'm still running my original stress-test stable undervolt(4.2ghz @ 1.125v). The system is being used heavily lately, mainly for casual use, browsing and 5hr+ gaming sessions /w Relive recording on-top & plenty of idle time in-between, zero abnormal behavior.

Since I want to make my testing as thorough as possible, I'll give it until the 8th of June(so a full 2 weeks) before re-enabling HWinfo 'launch at startup' again which I believe was the main trigger for these errors appearing but I do believe it's a combination of other applications WITH HWinfo rather than HWinfo by itself. Using HWinfo for short periods has been perfectly fine. I have also postponed Windows, Chipset & GPU driver updates until the 22nd of June to minimize variables as much as possible unless they specifically mention a Cache Hierarchy issue.
I'll report back if the error re-appears anytime before the 8th along with any specific activity around that time.
--------------------------------------------
@Martin Is it possible that one of the AMD processes added to the 21.4.1 drivers for Ryzen Master // CPU monitoring integration into the GPU drivers could be conflicting with HWinfo 'launch at startup'? I have performance metrics & overlays ALL disabled within the radeon software, but happened to notice more AMD processes than what was standard ~6 months back.

Image — Postimages

postimg.cc

You can see the metrics added into the GPU software here: https://www.amd.com/en/support/kb/release-notes/rn-rad-win-21-4-1
- Also worth noting, if I enable 'CPU' metrics, all the CPU sensors appear with no visible loading delay, so I assume some kind of driver-level integration with the Radeon Software..., could that conflict with HWinfo at startup in any meaningful way? Here's a screenshot of the metrics available for monitoring: https://postimg.cc/PPqDFJWm I've kept all Radeon monitoring disabled to avoid conflicts since they first added it, but the fact they appear almost instantly when re-enabling makes me wonder.

Cheers!

Martin · May 31, 2021

Jo3yization said:
@Martin Is it possible that one of the AMD processes added to the 21.4.1 drivers for Ryzen Master // CPU monitoring integration into the GPU drivers could be conflicting with HWinfo 'launch at startup'? I have performance metrics & overlays ALL disabled within the radeon software, but happened to notice more AMD processes than what was standard ~6 months back.

Image — Postimages

postimg.cc

You can see the metrics added into the GPU software here: https://www.amd.com/en/support/kb/release-notes/rn-rad-win-21-4-1
- Also worth noting, if I enable 'CPU' metrics, all the CPU sensors appear with no visible loading delay, so I assume some kind of driver-level integration with the Radeon Software..., could that conflict with HWinfo at startup in any meaningful way? Here's a screenshot of the metrics available for monitoring: https://postimg.cc/PPqDFJWm I've kept all Radeon monitoring disabled to avoid conflicts since they first added it, but the fact they appear almost instantly when re-enabling makes me wonder.

Cheers!

I cannot exclude such possibility, but I think it would be very unlikely. The way those metrics are pulled by AMD would be similar to what Ryzen Master does for a couple of years and there are no known conflicts with HWiNFO. Also, if this would be the case, there would be dozens (or hundreds) of similar reports from many other users, but there aren't such.
I'd rather search for a conflict with some other application/tool.

Jo3yization · May 31, 2021

Martin said:
I cannot exclude such possibility, but I think it would be very unlikely. The way those metrics are pulled by AMD would be similar to what Ryzen Master does for a couple of years and there are no known conflicts with HWiNFO. Also, if this would be the case, there would be dozens (or hundreds) of similar reports from many other users, but there aren't such.
I'd rather search for a conflict with some other application/tool.

Thanks for the reply & no worries was just curious, when I resume testing with HWinfo, the prime suspects will be MSI Afterburner, Nicehash(which I'm leaving off for now), and ULPS, , its just odd that the issue only seemed to appear when using HWinfo @ startup & never via manual launch for stress testing before when I had all the previous settings/apps enabled.

Also worth noting that while I was using HWinfo to add OSD to Rivatuner, there was still some duplicate monitoring going on in the background for some of the sensors, as it would be a pain to disable all the Afterburner OSD readings in HWinfo when I use both applications separately from time to time, there doesnt seem to be an issue using both together after the system has booted.

But anywho, I'll get into disabling sensor groups once I start booting /w HWinfo again next week, I wish I could speed things up, but given how irregular this issue can be, I figure 2 weeks off and on with a much more controlled set of applications would be best.

craxton · Jun 1, 2021

Jo3yization said:
Just an update, Day 6 & the WHEA Cache Heirachy Event 18 has not returned. In light of the most recent comment, upping core voltage on my undervolt was one of my first reactions after getting this error during mostly idle/light usage, it did not help & I'm still running my original stress-test stable undervolt(4.2ghz @ 1.125v). The system is being used heavily lately, mainly for casual use, browsing and 5hr+ gaming sessions /w Relive recording on-top & plenty of idle time in-between, zero abnormal behavior.

Since I want to make my testing as thorough as possible, I'll give it until the 8th of June(so a full 2 weeks) before re-enabling HWinfo 'launch at startup' again which I believe was the main trigger for these errors appearing but I do believe it's a combination of other applications WITH HWinfo rather than HWinfo by itself. Using HWinfo for short periods has been perfectly fine. I have also postponed Windows, Chipset & GPU driver updates until the 22nd of June to minimize variables as much as possible unless they specifically mention a Cache Hierarchy issue.
I'll report back if the error re-appears anytime before the 8th along with any specific activity around that time.
--------------------------------------------
@Martin Is it possible that one of the AMD processes added to the 21.4.1 drivers for Ryzen Master // CPU monitoring integration into the GPU drivers could be conflicting with HWinfo 'launch at startup'? I have performance metrics & overlays ALL disabled within the radeon software, but happened to notice more AMD processes than what was standard ~6 months back.

Image — Postimages

postimg.cc

You can see the metrics added into the GPU software here: https://www.amd.com/en/support/kb/release-notes/rn-rad-win-21-4-1
- Also worth noting, if I enable 'CPU' metrics, all the CPU sensors appear with no visible loading delay, so I assume some kind of driver-level integration with the Radeon Software..., could that conflict with HWinfo at startup in any meaningful way? Here's a screenshot of the metrics available for monitoring: https://postimg.cc/PPqDFJWm I've kept all Radeon monitoring disabled to avoid conflicts since they first added it, but the fact they appear almost instantly when re-enabling makes me wonder.

Cheers!

by "heavy use" or "stress test" run y-cruncher 1, 7, 0 "enter" if you can run this
then ill agree its not your undervolt so much. (then again, prime 95 with my undervolt passes)
no curve settings just pure undervolt with 4850boost. but i get way better performance (in benchmarks)
with curve settings applied.

i can state with or without hwinfo running nothing but PBO in prime (limiting PPT) to reduce heat
i get no crash while not using curve. i ran OCCT core cycler all day while hwinfo was running and hadnt had any crash or errors happen.
but core-cycler as found on OCN made by....sp00n82 well, it does indeed yell that prime has encountered an error
with (auto) time as my setting as itll run all prime 95 has to offer "with whats set default inside the config.ini for core-cycler"

what program are you using to confirm your stability? i went personally never having any WHEA to all a sudden KABLAM!
WHEA 18 (no 19s even this 2000fclk has no issue with 4x8 config) WHEA Cache Hierarchy Event 18 apcid 6 5 times in a week and one apcid 11
to which is what core cycler was complaining about....(again this only happens with curve offset)
remove the curve and im good.... ill 100% confirm this on my system, (no settings inside my bios are altered that would normally be hidden)
as theyre all unlocked/unhidden i have no need for most these settings. DF-C states are on (spread spectrum is off)

do let me know tho what program (S) your using as one program isnt enough anymore.
TM5 passes (for ram) but HCI might not, y-cruncher might pass but, TM5 25 cycles on the 25 cycle
might get error 3 etc; you get my point....if you would like to test out
this core-cycler i speak of, https://github.com/sp00n/corecycler/releases
hopefully this is allowed to be linked as its pretty useful used in the correct way (itll take weeks to setup curve)
unless your chip is a beast at OC mins a dud on overclocking core just a diamond in the IMC department.

Jo3yization · Jun 1, 2021

craxton said:
by "heavy use" or "stress test" run y-cruncher 1, 7, 0 "enter" if you can run this
then ill agree its not your undervolt so much. (then again, prime 95 with my undervolt passes)
no curve settings just pure undervolt with 4850boost. but i get way better performance (in benchmarks)
with curve settings applied.

i can state with or without hwinfo running nothing but PBO in prime (limiting PPT) to reduce heat
i get no crash while not using curve. i ran OCCT core cycler all day while hwinfo was running and hadnt had any crash or errors happen.
but core-cycler as found on OCN made by....sp00n82 well, it does indeed yell that prime has encountered an error
with (auto) time as my setting as itll run all prime 95 has to offer "with whats set default inside the config.ini for core-cycler"

what program are you using to confirm your stability? i went personally never having any WHEA to all a sudden KABLAM!
WHEA 18 (no 19s even this 2000fclk has no issue with 4x8 config) WHEA Cache Hierarchy Event 18 apcid 6 5 times in a week and one apcid 11
to which is what core cycler was complaining about....(again this only happens with curve offset)
remove the curve and im good.... ill 100% confirm this on my system, (no settings inside my bios are altered that would normally be hidden)
as theyre all unlocked/unhidden i have no need for most these settings. DF-C states are on (spread spectrum is off)

do let me know tho what program (S) your using as one program isnt enough anymore.
TM5 passes (for ram) but HCI might not, y-cruncher might pass but, TM5 25 cycles on the 25 cycle
might get error 3 etc; you get my point....if you would like to test out
this core-cycler i speak of, https://github.com/sp00n/corecycler/releases
hopefully this is allowed to be linked as its pretty useful used in the correct way (itll take weeks to setup curve)
unless your chip is a beast at OC mins a dud on overclocking core just a diamond in the IMC department.

By 'being used heavily' I meant constantly/alot, multitasking, gaming & video renders etc. I dont mean 'heavily' by pegging CPU usage at 99% for Synthetic mathematical/compute workloads for hours on end (I dont do scientific work on the PC or Folding@home etc.).

The WHEA issue for me only happened under idle/very light use, so while a fail in y-cruncher or prime95 might be meaningful for load instability, it would not be helpful in diagnosing an idle issue like this imo. There's also plenty accounts online of users 'stress-testing' for hours in Prime95//IBT/OCCT etc. 'stable' only to have X game or other application crash or encounter an issue like the one being discussed in this thread, which I'm sure you've encountered yourself given you know one program isnt enough xD...,, I shifted over to minimal synthetic stress-testing over a decade ago & favor use-case stability testing these days rather than long hours of stress testing workloads the system would never realistically see.

And I agree, you could use 5 different stability test programs for hours each, and the 6th stresstest you try, or that one app that loads the system differently could still cause a crash, or you could still have idle instability if you OC with power saving features enabled. I used to spend days stability testing /w a suite of applications back in the sandy bridge 2600k era, but these days it just isn't practical imo.

The Ryzen 5 3600 & B450m Mortar Max doesnt have curve optimization features(I believe its 5000 only?) At least on the latest MSI B450m bios, PBO doesnt work well with undervolts either, so its a locked all-core undervolt, rock stable in terms of real-world use for months up until the WHEA Cache Hierarchy error appeared within days of setting HWinfo to launch at startup, an interaction with something else running on the system rather than actual CPU instability is my main suspicion.

I do run basic stress-testing which consists of HCImemtest pro ~200% & 30mins of Asus Realbench followed by running all the applications I plan to use on the PC for a few hours at a time. For the systems I have used this method with, its proven long term reliable. My i7-6700k @ 4.5ghz is ~2yrs stable& still going strong with this same method. If the system can do everything you need it to do without any performance issues or random crashing/BSODs, that's my definition of stable, except when it comes to weird situations like this, where the issue is not easily reproducible under light load, & only a handful of programs running at the time are involved. If it was genuine vcore/idle instability, I would expect the problem to continue after disabling the suspect applications.

CommanderShepard · Jun 4, 2021

@Martin

I run 2 rigs full time Folding@home for the GamersNexus Covid 19 research team, and recently noticed this too after monitoring my systems with HWINFO64, I was monitoring attempting to get a temperature baseline logged for the hardware to assess thermals since the systems are under constant near 100% load, either folding or Gaming

I'm wondering if there is a resolution to this and if not is there anything I can provide to help troubleshoot this issue as I really enjoy HWINFO64 and would prefer to having it running, unfortunately Hardware Monitoring software causing BSODS is less than idea

Main Gaming Rig:
CPU: Ryzen 9 5950x, Stock, no PBO, or Manual OC (Arctic Liquid Freezer 280)
RAM: 64GB Trident Z Royal 3600 CL16 (Samsung B-Die)
MOBO: Gigabyte X570 AORUS MASTER (rev. 1.2) (BIOS Rev F32 AGESA ComboV2 1.1.0.0 D)
PSU: EVGA SuperNova 1200W P2
Video Card: Asus RTX 3090 (TUF-RTX3090-O24G-GAMING)
Boot Drive: WD SN850 1TB NVMe
Game Drive: Samsung 980Pro 1TB NVMe
Storage: Intel 660p 2TB NVMe

Home Theater PC:
CPU: Ryzen 9 3950x, Stock
RAM: 32GB Trident Z Royal 3600 CL16 (Samsung B-Die)
MOBO: Gigabyte B550I AORUS Pro AX (BIOS Rev F1, Didn't upgrade due to Ryzen 3000 CPU)
PSU: EVGA SuperNOVA 650 GT
Video Card: EVGA GTX 1080 Ti SC2
Boot Drive: Samsung 970 EVO Plus 500GB NVMe
Game/Storage Drive: Samsung 970 EVO Plus 1TB NVMe

Folding Stats:

CommanderShepard User Summary - Folding@Home Stats

F@H stats user summary for CommanderShepard. Help Folding at Home fight Coronavirus, further medical research, and prevent diseases with distributed computing!

folding.extremeoverclocking.com

If requested I can provide more information, this is all I know off the top of my head as I'm currently "at work" and clearly not researching something unrelated to work

Anyways, "I'm CommanderShepard and HWINFO64 is my favorite software on the Citadel"
I should go...

Martin · Jun 4, 2021

CommanderShepard said:
@Martin

I run 2 rigs full time Folding@home for the GamersNexus Covid 19 research team, and recently noticed this too after monitoring my systems with HWINFO64, I was monitoring attempting to get a temperature baseline logged for the hardware to assess thermals since the systems are under constant near 100% load, either folding or Gaming

I'm wondering if there is a resolution to this and if not is there anything I can provide to help troubleshoot this issue as I really enjoy HWINFO64 and would prefer to having it running, unfortunately Hardware Monitoring software causing BSODS is less than idea

Main Gaming Rig:
CPU: Ryzen 9 5950x, Stock, no PBO, or Manual OC (Arctic Liquid Freezer 280)
RAM: 64GB Trident Z Royal 3600 CL16 (Samsung B-Die)
MOBO: Gigabyte X570 AORUS MASTER (rev. 1.2) (BIOS Rev F32 AGESA ComboV2 1.1.0.0 D)
PSU: EVGA SuperNova 1200W P2
Video Card: Asus RTX 3090 (TUF-RTX3090-O24G-GAMING)
Boot Drive: WD SN850 1TB NVMe
Game Drive: Samsung 980Pro 1TB NVMe
Storage: Intel 660p 2TB NVMe

Home Theater PC:
CPU: Ryzen 9 3950x, Stock
RAM: 32GB Trident Z Royal 3600 CL16 (Samsung B-Die)
MOBO: Gigabyte B550I AORUS Pro AX (BIOS Rev F1, Didn't upgrade due to Ryzen 3000 CPU)
PSU: EVGA SuperNOVA 650 GT
Video Card: EVGA GTX 1080 Ti SC2
Boot Drive: Samsung 970 EVO Plus 500GB NVMe
Game/Storage Drive: Samsung 970 EVO Plus 1TB NVMe

Folding Stats:

CommanderShepard User Summary - Folding@Home Stats

F@H stats user summary for CommanderShepard. Help Folding at Home fight Coronavirus, further medical research, and prevent diseases with distributed computing!

folding.extremeoverclocking.com

If requested I can provide more information, this is all I know off the top of my head as I'm currently "at work" and clearly not researching something unrelated to work

Anyways, "I'm CommanderShepard and HWINFO64 is my favorite software on the Citadel"
I should go...

Are you also running some other monitoring or tuning tools along with HWiNFO? If yes, try to close them if the issue will happen as well.

CommanderShepard · Jun 5, 2021

Hi @Martin,

I have run GPU-Z at the same time before, however I typically just run HWINFO64 alone, I'll make note of only running HWINFO64 for future reference
On a side note the only other monitoring software I run or install include: GPU-Z, CPU-Z, HWINFO64, and MSI Afterburner, anything else I consider bloatware, IE Gigabyte software, AMD Ryzen Master software....etc....call me Old School if you like but that's what a BIOS/UEFI is for in regard's to setting OC/XMP or anything related to that

Reading this forum, I'm getting the indication that HWINFO64's issue stems from GPU monitoring, I've seen recommendations to disable the GPU monitoring, if that's true I'm OK with that as I personally feel GPU-Z is easier for monitoring GPU heath, at least in its current state

Jo3yization · Jun 6, 2021

Update June 7th, Another Cache Hierarchy crash Yesterday morning(June 6th) though it was in a rather unusual scenario & two of the suspect programs were involved.

Firstly I had tried to run Nicehash the night before, but changed my mind, it ran briefly before the computer was shutdown, but nevertheless I feel it's still important to mention due to the way RDNA 2 gets stuck in 'compute mode' after any mining activity, there is no graphics/compute switch within the drivers anymore, so nicehash itself switching the GPU to compute mode seems to be enough to get it 'stuck' in compute,, this causes Radeon Relive to record at ~48fps even with the software set to 60fps when trying to record gameplay after mining without a reboot,,, I'm not sure if simply shutting down or resetting the system 'resets' the GPU back to normal Graphics mode completely but a reboot is needed to fix the ReLive framerate issue & get 60fps recording again,, Nicehash was also run more than once within ~24hrs of some of the WHEA-Cache Hierarchy crashes in early May when they were much more frequent and close together, sometimes shortly after a reboot too,, Nicehash had not been run for almost a full 2 weeks until the night before this happened.

The other half of the unusual circumstance is it happened when loading a quick save for a game(Cyberpunk) for a benchmark test rather than idle, but 'idling' was involved due to the way the quickload system in the game works,, so basically I loaded into Cyberpunk, it was working fine, CPU & GPU temps good etc FPS normal,, then I went to use the Afterburner benchmark tool & forgot that the quickload button was set to (F7) a long time ago, the same button I've been using for MSI Afterburner to 'reset' the Benchmark tool (F7), so when I hit F7 to reset the benchmark tool I also triggered the Quickload which looked like it froze the game(it was actually just loading), so I hit F7 a few more times trying to reset the tool which I think triggered a second quickload before the first one had fully activated, this is when mouse input began to lag followed by the Cache Hierarchy crash in the middle of the quickload, so its hard to pinpoint if this was the Game itself/Drivers/Nicehash or combination of activating Afterburner benchmark at the same time as a game load trigger but VRAM was involved, and the GPU being in 'compute' from the night before could be a key factor.

After the crash I immediately reloaded Cyberpunk, fixed the quickload shortcut and tried to reproduce the issue via normal quickloading again a few times & resetting the benchmark tool separately multiple times along with playing the game, it worked 100% normally, but I did not try to repeat the same exact steps that caused this in the first place due to work I have to do on the PC.

I do plan to try if I can force the crash later this week by doing the exact same thing with Nicehash + the quickload+reset benchmark both set to F7 later this week,, to see if this Compute mode 'bug' may be the cause,, I will do all this before I start testing with HWinfo launch on startup again to completely vindicate it, but regardless, Nicehash will definitely be a program to keep disabled during the next 2 week period with HWinfo back in my daily-use programs as the occurrence of these crashes 'around' the same use periods as Nicehash prior feel more than coincidental, interaction with MSI afterburner will definitely be under scrutiny as well.

Anywho, This is the first Hierarchy crash I've had without HWinfo running, but given the very niche scenario it could be drivers, GPU sensors, maybe both, but the GPU Compute mode getting stuck seems like a major factor in triggering the cache heirarchy error but I'd need help from other RX 5000/6000 series owners to confirm, there used to be a manual option to switch between Compute & Graphics, but I believe its only available externally now & rebooting may not be enough to completely reset to 3D Graphics mode.

After the Cache Hierarchy crash, the GPU driver is always 'reset' which might also explain why the system works fine afterwards & the issue is so intermittent.

jackmeat · Jul 2, 2021

Ok, I just finished a semi new build. More or less pass down parts but anyway it is a Ryzen 3600, rx 580, asrock b550 phantom. Ok, the point is I am glad I stumbled across this forum. I can tell you with confirmation, if HWInfo ver 6.40 or 6.43beta (I read this whole post) are running and I switch monitors via my KVM switch, it causes this exact issue (the one from February, WHEA-logger A fatal hardware error has occurred. A record describing the condition is contained in the data section of this event.) Being very easy to trigger the crash, all I have done was stop HWinfo.exe from running, and bingo, no crash. So it is for sure HWinfo in this instance, and I am quite annoyed only because I have been using this program for as long as I can remember. Running on my gaming pc and laptop is no issue since neither contain a ryzen cpu or amd gpu. Sorry to say I am going to have to be looking elsewhere for something to monitor temps on this PC. I figured I would at least give you the info, Martin, since you were so helpful in this post in fixing this. Maybe knowing this exact specific can help. EDIT: I just tried it as well with CoreTemp, and it is the exact same problem. AMD issue of some sort.

Martin · Jul 2, 2021

jackmeat said:
Ok, I just finished a semi new build. More or less pass down parts but anyway it is a Ryzen 3600, rx 580, asrock b550 phantom. Ok, the point is I am glad I stumbled across this forum. I can tell you with confirmation, if HWInfo ver 6.40 or 6.43beta (I read this whole post) are running and I switch monitors via my KVM switch, it causes this exact issue (the one from February, WHEA-logger A fatal hardware error has occurred. A record describing the condition is contained in the data section of this event.) Being very easy to trigger the crash, all I have done was stop HWinfo.exe from running, and bingo, no crash. So it is for sure HWinfo in this instance, and I am quite annoyed only because I have been using this program for as long as I can remember. Running on my gaming pc and laptop is no issue since neither contain a ryzen cpu or amd gpu. Sorry to say I am going to have to be looking elsewhere for something to monitor temps on this PC. I figured I would at least give you the info, Martin, since you were so helpful in this post in fixing this. Maybe knowing this exact specific can help. EDIT: I just tried it as well with CoreTemp, and it is the exact same problem. AMD issue of some sort.

HWiNFO versions 6.4x that you used are quite outdated and this specific problem was resolved afterwards. Have you tried latest versions of HWiNFO v7.04 or v7.05 Beta?

jackmeat · Jul 3, 2021

Martin said:
HWiNFO versions 6.4x that you used are quite outdated and this specific problem was resolved afterwards. Have you tried latest versions of HWiNFO v7.04 or v7.05 Beta?

No, unfortunately after ver. 7 it became not free anymore. And like I said, same issue with Core Temp as well so I am barking up AMDs tree now.

Martin · Jul 3, 2021

jackmeat said:
No, unfortunately after ver. 7 it became not free anymore. And like I said, same issue with Core Temp as well so I am barking up AMDs tree now.

Version 7.0+ is still free for non-commercial use.

jackmeat · Jul 3, 2021

Martin said:
Version 7.0+ is still free for non-commercial use.

Something with SHM and pulling the data into the windows gadget dies after a certain period of time if you are NOT using the pro version from what I remember. I don't need a commercial license to monitor a home pc but because i leave it on 24/7 for plex, the hwinfo is always sending data to the gadget. I recall rolling back to 6.4 due to that (found it on reddit somewhere earlier this year) Anyway, I still could use it, just can't use the gadget. Right now having no monitor at all is better than BSOD every time I touch the KVM EDIT : Removing the monitor from the KVM and using the switch to only share the KB/mouse while directly plugging in to monitor and switching proves to work just fine. HWinfo v6.43 is not affected by this, or core temp for that matter. The issue is an amd driver? issue than i would guess. They haven't gotten back to me. Sorry to blame HWinfo, at least for my issue. Now if only that gadget thing LOL.

Shakj · May 15, 2022

Digging this up since on AMD GPU-Driver 22.5.1 the WHEA-Reboots returned for me. In Forza Horizon after an hour or so, in Far Cry 6 after 20-30 Minutes. Without HW-Info 12 Hours of Far Cry 6 without a Crash or a WHEA-Error before the Crash to Black & Reboot. Board is a B450 MSI with latest BIOS and a 3700x (Stock). GPU is a 6900XT (Downclocked, without UV). Tested with HW-Info 7.22 and 7.24 on Windows 11. Edit: Found out i had a vram-oc in the wattman profile. disabled it. Could not reproduce it for 2 hours in Far Cry 6.

Martin · May 15, 2022

Shakj said:
Digging this up since on AMD GPU-Driver 22.5.1 the WHEA-Reboots returned for me. In Forza Horizon after an hour or so, in Far Cry 6 after 20-30 Minutes. Without HW-Info 12 Hours of Far Cry 6 without a Crash or a WHEA-Error before the Crash to Black & Reboot. Board is a B450 MSI with latest BIOS and a 3700x (Stock). GPU is a 6900XT (Downclocked, without UV). Tested with HW-Info 7.22 and 7.24 on Windows 11.

Try to disable monitoring of the GPU sensor to see if it will still crash.

Shakj · May 16, 2022

Martin said:
Try to disable monitoring of the GPU sensor to see if it will still crash.

it was my fault. I checked the profile again in wattman and it had a probably unstable vram-oc saved. disabled it and could not reproduce the error for 2 hours. i will play some hours far cry 6 with hwinfo in the background but i think that was it. sorry for the false alarm. should have double checked my settings.

brandorf · Sep 8, 2022

I'm currently on 2.27-4800. Is it possible we have a regression here? I use a laptop at work, and I'm seeing that my computer hangs with a WHEA 18 error when I'm using the laptop on a KVM (so the PC is idle). I'm seeing an uptick in these in the events log since about 8/17/22, and it's happened twice today. This is the exact same system as when we were diagnosing this issue last year.

Is HWiNFO causing the dreaded WHEA-Logger Event ID XX Cache Hierarchy Errors and sudden reboots on AMD Ryzen systems?

HWiNFO Author

Member

Member

HWiNFO Author

Member

Member

Member

New Member

HWiNFO Author

New Member

Member

New Member

HWiNFO Author

New Member

HWiNFO Author

New Member

New Member

HWiNFO Author

New Member

Member

Similar threads