seen you in OCN a few times consulting with Veii, and perhaps manna and mongolHello everyone:
A couple of users and myself have been suffering sudden reboots with our computers composed of Ryzen CPU systems (Ryzen 3000, but especially 5000) under different load conditions. The quickest way for us to trigger it, however, has been by using software designed to test RAM stability such as TM5 or Karhu RAM Test.
We have recently discovered that this problem only occurs if we have HWiNFO loaded in the background on Windows 10. Most of us also have AMD Radeon graphics cards, but we yet have to determine if that is a contributing factor. We don't know exactly where the conflict is, but the pattern is clear: we see the dreaded WHEA-Logger Event ID XX Cache Hierarchy error in the Event Viewer of Windows after those sudden reboots.
This has been tested by multiple users across different setups: motherboard manufacturers, AGESA/BIOS revisions, RAM brands and configurations, settings, and even after a fresh install of Windows (including different versions of the operating system). The only common denominator we have been able to find this far is the use of HWiNFO (we've only tested this using the latest versions - we still don't know if it can be solved by rolling back to a previous version specifically).
I'm sharing this information here with the hope that this problem can be reproduced and fixed accordingly. Perhaps, it will also require collaboration with AMD.
Thank you for your time.
Just an update, Day 6 & the WHEA Cache Heirachy Event 18 has not returned. In light of the most recent comment, upping core voltage on my undervolt was one of my first reactions after getting this error during mostly idle/light usage, it did not help & I'm still running my original stress-test stable undervolt(4.2ghz @ 1.125v). The system is being used heavily lately, mainly for casual use, browsing and 5hr+ gaming sessions /w Relive recording on-top & plenty of idle time in-between, zero abnormal behavior.Hey everyone,
System Info:
Ryzen 5 3600 4.2ghz @ 1.125v Allcore
B450M Mortar Max Bios: 7B89v2D Agesa 1.2.0.2, previously 1.1.0.0 with no improvement.
Chipset Driver: 2.13.27.501 - 2/4/2021
Gigabyte RX 6700 XT - 21.5.2 Driver
Sfc /scannow - chkdsk - DISM, no problems.
PSU - Seasonic Core GM-650w(Gold)
30mins Asus Realbench, ~200% mem test pro stable, all kinds of real world load use stable prior to MAY 2021, 5 hour gaming sessions, 3 hour video rendering etc. General use Temps under 72C max load,, <60C gaming. Often left on 24/7.
Then randomly, over the past month at idle, things like youtube or launching a game started triggering a blackscreen/crash with Cache Heirachy Error, this has happened mostly over the past 30 days after setting HWinfo to start with system & simply leaving it on. (I was using it less prior & not running it 24/7),, now originally at the start of the month, I was getting this error on Agesa 1.1.0.0, & it continued after updating to 1.2.0.2 in an attempt to resolve the problem on May 10th, with no improvement,, over the month of May I had the Cache Hierachy Error ~11 times at seemingly random times on random Processor APIC ID's, sometimes with two different processor APIC IDs causing a cache error at exactly the same time.
After the crash and attempting to reproduce the issue, I could not force it to crash again by performing the same actions via applications, but sometimes it would crash when left idle again without me noticing, for only a few hours of idle. (HWinfo open), so I did not even recognize some of the entries until taking a closer look.
At first I tried increasing Vcore by +0.05v & SoC(1.050>1.100v) thinking it was my undervolt idle voltages, or some kindof degradation, with NO improvement before I stumbled across this thread a few days back mentioning HWinfo, which should have been fixed on the current version '7.04' that I was running,, nevertheless I decided I'll avoid using HWinfo to rule it out & see what happens, as well as disable GPU ULPS via Afterburner as it was the other prime suspect on other idle crashing threads, after the CPU voltage increases did not seem to help. But I was running ULPS default for months prior without issue. (Maybe something to do with Resizable BAR, CPU & GPU cache & something HWinfo is doing between core parking & idle just a theory.)
So anyway, onto the 3rd day now and the issue has not returned but still early to know for sure, anecdotal I know but just wanted to share, I even reverted the previously suspected CPU undervolt values to what I originally tested as load stable at the same time & its simply been rock solid(so far), & I've left the system to idle for a few hours multiple times, between 3 hour gaming sessions with ReLive recording etc, followed by long idles or minimal but normal desktop use.
Here's the log.
View attachment 6509
Processor APIC ID's
01/05 - 11
03/05 - 8, 11
04/05 - 5 & 10
16/05 - 8, 0 & 12.
24/05 - 5 & 12.
25/05 - 8
I tried to find some pattern of the same cores throwing cache errors,, nothing conclusive imo, core 2 is the fastest in CCX 0 & tends to sleep before Core 1 & 3, but only after CCX 1 is usually completely asleep,,, core 5 is fastest in CCX 1, core 4 is second fastest,, cores 2, 4, 5 & 6 sleep the most when watching Ryzen master at desktop but cores 4 & 8 usually sleep before 5 with core 2 sleeping last /w 1 & 3 the last ones left awake.....(feels like a riddle, which core is most likely to error?)
TLDR: Started using HWinfo daily in May & started getting crashes(not knowing it might have anything to do with HWinfo), I was on 6.42 & updated to 7.04 on the 18th, the version change didnt seem to help,, still had multiple cache heirachy crashes. Prior months before the crashes I would only run HWinfo when benchmarking or stresstesting instead of 24/7. Now ~3 days since keeping HWinfo closed & not a single error. Will keep posted as nothing conclusive yet, give it another 2 weeks before I'll know for certain.
@Martin Is it possible that one of the AMD processes added to the 21.4.1 drivers for Ryzen Master // CPU monitoring integration into the GPU drivers could be conflicting with HWinfo 'launch at startup'? I have performance metrics & overlays ALL disabled within the radeon software, but happened to notice more AMD processes than what was standard ~6 months back.
You can see the metrics added into the GPU software here: https://www.amd.com/en/support/kb/release-notes/rn-rad-win-21-4-1Image — Postimages
postimg.cc
- Also worth noting, if I enable 'CPU' metrics, all the CPU sensors appear with no visible loading delay, so I assume some kind of driver-level integration with the Radeon Software..., could that conflict with HWinfo at startup in any meaningful way? Here's a screenshot of the metrics available for monitoring: https://postimg.cc/PPqDFJWm I've kept all Radeon monitoring disabled to avoid conflicts since they first added it, but the fact they appear almost instantly when re-enabling makes me wonder.
Cheers!
Thanks for the reply & no worries was just curious, when I resume testing with HWinfo, the prime suspects will be MSI Afterburner, Nicehash(which I'm leaving off for now), and ULPS, , its just odd that the issue only seemed to appear when using HWinfo @ startup & never via manual launch for stress testing before when I had all the previous settings/apps enabled.I cannot exclude such possibility, but I think it would be very unlikely. The way those metrics are pulled by AMD would be similar to what Ryzen Master does for a couple of years and there are no known conflicts with HWiNFO. Also, if this would be the case, there would be dozens (or hundreds) of similar reports from many other users, but there aren't such.
I'd rather search for a conflict with some other application/tool.
by "heavy use" or "stress test" run y-cruncher 1, 7, 0 "enter" if you can run thisJust an update, Day 6 & the WHEA Cache Heirachy Event 18 has not returned. In light of the most recent comment, upping core voltage on my undervolt was one of my first reactions after getting this error during mostly idle/light usage, it did not help & I'm still running my original stress-test stable undervolt(4.2ghz @ 1.125v). The system is being used heavily lately, mainly for casual use, browsing and 5hr+ gaming sessions /w Relive recording on-top & plenty of idle time in-between, zero abnormal behavior.
Since I want to make my testing as thorough as possible, I'll give it until the 8th of June(so a full 2 weeks) before re-enabling HWinfo 'launch at startup' again which I believe was the main trigger for these errors appearing but I do believe it's a combination of other applications WITH HWinfo rather than HWinfo by itself. Using HWinfo for short periods has been perfectly fine. I have also postponed Windows, Chipset & GPU driver updates until the 22nd of June to minimize variables as much as possible unless they specifically mention a Cache Hierarchy issue.
I'll report back if the error re-appears anytime before the 8th along with any specific activity around that time.
--------------------------------------------
@Martin Is it possible that one of the AMD processes added to the 21.4.1 drivers for Ryzen Master // CPU monitoring integration into the GPU drivers could be conflicting with HWinfo 'launch at startup'? I have performance metrics & overlays ALL disabled within the radeon software, but happened to notice more AMD processes than what was standard ~6 months back.
You can see the metrics added into the GPU software here: https://www.amd.com/en/support/kb/release-notes/rn-rad-win-21-4-1Image — Postimages
postimg.cc
- Also worth noting, if I enable 'CPU' metrics, all the CPU sensors appear with no visible loading delay, so I assume some kind of driver-level integration with the Radeon Software..., could that conflict with HWinfo at startup in any meaningful way? Here's a screenshot of the metrics available for monitoring: https://postimg.cc/PPqDFJWm I've kept all Radeon monitoring disabled to avoid conflicts since they first added it, but the fact they appear almost instantly when re-enabling makes me wonder.
Cheers!
by "heavy use" or "stress test" run y-cruncher 1, 7, 0 "enter" if you can run this
then ill agree its not your undervolt so much. (then again, prime 95 with my undervolt passes)
no curve settings just pure undervolt with 4850boost. but i get way better performance (in benchmarks)
with curve settings applied.
i can state with or without hwinfo running nothing but PBO in prime (limiting PPT) to reduce heat
i get no crash while not using curve. i ran OCCT core cycler all day while hwinfo was running and hadnt had any crash or errors happen.
but core-cycler as found on OCN made by....sp00n82 well, it does indeed yell that prime has encountered an error
with (auto) time as my setting as itll run all prime 95 has to offer "with whats set default inside the config.ini for core-cycler"
what program are you using to confirm your stability? i went personally never having any WHEA to all a sudden KABLAM!
WHEA 18 (no 19s even this 2000fclk has no issue with 4x8 config) WHEA Cache Hierarchy Event 18 apcid 6 5 times in a week and one apcid 11
to which is what core cycler was complaining about....(again this only happens with curve offset)
remove the curve and im good.... ill 100% confirm this on my system, (no settings inside my bios are altered that would normally be hidden)
as theyre all unlocked/unhidden i have no need for most these settings. DF-C states are on (spread spectrum is off)
do let me know tho what program (S) your using as one program isnt enough anymore.
TM5 passes (for ram) but HCI might not, y-cruncher might pass but, TM5 25 cycles on the 25 cycle
might get error 3 etc; you get my point....if you would like to test out
this core-cycler i speak of, https://github.com/sp00n/corecycler/releases
hopefully this is allowed to be linked as its pretty useful used in the correct way (itll take weeks to setup curve)
unless your chip is a beast at OC mins a dud on overclocking core just a diamond in the IMC department.
@Martin
I run 2 rigs full time Folding@home for the GamersNexus Covid 19 research team, and recently noticed this too after monitoring my systems with HWINFO64, I was monitoring attempting to get a temperature baseline logged for the hardware to assess thermals since the systems are under constant near 100% load, either folding or Gaming
I'm wondering if there is a resolution to this and if not is there anything I can provide to help troubleshoot this issue as I really enjoy HWINFO64 and would prefer to having it running, unfortunately Hardware Monitoring software causing BSODS is less than idea
Main Gaming Rig:
CPU: Ryzen 9 5950x, Stock, no PBO, or Manual OC (Arctic Liquid Freezer 280)
RAM: 64GB Trident Z Royal 3600 CL16 (Samsung B-Die)
MOBO: Gigabyte X570 AORUS MASTER (rev. 1.2) (BIOS Rev F32 AGESA ComboV2 1.1.0.0 D)
PSU: EVGA SuperNova 1200W P2
Video Card: Asus RTX 3090 (TUF-RTX3090-O24G-GAMING)
Boot Drive: WD SN850 1TB NVMe
Game Drive: Samsung 980Pro 1TB NVMe
Storage: Intel 660p 2TB NVMe
Home Theater PC:
CPU: Ryzen 9 3950x, Stock
RAM: 32GB Trident Z Royal 3600 CL16 (Samsung B-Die)
MOBO: Gigabyte B550I AORUS Pro AX (BIOS Rev F1, Didn't upgrade due to Ryzen 3000 CPU)
PSU: EVGA SuperNOVA 650 GT
Video Card: EVGA GTX 1080 Ti SC2
Boot Drive: Samsung 970 EVO Plus 500GB NVMe
Game/Storage Drive: Samsung 970 EVO Plus 1TB NVMe
Folding Stats:
CommanderShepard User Summary - Folding@Home Stats
F@H stats user summary for CommanderShepard. Help Folding at Home fight Coronavirus, further medical research, and prevent diseases with distributed computing!folding.extremeoverclocking.com
If requested I can provide more information, this is all I know off the top of my head as I'm currently "at work" and clearly not researching something unrelated to work
Anyways, "I'm CommanderShepard and HWINFO64 is my favorite software on the Citadel"
I should go...
Ok, I just finished a semi new build. More or less pass down parts but anyway it is a Ryzen 3600, rx 580, asrock b550 phantom. Ok, the point is I am glad I stumbled across this forum. I can tell you with confirmation, if HWInfo ver 6.40 or 6.43beta (I read this whole post) are running and I switch monitors via my KVM switch, it causes this exact issue (the one from February, WHEA-logger A fatal hardware error has occurred. A record describing the condition is contained in the data section of this event.) Being very easy to trigger the crash, all I have done was stop HWinfo.exe from running, and bingo, no crash. So it is for sure HWinfo in this instance, and I am quite annoyed only because I have been using this program for as long as I can remember. Running on my gaming pc and laptop is no issue since neither contain a ryzen cpu or amd gpu. Sorry to say I am going to have to be looking elsewhere for something to monitor temps on this PC. I figured I would at least give you the info, Martin, since you were so helpful in this post in fixing this. Maybe knowing this exact specific can help. EDIT: I just tried it as well with CoreTemp, and it is the exact same problem. AMD issue of some sort.
No, unfortunately after ver. 7 it became not free anymore. And like I said, same issue with Core Temp as well so I am barking up AMDs tree now.HWiNFO versions 6.4x that you used are quite outdated and this specific problem was resolved afterwards. Have you tried latest versions of HWiNFO v7.04 or v7.05 Beta?
Version 7.0+ is still free for non-commercial use.No, unfortunately after ver. 7 it became not free anymore. And like I said, same issue with Core Temp as well so I am barking up AMDs tree now.
Something with SHM and pulling the data into the windows gadget dies after a certain period of time if you are NOT using the pro version from what I remember. I don't need a commercial license to monitor a home pc but because i leave it on 24/7 for plex, the hwinfo is always sending data to the gadget. I recall rolling back to 6.4 due to that (found it on reddit somewhere earlier this year) Anyway, I still could use it, just can't use the gadget. Right now having no monitor at all is better than BSOD every time I touch the KVM EDIT : Removing the monitor from the KVM and using the switch to only share the KB/mouse while directly plugging in to monitor and switching proves to work just fine. HWinfo v6.43 is not affected by this, or core temp for that matter. The issue is an amd driver? issue than i would guess. They haven't gotten back to me. Sorry to blame HWinfo, at least for my issue. Now if only that gadget thing LOL.Version 7.0+ is still free for non-commercial use.
Digging this up since on AMD GPU-Driver 22.5.1 the WHEA-Reboots returned for me. In Forza Horizon after an hour or so, in Far Cry 6 after 20-30 Minutes. Without HW-Info 12 Hours of Far Cry 6 without a Crash or a WHEA-Error before the Crash to Black & Reboot. Board is a B450 MSI with latest BIOS and a 3700x (Stock). GPU is a 6900XT (Downclocked, without UV). Tested with HW-Info 7.22 and 7.24 on Windows 11.
it was my fault. I checked the profile again in wattman and it had a probably unstable vram-oc saved. disabled it and could not reproduce the error for 2 hours. i will play some hours far cry 6 with hwinfo in the background but i think that was it. sorry for the false alarm. should have double checked my settings.Try to disable monitoring of the GPU sensor to see if it will still crash.