The SW should look at THERMAL_MONITOR_STATUS [0] instead of THERMAL_MONITOR_LOG [1] of MSR 19C.
a. STATUS tells you it the event is current active or not.
b. LOG tells you that the event happened at least once since last time LOG was cleared
With a polling rate of 2000 ms (or even 100 ms) how big would be the chance that software would catch the non-sticky flag? It would miss 99% of short spikes.
Intel is aware of this issue on latest generations, yet haven't provided any clue why it happens.