How to decode MCE errors?

Debian_SuperUser

Active Member
Joined
Mar 18, 2024
Messages
131
Reaction score
33
Credits
1,513
The journal daemon logs the encountered Machine Check Exception errors like -

I want to decode them to understand the problem better. Even AI can't do it because it is machine specific and very low level. It said me to use either mcelog or rasdaemon. Mcelog seems to be old and I would have compile my own kernel with a flag (CONFIG_X86_MCELOG_LEGACY) for it to work. Rasdaemon on the other hand, is newer, and I had it running in the background when an MCE interrupt was thrown, but it didn't output anything. I did have mce=off kernel parameter, otherwise the system would halt and reset. AI wanted me to check if CONFIG_RAS flag was used to compile my kernel, and it was, but CONFIG_RAS_CEC_DEBUG wasn't, though just CONFIG_RAS_CEC was. I don't know if Rasdaemon even works on my system. I have an Intel Coffee Lake processor.

I tried to refer that freaking 5000+ page documentation of Intel 64 architecture and I even got the MCE chapter but obviously I couldn't pick up anything. If somebody has any idea, please respond.
 


Rasdaemon Compatibility: Rasdaemon should work with your Intel Coffee Lake processor, but it requires proper configuration. Ensure that your kernel has the necessary RAS (Reliability, Availability, and Serviceability) features enabled. Specifically, the CONFIG_RAS flag is crucial, and while CONFIG_RAS_CEC is enabled, CONFIG_RAS_CEC_DEBUG is not mandatory for basic functionality
Kernel Parameters: The mce=off parameter disables MCE handling, which might be why Rasdaemon isn't logging anything. Try removing this parameter to allow the kernel to handle MCEs. However, be cautious as this might cause system instability.
Rasdaemon Configuration: Verify that Rasdaemon is correctly configured and running. You can check its status with:
systemctl status rasdaemon
Ensure it's enabled and actively monitoring.
Log Files: Rasdaemon logs errors to /var/log/rasdaemon/ras-mc_event.log by default. Check this file for any recorded MCEs.
 
I tried to refer that freaking 5000+ page documentation of Intel 64 architecture and I even got the MCE chapter but obviously I couldn't pick up anything. If somebody has any idea, please respond.
If you don't know assembly and how to debug you won't get any further than Intel's reference manual.

I suggest to run intel processor diagnostic tool to tell you if everything is OK with your CPU first.
But now, I'm pretty sure you're going to dismiss this.
 

Staff online


Top