I need help about mcelog - Machine Check Exception
Thursday, 13. August 2009, 11:30:00
My system is Asus M51Sn notebook, which has Intel Core 2 Duo T5550 CPU, 3GB of RAM and GeForce 9500M GS videocard. It runs Gentoo Linux amd64 (x86_64).
Over time, I've noticed that a few Machine Check Exceptions have been recorded by the system. Unfortunately, I have no idea about what they mean, and about what caused them. What's more: they happen about every month or so, but I never notice any side-effect of them. This makes me more scared, because if I see something broken, I can fix it as soon as possible. However, if everything seems to go well, I'm never sure if there is something broken under the hood, and that will get worse as time goes on.
Here is the full /var/log/mcelog (you need to install app-admin/mcelog in order to log MCE to that file):
2009-06-05 03:10:15 BRT MCE 0 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 1 BANK 128 TSC 2d7ca16e5a7a STATUS 880901c0 MCGSTATUS 0 2009-06-22 03:10:10 BRT MCE 0 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 1 BANK 128 TSC 1d61709dba6e STATUS 880901c0 MCGSTATUS 0 2009-07-20 22:40:09 BRT MCE 0 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 1 BANK 128 TSC 1e03f8a51387 STATUS 880b0100 MCGSTATUS 0 2009-08-13 03:10:18 BRT MCE 0 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 1 BANK 128 TSC 2cc70c476bc5 STATUS 880c0100 MCGSTATUS 0
Note, however, that MCEs are logged via crontab that runs daily (I've just changed it to run hourly) and thus there might have been MCEs that weren't logged, and also the date/time that is written to the log is the time when the cronjob ran, and not exactly the time when the MCE happened.
The last MCE logged in /var/log/mcelog is from the last day. I can't know exactly when it happened, but I know that I've updated my Gentoo this night. I went to bed while leaving the notebook running python-updater, which in turn re-emerged (and, thus, re-compiled) lots of packages. I know the CPU usage when to maximum and the temperature got very hot, because the fan noise was pretty loud. Then, today I found this at dmesg:
[114637.131021] CPU1: Temperature/speed normal [114900.326042] Machine check events logged
So, my guess is that the Machine Check Exception I got was about CPU over-heating. I have no idea if this is true, this is just a guess.
By the way, my desktop machine (which is AMD 64 Sempron LE-1200) has no entries in /var/log/mcelog, so I assume there were no MCEs on my desktop (at least not yet).
If you have relevant info about MCE, please post below in the comments! I would like to know what causes MCEs on this machine, what effects MCE has on the whole system, and if possible how to avoid them.









anzah # 17. August 2009, 01:01
CrazyTerabyte # 17. August 2009, 01:23
anzah # 17. August 2009, 01:38
Mersenne Prime could be good for the CPU testing, it wasn't simplest one to get running though.
It should be possible to compute list of known primes to check if results keep coming correct. That should also heat the CPU.
If it's cause by overheating, some motherboards have option to raise an alarm when the heat goes up certain level. Most likely it's on by default. In that case you would have noticed if there was overheating problem.
I noticed that such feature existed with computer which had all fans except one in power supply removed.