I got another MCE
Friday, 21. August 2009, 14:49:53
One week ago, I've blogged about machine check exception. Today (or yesterday), another MCE shows up on my machine. Damn, this is driving me crazy...
2009-08-20 12:00:01 BRT MCE 0 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 1 BANK 128 TSC 372f23a16b9 STATUS 880b0100 MCGSTATUS 0
By the time this MCE happened, I think I was updating my Gentoo system, and thus was compiling a few packages. The CPU usage was very high (of course!), and thus the CPU was hot. So, MAYBE overheating was the cause, but MAYBE not.
Last night, I also noticed a weird behavior. The ssh client was refusing to connect to a server, talking about incorrect hash. It made no sense, because nothing has been changed on that server. Then I tried "ssh -v -v" to make it a lot more verbose and... it stopped complaining and got connected!
Damn! I felt the system was completely unreliable at that point. Maybe some part of RAM got corrupted, maybe something else. I should also note that I've also configured hibernate/suspend yesterday, so if there was something corrupt, that data probably was saved and were propagated to the next resumes.
And, what's more, almost every time an MCE happens, then file system corruption also happens. Today I ran fsck and it found many errors on my root partition. MANY erros. Fortunately, I was luck enough that I lost only non-important files (like old kernel sources that were going to be deleted anyway). A few months ago, however, I lost some of my stored mail due to corruption on my /home partition (probably also caused after a MCE).
For comparison, my Sempron 64 desktop machine (with a Western Digital hard disk after the Maxtor one failed) had an uptime of 40 days with no MCEs.
I know this is a hardware problem, I just wanna know what piece of hardware causes this, so I can go on and fix that!
Update, a few hours later: By the way, I left Memtest86+ running overnight. No error was detected after about 12 hours.
Update, later at this night: According to Wikipedia, we should consult the Intel 64 and IA-32 Architectures Software Developer's Manual. I looked at it (very quickly) and couldn't understand most of it, and thus couldn't find anything that would help me to decipher the exception.
Update on 2009-08-29: Today I got another MCE, but now I'm pretty sure about what causes them: overheating. I was running a CPU and GPU intensive application, then I noticed the CPU went above 70° C, the GPU probably was above 90° C, the fan noise was very loud, and then a MCE was logged.
Update on 2009-09-19: Yes, the cause was overheating. After cleaning the notebook, I haven't noticed any other MCE.









How to use Quote function: