CrazyTB's blog

I got another MCE

, , ,

One week ago, I've blogged about machine check exception. Today (or yesterday), another MCE shows up on my machine. Damn, this is driving me crazy...


2009-08-20 12:00:01 BRT
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 BANK 128 TSC 372f23a16b9 
STATUS 880b0100 MCGSTATUS 0

By the time this MCE happened, I think I was updating my Gentoo system, and thus was compiling a few packages. The CPU usage was very high (of course!), and thus the CPU was hot. So, MAYBE overheating was the cause, but MAYBE not.

Last night, I also noticed a weird behavior. The ssh client was refusing to connect to a server, talking about incorrect hash. It made no sense, because nothing has been changed on that server. Then I tried "ssh -v -v" to make it a lot more verbose and... it stopped complaining and got connected!

Damn! I felt the system was completely unreliable at that point. Maybe some part of RAM got corrupted, maybe something else. I should also note that I've also configured hibernate/suspend yesterday, so if there was something corrupt, that data probably was saved and were propagated to the next resumes.

And, what's more, almost every time an MCE happens, then file system corruption also happens. Today I ran fsck and it found many errors on my root partition. MANY erros. Fortunately, I was luck enough that I lost only non-important files (like old kernel sources that were going to be deleted anyway). A few months ago, however, I lost some of my stored mail due to corruption on my /home partition (probably also caused after a MCE).

For comparison, my Sempron 64 desktop machine (with a Western Digital hard disk after the Maxtor one failed) had an uptime of 40 days with no MCEs.

I know this is a hardware problem, I just wanna know what piece of hardware causes this, so I can go on and fix that!

Update, a few hours later: By the way, I left Memtest86+ running overnight. No error was detected after about 12 hours.

Update, later at this night: According to Wikipedia, we should consult the Intel 64 and IA-32 Architectures Software Developer's Manual. I looked at it (very quickly) and couldn't understand most of it, and thus couldn't find anything that would help me to decipher the exception.

Update on 2009-08-29: Today I got another MCE, but now I'm pretty sure about what causes them: overheating. I was running a CPU and GPU intensive application, then I noticed the CPU went above 70° C, the GPU probably was above 90° C, the fan noise was very loud, and then a MCE was logged.

Update on 2009-09-19: Yes, the cause was overheating. After cleaning the notebook, I haven't noticed any other MCE.

I have suspend/hibernate, and also a battery monitorTrying to set up Access Point with RaLink rt61pci

How to use Quote function:

  1. Select some text
  2. Click on the Quote link

Write a comment

Comment
(BBcode and HTML is turned off for anonymous user comments.)

If you can't read the words, press the small reload icon.


Smilies

May 2012
S M T W T F S
April 2012June 2012
1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31