Linux x86_64: Detecting Hardware Errors

by on June 2, 2009 · 9 comments· LAST UPDATED June 18, 2009

in , ,

The Blue Screen of Death (BSoD) is used by Microsoft Windows, after encountering a critical system error. Linux / UNIX like operating system may get a kernel panic. It is just like BSoD. The BSoD and a kernel panic generated using a Machine Check Exception (MCE). MCE is nothing but feature of AMD / Intel 64 bit systems which is used to detect an unrecoverable hardware problem. MCE can detect:

  • Communication error between CPU and motherboard.
  • Memory error - ECC problems.
  • CPU cache errors and so on.


Program such mcelog decodes machine check events (hardware errors) on x86-64 machines running a 64-bit Linux kernel. It should be run regularly as a cron job on any x86-64 Linux system. This is useful for predicting server hardware failure before actual server crash.

Install mcelog

Type the following command under RHEL / CentOS / Fedora Linux, 64 bit kernel:
# yum install mcelog
Type the following command under Debian / Ubuntu Linux, 64 bit kernel:
# apt-get update && apt-get install mcelog

Default Cronjob

mcelog should be run regularly as a cron job on any x86-64 Linux system. By default following cron settings are used on Debian / Ubuntu Linux - /etc/cron.d/mcelog:

# /etc/cron.d/mcelog: crontab entry for the mcelog package
SHELL=/bin/sh
PATH=/sbin:/bin:/usr/sbin:/usr/bin
 
*/5 * * * *	root	test -x /usr/sbin/mcelog -a ! -e /etc/mcelog-disabled && /usr/sbin/mcelog --ignorenodev --filter >> /var/log/mcelog

CentOS / RHEL / Fedora Linux runs hourly cron job via /etc/cron.hourly/mcelog.cron:

#!/bin/bash
/usr/sbin/mcelog --ignorenodev --filter >> /var/log/mcelog

How do I view error logs?

Use tail or grep command:
# tail -f /var/log/mcelog
OR
# grep -i "hardware error" /var/log/mcelog
OR
# grep -c "hardware error" /var/log/mcelog
Alternatively, you can send an email alert when hardware error found on the system (write a shell script and call it via cron job):
# [ $(grep -c "hardware error" /var/log/mcelog) -gt 0 ] && echo "Hardware Error Found $(hostname) @ $(date)" | mail -s 'H/w Error' pager@example.com
With this tool I was able to pick up couple of hardware problem before a kernel panic i.e. server crash.

A Note About mcelog

  • You need to use 64 bit Linux kernel and operating system to run mcelog. Machine checks can indicate failing hardware, system overheats, bad DIMMs or other problems. Some MCEs are fatal and can not generally be survived without reboot and h/w replacement, but I was able to catch lots of bad h/w before crash with this tool.
  • mcat - A Windows command-line program from AMD to decode MCEs from AMD K8, Family 0x10 and 0x11 processors.
  • mcelog project home page.
  • mcedaemon - a daemonthat can get MCE notifications as soon as the kernel finds them. It does not try to interpret the MCE data, just alert other apps.
  • Linux Kernel panic source code.
  • man mcelog
  • Machine check exception support information for MS-Windows server 2003 and XP operating systems.
TwitterFacebookGoogle+PDF versionFound an error/typo on this page? Help us!

{ 9 comments… read them below or add one }

1 Damian Myerscough June 3, 2009 at 10:24 am

Hello,

Is there any similar tools for 32-bit operating systems? You mention mcelog only works
with 64-bit operating systems.

Reply

2 nixCraft June 3, 2009 at 10:31 am

Noop. AFAIK.

Reply

3 david July 21, 2009 at 7:22 am

There are some other tools for other CPUs as well: Wikipedia

Reply

4 hi group October 23, 2014 at 12:36 am

i can update tools linux to backtrack

Reply

5 Lars Michelsen March 15, 2010 at 4:13 pm

Do anyone know about a working solution for 32bit operating systems on x86_64 hardware?

Reply

6 nawab April 28, 2010 at 8:28 pm

if i run your script i am getting this error..
/etc/cron.hourly/mcelog.cron
Usage:
mcelog [--k8|--p4|--generic] [--syslog] [mcelogdevice]
mcelog [--k8|--p4|--generic] –ascii
Decode machine check error records

Reply

7 Robby Nazareth September 4, 2011 at 7:34 am

Hi Vivek ! i get lot of information through your website .. Thanks very much. pls help me to decode the mcelog errors: As i forwarded this case to HP , But as per hp its is firware issue ….What you have to say?
Node : BL280c-G6
1)plcg298: MCE 0
plcg298: HARDWARE ERROR. This is *NOT* a software problem!
plcg298: Please contact your hardware vendor
plcg298: CPU 11 BANK 5 TSC 7d0a8fb75c06bd [at 2934 Mhz 138 days 20:43:18 uptime (unreliable)]
plcg298: MISC 1091 ADDR 61797b458
plcg298: MCG status:
plcg298: MCi status:
plcg298: MCi_MISC register valid
plcg298: MCi_ADDR register valid
plcg298: MCA: corrected filtering (some unreported errors in same region)
plcg298: Data CACHE Level-1 Data-Read Error
plcg298: STATUS 8c20004000101135 MCGSTATUS 0
plcg371:

2) plcg423: MCE 0
plcg423: HARDWARE ERROR. This is *NOT* a software problem!
plcg423: Please contact your hardware vendor
plcg423: CPU 6 BANK 8 TSC 7ca01c751f525e [at 2934 Mhz 138 days 9:38:40 uptime (unreliable)]
plcg423: MISC 1008040200081588 ADDR 3f2c58200
plcg423: MCG status:
plcg423: MCi status:
plcg423: MCi_MISC register valid
plcg423: MCi_ADDR register valid
plcg423: MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR
plcg423: Transaction: Memory read error
plcg423: STATUS 8c0000400001009f MCGSTATUS 0
plcg423: MCE 1
plcg423: HARDWARE ERROR. This is *NOT* a software problem!
plcg423: Please contact your hardware vendor
plcg423: CPU 2 BANK 8 TSC 7ca01c751f5057 [at 2934 Mhz 138 days 9:38:40 uptime (unreliable)]
plcg423: MISC 1008040200081588 ADDR 3f2c58200
plcg423: MCG status:
plcg423: MCi status:
plcg423: MCi_MISC register valid
plcg423: MCi_ADDR register valid
plcg423: MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR
plcg423: Transaction: Memory read error
plcg423: STATUS 8c0000400001009f MCGSTATUS 0

Reply

8 rasoul April 20, 2012 at 9:36 pm

hi
i have problem to install any os on laptop and test the dvd & usb i dont know how install os

Reply

9 Rinshad August 20, 2013 at 5:06 am

Hi Vivek,
I am getting this error in mcelog ,

MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge TSC aeffd2efa9f1db
ADDR 65bc76a0
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 84ac
bit32 = err cpu0
bit46 = corrected ecc error
bus error ‘local node origin, request didn’t time out
generic read mem transaction
memory access, level generic’
STATUS 9456400184080813 MCGSTATUS 0

Please help ..

Reply

Leave a Comment

Tagged as: , , , , , , , , , , , , , , , , , , ,

Previous post:

Next post: