≡ Menu

Linux x86_64: Detecting Hardware Errors

The Blue Screen of Death (BSoD) is used by Microsoft Windows, after encountering a critical system error. Linux / UNIX like operating system may get a kernel panic. It is just like BSoD. The BSoD and a kernel panic generated using a Machine Check Exception (MCE). MCE is nothing but feature of AMD / Intel 64 bit systems which is used to detect an unrecoverable hardware problem. MCE can detect:

  • Communication error between CPU and motherboard.
  • Memory error - ECC problems.
  • CPU cache errors and so on.


Program such mcelog decodes machine check events (hardware errors) on x86-64 machines running a 64-bit Linux kernel. It should be run regularly as a cron job on any x86-64 Linux system. This is useful for predicting server hardware failure before actual server crash.

Install mcelog

Type the following command under RHEL / CentOS / Fedora Linux, 64 bit kernel:
# yum install mcelog
Type the following command under Debian / Ubuntu Linux, 64 bit kernel:
# apt-get update && apt-get install mcelog

Default Cronjob

mcelog should be run regularly as a cron job on any x86-64 Linux system. By default following cron settings are used on Debian / Ubuntu Linux - /etc/cron.d/mcelog:

# /etc/cron.d/mcelog: crontab entry for the mcelog package
SHELL=/bin/sh
PATH=/sbin:/bin:/usr/sbin:/usr/bin
 
*/5 * * * *	root	test -x /usr/sbin/mcelog -a ! -e /etc/mcelog-disabled && /usr/sbin/mcelog --ignorenodev --filter >> /var/log/mcelog

CentOS / RHEL / Fedora Linux runs hourly cron job via /etc/cron.hourly/mcelog.cron:

#!/bin/bash
/usr/sbin/mcelog --ignorenodev --filter >> /var/log/mcelog

How do I view error logs?

Use tail or grep command:
# tail -f /var/log/mcelog
OR
# grep -i "hardware error" /var/log/mcelog
OR
# grep -c "hardware error" /var/log/mcelog
Alternatively, you can send an email alert when hardware error found on the system (write a shell script and call it via cron job):
# [ $(grep -c "hardware error" /var/log/mcelog) -gt 0 ] && echo "Hardware Error Found $(hostname) @ $(date)" | mail -s 'H/w Error' pager@example.com
With this tool I was able to pick up couple of hardware problem before a kernel panic i.e. server crash.

A Note About mcelog

  • You need to use 64 bit Linux kernel and operating system to run mcelog. Machine checks can indicate failing hardware, system overheats, bad DIMMs or other problems. Some MCEs are fatal and can not generally be survived without reboot and h/w replacement, but I was able to catch lots of bad h/w before crash with this tool.
  • mcat - A Windows command-line program from AMD to decode MCEs from AMD K8, Family 0x10 and 0x11 processors.
  • mcelog project home page.
  • mcedaemon - a daemonthat can get MCE notifications as soon as the kernel finds them. It does not try to interpret the MCE data, just alert other apps.
  • Linux Kernel panic source code.
  • man mcelog
  • Machine check exception support information for MS-Windows server 2003 and XP operating systems.
Tweet itFacebook itGoogle+ itPDF itFound an error/typo on this page?

Comments on this entry are closed.

  • Damian Myerscough June 3, 2009, 10:24 am

    Hello,

    Is there any similar tools for 32-bit operating systems? You mention mcelog only works
    with 64-bit operating systems.

  • nixCraft June 3, 2009, 10:31 am

    Noop. AFAIK.

  • david July 21, 2009, 7:22 am

    There are some other tools for other CPUs as well: Wikipedia

    • hi group October 23, 2014, 12:36 am

      i can update tools linux to backtrack

  • Lars Michelsen March 15, 2010, 4:13 pm

    Do anyone know about a working solution for 32bit operating systems on x86_64 hardware?

  • nawab April 28, 2010, 8:28 pm

    if i run your script i am getting this error..
    /etc/cron.hourly/mcelog.cron
    Usage:
    mcelog [–k8|–p4|–generic] [–syslog] [mcelogdevice]
    mcelog [–k8|–p4|–generic] –ascii
    Decode machine check error records

  • Robby Nazareth September 4, 2011, 7:34 am

    Hi Vivek ! i get lot of information through your website .. Thanks very much. pls help me to decode the mcelog errors: As i forwarded this case to HP , But as per hp its is firware issue ….What you have to say?
    Node : BL280c-G6
    1)plcg298: MCE 0
    plcg298: HARDWARE ERROR. This is *NOT* a software problem!
    plcg298: Please contact your hardware vendor
    plcg298: CPU 11 BANK 5 TSC 7d0a8fb75c06bd [at 2934 Mhz 138 days 20:43:18 uptime (unreliable)]
    plcg298: MISC 1091 ADDR 61797b458
    plcg298: MCG status:
    plcg298: MCi status:
    plcg298: MCi_MISC register valid
    plcg298: MCi_ADDR register valid
    plcg298: MCA: corrected filtering (some unreported errors in same region)
    plcg298: Data CACHE Level-1 Data-Read Error
    plcg298: STATUS 8c20004000101135 MCGSTATUS 0
    plcg371:

    2) plcg423: MCE 0
    plcg423: HARDWARE ERROR. This is *NOT* a software problem!
    plcg423: Please contact your hardware vendor
    plcg423: CPU 6 BANK 8 TSC 7ca01c751f525e [at 2934 Mhz 138 days 9:38:40 uptime (unreliable)]
    plcg423: MISC 1008040200081588 ADDR 3f2c58200
    plcg423: MCG status:
    plcg423: MCi status:
    plcg423: MCi_MISC register valid
    plcg423: MCi_ADDR register valid
    plcg423: MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR
    plcg423: Transaction: Memory read error
    plcg423: STATUS 8c0000400001009f MCGSTATUS 0
    plcg423: MCE 1
    plcg423: HARDWARE ERROR. This is *NOT* a software problem!
    plcg423: Please contact your hardware vendor
    plcg423: CPU 2 BANK 8 TSC 7ca01c751f5057 [at 2934 Mhz 138 days 9:38:40 uptime (unreliable)]
    plcg423: MISC 1008040200081588 ADDR 3f2c58200
    plcg423: MCG status:
    plcg423: MCi status:
    plcg423: MCi_MISC register valid
    plcg423: MCi_ADDR register valid
    plcg423: MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR
    plcg423: Transaction: Memory read error
    plcg423: STATUS 8c0000400001009f MCGSTATUS 0

  • rasoul April 20, 2012, 9:36 pm

    hi
    i have problem to install any os on laptop and test the dvd & usb i dont know how install os

  • Rinshad August 20, 2013, 5:06 am

    Hi Vivek,
    I am getting this error in mcelog ,

    MCE 0
    HARDWARE ERROR. This is *NOT* a software problem!
    Please contact your hardware vendor
    CPU 0 4 northbridge TSC aeffd2efa9f1db
    ADDR 65bc76a0
    Northbridge Chipkill ECC error
    Chipkill ECC syndrome = 84ac
    bit32 = err cpu0
    bit46 = corrected ecc error
    bus error ‘local node origin, request didn’t time out
    generic read mem transaction
    memory access, level generic’
    STATUS 9456400184080813 MCGSTATUS 0

    Please help ..