[wplug] deciphering machine check exceptions

Eli Heady eli.heady at gmail.com
Thu Jul 12 10:51:21 EDT 2007


Upon seeing this in the system logs,

Jul 12 08:02:42 hydra Machine check events logged

I ran mcelog and see that there have been several machine check exception 
events on this box. My question is in how I should use this information to 
determine which part(s) need replacing: cpu, memory, or (Server Gods forbid) 
the system board.

The machine in question is running two Opteron 250's, so the memory controller 
is integrated, but the mce logs report ECC errors in cpu0 L2 cache and the 
northbridge - leading me to wonder whether the mce log could be identifying 
bad memory, bad proc, bad northbridge or some combination of the above. cpu0 
and it's memory were installed shortly before the mce events were logged. I 
ran memtest at the time and found no errors, fwiw. Is it worth running 
memtest again, or do these logs definitavely say that cpu0 is the problem?

Thanks in advance!
-eli

mcelog output:

MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 2 bus unit TSC 2719bcf6e3fb28 
  L2 cache ECC error
  Bus or cache array error
       bit46 = corrected ecc error
  bus error 'local node origin, request didn't time out
      prefetch mem transaction
      memory access, level generic'
STATUS 9000400000000863 MCGSTATUS 0
MCE 1
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge TSC 27b666b2df57dc 
ADDR 265a7ff0 
  Northbridge ECC error
  ECC syndrome = 9b
       bit46 = corrected ecc error
  bus error 'local node response, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS 944dc00000000a13 MCGSTATUS 0
MCE 2
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge TSC 28a4afdc430224 
ADDR 265a7ff0 
  Northbridge ECC error
  ECC syndrome = 9b
       bit46 = corrected ecc error
  bus error 'local node response, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS 944dc00000000a13 MCGSTATUS 0
MCE 3
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge TSC 2dcbd0e50c6844 
ADDR 265a7ff0 
  Northbridge ECC error
  ECC syndrome = 9b
       bit46 = corrected ecc error
  bus error 'local node response, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS 944dc00000000a13 MCGSTATUS 0
MCE 4
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge TSC 3188e8972d3b74 
ADDR 265a7ff0 
  Northbridge ECC error
  ECC syndrome = 9b
       bit46 = corrected ecc error
  bus error 'local node response, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS 944dc00000000a13 MCGSTATUS 0
MCE 5
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 2 bus unit TSC 31bce8ffd3738c 
  L2 cache ECC error
  Bus or cache array error
       bit46 = corrected ecc error
  bus error 'local node origin, request didn't time out
      prefetch mem transaction
      memory access, level generic'
STATUS 9000400000000863 MCGSTATUS 0
MCE 6
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge TSC 31bce8ffd37bc4 
ADDR 265a7ff0 
  Northbridge ECC error
  ECC syndrome = 9b
       bit32 = err cpu0
       bit46 = corrected ecc error
  bus error 'local node origin, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS 944dc00100000813 MCGSTATUS 0




-- 
Eli Heady
Systems Administrator
SEEGRID Corporation
eli at seegrid.com


More information about the wplug mailing list