[wplug] deciphering machine check exceptions
Eli Heady
eli.heady at gmail.com
Thu Jul 12 10:51:21 EDT 2007
Upon seeing this in the system logs,
Jul 12 08:02:42 hydra Machine check events logged
I ran mcelog and see that there have been several machine check exception
events on this box. My question is in how I should use this information to
determine which part(s) need replacing: cpu, memory, or (Server Gods forbid)
the system board.
The machine in question is running two Opteron 250's, so the memory controller
is integrated, but the mce logs report ECC errors in cpu0 L2 cache and the
northbridge - leading me to wonder whether the mce log could be identifying
bad memory, bad proc, bad northbridge or some combination of the above. cpu0
and it's memory were installed shortly before the mce events were logged. I
ran memtest at the time and found no errors, fwiw. Is it worth running
memtest again, or do these logs definitavely say that cpu0 is the problem?
Thanks in advance!
-eli
mcelog output:
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 2 bus unit TSC 2719bcf6e3fb28
L2 cache ECC error
Bus or cache array error
bit46 = corrected ecc error
bus error 'local node origin, request didn't time out
prefetch mem transaction
memory access, level generic'
STATUS 9000400000000863 MCGSTATUS 0
MCE 1
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge TSC 27b666b2df57dc
ADDR 265a7ff0
Northbridge ECC error
ECC syndrome = 9b
bit46 = corrected ecc error
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS 944dc00000000a13 MCGSTATUS 0
MCE 2
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge TSC 28a4afdc430224
ADDR 265a7ff0
Northbridge ECC error
ECC syndrome = 9b
bit46 = corrected ecc error
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS 944dc00000000a13 MCGSTATUS 0
MCE 3
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge TSC 2dcbd0e50c6844
ADDR 265a7ff0
Northbridge ECC error
ECC syndrome = 9b
bit46 = corrected ecc error
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS 944dc00000000a13 MCGSTATUS 0
MCE 4
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge TSC 3188e8972d3b74
ADDR 265a7ff0
Northbridge ECC error
ECC syndrome = 9b
bit46 = corrected ecc error
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS 944dc00000000a13 MCGSTATUS 0
MCE 5
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 2 bus unit TSC 31bce8ffd3738c
L2 cache ECC error
Bus or cache array error
bit46 = corrected ecc error
bus error 'local node origin, request didn't time out
prefetch mem transaction
memory access, level generic'
STATUS 9000400000000863 MCGSTATUS 0
MCE 6
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge TSC 31bce8ffd37bc4
ADDR 265a7ff0
Northbridge ECC error
ECC syndrome = 9b
bit32 = err cpu0
bit46 = corrected ecc error
bus error 'local node origin, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS 944dc00100000813 MCGSTATUS 0
--
Eli Heady
Systems Administrator
SEEGRID Corporation
eli at seegrid.com
More information about the wplug
mailing list