[wplug] deciphering machine check exceptions

Alheid, Gregory alheidgj at upmc.edu
Thu Jul 12 11:53:22 EDT 2007


Eli Heady inquired about MCEs on an Opteron 250.

Eli,

> cpu0 and it's memory were installed shortly
>   before the mce events were logged.

At worst, not knowing the procedure used to install the CPU
and memory there may be ESD damage, which may or may not
continue to get errors.

Also if there are time stamps available, the first MCE error
is the most likely cause.

I suspect that the CPU 0 L2 cache error is the cause of the
errors and I do not know if the bad L2 cache bit has an
external connection:
 if no external connection, the CPU 0 is most likely OK;
 If these is an direct L2 cache external connection,
   again there may be the ESD damage.

So at best, you had a ONE-TIME ECC error. Where displayed,
the address and ECC syndrome are the same. You need to
continue monitor for any additional errors.

Also I've grouped the same type of errors together;

> MCE 0
> CPU 0 2 bus unit TSC 2719bcf6e3fb28
>  L2 cache ECC error
>  Bus or cache array error
>       bit46 = corrected ecc error

> MCE 5
> CPU 0 2 bus unit TSC 31bce8ffd3738c
>   L2 cache ECC error
>   Bus or cache array error
>        bit46 = corrected ecc error

- - - - - - - - - - - - -

> MCE 6
> CPU 0 4 northbridge TSC 31bce8ffd37bc4
> ADDR 265a7ff0
>   Northbridge ECC error
>   ECC syndrome = 9b
>        bit32 = err cpu0
>        bit46 = corrected ecc error

- - - - - - - - - - - - -

> MCE 1
> CPU 0 4 northbridge TSC 27b666b2df57dc
> ADDR 265a7ff0
>   Northbridge ECC error
>   ECC syndrome = 9b
>        bit46 = corrected ecc error

> MCE 2
> CPU 0 4 northbridge TSC 28a4afdc430224
> ADDR 265a7ff0
>   Northbridge ECC error
>   ECC syndrome = 9b
>        bit46 = corrected ecc error

> MCE 3
> CPU 0 4 northbridge TSC 2dcbd0e50c6844
> ADDR 265a7ff0
>   Northbridge ECC error
>   ECC syndrome = 9b
>        bit46 = corrected ecc error

> MCE 4
> CPU 0 4 northbridge TSC 3188e8972d3b74
> ADDR 265a7ff0
>   Northbridge ECC error
>   ECC syndrome = 9b
>        bit46 = corrected ecc error





More information about the wplug mailing list