Tuesday, December 13, 2016

motherboard - How to Troubleshoot Cause of Memory Errors

Hope someone can give me an idea of how to resolve this issue short of replacing the whole machine.


Background/History


I have an ASUS P8Z68-M Pro MB / G620 CPU / 16GB DDR3 1333MHz CL 9-9-9-24 DRAM. The system is about 4 years old, and it had memory errors about 2 years ago. I bought new RAM and RMA'd the bad set to keep for spare.


Last week I noticed some weird errors in FreeNAS (which have been happening for some time), so I took the machine down and started running Memtest86+ v4.2, and found an easily reproducible error in one of the DIMMs at address 0019bd12878.


First time memory failed on Pass 1, Test 2 error bit was 00010000 - bit expected was 0, but 1 was read.


Second time error was on Pass 1, Test 1 - error bit was 00020000, again 0 expected, one read.


Problem was very easy to reproduce - Put the bad DIMM in a different slot for the two different tests - failed both times.


The problem


I replaced the bad RAM with the spare RAM from the first RMA. Brand new Patriot VIPER DDR3 1600MHz CL9-9-9-24 which I set up to run at 1333MHz in the BIOS. (G620 won't take the higher multiplier.) Did XMP in the BIOS, and then set the clock speed to 1333.


I now have a weird situation with the replacement.


This Ran fine for just over 24 hours, then I started getting a few errors at 0004d2fxxxx. (Range of addresses - program only shows a few on the screen and I don't have a printer hooked up to it, or any way to capture more details.)


Without taking down the machine I changed the Memtest86+ settings to spot test the area that was reporting the errors, and got about 4500 errors very quickly. All the errors reported with Test 8 "Random Patterns"


When I tried to reproduce and localize the problem by pulling one of the two DIMMs, and the errors stopped. So the power cycle and/or reinserting the other DIMM cleared the problem.


I went back to the original configuration and so far it has been running error free for over 37 hours. Which makes it less likely to be a simple thermal problem.


Questions



  1. Any suggestions on how I can localize this problem?

  2. Any other test programs I should run that might help?

  3. Is this more likely to be a memory problem, motherboard problem (or even CPU chip or Power supply issue)?


Any suggestions or input would be most appreciated.


Thanks.

No comments:

Post a Comment

hard drive - Leaving bad sectors in unformatted partition?

Laptop was acting really weird, and copy and seek times were really slow, so I decided to scan the hard drive surface. I have a couple hundr...