Friday, January 13, 2017

windows 7 - BSOD - meaning of bugcheck?


When logging into Windows 7 today, my PC instantly BSOD'd. Using WhoCrashed I get the following report:


--



  • On Tue 12.02.2013 13:56:20 GMT your computer crashed

  • crash dump file: C:\Windows\Minidump\021213-27390-01.dmp

  • uptime: 00:00:25

  • This was probably caused by the following module: ntoskrnl.exe (nt+0x1AA698)

  • Bugcheck code: 0x1000007E (0xFFFFFFFFC0000096, 0xFFFFF80003610698, 0xFFFFF8800614C7B8, 0xFFFFF8800614C010)

  • Error: SYSTEM_THREAD_EXCEPTION_NOT_HANDLED_M

  • file path: C:\Windows\system32\ntoskrnl.exe

  • product: Microsoft® Windows® Operating System

  • company: Microsoft Corporation

  • description: NT Kernel & System

  • Bug check description: This indicates that a system thread generated an exception which the error handler did not catch.


This appears to be a typical software driver bug and is not likely to be caused by a hardware problem.
The crash took place in the Windows kernel. Possibly this problem is caused by another driver which cannot be identified at this time.


--


Now, my PC had been crashing/freezing occasionally and on specific performance-heavy tasks in the past, but the cause of it (I thought) was a flawed RAM-slot in my motherboard. Keeping that slot empty stopped the crashes.


Today, it crashed again, and I have not changed anything hardware-related.


I know I could go around Google reading what this bugcheck code means, but lately I've come to realize that a personal experience from somebody (with the same bugcheck/problem) is much more useful, specially as this person might have come to a solution.


Thank you very much!


Answer



In this case, a thread encountered the exception


C0000096: STATUS_PRIVILEGED_INSTRUCTION
Executing an instruction not allowed in current machine mode.

This error was raised by the CPU itself. Some code tried to execute an instruction that it isn't allowed to do. Likely this is caused by memory corruption; where kernel code tried to execute junk data.


This kind of error really is impossible to pinpoint. There was an error in "kernel" code that shouldn't have happened. It's extraordinarily unlikely that there's a software bug in any of Microsoft's code; which is when you begin to look elsewhere.



  • Third Party Drivers. Kernel mode drives have full access to the physical hardware. Any stray bug in any 3rd party driver (e.g. video, sound, network, USB 3.0, SATA) and it can corrupt code or data of anything else in the system. Next steps: try removing newly added hardware (so some third party drivers are not loaded), try booting in safe mode (so some third party drivers are not loaded), or reinstall Windows (so some 3rd party drivers are not loaded)

  • Bad RAM. If a bit was flipped, and it turned a perfectly benign instruction into a different, invalid, instruction, you could get this error. Next Steps: Remove RAM stick, move RAM to other slots, unclock RAM, change power supplies

  • Overclocking. Sometimes extraordinarily strange things can happen when you overclock. Hopefully everyone is sending Microsoft their crash dumps; because Microsoft does investigate them. A common error they would get is when the CPU is executing the instruction:


    xor eax, eax;

    This is an extraordinarily simple operation that the CPU can execute; it's simply setting an internal CPU register EAX to zero. There's no way it can fail; except when you overclock - or other physical problems.



tl;dr: If you've eliminated the software, then it's the hardware.


Update: Troubleshooting Methodology


i wanted to mention the details that i went through, almost mindlessly when looking at this error.


The first was the actual bugcheck code:


0x1000007E - SYSTEM_THREAD_EXCEPTION_NOT_HANDLED_M

Binging that on Google gives the Microsoft documentation page



Bug Check 0x1000007E: SYSTEM_THREAD_EXCEPTION_NOT_HANDLED_M


This indicates that a system thread generated an exception which the error handler did not catch.



i know, from experience being a developer, that if my application (or one of its threads) experiences an "exception", and i don't "handle" the exception, Windows will eventually handle it by killing the application. If an unhandled exception happens while in kernel mode, the OS has no choice but to handle it by shutting down the kernel. What i was interested in is which exception was being thrown. i assumed (incorrectly, it turns out) it was an "Access Violation".


i know that all bugchecks are accompanied by four parameters that describe what actually happened:



  • Parameter 1: 0xFFFFFFFFC0000096

  • Parameter 2: 0xFFFFF80003610698

  • Parameter 3: 0xFFFFF8800614C7B8

  • Parameter 4: 0xFFFFF8800614C010


But what the hell do these mean?! That's when we turn back to the documentation page, which doesn't describe them. But it does say:



Bug check 0x1000007E has the same meaning and parameters as bug check 0x7E (SYSTEM_THREAD_EXCEPTION_NOT_HANDLED).



Excellent. And this other page documents the parameters:



The following parameters appear on the blue screen.



  • Parameter 1: The exception code that was not handled

  • Parameter 2: The address where the exception occurred

  • Parameter 3: The address of the exception record

  • Parameter 4: The address of the context record



This is what i wanted, the exception code that was not handled. In your case it was exception code:


0xFFFFFFFFC0000096

i know, from experience, that you're running on a 64-bit Windows, because that code is 64-bits long. Really i only want the lower 32-bits:


0xC0000096

Normally i would have expected to find that error code in winerror.h in my development directory; but it wasn't there. It took some Binging, but i found that searching for:



winerror C0000096



lead me to a page on winehq, that declared the constant:


STATUS_PRIVILEGED_INSTRUCTION = 0xC0000096

Binging for that constant lead me a canonical Microsoft documentation page:



STATUS_PRIVILEGED_INSTRUCTION:Executing an instruction not allowed in current machine mode.



I also know that this exception is thrown by the CPU itself. i know that because "Privileged Instruction" means you tried to execute a CPU instruction you're not allowed. i also can know this because the page is called Hardware Exceptions.


So we're at the point were some code was running that tried to execute a CPU instruction it wasn't supposed to. There's two possibilities:



  • memory was corrupted; the software wasn't written to try to execute that code, but that's what just ended up in RAM

  • it really is buggy software, and it tried to do something it's not allowed.


Given that Microsoft's code is constantly being field tested in millions of machine's every day, it's more likely:



  • to be a problem with your hardware

  • a bug in someone else's code causing problems




Anyway, that was how i worked on that bugcheck. Maybe by knowing how i went through it, it can help you the next time you have a bugcheck.


No comments:

Post a Comment

hard drive - Leaving bad sectors in unformatted partition?

Laptop was acting really weird, and copy and seek times were really slow, so I decided to scan the hard drive surface. I have a couple hundr...