When the "probably caused by" is incorrect, how does that work?

Patrick

Sysnative Staff
Joined
Jun 7, 2012
Posts
4,618
So I've been analyzing BSOD's for a fair bit of time now, I've solved hundreds and hundreds of cases, and I can do a fair bit of complicated things, etc... but I am having trouble understanding something so simple, so I feel a bit silly and embarrassed :r1: However, we're all here to learn... and without questions, there would be no answers!

I was always under the assumption that when the "probably caused by" is a signed Microsoft driver, or just an incorrect fault, like ntoskrnl.exe, it's because Windows just has no idea what driver may have caused the crash, or if a driver was even the cause in the first place, so it shoots an incorrect fault. Okay, if that is the case, how exactly does it go by shooting that incorrect fault? We all know that whatever causes a blue screen occurs before Windows actually recognizes it. So, if something causes a blue screen... and Windows doesn't find out in time, and the culprit is gone.. does it randomly go "okay, ntoskrnl.exe... I saw you first, I don't like you, you did it!", and that's it?

Basically, I'm just trying to wrap my mind around how Windows comes to shooting an incorrect fault, why if a driver is the issue, Windows cannot always catch it in the act, etc.
 
First off, the fault usually isn't incorrect. Most times when it says that ntoskrnl.exe was at fault, it was. But what it doesn't say is what caused ntoskrnl.exe to crash - the true cause.
We assume that ntoskrnl.exe isn't at fault because it contains the kernel (core) of the OS - and it's protected by the System File Protection. So it's less likely to be ntoskrnl.exe that's the actual cause - rather it's usually that ntoskrnl.exe was crashes by a misbehaving 3rd party app/driver.

But ntoskrnl.exe can be at fault, and it's most likely that you'd see it in a hacked copy of an OS - so you can never just dismiss it without checking for other signs.
 
Last edited:
Ah! So that is why ntoskrnl is mentioned often in various dump files.

Yes. To add to what usasma has already said, and to put it a slightly different way, ntoskrnl (or equivalent, e.g. when PAE is enabled) usually gets blamed due to the call stack. I don't know how much you know about programming, but very simply (and this is an extreme oversimplification, but there is plenty of advanced reading on the internet, should you be interested in the advanced details. For example, I am mixing usermode and kernel mode in ways which you can't really, but works for my example) blocks of code are called functions. These blocks of code call other blocks of code, and these call yet more blocks of code, perhaps even in a separate module (file, simplistically), and the end result is a long chain of code calling other code.

Look at this stack excerpt:

nt!KeBugCheckEx
nt!KiBugCheckDispatch+0x69
nt!KiPageFault+0x260
nt!MiInsertPageInFreeOrZeroedList+0x54c

Here, nt!MiInsertPageInFreeOrZeroedList calls nt!KiPageFault, which calls nt!KiBugCheckDispatch, which calls nt!KeBugCheckEx.

Now look at this ficticious stack

nt!KernelCreateNewThread+0x08
MyProgram!CreateNewThread+0x08
MyProgram!Main+0x08

Main is the name given to the first function called in a new process (well, sometimes at least!). Notice how MyProgram starts off, calls CreateNewThread which decides the settings which the new thread will be created with, and then passes those settings to the kernel to actually create the thread. Basically everything has to go through the kernel at some point, and this means that the kernel almost always ends up on the call stack somewhere.

But why have a stack at all? Well, the kernel has now created this new thread, but it personally doesn't have a use for it. MyProgram is what wants it. And so the kernel must keep a record of who wanted the thread object, so that it can be returned to that bit of code. This is where the stack comes in. Notice how the stack shows that MyProgram!CreateNewThread is who wanted the object? So the kernel beings a stack unwind, and starts crawling back up the stack, until it gets to the block of code which wants the new thread.

Stacks are kept absolutely up to date. Whenever a function is called, it is pushed onto the stack. Whenever a function is returned, it is popped from the stack.


So...

the stack starts off as MyProgram!Main+0x08

When CreateNewThread is called, it is pushed onto the stack, creating

MyProgram!CreateNewThread+0x08
MyProgram!Main+0x08

When CreateNewThread calls KernelCreateNewThread (in ntoskrnl, coincidentally), that is also pushed onto the stack...

nt!KernelCreateNewThread+0x08
MyProgram!CreateNewThread+0x08
MyProgram!Main+0x08

When KernelCreateNewThread is finished, and returns the new thread, it is popped:

MyProgram!CreateNewThread+0x08
MyProgram!Main+0x08

CreateNewThread's job is done, it then immediately returns the new thread also, and so it returns and is popped...

MyProgram!Main+0x08

Now the Main method wants to use the thread object, so it calls and pushes...

MyProgram!UseNewThread+0x08
MyProgram!Main+0x08

which eventually completes, returns, and pops:

MyProgram!Main+0x08

Finally, when the program is ready to quit, Main returns and pops (it returns back to the kernel code which created the new process).

That is the stack...very simplistically.


In this stack:

nt!KernelCreateNewThread+0x08
MyProgram!CreateNewThread+0x08
MyProgram!Main+0x08

!analyze -v will correctly identify MyProgram as the cause of any bugs (which there aren't in this ficticious example)

But now let us assume there is a bug:


nt!KernelCreateNewThread+0x08
MyProgram!CreateNewProcess+0x08
MyProgram!Main+0x08

This is very wrong. Assume here that Main calls CreateNewProcess, expecting a new process to be created. CreateNewProcess sets up all the settings as though it will create a new process. But then, accidentally (read BUG!), it calls nt!KernelCreateNewThread, instead of nt!KernelCreateNewProcess, passing process data to a thread function.

Windows will become really confused here. What is nt!KernelCreateNewThread supposed to do with process data? It has no idea. It crashes = BSOD.

nt!KernelCreateNewThread is not buggy though. It was passed bad data, and didn't know what to do with it. MyProgram!CreateNewProcess is what is buggy, by calling the wrong kernel mode function. MyProgram needs updating, in the hope that this bug has been fixed.

In the above call stack, MyProgram should get the blame.

But now let us assume that there has been a stack trash. This call stack:

nt!KernelCreateNewThread+0x08
MyProgram!CreateNewProcess+0x08
MyProgram!Main+0x08

becomes this:

nt!KernelCreateNewThread+0x08
0x7c90eb94+0x08

What is the only driver remaining? nt (ntoskrnl). Was nt!KernelCreateNewThread really at fault? No. Can analyze -v do anything other than blame ntoskrnl? No. Does it do just that? Yes.

There. Now you see ntoskrnl getting the blame, even though it wasn't buggy, but still for a very good reason.

P.S. There are many other reasons why a driver might not show other than a stack trash. Stack trashes almost always result in 0x1E crashed. I will try to get an example soon. There are a good two pages in Windows Internals on stack trashes if still interested, though.

Hope this helps.

Richard
 
neimiro and usasma did an excellent job explaining things. I'd like to summarize a couple things just to add to some already good points.

Concerning the "probably caused by" output, it is generated based on a generic analysis of the situation which primarily works on the callstack that faulted at the time. It rates drivers listed in the callstack as being potential suspects, which I venture to understand that 3rd-party drivers lacking either symbols or being signed are prime suspects, and Windows signed kernel drivers are at lowest. When the faulting callstack consists only of Windows kernel stuff like the nt module (this can happen at times like when the faulting callstack is for a worker thread) then it has no choice but to blame a Windows driver, picking one from the list of candidates and judging those from the ranking system (the nt module appears to rated highest and therefore least suspect).

That's why the BSOD mentions that and it's also mentioned in Windbg. However, if you're wondering why a fault wasn't found earlier, that's already been explained well by neimiro and usasma. That's why, as JC pointed out, it's good to turn on Driver Verifier because it adds extra checks that Windows will use to bugcheck the system, hoping that it'll fault earlier and during the actual time when the original bug occurred, rather than after the crime has already long been committed and someones steps into the scene unexpectedly. The attempt is always to catch it in the act, never afterwards. If you can't get that close (which usually only happens with live kernel debugging anyways), then you try to arrive at the scene as early as possible, where the tracks are fresh and the culprit is still nearby.

There are exceptions to this, however. For example, when a leak of IRPs or some other form of memory occurs, the more it leaks the more evident the problem (and therefore the source) is. This is a good solid case of the sort, in which IRPs from AV software were failing to be completed, therefore it ended up filling up nonpaged pool with unfinished IRPs and caused a crash when a system service faulted because it failed to allocate appropriate resources. When one looks at the list of queued IRPs, it is very evident there's something amiss when one sees all those pending IRPs, and then they can work towards the driver responsible for those IRPs and discover the problem. If it was early on, the leak wouldn't have manifested itself much, making the bad IRPs not as easily visible and detectable unless you're following a strict coherent trail of info straight to the one responsible.

It's all about correlating a cause with a culprit, using deductive reasoning and other sleuthlike capacities to find good solid answers. Automated analysis can only go so far, and it's up to us to start where it stopped, even if it means sometimes starting back at square one and looking at the big picture.
 

Has Sysnative Forums helped you? Please consider donating to help us support the site!

Back
Top