I'd like to take you all on a little adventure with me.
I'm working with a missionary that I'm good friends with that has been experiencing some bsods. At the moment he's only given me a few. While a couple were inconclusive (despite all being DV-enabled), one sorta stuck out, which is attached and mentioned below:
The video scheduler has detected that fatal violation has occurred. This resulted
in a condition that video scheduler can no longer progress. Any other values after
parameter 1 must be individually examined according to the subtype.
Arg1: 0000000000000001, The driver has reported an invalid fence ID.
LAST_CONTROL_TRANSFER: from fffff880044d822f to fffff80002e93d00
fffff880`0a714de8 fffff880`044d822f : 00000000`00000119 00000000`00000001 00000000`00008be2 00000000`00008c5c : nt!KeBugCheckEx
fffff880`0a714df0 fffff880`04137eb9 : 00000000`00000000 00000000`00008be2 00000000`00000000 00000000`00008c5c : watchdog!WdLogEvent5+0x11b
fffff880`0a714e40 fffff880`04138125 : fffffa80`09b4f000 fffff880`0a714f70 00000000`000011ac fffff8a0`12a87c10 : dxgmms1!VidSchiVerifyDriverReportedFenceId+0xad
fffff880`0a714e70 fffff880`04137f76 : 00000000`00008be2 fffff880`0a715001 fffffa80`09b43000 00000000`00000001 : dxgmms1!VidSchDdiNotifyInterruptWorker+0x19d
fffff880`0a714ec0 fffff880`0403f13f : fffffa80`087a5040 fffff800`02e968a4 fffff800`00000002 fffff800`00000000 : dxgmms1!VidSchDdiNotifyInterrupt+0x9e
fffff880`0a714ef0 fffff880`00c1ecca : 00000000`00000000 fffffa80`087a3040 00000000`00000000 fffff800`02e966ef : dxgkrnl!DxgNotifyInterruptCB+0x83
fffff880`0a714f20 00000000`00000000 : fffffa80`087a3040 00000000`00000000 fffff800`02e966ef fffff880`03164180 : atikmpag+0x4cca
fffff880`04137eb9 c744244053eeffff mov dword ptr [rsp+40h],0FFFFEE53h
Now just so you know, I had initially hardly just as much understanding on this as you probably do while reading this. I have absolutely no clue what Fence IDs are. However, I did some lookin up and noticed the following concerning em: Windows Vista and Later Display Driver Model Operation Flow.
So I go through the motions of it and I get a bit of an idea what a Fence ID is. It's apparently a "ticket" for the GPU to have access to process a DMA buffer. For those unaware, DMA means Direct Memory Access, which means a connection for - in this case - the GPU to be able to mess with system memory directly without havin to hassle the cpu or OS. This is the apparent process. Do you see anything familiar in relation to the call stack listed above in the crashdump?
The DirectX graphics kernel subsystem calls the display miniport driver's DxgkDdiSubmitCommand function to queue the DMA buffer to the GPU execution unit. Each DMA buffer submitted to the GPU contains a fence identifier, which is a number. After the GPU finishes processing the DMA buffer, the GPU generates an interrupt.
The display miniport driver is notified of the interrupt in its DxgkDdiInterruptRoutine function. The display miniport driver should read, from the GPU, the fence identifier of the DMA buffer that just completed.
The display miniport driver should call the DxgkCbNotifyInterrupt function to notify the DirectX graphics kernel subsystem that the DMA buffer completed. The display miniport driver should also call the DxgkCbQueueDpc function to queue a deferred procedure call (DPC).
So where in the process of this did the crash occur? As you can tell, it's during the "NotifyInterrupt" function at the very end, on step 16 - all notifying that a DMA buffer completed. Part of this notification is a pointer pointing to a data structure (DXGKARGCB_NOTIFY_INTERRUPT_DATA), and part of the data in that structure is the fence ID.
Apparently what we have here, is that after the GPU finished processing the DMA buffer, it notified the graphics driver that it finished doing what it wanted to do and gave it the id number for the DMA buffer (the Fence ID). The graphics driver gives this as part of the notification to DirectX that it got done, DirectX took a look at the Fence ID, and bugs out, thinking, "This fence ID doesn't look familiar at all. Something ain't right!" So it tells Windows to stop everything cuz it *appears* as if the gpu got illegal access to memory.
Part of me thinks this isn't so much a graphics driver issue as it is a graphics hardware issue. That's my initial diagnosis, and right now I'm still working with him to gather more info on this to verify what's what. As for my end, right now I'd like to know a few things in case anyone can help me:
If anyone else has had similar bsods that they've resolved and the culprits behind em. Was it typically hardware, and what hardware was it? Was it the drivers?
I'd like to know what the fenceID was. However, I'm unfamiliar with the dt command in Windbg and I'm not sure where to point it too and how. To those wondering, this command points to a data structure and reveals its contents and info on it. Since this is part of the notification process to DirectX about the DMA buffer completion, I should be able to see the FenceID inside the notification data structure.
I'd like to know what the FenceID was prior to the DMA buffer completion. If I knew this as well as what it was after the completion (when it bugged out), I can discern if the DMA buffer access itself was bad, or if the returned FenceID from the GPU ended up gettin corrupted somehow. Not sure how or if it's even possible to get this info, though.
This obviously isn't the end of my journey on this. I'll be continuing as I progress with finding an answer on this and extra more info from the guy about the situation.
Comments from previous discussion on this:
It looks like the crash is in the directx routine that reports the out of order fence returns. There are quite a number of bugs logged on this for Windows 7, and they run the gamut of ATI, Nvidia, and Intel video drivers as root causes. What is actually happening under the covers is that these FenceIDs are being returned out-of-order, and thus the bugcheck (why dx says "that's not right", because there's a proper way to return these). Again, in every case I can find, it was a driver (not hardware) issue, and the external vendor would be tasked with resolving the issues with their driver on customer hardware.
Unfortunately, the problem happens in the external driver before it hits directx, so I can't tell you why it's happening, but the likelihood it's a hardware issue is probably almost nil if it isn't also bugchecking with a 116. It could be power-related, though, so if the machine is older checking the PSU isn't a bad idea.
Sorry I can't provide the debug, but the directx drivers aren't public on purpose, and I don't feel comfortable putting any of that out here even amongst this small group given the protection around this source.
Interrupts (or ISRs; Interrupt Service Routines) to handle device I/O need to be done very quickly or risk holding up the entire system (because of high IRQL), so what usually happens is the interrupt is designed to merely create a DPC, or Deferred Procedure Call, to defer (hence the name) the responsibility of handling the I/O till later. The DPC itself, once it is next in the DPC queue, will then do the actual servicing of the device's I/O. The interrupt is only there to notify the system to prepare for I/O, while it is the DPC itself that does all the work. Windows Internals 5th Edition explains all this in the I/O System chapter. If you have the 6th edition, you'll have to wait until Part 2 of it comes out.
So what's going on is that the interrupt has already done its work, and it is now the DPC (the actual I/O) that's doing the work, which DirectX is involved (obviously some form of video/audio I/O). You can check the DPC queue for each processor using !dpcs in Windbg. Obviously, this information, like most, is not available in a minidump, but if you give it the number of the processor that was currently running at the time of the crash (you can tell by the Windbg prompt which proc you're in) you may be lucky, but I doubt it.
Understand that the KDPC data structure is opaque, in that it is an internal structure where information on it is publicly finite, and so you kinda have to walk it out, fiddle with it, and figure it out on your own. Also, it's not something that a driver is allowed to manipulate, only Windows kernel can. So if you discover that a driver has tampered with this or even is attempting to write to it, you know the driver is being unscrupulous (a driver can point to it, though, just not edit). That's not to say it's the case you're dealing with, however.
Gah, I don't think the OP will respond though, mentioned he does not have time for hardware diagnostics / troubleshooting so it's likely I won't be able to take in much knowledge from this specific analysis. Also, I tried running a !dpcs command and got the following -
3: kd> !dpcs
CPU Type KDPC Function
Failed to read DPC at 0xfffffa800615b0c8
Failed to read DPC at 0xfffff88002fd5318
When you load up a crashdump, the processor and thread context that's initially loaded is the one that was most recent during time of the crashdump, as in the one that was active at that time. You can tell the thread by doing !thread, but the processor is much easier, by just looking at the Windbg prompt:
Oh, of course, if you change the processor context using ~, then the prompt will adjust accordingly, but this is the one that showed up for me when I opened this particular kernel dump. Instead of defaulting to processor 0, it automatically was set to proc 2, which was running at time of crash.
Yah, I saw that, good catch. Look at Arg 2-4, which are actually the fence ids. Either Arg3 is the fence id it expected, or the fence id directly prior to the one we're dealing with. Either way, it's evident we're looking at one messed up fence id in Arg 2. Clearly it's an overwritten value, perhaps from stack overflow or some other driver nonsense.
A fence is an instruction that contains 64 bits of data and an address. The display miniport driver can insert a fence in the direct memory access (DMA) stream that is sent to the graphics processing unit (GPU). When the GPU reads the fence, the GPU writes the fence data at the specified fence address. However, before the GPU can write the fence data to memory, it must ensure that all of the pixels from the primitives that precede the fence instruction are retired and properly written to memory. Note The GPU is not required to stall the entire pipeline while it waits for the last pixel from the primitives that precede the fence instruction to retire; the GPU can instead run the primitives that follow the fence instruction.
Hardware that supports per-GPU-context virtual address space must support the following types of fences:
Regular fences are fences that can be inserted in a DMA buffer that is created in user mode. Because the content of a DMA buffer from user mode is not trusted, fences within such a DMA buffer must refer to a virtual address in the GPU context address space and not to a physical address. Access to such a virtual address is bound by the same memory validation mechanism as any other virtual address that the GPU accesses.
Privileged fences are fences that can be inserted only in a DMA buffer that is created (and only accessible) in kernel mode. Fences within such a DMA buffer refer to a physical address in memory.
Note that if the fence target address was accessible in user mode, malicious software could perform a graphics operation over the memory location for the fence and therefore override the content of what the kernel expected to receive.