Vir Gnarus
BSOD Kernel Dump Expert
- Mar 2, 2012
- 474
Hi all,
I'd like to take you all on a little adventure with me.
I'm working with a missionary that I'm good friends with that has been experiencing some bsods. At the moment he's only given me a few. While a couple were inconclusive (despite all being DV-enabled), one sorta stuck out, which is attached and mentioned below:
Now just so you know, I had initially hardly just as much understanding on this as you probably do while reading this. I have absolutely no clue what Fence IDs are. However, I did some lookin up and noticed the following concerning em: Windows Vista and Later Display Driver Model Operation Flow.
So I go through the motions of it and I get a bit of an idea what a Fence ID is. It's apparently a "ticket" for the GPU to have access to process a DMA buffer. For those unaware, DMA means Direct Memory Access, which means a connection for - in this case - the GPU to be able to mess with system memory directly without havin to hassle the cpu or OS. This is the apparent process. Do you see anything familiar in relation to the call stack listed above in the crashdump?
So where in the process of this did the crash occur? As you can tell, it's during the "NotifyInterrupt" function at the very end, on step 16 - all notifying that a DMA buffer completed. Part of this notification is a pointer pointing to a data structure (DXGKARGCB_NOTIFY_INTERRUPT_DATA), and part of the data in that structure is the fence ID.
Apparently what we have here, is that after the GPU finished processing the DMA buffer, it notified the graphics driver that it finished doing what it wanted to do and gave it the id number for the DMA buffer (the Fence ID). The graphics driver gives this as part of the notification to DirectX that it got done, DirectX took a look at the Fence ID, and bugs out, thinking, "This fence ID doesn't look familiar at all. Something ain't right!" So it tells Windows to stop everything cuz it *appears* as if the gpu got illegal access to memory.
Part of me thinks this isn't so much a graphics driver issue as it is a graphics hardware issue. That's my initial diagnosis, and right now I'm still working with him to gather more info on this to verify what's what. As for my end, right now I'd like to know a few things in case anyone can help me:
This obviously isn't the end of my journey on this. I'll be continuing as I progress with finding an answer on this and extra more info from the guy about the situation.
UPDATE:
Comments from previous discussion on this:
I'd like to take you all on a little adventure with me.
I'm working with a missionary that I'm good friends with that has been experiencing some bsods. At the moment he's only given me a few. While a couple were inconclusive (despite all being DV-enabled), one sorta stuck out, which is attached and mentioned below:
Code:
[FONT=Courier New]VIDEO_SCHEDULER_INTERNAL_ERROR (119)
The video scheduler has detected that fatal violation has occurred. This resulted
in a condition that video scheduler can no longer progress. Any other values after
parameter 1 must be individually examined according to the subtype.
Arguments:
Arg1: 0000000000000001, The driver has reported an invalid fence ID.
Arg2: 0000000000008be2
Arg3: 0000000000008c5c
Arg4: 0000000000008c5b
Debugging Details:
------------------
CUSTOMER_CRASH_COUNT: 1
DEFAULT_BUCKET_ID: VERIFIER_ENABLED_VISTA_MINIDUMP
BUGCHECK_STR: 0x119
PROCESS_NAME: svchost.exe
CURRENT_IRQL: a
LAST_CONTROL_TRANSFER: from fffff880044d822f to fffff80002e93d00
STACK_TEXT:
fffff880`0a714de8 fffff880`044d822f : 00000000`00000119 00000000`00000001 00000000`00008be2 00000000`00008c5c : nt!KeBugCheckEx
fffff880`0a714df0 fffff880`04137eb9 : 00000000`00000000 00000000`00008be2 00000000`00000000 00000000`00008c5c : watchdog!WdLogEvent5+0x11b
fffff880`0a714e40 fffff880`04138125 : fffffa80`09b4f000 fffff880`0a714f70 00000000`000011ac fffff8a0`12a87c10 : dxgmms1!VidSchiVerifyDriverReportedFenceId+0xad
fffff880`0a714e70 fffff880`04137f76 : 00000000`00008be2 fffff880`0a715001 fffffa80`09b43000 00000000`00000001 : dxgmms1!VidSchDdiNotifyInterruptWorker+0x19d
fffff880`0a714ec0 fffff880`0403f13f : fffffa80`087a5040 fffff800`02e968a4 fffff800`00000002 fffff800`00000000 : dxgmms1!VidSchDdiNotifyInterrupt+0x9e
fffff880`0a714ef0 fffff880`00c1ecca : 00000000`00000000 fffffa80`087a3040 00000000`00000000 fffff800`02e966ef : dxgkrnl!DxgNotifyInterruptCB+0x83
fffff880`0a714f20 00000000`00000000 : fffffa80`087a3040 00000000`00000000 fffff800`02e966ef fffff880`03164180 : atikmpag+0x4cca
STACK_COMMAND: kb
FOLLOWUP_IP:
dxgmms1!VidSchiVerifyDriverReportedFenceId+ad
fffff880`04137eb9 c744244053eeffff mov dword ptr [rsp+40h],0FFFFEE53h
SYMBOL_STACK_INDEX: 2
SYMBOL_NAME: dxgmms1!VidSchiVerifyDriverReportedFenceId+ad
FOLLOWUP_NAME: MachineOwner
MODULE_NAME: dxgmms1
IMAGE_NAME: dxgmms1.sys
DEBUG_FLR_IMAGE_TIMESTAMP: 4ce799c1
FAILURE_BUCKET_ID: X64_0x119_VRF_dxgmms1!VidSchiVerifyDriverReportedFenceId+ad
BUCKET_ID: X64_0x119_VRF_dxgmms1!VidSchiVerifyDriverReportedFenceId+ad
Followup: MachineOwner[/FONT]
Now just so you know, I had initially hardly just as much understanding on this as you probably do while reading this. I have absolutely no clue what Fence IDs are. However, I did some lookin up and noticed the following concerning em: Windows Vista and Later Display Driver Model Operation Flow.
So I go through the motions of it and I get a bit of an idea what a Fence ID is. It's apparently a "ticket" for the GPU to have access to process a DMA buffer. For those unaware, DMA means Direct Memory Access, which means a connection for - in this case - the GPU to be able to mess with system memory directly without havin to hassle the cpu or OS. This is the apparent process. Do you see anything familiar in relation to the call stack listed above in the crashdump?
14.
The DirectX graphics kernel subsystem calls the display miniport driver's DxgkDdiSubmitCommand function to queue the DMA buffer to the GPU execution unit. Each DMA buffer submitted to the GPU contains a fence identifier, which is a number. After the GPU finishes processing the DMA buffer, the GPU generates an interrupt.
15.
The display miniport driver is notified of the interrupt in its DxgkDdiInterruptRoutine function. The display miniport driver should read, from the GPU, the fence identifier of the DMA buffer that just completed.
16.
The display miniport driver should call the DxgkCbNotifyInterrupt function to notify the DirectX graphics kernel subsystem that the DMA buffer completed. The display miniport driver should also call the DxgkCbQueueDpc function to queue a deferred procedure call (DPC).
So where in the process of this did the crash occur? As you can tell, it's during the "NotifyInterrupt" function at the very end, on step 16 - all notifying that a DMA buffer completed. Part of this notification is a pointer pointing to a data structure (DXGKARGCB_NOTIFY_INTERRUPT_DATA), and part of the data in that structure is the fence ID.
Apparently what we have here, is that after the GPU finished processing the DMA buffer, it notified the graphics driver that it finished doing what it wanted to do and gave it the id number for the DMA buffer (the Fence ID). The graphics driver gives this as part of the notification to DirectX that it got done, DirectX took a look at the Fence ID, and bugs out, thinking, "This fence ID doesn't look familiar at all. Something ain't right!" So it tells Windows to stop everything cuz it *appears* as if the gpu got illegal access to memory.
Part of me thinks this isn't so much a graphics driver issue as it is a graphics hardware issue. That's my initial diagnosis, and right now I'm still working with him to gather more info on this to verify what's what. As for my end, right now I'd like to know a few things in case anyone can help me:
- If anyone else has had similar bsods that they've resolved and the culprits behind em. Was it typically hardware, and what hardware was it? Was it the drivers?
- I'd like to know what the fenceID was. However, I'm unfamiliar with the dt command in Windbg and I'm not sure where to point it too and how. To those wondering, this command points to a data structure and reveals its contents and info on it. Since this is part of the notification process to DirectX about the DMA buffer completion, I should be able to see the FenceID inside the notification data structure.
- I'd like to know what the FenceID was prior to the DMA buffer completion. If I knew this as well as what it was after the completion (when it bugged out), I can discern if the DMA buffer access itself was bad, or if the returned FenceID from the GPU ended up gettin corrupted somehow. Not sure how or if it's even possible to get this info, though.
This obviously isn't the end of my journey on this. I'll be continuing as I progress with finding an answer on this and extra more info from the guy about the situation.
UPDATE:
Comments from previous discussion on this:
cluberti said:It looks like the crash is in the directx routine that reports the out of order fence returns. There are quite a number of bugs logged on this for Windows 7, and they run the gamut of ATI, Nvidia, and Intel video drivers as root causes. What is actually happening under the covers is that these FenceIDs are being returned out-of-order, and thus the bugcheck (why dx says "that's not right", because there's a proper way to return these). Again, in every case I can find, it was a driver (not hardware) issue, and the external vendor would be tasked with resolving the issues with their driver on customer hardware.
Unfortunately, the problem happens in the external driver before it hits directx, so I can't tell you why it's happening, but the likelihood it's a hardware issue is probably almost nil if it isn't also bugchecking with a 116. It could be power-related, though, so if the machine is older checking the PSU isn't a bad idea.
Sorry I can't provide the debug, but the directx drivers aren't public on purpose, and I don't feel comfortable putting any of that out here even amongst this small group given the protection around this source.
VirGnarus said:I can see how this could be problematic, as if I recall DMA process jobs being out of order can potentially cause memory corruption. I was figuring it was maybe just a bad fence ID altogether.
Though I wonder, what are the parameters the watchdog sent to KeBugCheckEx? Obviously the first one is the subtype of error, but are the other three the Fence IDs (like expected/received)?
cluberti said:Yes, they are the received fences. I would still recommend testing a different card to be safe, but the driver is still the likely culprit.