OK, here's the latest. I'm still mostly problem free but yesterday I had a recovered TDR - a single 117 live kernel report - while playing Civ 5 (which crashed the game) followed by series of 6 117 live kernel reports in quick succession 3 hours later, this time freezing the PC without a 116 bugcheck (like I had with the GTX 260 installed). Both times, the 117s were triggered by exactly the same action in game and I've found other reports of people getting TDRs under the same circumstances, so this may be a driver glitch associated with Civ 5. However, I'm not quite ready to let things drop yet, so I've been doing some more thinking...
I was thinking about the disassembly at the pointer into nvlddmkm a little more and came up with an idea based on totally the wrong conclusion about it! (will explain that in a minute) Looking at the hardware IRQ assignments on my machine, I noticed that my video card was sharing IRQ 16 with a "VIA Rev 5" USB Universal Host Controller. Now, I already happened to have suspicions about this device. It's actually part of a TV tuner card I have installed. There are two such controllers which each claim to have a "USB Composite Device" attached to them. I don't fully understand what these do and will have to investigate further. However, what I can tell you is that they don't respond nicely to S3 sleep at all - every time I switch my computer back on from S3 sleep, they claim to disconnect and reconnect, sometimes producing errors and often causing the attached infrared controller to stop working properly. My guess would be that it's because the PCI bus loses power in S3 sleep while "normal" USB controllers do not, so when we go to S3, the power to these controllers is suddenly lost unexpectedly.
So, for my next step of diagnostics, I have disabled the one VIA USB controller that is sharing IRQ 16 (the other is on IRQ 18). I'll see how that goes and try removing the TV tuner card completely as the next step if it doesn't work.
Now, here is the "wrong conclusion" I made and some interesting extra information from it. You'll recall that the disassembly in the NVidia driver showed lots of "int" commands surrounding all those jumps (I'm back on the full memory dump from the 116 from a few weeks ago). In fact, the instruction previous to the stop point was also an int:
Code:
4: kd> u fffff8800f804530-10 L20
nvlddmkm+0x14f520:
fffff880`0f804520 48ff2571837100 jmp qword ptr [nvlddmkm!nvDumpConfig+0x188388 (fffff880`0ff1c898)]
fffff880`0f804527 cc int 3
fffff880`0f804528 e9b799f0ff jmp nvlddmkm+0x58ee4 (fffff880`0f70dee4)
fffff880`0f80452d cc int 3
fffff880`0f80452e cc int 3
fffff880`0f80452f cc int 3
fffff880`0f804530 48ff25d9817100 jmp qword ptr [nvlddmkm!nvDumpConfig+0x188200 (fffff880`0ff1c710)]
fffff880`0f804537 cc int 3
fffff880`0f804538 e94fa2f0ff jmp nvlddmkm+0x5978c (fffff880`0f70e78c)
fffff880`0f80453d cc int 3
fffff880`0f80453e cc int 3
fffff880`0f80453f cc int 3
fffff880`0f804540 48ff2589817100 jmp qword ptr [nvlddmkm!nvDumpConfig+0x1881c0 (fffff880`0ff1c6d0)]
fffff880`0f804547 cc int 3
fffff880`0f804548 e94ba7f0ff jmp nvlddmkm+0x59c98 (fffff880`0f70ec98)
fffff880`0f80454d cc int 3
fffff880`0f80454e cc int 3
fffff880`0f80454f cc int 3
fffff880`0f804550 48ff25c1827100 jmp qword ptr [nvlddmkm!nvDumpConfig+0x188308 (fffff880`0ff1c818)]
fffff880`0f804557 cc int 3
fffff880`0f804558 e9d3a8f0ff jmp nvlddmkm+0x59e30 (fffff880`0f70ee30)
fffff880`0f80455d cc int 3
fffff880`0f80455e cc int 3
fffff880`0f80455f cc int 3
fffff880`0f804560 48ff25d1817100 jmp qword ptr [nvlddmkm!nvDumpConfig+0x188228 (fffff880`0ff1c738)]
fffff880`0f804567 cc int 3
fffff880`0f804568 48ff2559817100 jmp qword ptr [nvlddmkm!nvDumpConfig+0x1881b8 (fffff880`0ff1c6c8)]
fffff880`0f80456f cc int 3
fffff880`0f804570 e99bb1f0ff jmp nvlddmkm+0x5a710 (fffff880`0f70f710)
fffff880`0f804575 cc int 3
fffff880`0f804576 cc int 3
fffff880`0f804577 cc int 3
My mind immediately went from "int", i.e. "interrupt" to hardware interrupts and hence IRQs. In fact, after a little more reading, I realise now that this is a software interrupt requesting ISR 3. We can find out what that is:
Code:
4: kd> !idt
Dumping IDT: fffff880009bd6c0
00: fffff80002ad1940 nt!KiDivideErrorFault
01: fffff80002ad1a40 nt!KiDebugTrapOrFault
02: fffff80002ad1c00 nt!KiNmiInterrupt Stack = 0xFFFFF880009BD0C0
03: fffff80002ad1f80 nt!KiBreakpointTrap
04: fffff80002ad2080 nt!KiOverflowTrap
05: fffff80002ad2180 nt!KiBoundFault
06: fffff80002ad2280 nt!KiInvalidOpcodeFault
07: fffff80002ad24c0 nt!KiNpxNotAvailableFault
08: fffff80002ad2580 nt!KiDoubleFaultAbort Stack = 0xFFFFF880009B90C0
09: fffff80002ad2640 nt!KiNpxSegmentOverrunAbort
0a: fffff80002ad2700 nt!KiInvalidTssFault
0b: fffff80002ad27c0 nt!KiSegmentNotPresentFault
0c: fffff80002ad2900 nt!KiStackFault
0d: fffff80002ad2a40 nt!KiGeneralProtectionFault
0e: fffff80002ad2b80 nt!KiPageFault
10: fffff80002ad2f40 nt!KiFloatingErrorFault
11: fffff80002ad30c0 nt!KiAlignmentFault
12: fffff80002ad31c0 nt!KiMcheckAbort Stack = 0xFFFFF880009BB0C0
13: fffff80002ad3540 nt!KiXmmException
1f: fffff80002ac77d0 nt!KiApcInterrupt
2c: fffff80002ad3700 nt!KiRaiseAssertion
2d: fffff80002ad3800 nt!KiDebugServiceTrap
2f: fffff80002b1f950 nt!KiDpcInterrupt
37: fffffa80068aba90 hal!HalpApicSpuriousService (KINTERRUPT fffffa80068aba00)
3f: fffffa80068abb30 hal!HalpApicSpuriousService (KINTERRUPT fffffa80068abaa0)
50: fffffa80068abc70 hal!HalpCmciService (KINTERRUPT fffffa80068abbe0)
52: fffffa8006853990 USBPORT!USBPORT_InterruptService (KINTERRUPT fffffa8006853900)
61: fffffa8006853090 serial!SerialCIsrSw (KINTERRUPT fffffa8006853000)
62: fffffa80068538d0 USBPORT!USBPORT_InterruptService (KINTERRUPT fffffa8006853840)
USBPORT!USBPORT_InterruptService (KINTERRUPT fffffa8006853780)
USBPORT!USBPORT_InterruptService (KINTERRUPT fffffa8006853480)
72: fffffa8006853690 USBPORT!USBPORT_InterruptService (KINTERRUPT fffffa8006853600)
USBPORT!USBPORT_InterruptService (KINTERRUPT fffffa8006853300)
82: fffffa8006853210 USBPORT!USBPORT_InterruptService (KINTERRUPT fffffa8006853180)
92: fffffa8006853e10 ataport!IdePortInterrupt (KINTERRUPT fffffa8006853d80)
ataport!IdePortInterrupt (KINTERRUPT fffffa8006853cc0)
USBPORT!USBPORT_InterruptService (KINTERRUPT fffffa8006853540)
USBPORT!USBPORT_InterruptService (KINTERRUPT fffffa80068530c0)
portcls!CInterruptSync::`scalar deleting destructor'+0xb8 (KINTERRUPT fffffa800a1b3f00)
a0: fffffa8006853750 ndis!ndisMiniportMessageIsr (KINTERRUPT fffffa80068536c0)
a2: fffffa8006853b10 HDAudBus!HdaController::Isr (KINTERRUPT fffffa8006853a80)
b0: fffffa8006853c90 storport!RaidpAdapterMSIInterruptRoutine (KINTERRUPT fffffa8006853c00)
b1: fffffa8006853f90 ACPI!ACPIInterruptServiceRoutine (KINTERRUPT fffffa8006853f00)
b2: fffffa8006853bd0 ataport!IdePortInterrupt (KINTERRUPT fffffa8006853b40)
ataport!IdePortInterrupt (KINTERRUPT fffffa8006853e40)
USBPORT!USBPORT_InterruptService (KINTERRUPT fffffa80068539c0)
USBPORT!USBPORT_InterruptService (KINTERRUPT fffffa80068533c0)
USBPORT!USBPORT_InterruptService (KINTERRUPT fffffa8006853240)
dxgkrnl!DpiFdoLineInterruptRoutine (KINTERRUPT fffffa800a1b3e40)
c1: fffff80002a46450 hal!HalpBroadcastCallService (KINTERRUPT fffff80002a463c0)
d1: fffff80002adfe10 nt!KiSecondaryClockInterrupt
d2: fffff80002a46590 hal!HalpHpetRolloverInterrupt (KINTERRUPT fffff80002a46500)
df: fffff80002a463b0 hal!HalpApicRebootService (KINTERRUPT fffff80002a46320)
e1: fffff80002ade970 nt!KiIpiInterrupt
e2: fffffa80068abd10 hal!HalpDeferredRecoveryService (KINTERRUPT fffffa80068abc80)
e3: fffffa80068abbd0 hal!HalpLocalApicErrorService (KINTERRUPT fffffa80068abb40)
fd: fffffa80068abdb0 hal!HalpProfileInterrupt (KINTERRUPT fffffa80068abd20)
fe: fffffa80068abe50 hal!HalpPerfInterrupt (KINTERRUPT fffffa80068abdc0)
ff: 0000000000000000
So it looks like a simple software interrupt for handling the bugcheck. Oh well, it could have pointed me in the right direction, so we'll see how things go now I've got IRQ 16 a little more cleared up. Unfortunately, I still don't know how I can reconstruct the call stack in nvlddmkm and I still don't understand the kernel driver threading model, so I need to do some more reading.