Determining the source of Bug Check 0x133 (DPC_WATCHDOG_VIOLATION) errors on Windows

JMH

Emeritus, Contributor
Joined
Apr 2, 2012
Posts
7,197
Determining the source of Bug Check 0x133 (DPC_WATCHDOG_VIOLATION) errors on Windows Server 2012

What is a bug check 0x133?
Starting in Windows Server 2012, a DPC watchdog timer is enabled which will bug check a system if too much time is spent in DPC routines. This bug check was added to help identify drivers that are deadlocked or misbehaving. The bug check is of type "DPC_WATCHDOG_VIOLATION" and has a code of 0x133. (Windows 7 also included a DPC watchdog but by default, it only took action when a kernel debugger was attached to the system.) A description of DPC routines can be found at http://msdn.microsoft.com/en-us/library/windows/hardware/ff544084(v=vs.85).aspx.

The DPC_WATCHDOG_VIOLATION bug check can be triggered in two ways. First, if a single DPC exceeds a specified number of ticks, the system will stop with 0x133 with parameter 1 of the bug check set to 0. In this case,the system's time limit for single DPC will be in parameter 3, with the number of ticks taken by this DPC in parameter 2. Alternatively, if the system exceeds a larger timeout of time spent cumulatively in all DPCs since the IRQL was raised to DPC level, the system will stop with a 0x133 with parameter 1 set to 1. Microsoft recommends that DPCs should not run longer than 100 microseconds and ISRs should not run longer than 25 microseconds, however the actual timeout values on the system are set much higher.
http://blogs.msdn.com/b/ntdebugging...-violation-errors-on-windows-server-2012.aspx
 
A good idea in theory. In practice, it seems to have mixed results. Could just be that I haven't yet learned where to look with this to find the true culprit, but in much of my experience, the driver blamed belongs to Windows and is of little use when debugging... The memory stack is really the only saving grace in those instances, and even that can sometimes be corrupted and/or missing. 0x133 has frustrated me a few times so far with Windows 8 systems. Hopefully we will all learn more in time with how to analyze that particular crash. Maybe it requires a Kernel dump?
 
One of the biggest problems is there is always the assumption the system will detect AND RECORD all the particulars in the various event logs and dumps BEFORE the fault halts the system. But that does not always happen, and even when the event is recorded, it is often so cryptic, it takes "BSOD Kernel Dump Experts" to interpret the results - IF possible, and it often is not.

What "the industry" (HW and OS/SW) needs to do is come up with a device that plugs into the USB port of a broken computer that will then retrieve, organize and interpret what happened the last few clock cycles before the crash. A "black box" recorder for computer systems, if you will.

But that, of course, will raise the rankles of privacy advocates afraid their activities will be recorded. :(

Windows greatest asset is its flexibility - capable of running on literally millions of combinations of motherboards, CPUs, graphics and other devices from 1000s of different HW makers, running all sorts of software (including deep kernel security stuff). Fortunately, especially with the latest version of Windows and the latest hardware (at least from the reputable makers), BSODs and the like are much less frequent then just a few years ago.

Windows greatest liability is its flexibility. If some rinky-dink device out of the back woods of some 3rd world country does not work - Windows gets the blame.
 
To me, the bottom line is that when dealing with failing hardware, the information in the mini kernel dumps cannot be relied upon and at best can only possibly point us in a direction like HDD failure with exception codes like 0xc0000185 (I/O) error.

When mobo and PSU are at fault, the info in the dumps is nearly worthless, which in itself tells us hardware failure is the likely cause.

Also, I rely on Driver Verifier to help pinpoint hardware failure v. 3rd party driver cause.

A VERIFIER_ENABLED dump that names no 3rd party driver = unknown hardware failure.
 
What about the rare instance when Verifier does not cause any crashes for weeks at a time, but the system crashes regularly with it disabled?

I've seen those be 3rd party drivers and hardware making it an either/or situation in the two or three occurrences in which that happened with an OP I was helping.
 
A situation where Driver Verifier appears to prevent BSODs (for lack of a better term), i.e., no BSODs when D/V is enabled, but BSODs when D/V is disabled?

I personally have never seen such.

I do know and should have stated that very rarely, if ever, have I seen Driver Verifier flag sptd.sys (or a variant) or other virtual device boot drivers, but have seen them named as probable cause in non-VERIFIER_ENABLED dumps.
 
And then you have cases like mine which are halfway inbetween. Tons of clueless BSODs with DV enabled, absolutely nothing with it disabled.

It took me a while to figure out why they suddenly stopped, and why they stopped in February: I turned DV off, that was all!

I don't know what to say TBH. The machine works flawlessly and has passed every hardware test thrown at it. Still slightly suspicious of the hardware, but it does work fine ATM.
 
I have never seen DV prevent BSODs either. Not saying it cannot because I don't know what all DV changes. I have just never seen it.

I don't know what to say TBH. The machine works flawlessly and has passed every hardware test thrown at it. Still slightly suspicious of the hardware, but it does work fine ATM.
This points me to 3 possible culprits.

(1) PSU is stressed or failing.
(2) Heat
(3) Leaky capacitors (though not typically a problem on motherboards made in the last 4 - 5 years).
 
Deliciously solid article. Thanks for bringing this to my attention. Didn't think this one would slip under my radar.

@Digerati

As for a "black box" recorder, it's called a live kernel debugging session using another PC hooked to the victim PC. Fortunately, Windows 7 and newer has incorporated network-based live kernel debugging, but I'm not sure how that works as I've personally never tried it. Heard good things about it, though.

Also, do remember that any changes logged to a system's environment can get real resource intensive, and fast. Take Process Monitor, for example. Have that run for a few minutes and you're already looking at a pretty hefty file. Now just imagine adding all memory allocations and all data pertaining to them, as well as all I/O and code execution data, and one can see things can start getting out of control. A complete memory dump from a system is already huge, so imagine having that plus iterative additions to it for every execution made, AND expect the victim PC to be running swell through all that!

Only logging activity in the past few cycles is ok and all, but just so you know, crashdumps already do that! Callstacks, processor environments, even PNP triage data is stored and conserved at all times on Windows that tells you what's happened previously to give you a window into past activities. Driver Verifier even goes a step further by providing logs for various activities of drivers it watches, which can be accessed in a crashdump using !verifier extension (of course, kernel dump required). Even so, many problems occur minutes, sometimes hours, days weeks or even months prior to an actual symptom manifests (especially with something like a bad memory allocation), so it does no good to only record most recent activity when the poison has already settled long outside the range of logging.

Suffice to say, a full blown logging process can't be done. However, live kernel debugging is the next best thing, because it allows someone to lay 'traps' (conditional breakpoints) so that when the culprit goes to commit the crime it'll be caught in the act. Of course, to do this one needs to sit down with all his current data, evaluate, and determine where he expects the suspect will strike next, but for someone that becomes more acquainted with this, it becomes easier to make a strong educated guess.

Btw, back to the logging thing, a somewhat more lightweight maneuver than logging every bit of activity on a system is to turn on the appropriate gflags for that system. Of course, personal experience has shown that tripping all these flags has some, undesirable effects, namely slowly a system down to horrendous crawl in many cases. However, if one only triggers the flags for data they are interested in viewing that is relevant to the problem, then it won't be so much an issue. Another option, as presented by the original article on DPCs, is an Event Viewer Trace Log.


I think a lot of misconceptions about debugging is that people think, "More data, more data, more data!" There are indeed cases where a broad general sweep is necessary to track down a cause, but that's only if all other speculation has missed their mark. An anecdotal example is here, where I've suffered long and hard, doing everything from Procmon logs, to Wireshark logs, to Trace logs, triggered crashdumps, and more, and I was still stumped on what was going on. I could see the effects in the data, and I could pinpoint it as far as some network service issue, but for the life of me I could not tell what was causing it. It was only then later on that it was discovered the router the PC was hooked up too was causing the problem. The amount of logs could only tell one side of the tale, but without an indepth analysis of all the relevant components, I was only working with flimsy data that could tell no more story. Crashdumps and all are also rather the same when it comes to hardware problems. Often times you can see the effects, and make guesses on what hardware may do it, but that's all it really is is just guessing. Without proper testing and diagnostic procedures on the hardware one could not tell from the data what's going on. To do any of this, though, is going to require prior experience, skill and knowledge to view the problem, make a hypothesis, then to gather the appropriate data and coordinate a solution. No matter how thorough !analyze -v can be, it - nor any other automated tool - is going to be as precise as a sharp mind.
 
Last edited:
As for a "black box" recorder, it's called a live kernel debugging session using another PC hooked to the victim PC.
This practice has been used for years on corporate networks, but of course, it takes additional resources which many home users and small office users just don't have. Then of course, you need to know how to interpret the data. And there's also the problem you noted, an overwhelming amount of data to sift through.

And finally, there is time - perhaps the most valuable resource. Many of my colleagues would love to take 2 or 3 hours, or longer troubleshooting hardware and analyzing logs in order to learn what happened, how to fix it, and how to keep it from recurring. But the reality is (at least for businesses) it is much quicker to simply re-image the drive, or R&R (remove and replace) major components to put the machine back in production as quickly as possible.
 
As convenient as it would be for sure to provide some modicum of automated service for home and small office users for these kinds of problems, it just doesn't best professional services, which is why often when SOHOs experience IT predicaments they call up a service tech. Corporate environments can afford quick hardware replacements and stuff, and all shell out more for redundant hardware setups and other preventive maintenance procedures that SOHO can't often achieve, but even they still have to turn to the pros when crucial problems arise that their SLA doesn't cover.

However, you do give me a good idea. While a flash drive of some sort would not suffice, a small ethernet/USB device like a Raspberry Pi would be perfect for the operation. Course it'd need storage to save the dump file, but something that can automate a live kernel debugging session and grab necessary data would be pretty beneficial if another PC is not on hand.
 
but something that can automate a live kernel debugging session and grab necessary data would be pretty beneficial if another PC is not on hand.
Yes, but again, it assumes the system has time to detect and record before halting. So I feel the value would be limited at best, most of the time - which is unfortunate.
 
I've never done a live kernel debug session, but would like to at some point.

I think my biggest limitation in doing so accurately would be the many Windbg kd commands that I have read about, but have yet to use.
 
The ones specific for live debugging or just in general? Either way I agree, I still have yet to skim the surface of its potential.

@Digerati:

The system does go into a collection mode when a BSOD occurs, by setting the IRQL to the highest available so no other process thread will impede, and then it'll perform the crashdump generation by first collecting data from various hardware and drivers - when available - to store in the dump file prior to creating it. It's how stuff like WHEA errors and other special error-based information gets preserved in a crashdump so that we can access that important data using something like !errrec. There's also of course the aforementioned logging that occurs at all times in case a crash does occur, which would normally prevent it from collecting said data. So you should be fairly relieved that at least that much is going on.
 
The ones specific for live debugging or just in general? Either way I agree, I still have yet to skim the surface of its potential.

Both I would say.

I certainly don't know them like you do. I am sure that I have learned more about the kd commands from you v. debugger.chm

And, since no coder here either, I can also attribute the things I've learned about stacks and other items in the dumps to you as well.

I entered the BSOD arena by accident 5 years ago when a system here was BSOD'ing to death. First I learned Vista (1st time ever behind the Windows Desktop). Then after the warranty people returned my laptop to me with exponentially increasing BSODs, I learned just enough about Windbg to figure out it was an outdated wifi driver that caused it all.

I started posting to try and help and simply could not believe all these new Vista system owner's BSODs were caused by hardware failure as was routinely implied. Not knowing anything about hardware at the time either, my sole path for the first 2+ years was focused solely on 3rd party drivers.

And here we are today - I'm still a driver person, but at least now recognize that hardware failure does occur and causes BSODs! :grin1:

Many here will tell you that is one huge step for me. :lol: (Quiet Bruce!)
 
The system does go into a collection mode when a BSOD occurs
There's also of course the aforementioned logging that occurs at all times in case a crash does occur
No! Sorry but that is not true with every crash! "Ideally" you would be correct and the system would have lots of time to go into collection mode and start recording status before the halt by saving files to the agonizingly slow hard drive! But you cannot assume "all crashes" are ideal! They aren't. Many are instant - one clock cycle all is good, the next, not good. This is clearly evidenced by the many many examples of no Event log entries found after a crash.

I wish you were right and Windows AND computing hardware had the capability to stay up a second or two every time there was problem so it could take a snap shot. But the facts and years of evidence are clear - it does not always happen. Any number of things can cause a CPU to suddenly halt, and when that happens, all the time in the world is not enough to record the status of the crash.

The only way to [almost] ensure what you are suggesting happens is to CONSTANTLY record status (like the black box) of both the hardware and all software running. But Windows does not do that. I also note BSODs can be caused by hardware failure, or software (to include drivers). And not all crashes produce the BSOD error screen.

So, I am sorry, but to suggest Windows always provides crash data is incorrect. As a technician with a shop, it would sure make my job easier if it did.
 
Actually the no event log entries means either it was setup to not log an event log entry or there was a problem accessing the storage device to write the entry - which actually I think is done by attaching it to the dump in the paging file then when system restarts and checks for dump file it also sees the event log entry and appends it to the log. I can't recall. Anyways, it has something to do with disk I/O being held up. If an event log wasn't written, most likely the crashdump wasn't either.

Windows crashes are actually a much cleaner process than you may perceive it to be. The data collection steps and all are actually done before the blue screen appears, which is why you're able to receive the bugcheck code and its following parameters. The only other operation that is not completed after the blue screen shows up is dumping everything to the paging file. If there's every a problem with that, it just simply halts or fails with an error message on the blue screen. If there's a system stability issue that compromises even the KeBugCheckEx thread from doing its job for creating the BSOD, then often it's going to result in some other symptom, typically being a system freeze.

You're right in that one clock cycle can be all good and the next bad, but even that means Windows has to detect that it's bad, and it does that by an exception handling process that goes up the chain of command until Windows kernel verifies this is unresolvable and has to BSOD the system to prevent potential disaster. Of course, again this is with the result being a BSOD, which is the cleanest and best means of a system failure one can achieve. Anything else and the system is just going to lock up because it cannot continue operating, for if it could continue operating, it would've operated on conducting the BSOD.

I agree with you on the CPU halt thing being a major pain in the butt, because it's hard to track down. But did you know that a system can actually be entered in with live debugging session even during a CPU halt? If the halt did not occur from an excessively high IRQL state, one can use an NMI (non-maskable interrupt) switch to trigger a BSOD and therefore gain access to the system. This is far more reliable than the manually-initiated crash using the Scroll Lock key combination, since it's an interrupt that cannot be disregarded. The only exception to this of course is if the system was halted by some hardware instability means, notably the CPU itself has bugged out, for in which case the NMI would not be handled and the system will stay halted. At least in that case you'll know it's a hardware problem!
 
Windows crashes are actually a much cleaner process than you may perceive it to be. The data collection steps and all are actually done before the blue screen appears, which is why you're able to receive the bugcheck code and its following parameters.
NO! Again, you are making assumptions you just cannot do. You are assuming Windows will detect a crash before it happens. The facts are there, that is not always the case and depends entirely on what crashed and how it crashed.

Windows crashes are actually a much cleaner process than you may perceive it to be.
Not at all! But I am afraid you assume all crashes are clean crashes (whatever that means). And they just aren't. Some might be, but you cannot tell or predict which will be. And neither can Windows.

The data collection steps and all are actually done before the blue screen appears
OF COURSE! I don't see what point you are trying to make. Of course the data is collected before the BSOD is displayed. How else would the system know what error code and message to display on the BSOD screen? :confused2:

If there's a system stability issue that compromises even the KeBugCheckEx thread from doing its job for creating the BSOD, then often it's going to result in some other symptom, typically being a system freeze.
Exactly! And that is exactly why you cannot make all those other assumptions that you do! ???

Also, you seem to be assuming a "crash" and "system freeze" are two discrete issues. You cannot assume that either. Sudden halts, unexpected reboots, or sudden shutdowns are all "crashes". And crashes can be caused by software or hardware problems that may be slow to occur, or instantaneous.

You're right in that one clock cycle can be all good and the next bad, but even that means Windows has to detect that it's bad
No it doesn't! :( I don't know where you are coming from but hardware does not work that way. If the CPU is in a halted state, nothing is running - not even Windows. Therefore, Windows cannot and will not detect that next bad clock cycle, because it does not exist.
 
Ah, I guess our definition of crash is different, which is why we keep getting confused. Pardon me on that.

Though I still don't see how any 'black box' can record everything especially since most functionality is done in isolation within the internals of separate devices. A CPU has its own internal caches, and even that is reported to Windows all the time - especially during a BSOD - unless the CPU fails to respond, in which case Windows goes on with the BSOD as much as it can without the data (this happens frequently in 0x101 bugchecks with hung cpu cores). A storage device's internal controller contains its own diagnostic logs like SMART, as well as standardized error codes, but without reporting them to the system, the system is completely and utterly clueless about its current conditon. A motherboard's PCI-E bus is multifaceted, but often times if a communication error occurs between and end device to a bridge or bridge-to-bridge, the only one that gets notified about it is the root port, and all it knows is who informed it of the problem, not who did it, because it can be the bus itself that did it! None of this information is retrievable either due to physical limitations or proprietary restrictions, and certainly an external device like a black box utility, hooked up either to the USB, PCI-E, or some other port, isn't really going to do any better than Windows would. About the closest thing you could get to something like that would be a type of diagnostic motherboard which collects data on everything running through its circuits, but even that doesn't resolve the fact that most of the work done is inside the components connected to the mobo! Even so, I've seen diagnostic mobos before (as well as POST PCI cards), and I myself have greatly desired one for my own kit, but it's still only a part of a professional's diagnostic kit, and not the kit entirely.

There's also, again, the whole performance thing involved. As if having Windows setup in a super-paranoid state wasn't crippling enough, but to have all the hardware constantly report on their operations, and the system would be reduced to something akin to a 386 IBM! There's a reason why Windows is able to run as it does, and that's because it's willing to compromise on reliability by assuming that the drivers and software running on it was designed properly and is working well, and that the hardware, too, is in good condition. If it had to be schizophrenic about everything, nothing would get done. But it does a good job by setting up security checks to anticipate corruption by stopping the system early, which is only improved with Driver Verifier and certain gflags.

All BSODs aren't just responses to the problem itself, you know. Many times it'll run algorithms and detect patterned behavior associated with creating problems that it will BSOD on because it expects things are going to go bad if it continues. Windows 8 has an example of this with improved resource contention detection, which if it finds a resource being contended with between various threads in an improper manner, it'll BSOD before a problem even arises (that problem being a hung system). Windows Vista/7 has it but not as robust. These are common methods in order to prevent permanent corruption to the system, or to try to alert as quickly as possible before the suspect flees the scene, but since Windows has to compromise for speed and functionality, usually it doesn't quite happen early enough.

I understand the desire to have an all-in-one diagnostic item for a computer, but computers are far too compartmentalized to be able to do that in some manner. Either a device/application has to report it has an issue to the system, or the system has to actively scrutinize over the device/app's activity and make an estimation that what the device/application is doing or intending to do is invalid. In reality both of these are actually performed by Windows, the CPU, and the motherboard all the time working in conjunction with each other. However, it's just not in an entirely robust 100% perfect method, as the closer it would get to that the closer it'd be to having to stop a car every piston movement and do a 180-point inspection. With a system that needs to have reliability, performance and efficiency, they all have to be weighed against each other to determine the right balance. I'm not denying that the current system is a perfect balance, which it isn't, and there's always room for improvement - but there's just no room for miracle working in this. When one aspect of this balancing act is improved, the others just have to give to compensate.
 
I don't think it is all that impossible as we head toward Windows 9 now for improvements in crash recording/ reporting to be implemented far beyond what we think is possible now.

I'm really the new kid on the block here, having learned Windows under Vista and therefore am probably spoiled. I rarely work/ have worked on XP systems/ OPs because of the lack of info v. Vista, 7, 8.

I count 138 Event Viewer logs files (not all are diagnostic- helpful, I know) in my Windows 7 system. I recall XP (Home) had 2 or 3? Who knows what the future holds.





As I recall, John Carrona, MS MVP, (usasma) followed a BSOD through to Blue Screen, then through re-boot with ProcMon.

This is what I found of his write-up so far - BSOD Crash Dump Generation




Granted Windows may not record bugcheck info for a BSOD prior system shutdown when catastrophic hardware failure occurs bringing the system down so fast there is no time to do so, but I usually find Event Viewer records like these after reboot -
Code:
Event[2491]:
  Log Name: System
  Source: Microsoft-Windows-Kernel-Power
  Date: 2012-01-20T20:01:16.222
  Event ID: 41
  Level: Critical
  Opcode: Info
  User: S-1-5-18
  User Name: NT AUTHORITY\SYSTEM
  Description: 
[COLOR="#FF0000"]The system has rebooted without cleanly shutting down first. This error could 
be caused if the system stopped responding, crashed, or lost power unexpectedly[/COLOR].

Obviously not much info to go on, but that is when I start looking through other files & request additional info from the OP.
 

Has Sysnative Forums helped you? Please consider donating to help us support the site!

Back
Top