PCI-E WHEA errors (0x124)

Vir Gnarus · Mar 23, 2012

Quick Reference:
PCI_EXPRESS_AER_CAPABILITY structure
PCI_EXPRESS_BRIDGE_AER_CAPABILITY structure
PCI_EXPRESS_ROOTPORT_AER_CAPABILITY structure

Hi everybody,

I'd like to present a small personal discovery of mine. You may often come across WHEA errors that, like in the BSOD Methods & Tips section describing them, will be primarily related to CPU or memory issues. However, there are times when you may come across one that involves the PCI-E bus. On motherboards with both PCI and PCI-E, the same controller is responsible for both, so often when this shows up it can be PCI as well, as you will find later on. In this case, the output is quite a bit different than the usual and it may appear daunting and indecipherable at first. Have no fear! With the right knowledgebase (MSDN) you shouldn't have all that much of a problem figuring it out.

Let's start by doing an !analyze -v:

Code:

3: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

WHEA_UNCORRECTABLE_ERROR (124)
A fatal hardware error has occurred. Parameter 1 identifies the type of error
source that reported the error. Parameter 2 holds the address of the
WHEA_ERROR_RECORD structure that describes the error conditon.
Arguments:
Arg1: 00000004, PCI Express Error
Arg2: 869348d4, Address of the WHEA_ERROR_RECORD structure.
Arg3: 00000000
Arg4: 00000000

Debugging Details:
------------------

TRIAGER: Could not open triage file : C:\Program Files (x86)\Windows Kits\8.0\Debuggers\x64\triage\modclass.ini, error 2

BUGCHECK_STR:  0x124_GenuineIntel

CUSTOMER_CRASH_COUNT:  1

DEFAULT_BUCKET_ID:  WIN7_DRIVER_FAULT

PROCESS_NAME:  System

CURRENT_IRQL:  a

STACK_TEXT:  
80e4cb2c 8341afcd 00000124 00000004 869348d4 nt!KeBugCheckEx+0x1e
80e4cb68 83506fc4 869334e1 869348d4 8691bc10 hal!HalBugCheckSystem+0xab
80e4cb9c 8c7ce609 8691b638 8690a780 80e4cd20 nt!WheaReportHwError+0x230
80e4cbb4 8c7cf088 869344b4 00000000 8691b638 pci!ExpressRootPortAerInterruptRoutine+0x1e7
80e4cbd8 8c7cf264 8690a780 86934008 80e4cbfc pci!ExpressRootPortInterruptRoutine+0x1a
80e4cbe8 834a9cff 8690a780 86934008 00000001 pci!ExpressRootPortMessageRoutine+0x10
80e4cbfc 83474ded 8690a780 86934008 80e4cc28 nt!KiInterruptMessageDispatch+0x12
80e4cbfc 93be45d6 8690a780 86934008 80e4cc28 nt!KiInterruptDispatch+0x6d
WARNING: Stack unwind information not available. Following frames may be wrong.
80e4cc98 8349ada4 888d8d48 80e35800 80e30000 intelppm+0x15d6
80e4cd20 834985ad 00000000 0000000e ab16ab16 nt!PoIdle+0x524
80e4cd24 00000000 0000000e ab16ab16 8bdf8bdf nt!KiIdleLoop+0xd


STACK_COMMAND:  kb

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: GenuineIntel

IMAGE_NAME:  GenuineIntel

DEBUG_FLR_IMAGE_TIMESTAMP:  0

FAILURE_BUCKET_ID:  0x124_GenuineIntel_PCIEXPRESS

BUCKET_ID:  0x124_GenuineIntel_PCIEXPRESS

Followup: MachineOwner
---------

As you can figure, like any WHEA BSOD, it's not going to be very explanatory. Really, the only things to look at are the arguments for the BSOD. In this case, we actually get a subcode 0x4 in Arg1, which means a PCI Express error. In Arg2, like usual, the address of the WHEA error structure is present. Like any 0x124 bugcheck, we take this value and direct !errrec towards it:

Code:

3: kd> !errrec 869348d4
===============================================================================
Common Platform Error Record @ 869348d4
-------------------------------------------------------------------------------
Record Id     : 01cd07d8bce4740f
Severity      : Fatal (1)
Length        : 672
Creator       : Microsoft
Notify Type   : PCI Express Error
Timestamp     : 3/22/2012 3:06:44 (UTC)
Flags         : 0x00000000

===============================================================================
Section 0     : PCI Express
-------------------------------------------------------------------------------
Descriptor    @ 86934954
Section       @ 869349e4
Offset        : 272
Length        : 208
Flags         : 0x00000001 Primary
Severity      : Recoverable

Port Type     : Root Port
Version       : 1.1
Command/Status: 0x4010/0x0507
Device Id     :
  VenId:DevId : 8086:340a
  Class code  : 030400
  Function No : 0x00
  Device No   : 0x03
  Segment     : 0x0000
  Primary Bus : 0x00
  Second. Bus : 0x00
  Slot        : 0x0000
Dev. Serial # : 0000000000000000
Express Capability Information @ 86934a18
  Device Caps : 00008021 Role-Based Error Reporting: 1
  Device Ctl  : 0107 ur FE NF CE
  Dev Status  : 0003 ur fe NF CE
   Root Ctl   : 0008 fs nfs cs

AER Information @ ffffffff86934a54
  Uncorrectable Error Status    : 00000020 ur ecrc mtlp rof uc ca cto fcp ptlp SD dlp und
  Uncorrectable Error Mask      : 00000000 ur ecrc mtlp rof uc ca cto fcp ptlp sd dlp und
  Uncorrectable Error Severity  : 00062010 ur ecrc MTLP ROF uc ca cto FCP ptlp sd DLP und
  Correctable Error Status      : 00000000 adv rtto rnro dllp tlp re
  Correctable Error Mask        : 00000000 adv rtto rnro dllp tlp re
  Caps & Control                : 00000005 ecrcchken ecrcchkcap ecrcgenen ecrcgencap FEP
  Header Log                    : 00000000 00000000 00000000 00000000
  Root Error Command            : 00000000 fen nfen cen
  Root Error Status             : 00000000 MSG# 00 fer nfer fuf mur ur mcr cer
  Correctable Error Source ID   : 00,00,00
  Correctable Error Source ID   : 00,00,00

===============================================================================
Section 1     : Processor Generic
-------------------------------------------------------------------------------
Descriptor    @ 8693499c
Section       @ 86934ab4
Offset        : 480
Length        : 192
Flags         : 0x00000000
Severity      : Informational

Proc. Type    : x86/x64
Instr. Set    : x86
CPU Version   : 0x00000000000106a5
Processor ID  : 0x0000000000000006

This definitely doesn't look like the typical WHEA output. So what we need to do is start digesting the information bit by bit. Much of it isn't really necessary to scrutinize over in various cases. The most important content here is the AER info, of which stands for Advanced Error Reporting. Of course, to the untrained eye, it looks like a bunch of gibberish. But in this article we will be able to discover what it means, and that it really is pretty simple stuff.

But first, let's figure out just what component of the PC the error came from. We start off by looking at the Device ID. The first entry describes the VenID (Vendor ID) and DevID (Device ID). We can approach a database online with these numbers to see what we get. I personally picked PCIDatabase.com. By entering in the DevID, preferably since it's more specific than using the VenID, we'll discover that this came from an Intel 7500 Chipset PCIe Root Port. This is part of the ICH10 Southbridge chipset. Given that this particular case the client described his motherboard as an ASRock X58 motherboard, I wouldn't be surprised by this information. However, given the Port Type and what we see here from the description given by the DevID, we can see this came from the root port of the PCI-E bus. This is important, as you'll discover later.

Anyways, now onto the actual AER information. Given that I too was as befuddled by the output as you are looking at it, I used keywords like "AER" and "PCI Express" in google to get information on what I'm looking at. Sure enough, the MSDN has an article pertaining to it, regarding a particular data structure with the all the details and descriptions for its variables. The initial article I found was this one. Everything described here matches what is present for the AER info presented in the the WHEA error record, so I figured this would be easy. I checked the Uncorrectable Error Status (PCI_EXPRESS_UNCORRECTABLE_ERROR_STATUS) in the article to see descriptions of all the weird bunches of letters that are displayed.

As I ventured to guess previously, this was a bunch of bit flags with each pertaining to a specific type of error. If a bit is set (the value is 1 and not 0) for a specific error, then that was an error that occurred. Evidently, in the WHEA error record displayed by !errrec, each of the bunches of letters is actually a very short abbreviation/acronym for the error name of the related bit. Whatever is lowercase is a bit that was not set, and whatever is capitalized was set. As we can tell, SD bit was set, so we should look through the article to see what SD means.

This is when things get problematic. There's nothing that sounds like it's remotely related to the letters SD, and we can tell from the other error bits that those also don't seem to have a related error described in the article. Something's not right here, so let's go back to the main article on PCI_EXPRESS_AER_CAPABILITY structure that we initially found and read up a bit.

Sure enough, in the remarks, the following is stated:

Root ports and root complex event collectors use the PCI_EXPRESS_ROOTPORT_AER_CAPABILITY structure instead of the PCI_EXPRESS_AER_CAPABILITY structure to describe the PCIe advanced error reporting capability structure.

This is why it was important to discover what was reporting the error. It could've been a PCI-E bridge (which has its own AER data structure), from the root port as we found here, or from the PCI-E device itself. So now we just go to the article describing the structure specific to root ports, and then go back to the Uncorrectable Error Status subarticle and read up the error bits and their descriptions. As you can see, the errors now correspond well to those represented in the WHEA record. For example, ecrc means ECRC Error, rof means Receiver Overflow, etc. For this specific case, SD was the only one capitalized, so let's look at the description for that. It mentions Surprise Down. Personally, this is self explanatory, but just in case I google it along with PCI Express to find out what it means for sure. As I figured, it means that there was an unexpected (Surprise!) loss in connection between the PCI-E hub and the PCI-E device. The device in mind could only be the video card, since during the client's attempt to troubleshoot the only PCI-E device he has installed is his video card (read Update).

So now that we have a specific description of the error, we need to think of what would cause this error. As I can speculate, either the root port or the card could be responsible. In addition, I would venture to believe anything impeding the connection physically, like dust or other contaminant in the slot, or having the card sit ajar in the slot would create this problem. Currently at this time, the client is still in the process of determining this. However, on googling this, I found this to be rather prevalent for individuals with the ASRock X58 motherboard, especially in that they happen to have a lot of PCI-E WHEA error events in their event log. I would figure this may be related.

I'll get back on a definitive answer of this case. Until then, you at least now know what to do when you come across one of these WHEA errors. Have fun!

UPDATE:

The culprit actually turned out to be the TV Tuner card in a PCI slot (not PCI-E). I was erroneous in assuming the client had only the video card installed. I have also forgotten that the same controller is responsible for both the PCI and PCI-E buses, so this error can manifest from PCI cards as well on newer boards.

Vir Gnarus · Apr 26, 2012

Bump: Updated.

zigzag3143 · Apr 26, 2012

Thorough analysis thanks. Keep them coming. The DMP looks vaguely familiar.

Vir Gnarus · Apr 26, 2012

Yep, it should be. It's from a case I helped solve at SF. I attached the dmp file just now.

Teln3t · Apr 30, 2012

Hi....I've been having an extremely similar issue as the person you helped assist for the past month or so now. I'm very technically minded and I've always been able to debug my way to answers before. But not this time. Your case is the only one somewhat similar to mine, Vir Gnarus. I'm wondering if your expertise could be of use.

Here's the issue. My computer crashes very randomly. Mostly, it will go all day, but in the wee-hours of the morning or late at night; it will crash. The crash is a sudden black screen....no blue screen. Occasionally there is audio that repeats, but not always. Both monitors go black and I'm forced to cold-reboot. After the first crash of the day, the others follow very shortly after when trying to do intensive gaming, loading games, etc.

Here is the debug and !errorrec

Code:

-------------------------------------------------------------------------------------------


Loading Dump File [C:\Windows\Minidump\043012-22386-01.dmp]
Mini Kernel Dump File: Only registers and stack trace are available


Symbol search path is: SRV*c:\symbols*http://msdl.microsoft.com/download/symbols
Executable search path is: 
Windows 7 Kernel Version 7600 MP (8 procs) Free x64
Product: WinNt, suite: TerminalServer SingleUserTS
Built by: 7600.16617.amd64fre.win7_gdr.100618-1621
Machine Name:
Kernel base = 0xfffff800`03661000 PsLoadedModuleList = 0xfffff800`0389ee50
Debug session time: Mon Apr 30 01:44:53.763 2012 (GMT-4)
System Uptime: 0 days 0:07:47.012
Loading Kernel Symbols
...............................................................
................................................................
........................
Loading User Symbols
Loading unloaded module list
....
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************


Use !analyze -v to get detailed debugging information.


BugCheck 124, {4, fffffa8005c8e038, 0, 0}


Probably caused by : hardware


Followup: MachineOwner
---------


7: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************


WHEA_UNCORRECTABLE_ERROR (124)
A fatal hardware error has occurred. Parameter 1 identifies the type of error
source that reported the error. Parameter 2 holds the address of the
WHEA_ERROR_RECORD structure that describes the error conditon.
Arguments:
Arg1: 0000000000000004, PCI Express Error
Arg2: fffffa8005c8e038, Address of the WHEA_ERROR_RECORD structure.
Arg3: 0000000000000000
Arg4: 0000000000000000


Debugging Details:
------------------




BUGCHECK_STR:  0x124_4


CUSTOMER_CRASH_COUNT:  1


DEFAULT_BUCKET_ID:  VISTA_DRIVER_FAULT


PROCESS_NAME:  System


CURRENT_IRQL:  7


STACK_TEXT:  
fffff880`03199a78 fffff800`0362a903 : 00000000`00000124 00000000`00000004 fffffa80`05c8e038 00000000`00000000 : nt!KeBugCheckEx
fffff880`03199a80 fffff800`037e7593 : 00000000`00000001 fffffa80`05c760c0 00000000`00000000 fffffa80`05c75b70 : hal!HalBugCheckSystem+0x1e3
fffff880`03199ac0 fffff880`00e20aff : fffffa80`00000750 fffffa80`05c760c0 00000000`00000001 fffffa80`05c8d7f0 : nt!WheaReportHwError+0x263
fffff880`03199b20 fffff880`00e20526 : 00000000`00000000 fffff880`03199c70 fffffa80`051c4d80 00000000`000000ff : pci!ExpressRootPortAerInterruptRoutine+0x27f
fffff880`03199b80 fffff800`036cd53c : fffff880`03171180 fffff880`03199c70 fffffa80`051c4d80 00000000`00000001 : pci!ExpressRootPortInterruptRoutine+0x36
fffff880`03199bf0 fffff800`036d9ec2 : fffff880`03171180 fffff880`00000001 00000000`00000001 fffff880`00000000 : nt!KiInterruptDispatch+0x16c
fffff880`03199d80 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiIdleLoop+0x32




STACK_COMMAND:  kb


FOLLOWUP_NAME:  MachineOwner


MODULE_NAME: hardware


IMAGE_NAME:  hardware


DEBUG_FLR_IMAGE_TIMESTAMP:  0


FAILURE_BUCKET_ID:  X64_0x124_4_PCIEXPRESS


BUCKET_ID:  X64_0x124_4_PCIEXPRESS


Followup: MachineOwner
---------


7: kd> !errrec fffffa8005c8e038
===============================================================================
Common Platform Error Record @ fffffa8005c8e038
-------------------------------------------------------------------------------
Record Id     : 01cd269348ca35c5
Severity      : Fatal (1)
Length        : 672
Creator       : Microsoft
Notify Type   : PCI Express Error
Timestamp     : 4/30/2012 5:44:53
Flags         : 0x00000000


===============================================================================
Section 0     : PCI Express
-------------------------------------------------------------------------------
Descriptor    @ fffffa8005c8e0b8
Section       @ fffffa8005c8e148
Offset        : 272
Length        : 208
Flags         : 0x00000001 Primary
Severity      : Fatal


Port Type     : Root Port
Version       : 1.1
Command/Status: 0x4010/0x0547
Device Id     :
  VenId:DevId : 8086:340a
  Class code  : 030400
  Function No : 0x00
  Device No   : 0x03
  Segment     : 0x0000
  Primary Bus : 0x00
  Second. Bus : 0x00
  Slot        : 0x0000
Dev. Serial # : 0000000000000000
Express Capability Information @ fffffa8005c8e17c
  Device Caps : 00008021 Role-Based Error Reporting: 1
  Device Ctl  : 0107 ur FE NF CE
  Dev Status  : 000d UR FE nf CE
   Root Ctl   : 0008 fs nfs cs


AER Information @ fffffa8005c8e1b8
  Uncorrectable Error Status    : 00140000 UR ecrc MTLP rof uc ca cto fcp ptlp sd dlp und
  Uncorrectable Error Mask      : 00000000 ur ecrc mtlp rof uc ca cto fcp ptlp sd dlp und
  Uncorrectable Error Severity  : 00062010 ur ecrc MTLP ROF uc ca cto FCP ptlp sd DLP und
  Correctable Error Status      : 000020c1 ADV rtto rnro DLLP TLP RE
  Correctable Error Mask        : 00000000 adv rtto rnro dllp tlp re
  Caps & Control                : 00000014 ecrcchken ecrcchkcap ecrcgenen ecrcgencap FEP
  Header Log                    : 34000000 02000030 00000000 00000000
  Root Error Command            : 00000000 fen nfen cen
  Root Error Status             : 00000000 MSG# 00 fer nfer fuf mur ur mcr cer
  Correctable Error Source ID   : 00,00,00
  Correctable Error Source ID   : 00,00,00


===============================================================================
Section 1     : Processor Generic
-------------------------------------------------------------------------------
Descriptor    @ fffffa8005c8e100
Section       @ fffffa8005c8e218
Offset        : 480
Length        : 192
Flags         : 0x00000000
Severity      : Informational


Proc. Type    : x86/x64
Instr. Set    : x64
CPU Version   : 0x00000000000106a5
Processor ID  : 0x0000000000000007
---------------------------------------------------------------------------------------------------------

That's as far as I can go....I've Googled through hell and high water and your issue with that one man is the only thing that has come remotely close to this strange error. Please....does any of this make a lightbulb go on? Does anything here really describe the issue? The most technical information I could find pertaining to this 'nt!KeBugCheckEx' was here: http://uninformed.org/index.cgi?v=3&a=3&p=17

I'll keep this window tabbed for the next few days in case you respond. I'm crossing my fingers.

Vir Gnarus · Apr 30, 2012

Hi mate,

As mentioned in the OP, the best thing to do is to first figure out what actually made the report, then follow that up by going to the appropriate structure for it and then look at the error code associated with the error given.

For the first step, check the Port Type in the WHEA error structure you saw in the !errrec output:

Code:

Port Type     : Root Port

So it's the root port that generated the error. Since that's the case, like in the OP, we'll need to look at the PCI_EXPRESS_ROOTPORT_AER_CAPABILITY structure (listed at the top of OP), and under that we should check the PCI_EXPRESS_UNCORRECTABLE_ERROR_STATUS substructure under that. Then match the names of the errors to the ones capitalized in the list of abbreviated errors mentioned in the AER data of the !errrec output:

Code:

AER Information @ fffffa8005c8e1b8
   Uncorrectable Error Status    : 00140000 UR ecrc MTLP rof uc ca cto fcp ptlp sd dlp und
   Uncorrectable Error Mask      : 00000000 ur ecrc mtlp rof uc ca cto fcp ptlp sd dlp und
   Uncorrectable Error Severity  : 00062010 ur ecrc MTLP ROF uc ca cto FCP ptlp sd DLP und
   Correctable Error Status      : 000020c1 ADV rtto rnro DLLP TLP RE
   Correctable Error Mask        : 00000000 adv rtto rnro dllp tlp re
   Caps & Control                : 00000014 ecrcchken ecrcchkcap ecrcgenen ecrcgencap FEP
   Header Log                    : 34000000 02000030 00000000 00000000
   Root Error Command            : 00000000 fen nfen cen
   Root Error Status             : 00000000 MSG# 00 fer nfer fuf mur ur mcr cer
   Correctable Error Source ID   : 00,00,00
   Correctable Error Source ID   : 00,00,00

UR and MTLP are the ones mentioned. In the article for the Uncorrectable Error Status, I see these are related to UnsupportedRequestError and MalformedTLP. Checking the PCI-Express Base Specifications Revision 1.1*, I've looked through the list of error codes and found that UnsupportedRequestError is not a fatal error, but MalformedTLP is. The UR error is commonly used to report request errors, so it's not unusual to see this with the MTLP error.

Given by what I skimmed through the PDF file for PCI-E Base Specs, the MTLP error can be triggered by a number of causes. What I'm most concerned about, though, is what was the source of the malformed TLP. Right now I'm trying to figure how one can determine that using WinDBG.

* - I figured which revision of the specs to use by looking up the version of your root port, which is mentioned below its name in the !errrec output.

Teln3t · Apr 30, 2012

Fantastic work! I can't read my whole post that I made on here, but im sure you saw it judging by your post. But yes, that's the problem for me too. I can't find the cause for the life of me. At first I thought it was a faulty voltage-regulator on the GPU(that turned out to be false). Later, I thought it was the motherboard (but after a full system hardware diagnostic, no faulty GPU, ram, MB, or CPU applied).

The only thing that keeps coming to mind is a power failure of some sort...but I can't pinpoint where!

If you need anything else from me, please let me know.

Vir Gnarus · Apr 30, 2012

A TLP can go either way through the PCI-E bus. It can either be originating from a CPU, Memory, PCI card, etc., so that's what's particularly difficult here. I do see the Header Log for the error that can give us clues but I'll have to interpret the packet header it gives us to figure out just what's happening. So far I can tell what format the TLP is, but not the Requester ID mentioned in the header that can give us what the source of the TLP is.

Your best bet is to start speculating that this may be caused by a PCI-E card or PCI card. The PCI-E root port services both PCI legacy and PCI-E buses.

As for it being a power issue, I kinda don't think so. There are three layers that communication through the PCI-E bus goes through: Physical Layer, Data Link Layer, and Transaction Layer. There are error checks that occur for all three, and anything that fails on a lower layer (like Physical Layer) doesn't get passed up through to the upper layers. A malformed TLP is a Transaction Layer error, so it means CRC checks and whatnot passed from the lower levels. So I think this is more a bug in a hardware piece (like one of the PCI cards) where this originated from, otherwise it would've not passed lower level checks. I guess if the power issue did reside in a piece of hardware and not the bus linking them that it'd cause this, but I doubt it. However, again I'm not exactly grounded in a solid foundation of knowledge on all of this. This is all pretty new to me as well.

Btw, I was not responsible for moderating your post. I guess because it appeared to be a request for specific issue - which it is - that it was done so as it doesn't exactly appear relevant here. However, I personally think it's a good example of troubleshooting using this approach. If the diagnostic endeavor has to go beyond this, then I'll move your stuff and make it it's own thread so people can assist better.

Teln3t · Apr 30, 2012

Vir Gnarus said:
A TLP can go either way through the PCI-E bus. It can either be originating from a CPU, Memory, PCI card, etc., so that's what's particularly difficult here. I do see the Header Log for the error that can give us clues but I'll have to interpret the packet header it gives us to figure out just what's happening. So far I can tell what format the TLP is, but not the Requester ID mentioned in the header that can give us what the source of the TLP is.

Your best bet is to start speculating that this may be caused by a PCI-E card or PCI card. The PCI-E root port services both PCI legacy and PCI-E buses.

As for it being a power issue, I kinda don't think so. There are three layers that communication through the PCI-E bus goes through: Physical Layer, Data Link Layer, and Transaction Layer. There are error checks that occur for all three, and anything that fails on a lower layer (like Physical Layer) doesn't get passed up through to the upper layers. A malformed TLP is a Transaction Layer error, so it means CRC checks and whatnot passed from the lower levels. So I think this is more a bug in a hardware piece (like one of the PCI cards) where this originated from, otherwise it would've not passed lower level checks. I guess if the power issue did reside in a piece of hardware and not the bus linking them that it'd cause this, but I doubt it. However, again I'm not exactly grounded in a solid foundation of knowledge on all of this. This is all pretty new to me as well.

Btw, I was not responsible for moderating your post. I guess because it appeared to be a request for specific issue - which it is - that it was done so as it doesn't exactly appear relevant here. However, I personally think it's a good example of troubleshooting using this approach. If the diagnostic endeavor has to go beyond this, then I'll move your stuff and make it it's own thread so people can assist better.

This is all very new grounds for me too Vir Gnarus. I've programmed for years in C++ and winsock API, PHP, and worked on finding memory exploits in various programs. Yet, all of this BSOD and/or 'Black-Screen' error troubleshooting goes very...very deep. When you take into consideration that vast amount of variables that can arise from multiple pieces of software, drivers, hardware...things can easily get complicated. So yes, this is very new grounds for me as well.

There is no rush my friend, as I said before in my first post, the computer is generally fine for the whole day once I take it out of sleep mode. However, towards the late-night (12-4am) it crashes when trying to do any intensive gaming or even whilst just loading an intensive gaming-map.

If you think that this one needs a new thread, even though the checks are similar, I'm totally okay with that. That's either your call or the mod's call. The whole goal is to be able to provide a resource of knowledge for future Googlers.

I do enjoy a challenge however. And, there is no rush. Take your time and if you need ANY information, please let me know.

P.S. Oh, and I never knew of the Physical Layer, Data Link Layer, and Transaction Layer thing. That much is very interesting to me.

Vir Gnarus · Apr 30, 2012

Yah, the layers operate a lot like those of the OSI model for computer networking. In essence, a motherboard is just a miniature network that ties all the pieces of hardware together, and all of them are like little computers sending info to and fro. Once I realized this understanding them came a lot easier. Who ever thought my classes in networking would help me with PCI-Express!

I'll dig through the PCI-Express spec documentation some more to find anything. If you want more details on the general concept behind PCI-Express, I found this all the way up to the transaction layer page to be a good primer on PCI-Express and the rules behind it. The PCI-Express spec documentation can also be found with googling it.

Teln3t · May 1, 2012

Hey Vir, just thought I would throw you an update for today. My computer crashed again tonight again from the same time frame of 12-4AM. I felt compelled to go into the standard windows event logger and check out anything interesting. I saw that there were 100+ warning messages that lead up to the crash (and I believe the warnings caused it).

Here's the error

Code:

A corrected hardware error has occurred.


Component: PCI Express Root Port
Error Source: Advanced Error Reporting (PCI Express)


Bus:Device:Function: 0x0:0x3:0x0
Vendor ID:Device ID: 0x8086:0x340a
Class Code: 0x30400


The details view of this entry contains further information.

ANYWAY. I did some Googling and found a post on the second page of this EVGA thread. Keep in mind that we ATI users were also getting this error/warning message.
http://www.evga.com/forums/fb.ashx?m=647987
The post by tpb7463(guy with the cat icon)

Cool huh?
Well, I went ahead and modified the registry values and I'm sitting here typing this message without a crash yet. But, something is definitely up. I don't know if this will fix the issue (hell, it probably won't). But, we'll see how things go.

Nevertheless, the standard windows event logger told me to view the 'Details' tab for further detailed information. Thanks Windows. :r1:

Code:


ErrorSource 4

FRUId {00000000-0000-0000-0000-000000000000}

FRUText 

ValidBits 0xdf

PortType 4

Version 0x101

Command 0x10

Status 0x547

Bus 0x0

Device 0x3

Function 0x0

Segment 0x0

SecondaryBus 0x0

Slot 0x0

VendorID 0x8086

DeviceID 0x340a

ClassCode 0x30400

DeviceSerialNumber 0x0

BridgeControl 0x0

BridgeStatus 0x0

UncorrectableErrorStatus 0x0

CorrectableErrorStatus 0x1000

HeaderLog 00000000000000000000000000000000

Length 672

RawData 435045521002FFFFFFFF02000200000002000000A0020000282205001E040C140000000000000000000000000000000000000000000000000000000000000000BDC407CF89B7184EB3C41F732CB571311FC093CF161AFC4DB8BC9C4DAF67C104DB7FBA6C4626CD0100000000000000000000000000000000000000000000000010010000D0000000010200000100000054E995D9C1BB0F43AD91B44DCB3C6F3500000000000000000000000000000000020000000000000000000000000000000000000000000000E0010000C00000000102000000000000ADCC7698B447DB4BB65E16F193C4F3DB00000000000000000000000000000000030000000000000000000000000000000000000000000000DF000000000000000400000001010000100047050000000086800A3400040300030000000000000000000000000000000000000010E042012180000007010100023D3B03400001F11F790206C0074801080001000000000000000000000000000000000000000000000000000000000001000115000000000000000010200600001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000043010000000000000002000000000000A506010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000007000000000000000000000000000000000000000000000000000000000000000000000000000000

Vir Gnarus · May 1, 2012

I'm finding out a lot more about this PCI-E thing as this progresses. Thanks for the additional information.

I did some research by perusing more through the PCI-E spec documentation and it does sound very much related to this case here, as far as I can tell. Here's the process I used to figure this out. Again, the PCI-E spec documentation for Revision 1.1 is crucial here. Google it to find a PDF version of it (primary site is member only access).

I looked at the detailed information and it says CorrectableErrorStatus of 0x1000. I sifted through the PCI-E spec doc and saw the AER Correctable Error bits and what error status each one represents. Since only one bit is set, and this is the 12th positioned bit (remember, numbering starts at 0, not 1), it relates to the Replay Timer Timeout error. Having no clue what the Replay Timer is, I looked it up in the spec doc. I got this:

REPLAY_TIMER - Counts time since last Ack or Nak DLLP received

Ack or Nak in networking terms means Acknowledged or Not Acknowledged, which are the two most generic responses a device (or computer in comp networking) will respond to something. So really it just means here this timer is for the last time any DLLP activity went through. What's a DLLP? Again, to the spec doc! Under Terms and Acronyms, the following is available:

Data Link Layer Packet,
DLLP A Packet generated in the Data Link Layer to support Link management
functions.

Ah ha, so it's lower than a TLP (Translation Layer Packet) and it's basically for link management (typical of Data Link Layer). So this tells us there's a timeout for activity for a particular link. I can now start seeing why this would be a power issue.

TBP (the guy with the cat avatar) on the thread you referenced didn't seem mention much relevant to this. However, a guy a few posts down, named terminou, does present something viable as a solution (or at least a workaround). He mentions that turning off ASPM (Active State Power Management) - or Link State Power Management - for the PCI Express bus seemed to have done the trick. You can do this by going to your Power plan for the PC (Power Options > Change plan settings > Change advanced power settings) then go to PCI Express and turn off the associated item. See if that will stifle those error messages.

For some reason, though, I'm not sure how this is related to the Malformed TLP fatal error that gives you the PCI-E BSODs. Still, we can work with that so far and go from there. Maybe these are two separate errors caused by two separate issues, with the Malformed TLP related to the HDCP incompatibility and the bunch of Replay Timer Timeouts are from the ASPM. Again, check and see and let us evaluate the results.

Teln3t · May 1, 2012

Vir Gnarus said:
I'm finding out a lot more about this PCI-E thing as this progresses. Thanks for the additional information.

I did some research by perusing more through the PCI-E spec documentation and it does sound very much related to this case here, as far as I can tell. Here's the process I used to figure this out. Again, the PCI-E spec documentation for Revision 1.1 is crucial here. Google it to find a PDF version of it (primary site is member only access).

I looked at the detailed information and it says CorrectableErrorStatus of 0x1000. I sifted through the PCI-E spec doc and saw the AER Correctable Error bits and what error status each one represents. Since only one bit is set, and this is the 12th positioned bit (remember, numbering starts at 0, not 1), it relates to the Replay Timer Timeout error. Having no clue what the Replay Timer is, I looked it up in the spec doc. I got this:

REPLAY_TIMER - Counts time since last Ack or Nak DLLP received

Click to expand...

Ack or Nak in networking terms means Acknowledged or Not Acknowledged, which are the two most generic responses a device (or computer in comp networking) will respond to something. So really it just means here this timer is for the last time any DLLP activity went through. What's a DLLP? Again, to the spec doc! Under Terms and Acronyms, the following is available:

Data Link Layer Packet,
DLLP A Packet generated in the Data Link Layer to support Link management
functions.

Click to expand...

Ah ha, so it's lower than a TLP (Translation Layer Packet) and it's basically for link management (typical of Data Link Layer). So this tells us there's a timeout for activity for a particular link. I can now start seeing why this would be a power issue.

TBP (the guy with the cat avatar) on the thread you referenced didn't seem mention much relevant to this. However, a guy a few posts down, named terminou, does present something viable as a solution (or at least a workaround). He mentions that turning off ASPM (Active State Power Management) - or Link State Power Management - for the PCI Express bus seemed to have done the trick. You can do this by going to your Power plan for the PC (Power Options > Change plan settings > Change advanced power settings) then go to PCI Express and turn off the associated item. See if that will stifle those error messages.

For some reason, though, I'm not sure how this is related to the Malformed TLP fatal error that gives you the PCI-E BSODs. Still, we can work with that so far and go from there. Maybe these are two separate errors caused by two separate issues, with the Malformed TLP related to the HDCP incompatibility and the bunch of Replay Timer Timeouts are from the ASPM. Again, check and see and let us evaluate the results.

Alright, just set the PCI-E power management to off. I'll give the computer a restart and let you know about the error messages and if there are any additional crashes.

Teln3t · May 1, 2012

Update: Nope, computer crashed and I just rebooted from it. Oddly enough this time it just restarted. There IS a new memory dump, but it's much different than the others.

well, I guess it can't hurt:

Code:

Microsoft (R) Windows Debugger Version 6.11.0001.404 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.




Loading Dump File [C:\Windows\MEMORY.DMP]
Kernel Summary Dump File: Only kernel address space is available


Symbol search path is: SRV*c:\symbols*http://msdl.microsoft.com/download/symbols
Executable search path is: 
Unable to read NT module Base Name string at 00740068`4010f410 - NTSTATUS 0xC0000141
Missing image name, possible paged-out or corrupt data.
*** WARNING: Unable to verify timestamp for Unknown_Module_00350033`00330031
*** ERROR: Module load completed but symbols could not be loaded for Unknown_Module_00350033`00330031
Debugger can not determine kernel base address
Windows 7 Kernel Version 7600 MP (8 procs) Free x64
Product: WinNt, suite: TerminalServer SingleUserTS
Built by: 7600.16617.amd64fre.win7_gdr.100618-1621
Machine Name:
Kernel base = 0xfffff800`03651000 PsLoadedModuleList = 0xfffff800`0388ee50
Debug session time: Tue May  1 11:56:25.058 2012 (GMT-4)
System Uptime: 0 days 0:20:56.323
Unable to read NT module Base Name string at 00740068`4010f410 - NTSTATUS 0xC0000141
Missing image name, possible paged-out or corrupt data.
*** WARNING: Unable to verify timestamp for Unknown_Module_00350033`00330031
*** ERROR: Module load completed but symbols could not be loaded for Unknown_Module_00350033`00330031
Debugger can not determine kernel base address
Loading Kernel Symbols
Unable to read NT module Base Name string at 00740068`4010f410 - NTSTATUS 0xC0000141
Missing image name, possible paged-out or corrupt data.
.Unable to read KLDR_DATA_TABLE_ENTRY at 0063002f`00300038 - NTSTATUS 0xC0000141


WARNING: .reload failed, module list may be incomplete
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************


Use !analyze -v to get detailed debugging information.


BugCheck 101, {19, 0, fffff88003100180, 6}


***** Debugger could not find nt in module list, module list might be corrupt, error 0x80070057.


Unable to read NT module Base Name string at 00740068`4010f410 - NTSTATUS 0xC0000141
Missing image name, possible paged-out or corrupt data.
Unable to read KLDR_DATA_TABLE_ENTRY at 0063002f`00300038 - NTSTATUS 0xC0000141
WARNING: .reload failed, module list may be incomplete
Unable to read NT module Base Name string at 00740068`4010f410 - NTSTATUS 0xC0000141
Missing image name, possible paged-out or corrupt data.
Unable to read KLDR_DATA_TABLE_ENTRY at 0063002f`00300038 - NTSTATUS 0xC0000141
WARNING: .reload failed, module list may be incomplete
Unable to read NT module Base Name string at 00740068`4010f410 - NTSTATUS 0xC0000141
Missing image name, possible paged-out or corrupt data.
Unable to read KLDR_DATA_TABLE_ENTRY at 0063002f`00300038 - NTSTATUS 0xC0000141
WARNING: .reload failed, module list may be incomplete
Unable to read NT module Base Name string at 00740068`4010f410 - NTSTATUS 0xC0000141
Missing image name, possible paged-out or corrupt data.
Unable to read KLDR_DATA_TABLE_ENTRY at 0063002f`00300038 - NTSTATUS 0xC0000141
WARNING: .reload failed, module list may be incomplete
Unable to read NT module Base Name string at 00740068`4010f410 - NTSTATUS 0xC0000141
Missing image name, possible paged-out or corrupt data.
Unable to read KLDR_DATA_TABLE_ENTRY at 0063002f`00300038 - NTSTATUS 0xC0000141
WARNING: .reload failed, module list may be incomplete
Probably caused by : Unknown_Image ( ANALYSIS_INCONCLUSIVE )


Followup: MachineOwner
---------


0: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************


CLOCK_WATCHDOG_TIMEOUT (101)
An expected clock interrupt was not received on a secondary processor in an
MP system within the allocated interval. This indicates that the specified
processor is hung and not processing interrupts.
Arguments:
Arg1: 0000000000000019, Clock interrupt time out interval in nominal clock ticks.
Arg2: 0000000000000000, 0.
Arg3: fffff88003100180, The PRCB address of the hung processor.
Arg4: 0000000000000006, 0.


Debugging Details:
------------------


***** Debugger could not find nt in module list, module list might be corrupt, error 0x80070057.


Unable to read NT module Base Name string at 00740068`4010f410 - NTSTATUS 0xC0000141
Missing image name, possible paged-out or corrupt data.
Unable to read KLDR_DATA_TABLE_ENTRY at 0063002f`00300038 - NTSTATUS 0xC0000141
WARNING: .reload failed, module list may be incomplete
Unable to read NT module Base Name string at 00740068`4010f410 - NTSTATUS 0xC0000141
Missing image name, possible paged-out or corrupt data.
Unable to read KLDR_DATA_TABLE_ENTRY at 0063002f`00300038 - NTSTATUS 0xC0000141
WARNING: .reload failed, module list may be incomplete
Unable to read NT module Base Name string at 00740068`4010f410 - NTSTATUS 0xC0000141
Missing image name, possible paged-out or corrupt data.
Unable to read KLDR_DATA_TABLE_ENTRY at 0063002f`00300038 - NTSTATUS 0xC0000141
WARNING: .reload failed, module list may be incomplete
Unable to read NT module Base Name string at 00740068`4010f410 - NTSTATUS 0xC0000141
Missing image name, possible paged-out or corrupt data.
Unable to read KLDR_DATA_TABLE_ENTRY at 0063002f`00300038 - NTSTATUS 0xC0000141
WARNING: .reload failed, module list may be incomplete
Unable to read NT module Base Name string at 00740068`4010f410 - NTSTATUS 0xC0000141
Missing image name, possible paged-out or corrupt data.
Unable to read KLDR_DATA_TABLE_ENTRY at 0063002f`00300038 - NTSTATUS 0xC0000141
WARNING: .reload failed, module list may be incomplete


BUGCHECK_STR:  CLOCK_WATCHDOG_TIMEOUT_8_PROC


DEFAULT_BUCKET_ID:  VISTA_DRIVER_FAULT


CURRENT_IRQL:  0


STACK_TEXT:  
fffff880`033e7508 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0xfffff800`036c1740




STACK_COMMAND:  kb


SYMBOL_NAME:  ANALYSIS_INCONCLUSIVE


FOLLOWUP_NAME:  MachineOwner


MODULE_NAME: Unknown_Module


IMAGE_NAME:  Unknown_Image


DEBUG_FLR_IMAGE_TIMESTAMP:  0


BUCKET_ID:  CORRUPT_MODULELIST


Followup: MachineOwner
---------

I just can't help but think that I have a faulty motherboard...

Vir Gnarus · May 1, 2012

Yes, this is quite different. It's more CPU related. While drivers can cause it with things like race conditions or erroneous use of IRQLs, it all involves the CPU getting stuck in a state it can't escape from.

Btw, in your event log, is it still getting pounded by those other PCI-E Correctable Error events?

Teln3t · May 1, 2012

Vir Gnarus said:
Yes, this is quite different. It's more CPU related. While drivers can cause it with things like race conditions or erroneous use of IRQLs, it all involves the CPU getting stuck in a state it can't escape from.

Btw, in your event log, is it still getting pounded by those other PCI-E Correctable Error events?

Checking....

Nope, the last ones to happen were the cause of the first crash last night at 12:40PM

As you can see this is very strange. The computer will run fine all day, but at night time, after the first crash, the computer crashes periodically for the rest of the night. It's very...very strange.

EDIT: If you'd like, we can do remote teamviewer and can look through all the system logs. Like I said, any information that would be useful you can have.

Vir Gnarus · May 1, 2012

So it appears the power management thing resolved the event log errors, which were all correctable, but the uncorrectable ones continue. I had a feeling they were unrelated.

Anyways, I'll still try and digest all of this to see what we can do. The thing I want to figure out is exactly what sent the Malformed TLPs that was causing the original BSODs. If this CLOCK_WATCHDOG_TIMEOUT BSOD is of any inclination to what's been going on so far, then it looks to be a problematic CPU. We'll see.

Vir Gnarus · May 2, 2012

I'm looking more into the AER info and I don't believe it's able to tell us anything important on the exact cause. It appears the erroneous TLP has been passed a number of times, getting marked with separate errors, before it came to the one that issued it as a fatal error (causing the BSOD). It'd be hardpressed to find any identifier on where it originated from.

If you have any minidumps at all (located in /Windows/Minidump directory) please zip em up and send em to us. I'd like to see all of them for any cross-patterns I may find.

Teln3t · May 2, 2012

Alright then. Here are all of them. Some older than others but hey, maybe you can find something.

http://speedy.sh/kPZnD/download/All-Dump-Files.rar

Vir Gnarus · May 3, 2012

Can you attach them to your post directly? They are small enough that zipping them should work. I cannot access them due to firewall restrictions against that site.

PCI-E WHEA errors (0x124)

BSOD Kernel Dump Expert

Attachments

BSOD Kernel Dump Expert

Contributor, Sysnative Staff Emeritus

BSOD Kernel Dump Expert

Member

BSOD Kernel Dump Expert

Member

BSOD Kernel Dump Expert

Member

BSOD Kernel Dump Expert

Member

BSOD Kernel Dump Expert

Member

Member

BSOD Kernel Dump Expert

Member

BSOD Kernel Dump Expert

BSOD Kernel Dump Expert

Member

BSOD Kernel Dump Expert