Page 1 of 3 123 Last
  1. #1

    Join Date
    Mar 2012
    Posts
    469

    PCI-E WHEA errors (0x124)

    Quick Reference:
    PCI_EXPRESS_AER_CAPABILITY structure
    PCI_EXPRESS_BRIDGE_AER_CAPABILITY structure
    PCI_EXPRESS_ROOTPORT_AER_CAPABILITY structure



    Hi everybody,

    I'd like to present a small personal discovery of mine. You may often come across WHEA errors that, like in the BSOD Methods & Tips section describing them, will be primarily related to CPU or memory issues. However, there are times when you may come across one that involves the PCI-E bus. On motherboards with both PCI and PCI-E, the same controller is responsible for both, so often when this shows up it can be PCI as well, as you will find later on. In this case, the output is quite a bit different than the usual and it may appear daunting and indecipherable at first. Have no fear! With the right knowledgebase (MSDN) you shouldn't have all that much of a problem figuring it out.

    Let's start by doing an !analyze -v:

    Code:
    3: kd> !analyze -v
    *******************************************************************************
    *                                                                             *
    *                        Bugcheck Analysis                                    *
    *                                                                             *
    *******************************************************************************
    
    WHEA_UNCORRECTABLE_ERROR (124)
    A fatal hardware error has occurred. Parameter 1 identifies the type of error
    source that reported the error. Parameter 2 holds the address of the
    WHEA_ERROR_RECORD structure that describes the error conditon.
    Arguments:
    Arg1: 00000004, PCI Express Error
    Arg2: 869348d4, Address of the WHEA_ERROR_RECORD structure.
    Arg3: 00000000
    Arg4: 00000000
    
    Debugging Details:
    ------------------
    
    TRIAGER: Could not open triage file : C:\Program Files (x86)\Windows Kits\8.0\Debuggers\x64\triage\modclass.ini, error 2
    
    BUGCHECK_STR:  0x124_GenuineIntel
    
    CUSTOMER_CRASH_COUNT:  1
    
    DEFAULT_BUCKET_ID:  WIN7_DRIVER_FAULT
    
    PROCESS_NAME:  System
    
    CURRENT_IRQL:  a
    
    STACK_TEXT:  
    80e4cb2c 8341afcd 00000124 00000004 869348d4 nt!KeBugCheckEx+0x1e
    80e4cb68 83506fc4 869334e1 869348d4 8691bc10 hal!HalBugCheckSystem+0xab
    80e4cb9c 8c7ce609 8691b638 8690a780 80e4cd20 nt!WheaReportHwError+0x230
    80e4cbb4 8c7cf088 869344b4 00000000 8691b638 pci!ExpressRootPortAerInterruptRoutine+0x1e7
    80e4cbd8 8c7cf264 8690a780 86934008 80e4cbfc pci!ExpressRootPortInterruptRoutine+0x1a
    80e4cbe8 834a9cff 8690a780 86934008 00000001 pci!ExpressRootPortMessageRoutine+0x10
    80e4cbfc 83474ded 8690a780 86934008 80e4cc28 nt!KiInterruptMessageDispatch+0x12
    80e4cbfc 93be45d6 8690a780 86934008 80e4cc28 nt!KiInterruptDispatch+0x6d
    WARNING: Stack unwind information not available. Following frames may be wrong.
    80e4cc98 8349ada4 888d8d48 80e35800 80e30000 intelppm+0x15d6
    80e4cd20 834985ad 00000000 0000000e ab16ab16 nt!PoIdle+0x524
    80e4cd24 00000000 0000000e ab16ab16 8bdf8bdf nt!KiIdleLoop+0xd
    
    
    STACK_COMMAND:  kb
    
    FOLLOWUP_NAME:  MachineOwner
    
    MODULE_NAME: GenuineIntel
    
    IMAGE_NAME:  GenuineIntel
    
    DEBUG_FLR_IMAGE_TIMESTAMP:  0
    
    FAILURE_BUCKET_ID:  0x124_GenuineIntel_PCIEXPRESS
    
    BUCKET_ID:  0x124_GenuineIntel_PCIEXPRESS
    
    Followup: MachineOwner
    ---------
    As you can figure, like any WHEA BSOD, it's not going to be very explanatory. Really, the only things to look at are the arguments for the BSOD. In this case, we actually get a subcode 0x4 in Arg1, which means a PCI Express error. In Arg2, like usual, the address of the WHEA error structure is present. Like any 0x124 bugcheck, we take this value and direct !errrec towards it:

    Code:
    3: kd> !errrec 869348d4
    ===============================================================================
    Common Platform Error Record @ 869348d4
    -------------------------------------------------------------------------------
    Record Id     : 01cd07d8bce4740f
    Severity      : Fatal (1)
    Length        : 672
    Creator       : Microsoft
    Notify Type   : PCI Express Error
    Timestamp     : 3/22/2012 3:06:44 (UTC)
    Flags         : 0x00000000
    
    ===============================================================================
    Section 0     : PCI Express
    -------------------------------------------------------------------------------
    Descriptor    @ 86934954
    Section       @ 869349e4
    Offset        : 272
    Length        : 208
    Flags         : 0x00000001 Primary
    Severity      : Recoverable
    
    Port Type     : Root Port
    Version       : 1.1
    Command/Status: 0x4010/0x0507
    Device Id     :
      VenId:DevId : 8086:340a
      Class code  : 030400
      Function No : 0x00
      Device No   : 0x03
      Segment     : 0x0000
      Primary Bus : 0x00
      Second. Bus : 0x00
      Slot        : 0x0000
    Dev. Serial # : 0000000000000000
    Express Capability Information @ 86934a18
      Device Caps : 00008021 Role-Based Error Reporting: 1
      Device Ctl  : 0107 ur FE NF CE
      Dev Status  : 0003 ur fe NF CE
       Root Ctl   : 0008 fs nfs cs
    
    AER Information @ ffffffff86934a54
      Uncorrectable Error Status    : 00000020 ur ecrc mtlp rof uc ca cto fcp ptlp SD dlp und
      Uncorrectable Error Mask      : 00000000 ur ecrc mtlp rof uc ca cto fcp ptlp sd dlp und
      Uncorrectable Error Severity  : 00062010 ur ecrc MTLP ROF uc ca cto FCP ptlp sd DLP und
      Correctable Error Status      : 00000000 adv rtto rnro dllp tlp re
      Correctable Error Mask        : 00000000 adv rtto rnro dllp tlp re
      Caps & Control                : 00000005 ecrcchken ecrcchkcap ecrcgenen ecrcgencap FEP
      Header Log                    : 00000000 00000000 00000000 00000000
      Root Error Command            : 00000000 fen nfen cen
      Root Error Status             : 00000000 MSG# 00 fer nfer fuf mur ur mcr cer
      Correctable Error Source ID   : 00,00,00
      Correctable Error Source ID   : 00,00,00
    
    ===============================================================================
    Section 1     : Processor Generic
    -------------------------------------------------------------------------------
    Descriptor    @ 8693499c
    Section       @ 86934ab4
    Offset        : 480
    Length        : 192
    Flags         : 0x00000000
    Severity      : Informational
    
    Proc. Type    : x86/x64
    Instr. Set    : x86
    CPU Version   : 0x00000000000106a5
    Processor ID  : 0x0000000000000006
    This definitely doesn't look like the typical WHEA output. So what we need to do is start digesting the information bit by bit. Much of it isn't really necessary to scrutinize over in various cases. The most important content here is the AER info, of which stands for Advanced Error Reporting. Of course, to the untrained eye, it looks like a bunch of gibberish. But in this article we will be able to discover what it means, and that it really is pretty simple stuff.

    But first, let's figure out just what component of the PC the error came from. We start off by looking at the Device ID. The first entry describes the VenID (Vendor ID) and DevID (Device ID). We can approach a database online with these numbers to see what we get. I personally picked PCIDatabase.com. By entering in the DevID, preferably since it's more specific than using the VenID, we'll discover that this came from an Intel 7500 Chipset PCIe Root Port. This is part of the ICH10 Southbridge chipset. Given that this particular case the client described his motherboard as an ASRock X58 motherboard, I wouldn't be surprised by this information. However, given the Port Type and what we see here from the description given by the DevID, we can see this came from the root port of the PCI-E bus. This is important, as you'll discover later.

    Anyways, now onto the actual AER information. Given that I too was as befuddled by the output as you are looking at it, I used keywords like "AER" and "PCI Express" in google to get information on what I'm looking at. Sure enough, the MSDN has an article pertaining to it, regarding a particular data structure with the all the details and descriptions for its variables. The initial article I found was this one. Everything described here matches what is present for the AER info presented in the the WHEA error record, so I figured this would be easy. I checked the Uncorrectable Error Status (PCI_EXPRESS_UNCORRECTABLE_ERROR_STATUS) in the article to see descriptions of all the weird bunches of letters that are displayed.

    As I ventured to guess previously, this was a bunch of bit flags with each pertaining to a specific type of error. If a bit is set (the value is 1 and not 0) for a specific error, then that was an error that occurred. Evidently, in the WHEA error record displayed by !errrec, each of the bunches of letters is actually a very short abbreviation/acronym for the error name of the related bit. Whatever is lowercase is a bit that was not set, and whatever is capitalized was set. As we can tell, SD bit was set, so we should look through the article to see what SD means.

    This is when things get problematic. There's nothing that sounds like it's remotely related to the letters SD, and we can tell from the other error bits that those also don't seem to have a related error described in the article. Something's not right here, so let's go back to the main article on PCI_EXPRESS_AER_CAPABILITY structure that we initially found and read up a bit.

    Sure enough, in the remarks, the following is stated:

    Root ports and root complex event collectors use the PCI_EXPRESS_ROOTPORT_AER_CAPABILITY structure instead of the PCI_EXPRESS_AER_CAPABILITY structure to describe the PCIe advanced error reporting capability structure.
    This is why it was important to discover what was reporting the error. It could've been a PCI-E bridge (which has its own AER data structure), from the root port as we found here, or from the PCI-E device itself. So now we just go to the article describing the structure specific to root ports, and then go back to the Uncorrectable Error Status subarticle and read up the error bits and their descriptions. As you can see, the errors now correspond well to those represented in the WHEA record. For example, ecrc means ECRC Error, rof means Receiver Overflow, etc. For this specific case, SD was the only one capitalized, so let's look at the description for that. It mentions Surprise Down. Personally, this is self explanatory, but just in case I google it along with PCI Express to find out what it means for sure. As I figured, it means that there was an unexpected (Surprise!) loss in connection between the PCI-E hub and the PCI-E device. The device in mind could only be the video card, since during the client's attempt to troubleshoot the only PCI-E device he has installed is his video card (read Update).

    So now that we have a specific description of the error, we need to think of what would cause this error. As I can speculate, either the root port or the card could be responsible. In addition, I would venture to believe anything impeding the connection physically, like dust or other contaminant in the slot, or having the card sit ajar in the slot would create this problem. Currently at this time, the client is still in the process of determining this. However, on googling this, I found this to be rather prevalent for individuals with the ASRock X58 motherboard, especially in that they happen to have a lot of PCI-E WHEA error events in their event log. I would figure this may be related.

    I'll get back on a definitive answer of this case. Until then, you at least now know what to do when you come across one of these WHEA errors. Have fun!


    UPDATE:

    The culprit actually turned out to be the TV Tuner card in a PCI slot (not PCI-E). I was erroneous in assuming the client had only the video card installed. I have also forgotten that the same controller is responsible for both the PCI and PCI-E buses, so this error can manifest from PCI cards as well on newer boards.
    Attached Files Attached Files
    Last edited by Vir Gnarus; 04-30-2012 at 09:51 AM.
    JMH, zigzag3143, Capt.Jack Sparrow and 5 others say thanks for this.


    • Ad Bot

      advertising
      Beep.

        
       

  2. #2

    Join Date
    Mar 2012
    Posts
    469

    Re: PCI-E WHEA errors (0x124)

    Bump: Updated.

  3. #3
    zigzag3143's Avatar
    Join Date
    Mar 2012
    Posts
    3,741
    • specs System Specs
      • Manufacturer:
        HP
      • Model Number:
        DV7
      • Cooling:
        Coolermaster U3 best in class
      • Operating System:
        Win 8 RTM

    Re: PCI-E WHEA errors (0x124)

    Thorough analysis thanks. Keep them coming. The DMP looks vaguely familiar.

    MS-MVP Windows IT-PRO 2010-2017
    MCC-2013-2017
    Wankiya & Dyami
    Team ZigZag





  4. #4

    Join Date
    Mar 2012
    Posts
    469

    Re: PCI-E WHEA errors (0x124)

    Yep, it should be. It's from a case I helped solve at SF. I attached the dmp file just now.

  5. #5

    Re: PCI-E WHEA errors (0x124)

    Hi....I've been having an extremely similar issue as the person you helped assist for the past month or so now. I'm very technically minded and I've always been able to debug my way to answers before. But not this time. Your case is the only one somewhat similar to mine, Vir Gnarus. I'm wondering if your expertise could be of use.

    Here's the issue. My computer crashes very randomly. Mostly, it will go all day, but in the wee-hours of the morning or late at night; it will crash. The crash is a sudden black screen....no blue screen. Occasionally there is audio that repeats, but not always. Both monitors go black and I'm forced to cold-reboot. After the first crash of the day, the others follow very shortly after when trying to do intensive gaming, loading games, etc.

    Here is the debug and !errorrec

    Code:
    -------------------------------------------------------------------------------------------
    
    
    Loading Dump File [C:\Windows\Minidump\043012-22386-01.dmp]
    Mini Kernel Dump File: Only registers and stack trace are available
    
    
    Symbol search path is: SRV*c:\symbols*http://msdl.microsoft.com/download/symbols
    Executable search path is: 
    Windows 7 Kernel Version 7600 MP (8 procs) Free x64
    Product: WinNt, suite: TerminalServer SingleUserTS
    Built by: 7600.16617.amd64fre.win7_gdr.100618-1621
    Machine Name:
    Kernel base = 0xfffff800`03661000 PsLoadedModuleList = 0xfffff800`0389ee50
    Debug session time: Mon Apr 30 01:44:53.763 2012 (GMT-4)
    System Uptime: 0 days 0:07:47.012
    Loading Kernel Symbols
    ...............................................................
    ................................................................
    ........................
    Loading User Symbols
    Loading unloaded module list
    ....
    *******************************************************************************
    *                                                                             *
    *                        Bugcheck Analysis                                    *
    *                                                                             *
    *******************************************************************************
    
    
    Use !analyze -v to get detailed debugging information.
    
    
    BugCheck 124, {4, fffffa8005c8e038, 0, 0}
    
    
    Probably caused by : hardware
    
    
    Followup: MachineOwner
    ---------
    
    
    7: kd> !analyze -v
    *******************************************************************************
    *                                                                             *
    *                        Bugcheck Analysis                                    *
    *                                                                             *
    *******************************************************************************
    
    
    WHEA_UNCORRECTABLE_ERROR (124)
    A fatal hardware error has occurred. Parameter 1 identifies the type of error
    source that reported the error. Parameter 2 holds the address of the
    WHEA_ERROR_RECORD structure that describes the error conditon.
    Arguments:
    Arg1: 0000000000000004, PCI Express Error
    Arg2: fffffa8005c8e038, Address of the WHEA_ERROR_RECORD structure.
    Arg3: 0000000000000000
    Arg4: 0000000000000000
    
    
    Debugging Details:
    ------------------
    
    
    
    
    BUGCHECK_STR:  0x124_4
    
    
    CUSTOMER_CRASH_COUNT:  1
    
    
    DEFAULT_BUCKET_ID:  VISTA_DRIVER_FAULT
    
    
    PROCESS_NAME:  System
    
    
    CURRENT_IRQL:  7
    
    
    STACK_TEXT:  
    fffff880`03199a78 fffff800`0362a903 : 00000000`00000124 00000000`00000004 fffffa80`05c8e038 00000000`00000000 : nt!KeBugCheckEx
    fffff880`03199a80 fffff800`037e7593 : 00000000`00000001 fffffa80`05c760c0 00000000`00000000 fffffa80`05c75b70 : hal!HalBugCheckSystem+0x1e3
    fffff880`03199ac0 fffff880`00e20aff : fffffa80`00000750 fffffa80`05c760c0 00000000`00000001 fffffa80`05c8d7f0 : nt!WheaReportHwError+0x263
    fffff880`03199b20 fffff880`00e20526 : 00000000`00000000 fffff880`03199c70 fffffa80`051c4d80 00000000`000000ff : pci!ExpressRootPortAerInterruptRoutine+0x27f
    fffff880`03199b80 fffff800`036cd53c : fffff880`03171180 fffff880`03199c70 fffffa80`051c4d80 00000000`00000001 : pci!ExpressRootPortInterruptRoutine+0x36
    fffff880`03199bf0 fffff800`036d9ec2 : fffff880`03171180 fffff880`00000001 00000000`00000001 fffff880`00000000 : nt!KiInterruptDispatch+0x16c
    fffff880`03199d80 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiIdleLoop+0x32
    
    
    
    
    STACK_COMMAND:  kb
    
    
    FOLLOWUP_NAME:  MachineOwner
    
    
    MODULE_NAME: hardware
    
    
    IMAGE_NAME:  hardware
    
    
    DEBUG_FLR_IMAGE_TIMESTAMP:  0
    
    
    FAILURE_BUCKET_ID:  X64_0x124_4_PCIEXPRESS
    
    
    BUCKET_ID:  X64_0x124_4_PCIEXPRESS
    
    
    Followup: MachineOwner
    ---------
    
    
    7: kd> !errrec fffffa8005c8e038
    ===============================================================================
    Common Platform Error Record @ fffffa8005c8e038
    -------------------------------------------------------------------------------
    Record Id     : 01cd269348ca35c5
    Severity      : Fatal (1)
    Length        : 672
    Creator       : Microsoft
    Notify Type   : PCI Express Error
    Timestamp     : 4/30/2012 5:44:53
    Flags         : 0x00000000
    
    
    ===============================================================================
    Section 0     : PCI Express
    -------------------------------------------------------------------------------
    Descriptor    @ fffffa8005c8e0b8
    Section       @ fffffa8005c8e148
    Offset        : 272
    Length        : 208
    Flags         : 0x00000001 Primary
    Severity      : Fatal
    
    
    Port Type     : Root Port
    Version       : 1.1
    Command/Status: 0x4010/0x0547
    Device Id     :
      VenId:DevId : 8086:340a
      Class code  : 030400
      Function No : 0x00
      Device No   : 0x03
      Segment     : 0x0000
      Primary Bus : 0x00
      Second. Bus : 0x00
      Slot        : 0x0000
    Dev. Serial # : 0000000000000000
    Express Capability Information @ fffffa8005c8e17c
      Device Caps : 00008021 Role-Based Error Reporting: 1
      Device Ctl  : 0107 ur FE NF CE
      Dev Status  : 000d UR FE nf CE
       Root Ctl   : 0008 fs nfs cs
    
    
    AER Information @ fffffa8005c8e1b8
      Uncorrectable Error Status    : 00140000 UR ecrc MTLP rof uc ca cto fcp ptlp sd dlp und
      Uncorrectable Error Mask      : 00000000 ur ecrc mtlp rof uc ca cto fcp ptlp sd dlp und
      Uncorrectable Error Severity  : 00062010 ur ecrc MTLP ROF uc ca cto FCP ptlp sd DLP und
      Correctable Error Status      : 000020c1 ADV rtto rnro DLLP TLP RE
      Correctable Error Mask        : 00000000 adv rtto rnro dllp tlp re
      Caps & Control                : 00000014 ecrcchken ecrcchkcap ecrcgenen ecrcgencap FEP
      Header Log                    : 34000000 02000030 00000000 00000000
      Root Error Command            : 00000000 fen nfen cen
      Root Error Status             : 00000000 MSG# 00 fer nfer fuf mur ur mcr cer
      Correctable Error Source ID   : 00,00,00
      Correctable Error Source ID   : 00,00,00
    
    
    ===============================================================================
    Section 1     : Processor Generic
    -------------------------------------------------------------------------------
    Descriptor    @ fffffa8005c8e100
    Section       @ fffffa8005c8e218
    Offset        : 480
    Length        : 192
    Flags         : 0x00000000
    Severity      : Informational
    
    
    Proc. Type    : x86/x64
    Instr. Set    : x64
    CPU Version   : 0x00000000000106a5
    Processor ID  : 0x0000000000000007
    ---------------------------------------------------------------------------------------------------------


    That's as far as I can go....I've Googled through hell and high water and your issue with that one man is the only thing that has come remotely close to this strange error. Please....does any of this make a lightbulb go on? Does anything here really describe the issue? The most technical information I could find pertaining to this 'nt!KeBugCheckEx' was here: http://uninformed.org/index.cgi?v=3&a=3&p=17



    I'll keep this window tabbed for the next few days in case you respond. I'm crossing my fingers.
    Last edited by Vir Gnarus; 04-30-2012 at 09:53 AM. Reason: Added [code] tags for cleaner post.

  6. #6

    Join Date
    Mar 2012
    Posts
    469

    Re: PCI-E WHEA errors (0x124)

    Hi mate,

    As mentioned in the OP, the best thing to do is to first figure out what actually made the report, then follow that up by going to the appropriate structure for it and then look at the error code associated with the error given.

    For the first step, check the Port Type in the WHEA error structure you saw in the !errrec output:

    Code:
    Port Type     : Root Port
    So it's the root port that generated the error. Since that's the case, like in the OP, we'll need to look at the PCI_EXPRESS_ROOTPORT_AER_CAPABILITY structure (listed at the top of OP), and under that we should check the PCI_EXPRESS_UNCORRECTABLE_ERROR_STATUS substructure under that. Then match the names of the errors to the ones capitalized in the list of abbreviated errors mentioned in the AER data of the !errrec output:

    Code:
    AER Information @ fffffa8005c8e1b8
       Uncorrectable Error Status    : 00140000 UR ecrc MTLP rof uc ca cto fcp ptlp sd dlp und
       Uncorrectable Error Mask      : 00000000 ur ecrc mtlp rof uc ca cto fcp ptlp sd dlp und
       Uncorrectable Error Severity  : 00062010 ur ecrc MTLP ROF uc ca cto FCP ptlp sd DLP und
       Correctable Error Status      : 000020c1 ADV rtto rnro DLLP TLP RE
       Correctable Error Mask        : 00000000 adv rtto rnro dllp tlp re
       Caps & Control                : 00000014 ecrcchken ecrcchkcap ecrcgenen ecrcgencap FEP
       Header Log                    : 34000000 02000030 00000000 00000000
       Root Error Command            : 00000000 fen nfen cen
       Root Error Status             : 00000000 MSG# 00 fer nfer fuf mur ur mcr cer
       Correctable Error Source ID   : 00,00,00
       Correctable Error Source ID   : 00,00,00
    UR and MTLP are the ones mentioned. In the article for the Uncorrectable Error Status, I see these are related to UnsupportedRequestError and MalformedTLP. Checking the PCI-Express Base Specifications Revision 1.1*, I've looked through the list of error codes and found that UnsupportedRequestError is not a fatal error, but MalformedTLP is. The UR error is commonly used to report request errors, so it's not unusual to see this with the MTLP error.

    Given by what I skimmed through the PDF file for PCI-E Base Specs, the MTLP error can be triggered by a number of causes. What I'm most concerned about, though, is what was the source of the malformed TLP. Right now I'm trying to figure how one can determine that using WinDBG.

    * - I figured which revision of the specs to use by looking up the version of your root port, which is mentioned below its name in the !errrec output.
    Last edited by Vir Gnarus; 04-30-2012 at 11:36 AM.

  7. #7

    Re: PCI-E WHEA errors (0x124)

    Fantastic work! I can't read my whole post that I made on here, but im sure you saw it judging by your post. But yes, that's the problem for me too. I can't find the cause for the life of me. At first I thought it was a faulty voltage-regulator on the GPU(that turned out to be false). Later, I thought it was the motherboard (but after a full system hardware diagnostic, no faulty GPU, ram, MB, or CPU applied).

    The only thing that keeps coming to mind is a power failure of some sort...but I can't pinpoint where!

    If you need anything else from me, please let me know.
    Last edited by Vir Gnarus; 04-30-2012 at 12:52 PM. Reason: Edit: no need to copy. I approved post. :)

  8. #8

    Join Date
    Mar 2012
    Posts
    469

    Re: PCI-E WHEA errors (0x124)

    A TLP can go either way through the PCI-E bus. It can either be originating from a CPU, Memory, PCI card, etc., so that's what's particularly difficult here. I do see the Header Log for the error that can give us clues but I'll have to interpret the packet header it gives us to figure out just what's happening. So far I can tell what format the TLP is, but not the Requester ID mentioned in the header that can give us what the source of the TLP is.

    Your best bet is to start speculating that this may be caused by a PCI-E card or PCI card. The PCI-E root port services both PCI legacy and PCI-E buses.

    As for it being a power issue, I kinda don't think so. There are three layers that communication through the PCI-E bus goes through: Physical Layer, Data Link Layer, and Transaction Layer. There are error checks that occur for all three, and anything that fails on a lower layer (like Physical Layer) doesn't get passed up through to the upper layers. A malformed TLP is a Transaction Layer error, so it means CRC checks and whatnot passed from the lower levels. So I think this is more a bug in a hardware piece (like one of the PCI cards) where this originated from, otherwise it would've not passed lower level checks. I guess if the power issue did reside in a piece of hardware and not the bus linking them that it'd cause this, but I doubt it. However, again I'm not exactly grounded in a solid foundation of knowledge on all of this. This is all pretty new to me as well.

    Btw, I was not responsible for moderating your post. I guess because it appeared to be a request for specific issue - which it is - that it was done so as it doesn't exactly appear relevant here. However, I personally think it's a good example of troubleshooting using this approach. If the diagnostic endeavor has to go beyond this, then I'll move your stuff and make it it's own thread so people can assist better.

  9. #9

    Re: PCI-E WHEA errors (0x124)

    Quote Originally Posted by Vir Gnarus View Post
    A TLP can go either way through the PCI-E bus. It can either be originating from a CPU, Memory, PCI card, etc., so that's what's particularly difficult here. I do see the Header Log for the error that can give us clues but I'll have to interpret the packet header it gives us to figure out just what's happening. So far I can tell what format the TLP is, but not the Requester ID mentioned in the header that can give us what the source of the TLP is.

    Your best bet is to start speculating that this may be caused by a PCI-E card or PCI card. The PCI-E root port services both PCI legacy and PCI-E buses.

    As for it being a power issue, I kinda don't think so. There are three layers that communication through the PCI-E bus goes through: Physical Layer, Data Link Layer, and Transaction Layer. There are error checks that occur for all three, and anything that fails on a lower layer (like Physical Layer) doesn't get passed up through to the upper layers. A malformed TLP is a Transaction Layer error, so it means CRC checks and whatnot passed from the lower levels. So I think this is more a bug in a hardware piece (like one of the PCI cards) where this originated from, otherwise it would've not passed lower level checks. I guess if the power issue did reside in a piece of hardware and not the bus linking them that it'd cause this, but I doubt it. However, again I'm not exactly grounded in a solid foundation of knowledge on all of this. This is all pretty new to me as well.

    Btw, I was not responsible for moderating your post. I guess because it appeared to be a request for specific issue - which it is - that it was done so as it doesn't exactly appear relevant here. However, I personally think it's a good example of troubleshooting using this approach. If the diagnostic endeavor has to go beyond this, then I'll move your stuff and make it it's own thread so people can assist better.
    This is all very new grounds for me too Vir Gnarus. I've programmed for years in C++ and winsock API, PHP, and worked on finding memory exploits in various programs. Yet, all of this BSOD and/or 'Black-Screen' error troubleshooting goes very...very deep. When you take into consideration that vast amount of variables that can arise from multiple pieces of software, drivers, hardware...things can easily get complicated. So yes, this is very new grounds for me as well.

    There is no rush my friend, as I said before in my first post, the computer is generally fine for the whole day once I take it out of sleep mode. However, towards the late-night (12-4am) it crashes when trying to do any intensive gaming or even whilst just loading an intensive gaming-map.

    If you think that this one needs a new thread, even though the checks are similar, I'm totally okay with that. That's either your call or the mod's call. The whole goal is to be able to provide a resource of knowledge for future Googlers.

    I do enjoy a challenge however. And, there is no rush. Take your time and if you need ANY information, please let me know.


    P.S. Oh, and I never knew of the Physical Layer, Data Link Layer, and Transaction Layer thing. That much is very interesting to me.

  10. #10

    Join Date
    Mar 2012
    Posts
    469

    Re: PCI-E WHEA errors (0x124)

    Yah, the layers operate a lot like those of the OSI model for computer networking. In essence, a motherboard is just a miniature network that ties all the pieces of hardware together, and all of them are like little computers sending info to and fro. Once I realized this understanding them came a lot easier. Who ever thought my classes in networking would help me with PCI-Express!

    I'll dig through the PCI-Express spec documentation some more to find anything. If you want more details on the general concept behind PCI-Express, I found this all the way up to the transaction layer page to be a good primer on PCI-Express and the rules behind it. The PCI-Express spec documentation can also be found with googling it.

  11. #11

    Re: PCI-E WHEA errors (0x124)

    Hey Vir, just thought I would throw you an update for today. My computer crashed again tonight again from the same time frame of 12-4AM. I felt compelled to go into the standard windows event logger and check out anything interesting. I saw that there were 100+ warning messages that lead up to the crash (and I believe the warnings caused it).



    Here's the error
    Code:
    A corrected hardware error has occurred.
    
    
    Component: PCI Express Root Port
    Error Source: Advanced Error Reporting (PCI Express)
    
    
    Bus:Device:Function: 0x0:0x3:0x0
    Vendor ID:Device ID: 0x8086:0x340a
    Class Code: 0x30400
    
    
    The details view of this entry contains further information.
    ANYWAY. I did some Googling and found a post on the second page of this EVGA thread. Keep in mind that we ATI users were also getting this error/warning message.
    http://www.evga.com/forums/fb.ashx?m=647987
    The post by tpb7463(guy with the cat icon)

    Cool huh?
    Well, I went ahead and modified the registry values and I'm sitting here typing this message without a crash yet. But, something is definitely up. I don't know if this will fix the issue (hell, it probably won't). But, we'll see how things go.

    Nevertheless, the standard windows event logger told me to view the 'Details' tab for further detailed information. Thanks Windows.
    Code:
    ErrorSource 4
    FRUId {00000000-0000-0000-0000-000000000000}
    FRUText
    ValidBits 0xdf
    PortType 4
    Version 0x101
    Command 0x10
    Status 0x547
    Bus 0x0
    Device 0x3
    Function 0x0
    Segment 0x0
    SecondaryBus 0x0
    Slot 0x0
    VendorID 0x8086
    DeviceID 0x340a
    ClassCode 0x30400
    DeviceSerialNumber 0x0
    BridgeControl 0x0
    BridgeStatus 0x0
    UncorrectableErrorStatus 0x0
    CorrectableErrorStatus 0x1000
    HeaderLog 00000000000000000000000000000000
    Length 672
    RawData 435045521002FFFFFFFF02000200000002000000A0020000282205001E040C140000000000000000000000000000000000000000000000000000000000000000BDC407CF89B7184EB3C41F732CB571311FC093CF161AFC4DB8BC9C4DAF67C104DB7FBA6C4626CD0100000000000000000000000000000000000000000000000010010000D0000000010200000100000054E995D9C1BB0F43AD91B44DCB3C6F3500000000000000000000000000000000020000000000000000000000000000000000000000000000E0010000C00000000102000000000000ADCC7698B447DB4BB65E16F193C4F3DB00000000000000000000000000000000030000000000000000000000000000000000000000000000DF000000000000000400000001010000100047050000000086800A3400040300030000000000000000000000000000000000000010E042012180000007010100023D3B03400001F11F790206C0074801080001000000000000000000000000000000000000000000000000000000000001000115000000000000000010200600001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000043010000000000000002000000000000A506010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000007000000000000000000000000000000000000000000000000000000000000000000000000000000

  12. #12

    Join Date
    Mar 2012
    Posts
    469

    Re: PCI-E WHEA errors (0x124)

    I'm finding out a lot more about this PCI-E thing as this progresses. Thanks for the additional information.

    I did some research by perusing more through the PCI-E spec documentation and it does sound very much related to this case here, as far as I can tell. Here's the process I used to figure this out. Again, the PCI-E spec documentation for Revision 1.1 is crucial here. Google it to find a PDF version of it (primary site is member only access).

    I looked at the detailed information and it says CorrectableErrorStatus of 0x1000. I sifted through the PCI-E spec doc and saw the AER Correctable Error bits and what error status each one represents. Since only one bit is set, and this is the 12th positioned bit (remember, numbering starts at 0, not 1), it relates to the Replay Timer Timeout error. Having no clue what the Replay Timer is, I looked it up in the spec doc. I got this:

    REPLAY_TIMER - Counts time since last Ack or Nak DLLP received
    Ack or Nak in networking terms means Acknowledged or Not Acknowledged, which are the two most generic responses a device (or computer in comp networking) will respond to something. So really it just means here this timer is for the last time any DLLP activity went through. What's a DLLP? Again, to the spec doc! Under Terms and Acronyms, the following is available:

    Data Link Layer Packet,
    DLLP A Packet generated in the Data Link Layer to support Link management
    functions.
    Ah ha, so it's lower than a TLP (Translation Layer Packet) and it's basically for link management (typical of Data Link Layer). So this tells us there's a timeout for activity for a particular link. I can now start seeing why this would be a power issue.

    TBP (the guy with the cat avatar) on the thread you referenced didn't seem mention much relevant to this. However, a guy a few posts down, named terminou, does present something viable as a solution (or at least a workaround). He mentions that turning off ASPM (Active State Power Management) - or Link State Power Management - for the PCI Express bus seemed to have done the trick. You can do this by going to your Power plan for the PC (Power Options > Change plan settings > Change advanced power settings) then go to PCI Express and turn off the associated item. See if that will stifle those error messages.

    For some reason, though, I'm not sure how this is related to the Malformed TLP fatal error that gives you the PCI-E BSODs. Still, we can work with that so far and go from there. Maybe these are two separate errors caused by two separate issues, with the Malformed TLP related to the HDCP incompatibility and the bunch of Replay Timer Timeouts are from the ASPM. Again, check and see and let us evaluate the results.

  13. #13

    Re: PCI-E WHEA errors (0x124)

    Quote Originally Posted by Vir Gnarus View Post
    I'm finding out a lot more about this PCI-E thing as this progresses. Thanks for the additional information.

    I did some research by perusing more through the PCI-E spec documentation and it does sound very much related to this case here, as far as I can tell. Here's the process I used to figure this out. Again, the PCI-E spec documentation for Revision 1.1 is crucial here. Google it to find a PDF version of it (primary site is member only access).

    I looked at the detailed information and it says CorrectableErrorStatus of 0x1000. I sifted through the PCI-E spec doc and saw the AER Correctable Error bits and what error status each one represents. Since only one bit is set, and this is the 12th positioned bit (remember, numbering starts at 0, not 1), it relates to the Replay Timer Timeout error. Having no clue what the Replay Timer is, I looked it up in the spec doc. I got this:

    REPLAY_TIMER - Counts time since last Ack or Nak DLLP received
    Ack or Nak in networking terms means Acknowledged or Not Acknowledged, which are the two most generic responses a device (or computer in comp networking) will respond to something. So really it just means here this timer is for the last time any DLLP activity went through. What's a DLLP? Again, to the spec doc! Under Terms and Acronyms, the following is available:

    Data Link Layer Packet,
    DLLP A Packet generated in the Data Link Layer to support Link management
    functions.
    Ah ha, so it's lower than a TLP (Translation Layer Packet) and it's basically for link management (typical of Data Link Layer). So this tells us there's a timeout for activity for a particular link. I can now start seeing why this would be a power issue.

    TBP (the guy with the cat avatar) on the thread you referenced didn't seem mention much relevant to this. However, a guy a few posts down, named terminou, does present something viable as a solution (or at least a workaround). He mentions that turning off ASPM (Active State Power Management) - or Link State Power Management - for the PCI Express bus seemed to have done the trick. You can do this by going to your Power plan for the PC (Power Options > Change plan settings > Change advanced power settings) then go to PCI Express and turn off the associated item. See if that will stifle those error messages.

    For some reason, though, I'm not sure how this is related to the Malformed TLP fatal error that gives you the PCI-E BSODs. Still, we can work with that so far and go from there. Maybe these are two separate errors caused by two separate issues, with the Malformed TLP related to the HDCP incompatibility and the bunch of Replay Timer Timeouts are from the ASPM. Again, check and see and let us evaluate the results.
    Alright, just set the PCI-E power management to off. I'll give the computer a restart and let you know about the error messages and if there are any additional crashes.

  14. #14

    Re: PCI-E WHEA errors (0x124)

    Update: Nope, computer crashed and I just rebooted from it. Oddly enough this time it just restarted. There IS a new memory dump, but it's much different than the others.


    well, I guess it can't hurt:

    Code:
    
    Microsoft (R) Windows Debugger Version 6.11.0001.404 AMD64
    Copyright (c) Microsoft Corporation. All rights reserved.
    
    
    
    
    Loading Dump File [C:\Windows\MEMORY.DMP]
    Kernel Summary Dump File: Only kernel address space is available
    
    
    Symbol search path is: SRV*c:\symbols*http://msdl.microsoft.com/download/symbols
    Executable search path is: 
    Unable to read NT module Base Name string at 00740068`4010f410 - NTSTATUS 0xC0000141
    Missing image name, possible paged-out or corrupt data.
    *** WARNING: Unable to verify timestamp for Unknown_Module_00350033`00330031
    *** ERROR: Module load completed but symbols could not be loaded for Unknown_Module_00350033`00330031
    Debugger can not determine kernel base address
    Windows 7 Kernel Version 7600 MP (8 procs) Free x64
    Product: WinNt, suite: TerminalServer SingleUserTS
    Built by: 7600.16617.amd64fre.win7_gdr.100618-1621
    Machine Name:
    Kernel base = 0xfffff800`03651000 PsLoadedModuleList = 0xfffff800`0388ee50
    Debug session time: Tue May  1 11:56:25.058 2012 (GMT-4)
    System Uptime: 0 days 0:20:56.323
    Unable to read NT module Base Name string at 00740068`4010f410 - NTSTATUS 0xC0000141
    Missing image name, possible paged-out or corrupt data.
    *** WARNING: Unable to verify timestamp for Unknown_Module_00350033`00330031
    *** ERROR: Module load completed but symbols could not be loaded for Unknown_Module_00350033`00330031
    Debugger can not determine kernel base address
    Loading Kernel Symbols
    Unable to read NT module Base Name string at 00740068`4010f410 - NTSTATUS 0xC0000141
    Missing image name, possible paged-out or corrupt data.
    .Unable to read KLDR_DATA_TABLE_ENTRY at 0063002f`00300038 - NTSTATUS 0xC0000141
    
    
    WARNING: .reload failed, module list may be incomplete
    *******************************************************************************
    *                                                                             *
    *                        Bugcheck Analysis                                    *
    *                                                                             *
    *******************************************************************************
    
    
    Use !analyze -v to get detailed debugging information.
    
    
    BugCheck 101, {19, 0, fffff88003100180, 6}
    
    
    ***** Debugger could not find nt in module list, module list might be corrupt, error 0x80070057.
    
    
    Unable to read NT module Base Name string at 00740068`4010f410 - NTSTATUS 0xC0000141
    Missing image name, possible paged-out or corrupt data.
    Unable to read KLDR_DATA_TABLE_ENTRY at 0063002f`00300038 - NTSTATUS 0xC0000141
    WARNING: .reload failed, module list may be incomplete
    Unable to read NT module Base Name string at 00740068`4010f410 - NTSTATUS 0xC0000141
    Missing image name, possible paged-out or corrupt data.
    Unable to read KLDR_DATA_TABLE_ENTRY at 0063002f`00300038 - NTSTATUS 0xC0000141
    WARNING: .reload failed, module list may be incomplete
    Unable to read NT module Base Name string at 00740068`4010f410 - NTSTATUS 0xC0000141
    Missing image name, possible paged-out or corrupt data.
    Unable to read KLDR_DATA_TABLE_ENTRY at 0063002f`00300038 - NTSTATUS 0xC0000141
    WARNING: .reload failed, module list may be incomplete
    Unable to read NT module Base Name string at 00740068`4010f410 - NTSTATUS 0xC0000141
    Missing image name, possible paged-out or corrupt data.
    Unable to read KLDR_DATA_TABLE_ENTRY at 0063002f`00300038 - NTSTATUS 0xC0000141
    WARNING: .reload failed, module list may be incomplete
    Unable to read NT module Base Name string at 00740068`4010f410 - NTSTATUS 0xC0000141
    Missing image name, possible paged-out or corrupt data.
    Unable to read KLDR_DATA_TABLE_ENTRY at 0063002f`00300038 - NTSTATUS 0xC0000141
    WARNING: .reload failed, module list may be incomplete
    Probably caused by : Unknown_Image ( ANALYSIS_INCONCLUSIVE )
    
    
    Followup: MachineOwner
    ---------
    
    
    0: kd> !analyze -v
    *******************************************************************************
    *                                                                             *
    *                        Bugcheck Analysis                                    *
    *                                                                             *
    *******************************************************************************
    
    
    CLOCK_WATCHDOG_TIMEOUT (101)
    An expected clock interrupt was not received on a secondary processor in an
    MP system within the allocated interval. This indicates that the specified
    processor is hung and not processing interrupts.
    Arguments:
    Arg1: 0000000000000019, Clock interrupt time out interval in nominal clock ticks.
    Arg2: 0000000000000000, 0.
    Arg3: fffff88003100180, The PRCB address of the hung processor.
    Arg4: 0000000000000006, 0.
    
    
    Debugging Details:
    ------------------
    
    
    ***** Debugger could not find nt in module list, module list might be corrupt, error 0x80070057.
    
    
    Unable to read NT module Base Name string at 00740068`4010f410 - NTSTATUS 0xC0000141
    Missing image name, possible paged-out or corrupt data.
    Unable to read KLDR_DATA_TABLE_ENTRY at 0063002f`00300038 - NTSTATUS 0xC0000141
    WARNING: .reload failed, module list may be incomplete
    Unable to read NT module Base Name string at 00740068`4010f410 - NTSTATUS 0xC0000141
    Missing image name, possible paged-out or corrupt data.
    Unable to read KLDR_DATA_TABLE_ENTRY at 0063002f`00300038 - NTSTATUS 0xC0000141
    WARNING: .reload failed, module list may be incomplete
    Unable to read NT module Base Name string at 00740068`4010f410 - NTSTATUS 0xC0000141
    Missing image name, possible paged-out or corrupt data.
    Unable to read KLDR_DATA_TABLE_ENTRY at 0063002f`00300038 - NTSTATUS 0xC0000141
    WARNING: .reload failed, module list may be incomplete
    Unable to read NT module Base Name string at 00740068`4010f410 - NTSTATUS 0xC0000141
    Missing image name, possible paged-out or corrupt data.
    Unable to read KLDR_DATA_TABLE_ENTRY at 0063002f`00300038 - NTSTATUS 0xC0000141
    WARNING: .reload failed, module list may be incomplete
    Unable to read NT module Base Name string at 00740068`4010f410 - NTSTATUS 0xC0000141
    Missing image name, possible paged-out or corrupt data.
    Unable to read KLDR_DATA_TABLE_ENTRY at 0063002f`00300038 - NTSTATUS 0xC0000141
    WARNING: .reload failed, module list may be incomplete
    
    
    BUGCHECK_STR:  CLOCK_WATCHDOG_TIMEOUT_8_PROC
    
    
    DEFAULT_BUCKET_ID:  VISTA_DRIVER_FAULT
    
    
    CURRENT_IRQL:  0
    
    
    STACK_TEXT:  
    fffff880`033e7508 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0xfffff800`036c1740
    
    
    
    
    STACK_COMMAND:  kb
    
    
    SYMBOL_NAME:  ANALYSIS_INCONCLUSIVE
    
    
    FOLLOWUP_NAME:  MachineOwner
    
    
    MODULE_NAME: Unknown_Module
    
    
    IMAGE_NAME:  Unknown_Image
    
    
    DEBUG_FLR_IMAGE_TIMESTAMP:  0
    
    
    BUCKET_ID:  CORRUPT_MODULELIST
    
    
    Followup: MachineOwner
    ---------
    I just can't help but think that I have a faulty motherboard...

  15. #15

    Join Date
    Mar 2012
    Posts
    469

    Re: PCI-E WHEA errors (0x124)

    Yes, this is quite different. It's more CPU related. While drivers can cause it with things like race conditions or erroneous use of IRQLs, it all involves the CPU getting stuck in a state it can't escape from.

    Btw, in your event log, is it still getting pounded by those other PCI-E Correctable Error events?

  16. #16

    Re: PCI-E WHEA errors (0x124)

    Quote Originally Posted by Vir Gnarus View Post
    Yes, this is quite different. It's more CPU related. While drivers can cause it with things like race conditions or erroneous use of IRQLs, it all involves the CPU getting stuck in a state it can't escape from.

    Btw, in your event log, is it still getting pounded by those other PCI-E Correctable Error events?
    Checking....

    Nope, the last ones to happen were the cause of the first crash last night at 12:40PM


    As you can see this is very strange. The computer will run fine all day, but at night time, after the first crash, the computer crashes periodically for the rest of the night. It's very...very strange.


    EDIT: If you'd like, we can do remote teamviewer and can look through all the system logs. Like I said, any information that would be useful you can have.

  17. #17

    Join Date
    Mar 2012
    Posts
    469

    Re: PCI-E WHEA errors (0x124)

    So it appears the power management thing resolved the event log errors, which were all correctable, but the uncorrectable ones continue. I had a feeling they were unrelated.

    Anyways, I'll still try and digest all of this to see what we can do. The thing I want to figure out is exactly what sent the Malformed TLPs that was causing the original BSODs. If this CLOCK_WATCHDOG_TIMEOUT BSOD is of any inclination to what's been going on so far, then it looks to be a problematic CPU. We'll see.

  18. #18

    Join Date
    Mar 2012
    Posts
    469

    Re: PCI-E WHEA errors (0x124)

    I'm looking more into the AER info and I don't believe it's able to tell us anything important on the exact cause. It appears the erroneous TLP has been passed a number of times, getting marked with separate errors, before it came to the one that issued it as a fatal error (causing the BSOD). It'd be hardpressed to find any identifier on where it originated from.

    If you have any minidumps at all (located in /Windows/Minidump directory) please zip em up and send em to us. I'd like to see all of them for any cross-patterns I may find.

  19. #19

    Re: PCI-E WHEA errors (0x124)

    Alright then. Here are all of them. Some older than others but hey, maybe you can find something.


    http://speedy.sh/kPZnD/download/All-Dump-Files.rar

  20. #20

    Join Date
    Mar 2012
    Posts
    469

    Re: PCI-E WHEA errors (0x124)

    Can you attach them to your post directly? They are small enough that zipping them should work. I cannot access them due to firewall restrictions against that site.

Page 1 of 3 123 Last

Similar Threads

  1. i5 3570k WHEA BSOD's
    By Skorov in forum BSOD, Crashes, Kernel Debugging
    Replies: 3
    Last Post: 06-18-2013, 06:38 AM
  2. Windows 7 x64 BSOD - 0x124 - Please Help
    By divine123 in forum BSOD, Crashes, Kernel Debugging
    Replies: 14
    Last Post: 05-14-2013, 11:55 PM
  3. BSOD 0x124 what causes this problem?
    By divine123 in forum BSOD, Crashes, Kernel Debugging
    Replies: 10
    Last Post: 05-13-2013, 11:28 AM
  4. [SOLVED] BSOD 0x124 need help
    By Damke in forum BSOD, Crashes, Kernel Debugging
    Replies: 29
    Last Post: 11-01-2012, 12:57 PM
  5. WHEA error for a MCA fault
    By Capt.Jack Sparrow in forum BSOD, Crashes, Kernel Debugging
    Replies: 4
    Last Post: 07-31-2012, 02:54 PM

Log in

Log in