Quick Reference:
PCI_EXPRESS_AER_CAPABILITY structure
PCI_EXPRESS_BRIDGE_AER_CAPABILITY structure
PCI_EXPRESS_ROOTPORT_AER_CAPABILITY structure
Hi everybody,
I'd like to present a small personal discovery of mine. You may often come across WHEA errors that, like in the
BSOD Methods & Tips section describing them, will be primarily related to CPU or memory issues. However, there are times when you may come across one that involves the PCI-E bus. On motherboards with both PCI and PCI-E, the same controller is responsible for both, so often when this shows up it can be PCI as well, as you will find later on. In this case, the output is quite a bit different than the usual and it may appear daunting and indecipherable at first. Have no fear! With the right knowledgebase (MSDN) you shouldn't have all that much of a problem figuring it out.
Let's start by doing an
!analyze -v:
Code:
3: kd> !analyze -v
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************
WHEA_UNCORRECTABLE_ERROR (124)
A fatal hardware error has occurred. Parameter 1 identifies the type of error
source that reported the error. Parameter 2 holds the address of the
WHEA_ERROR_RECORD structure that describes the error conditon.
Arguments:
Arg1: 00000004, PCI Express Error
Arg2: 869348d4, Address of the WHEA_ERROR_RECORD structure.
Arg3: 00000000
Arg4: 00000000
Debugging Details:
------------------
TRIAGER: Could not open triage file : C:\Program Files (x86)\Windows Kits\8.0\Debuggers\x64\triage\modclass.ini, error 2
BUGCHECK_STR: 0x124_GenuineIntel
CUSTOMER_CRASH_COUNT: 1
DEFAULT_BUCKET_ID: WIN7_DRIVER_FAULT
PROCESS_NAME: System
CURRENT_IRQL: a
STACK_TEXT:
80e4cb2c 8341afcd 00000124 00000004 869348d4 nt!KeBugCheckEx+0x1e
80e4cb68 83506fc4 869334e1 869348d4 8691bc10 hal!HalBugCheckSystem+0xab
80e4cb9c 8c7ce609 8691b638 8690a780 80e4cd20 nt!WheaReportHwError+0x230
80e4cbb4 8c7cf088 869344b4 00000000 8691b638 pci!ExpressRootPortAerInterruptRoutine+0x1e7
80e4cbd8 8c7cf264 8690a780 86934008 80e4cbfc pci!ExpressRootPortInterruptRoutine+0x1a
80e4cbe8 834a9cff 8690a780 86934008 00000001 pci!ExpressRootPortMessageRoutine+0x10
80e4cbfc 83474ded 8690a780 86934008 80e4cc28 nt!KiInterruptMessageDispatch+0x12
80e4cbfc 93be45d6 8690a780 86934008 80e4cc28 nt!KiInterruptDispatch+0x6d
WARNING: Stack unwind information not available. Following frames may be wrong.
80e4cc98 8349ada4 888d8d48 80e35800 80e30000 intelppm+0x15d6
80e4cd20 834985ad 00000000 0000000e ab16ab16 nt!PoIdle+0x524
80e4cd24 00000000 0000000e ab16ab16 8bdf8bdf nt!KiIdleLoop+0xd
STACK_COMMAND: kb
FOLLOWUP_NAME: MachineOwner
MODULE_NAME: GenuineIntel
IMAGE_NAME: GenuineIntel
DEBUG_FLR_IMAGE_TIMESTAMP: 0
FAILURE_BUCKET_ID: 0x124_GenuineIntel_PCIEXPRESS
BUCKET_ID: 0x124_GenuineIntel_PCIEXPRESS
Followup: MachineOwner
---------
As you can figure, like any WHEA BSOD, it's not going to be very explanatory. Really, the only things to look at are the arguments for the BSOD. In this case, we actually get a subcode 0x4 in Arg1, which means a PCI Express error. In Arg2, like usual, the address of the WHEA error structure is present. Like any 0x124 bugcheck, we take this value and direct
!errrec towards it:
Code:
3: kd> !errrec 869348d4
===============================================================================
Common Platform Error Record @ 869348d4
-------------------------------------------------------------------------------
Record Id : 01cd07d8bce4740f
Severity : Fatal (1)
Length : 672
Creator : Microsoft
Notify Type : PCI Express Error
Timestamp : 3/22/2012 3:06:44 (UTC)
Flags : 0x00000000
===============================================================================
Section 0 : PCI Express
-------------------------------------------------------------------------------
Descriptor @ 86934954
Section @ 869349e4
Offset : 272
Length : 208
Flags : 0x00000001 Primary
Severity : Recoverable
Port Type : Root Port
Version : 1.1
Command/Status: 0x4010/0x0507
Device Id :
VenId:DevId : 8086:340a
Class code : 030400
Function No : 0x00
Device No : 0x03
Segment : 0x0000
Primary Bus : 0x00
Second. Bus : 0x00
Slot : 0x0000
Dev. Serial # : 0000000000000000
Express Capability Information @ 86934a18
Device Caps : 00008021 Role-Based Error Reporting: 1
Device Ctl : 0107 ur FE NF CE
Dev Status : 0003 ur fe NF CE
Root Ctl : 0008 fs nfs cs
AER Information @ ffffffff86934a54
Uncorrectable Error Status : 00000020 ur ecrc mtlp rof uc ca cto fcp ptlp SD dlp und
Uncorrectable Error Mask : 00000000 ur ecrc mtlp rof uc ca cto fcp ptlp sd dlp und
Uncorrectable Error Severity : 00062010 ur ecrc MTLP ROF uc ca cto FCP ptlp sd DLP und
Correctable Error Status : 00000000 adv rtto rnro dllp tlp re
Correctable Error Mask : 00000000 adv rtto rnro dllp tlp re
Caps & Control : 00000005 ecrcchken ecrcchkcap ecrcgenen ecrcgencap FEP
Header Log : 00000000 00000000 00000000 00000000
Root Error Command : 00000000 fen nfen cen
Root Error Status : 00000000 MSG# 00 fer nfer fuf mur ur mcr cer
Correctable Error Source ID : 00,00,00
Correctable Error Source ID : 00,00,00
===============================================================================
Section 1 : Processor Generic
-------------------------------------------------------------------------------
Descriptor @ 8693499c
Section @ 86934ab4
Offset : 480
Length : 192
Flags : 0x00000000
Severity : Informational
Proc. Type : x86/x64
Instr. Set : x86
CPU Version : 0x00000000000106a5
Processor ID : 0x0000000000000006
This definitely doesn't look like the typical WHEA output. So what we need to do is start digesting the information bit by bit. Much of it isn't really necessary to scrutinize over in various cases. The most important content here is the AER info, of which stands for Advanced Error Reporting. Of course, to the untrained eye, it looks like a bunch of gibberish. But in this article we will be able to discover what it means, and that it really is pretty simple stuff.
But first, let's figure out just what component of the PC the error came from. We start off by looking at the Device ID. The first entry describes the VenID (Vendor ID) and DevID (Device ID). We can approach a database online with these numbers to see what we get. I personally picked PCIDatabase.com. By entering in the DevID, preferably since it's more specific than using the VenID, we'll discover that this came from an
Intel 7500 Chipset PCIe Root Port. This is part of the ICH10 Southbridge chipset. Given that this particular case the client described his motherboard as an ASRock X58 motherboard, I wouldn't be surprised by this information. However, given the
Port Type and what we see here from the description given by the DevID, we can see this came from the root port of the PCI-E bus. This is important, as you'll discover later.
Anyways, now onto the actual AER information. Given that I too was as befuddled by the output as you are looking at it, I used keywords like "AER" and "PCI Express" in google to get information on what I'm looking at. Sure enough, the MSDN has an article pertaining to it, regarding a particular data structure with the all the details and descriptions for its variables. The initial article I found was
this one. Everything described here matches what is present for the AER info presented in the the WHEA error record, so I figured this would be easy. I checked the Uncorrectable Error Status (PCI_EXPRESS_UNCORRECTABLE_ERROR_STATUS) in the article to see descriptions of all the weird bunches of letters that are displayed.
As I ventured to guess previously, this was a bunch of bit flags with each pertaining to a specific type of error. If a bit is set (the value is 1 and not 0) for a specific error, then that was an error that occurred. Evidently, in the WHEA error record displayed by
!errrec, each of the bunches of letters is actually a very short abbreviation/acronym for the error name of the related bit. Whatever is lowercase is a bit that was not set, and whatever is capitalized was set. As we can tell,
SD bit was set, so we should look through the article to see what SD means.
This is when things get problematic. There's nothing that sounds like it's remotely related to the letters SD, and we can tell from the other error bits that those also don't seem to have a related error described in the article. Something's not right here, so let's go back to the main article on PCI_EXPRESS_AER_CAPABILITY structure that we initially found and read up a bit.
Sure enough, in the remarks, the following is stated:
Root ports and root complex event collectors use the PCI_EXPRESS_ROOTPORT_AER_CAPABILITY structure instead of the PCI_EXPRESS_AER_CAPABILITY structure to describe the PCIe advanced error reporting capability structure.
This is why it was important to discover what was reporting the error. It could've been a PCI-E bridge (which has its own AER data structure), from the root port as we found here, or from the PCI-E device itself. So now we just go to the article describing the structure specific to root ports, and then go back to the Uncorrectable Error Status subarticle and read up the error bits and their descriptions. As you can see, the errors now correspond well to those represented in the WHEA record. For example,
ecrc means ECRC Error,
rof means Receiver Overflow, etc. For this specific case,
SD was the only one capitalized, so let's look at the description for that. It mentions
Surprise Down. Personally, this is self explanatory, but just in case I google it along with PCI Express to find out what it means for sure. As I figured, it means that there was an unexpected (Surprise!) loss in connection between the PCI-E hub and the PCI-E device. The device in mind could only be the video card, since during the client's attempt to troubleshoot the only PCI-E device he has installed is his video card (
read Update).
So now that we have a specific description of the error, we need to think of what would cause this error. As I can speculate, either the root port or the card could be responsible. In addition, I would venture to believe anything impeding the connection physically, like dust or other contaminant in the slot, or having the card sit ajar in the slot would create this problem. Currently at this time, the client is still in the process of determining this. However, on googling this, I found this to be rather prevalent for individuals with the ASRock X58 motherboard, especially in that they happen to have a lot of PCI-E WHEA error events in their event log. I would figure this may be related.
I'll get back on a definitive answer of this case. Until then, you at least now know what to do when you come across one of these WHEA errors. Have fun!
UPDATE:
The culprit actually turned out to be the TV Tuner card in a
PCI slot (not PCI-E). I was erroneous in assuming the client had only the video card installed. I have also forgotten that the same controller is responsible for both the PCI and PCI-E buses, so this error can manifest from PCI cards as well on newer boards.