PCI WHEA Error/BSOD

wutangfan2222

New member
Joined
Jul 20, 2022
Posts
3
Hi Everyone

I'm facing a difficult time with my server right now and having issues trying to figure out what's going on with it. This is a Server 2012 Standard server. I'm running 2 VMs off of it if that makes any difference.

I used this PCI-E WHEA errors (0x124) thread to help me analyze the dump file and I was very thankful to Vir Gnarus for that thread - it got me much further than I would have.

So our server went down with a BSOD at about 2 PM EST yesterday - with the code showing WHEA_uncorrectable_error. The only change I made recently was at the router level - we had to input a VPN site to site to azure connection, the only reason I think this may be relevant is of course because it's the only change I made - as well because the remote VPN subnet is on the same subnet as our heartbeat for our cluster (10.0.0.1 is heartbeat address - remote subnet is the same, so I'm not sure if this could have in some crazy way caused some sort of feedback loop and brought the NIC down - maybe this is nonsense, but it's for some reason where my mind keeps going).

Here is the bugcheck info/analysis

kd> !analyze -v
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************

WHEA_UNCORRECTABLE_ERROR (124)
A fatal hardware error has occurred. Parameter 1 identifies the type of error
source that reported the error. Parameter 2 holds the address of the
WHEA_ERROR_RECORD structure that describes the error conditon. Try !errrec Address of the WHEA_ERROR_RECORD structure to get more details.
Arguments:
Arg1: 0000000000000004, PCI Express Error
Arg2: fffffa8019f65038, Address of the WHEA_ERROR_RECORD structure.
Arg3: 0000000000000000
Arg4: 0000000000000000

Debugging Details:
------------------


KEY_VALUES_STRING: 1

Key : Analysis.CPU.mSec
Value: 1499

Key : Analysis.DebugAnalysisManager
Value: Create

Key : Analysis.Elapsed.mSec
Value: 1495

Key : Analysis.Init.CPU.mSec
Value: 3937

Key : Analysis.Init.Elapsed.mSec
Value: 2245282

Key : Analysis.Memory.CommitPeak.Mb
Value: 82

Key : WER.OS.Branch
Value: win8_rtm

Key : WER.OS.Timestamp
Value: 2012-07-25T12:47:00Z

Key : WER.OS.Version
Value: 8.0.9200.16384


BUGCHECK_CODE: 124

BUGCHECK_P1: 4

BUGCHECK_P2: fffffa8019f65038

BUGCHECK_P3: 0

BUGCHECK_P4: 0

CUSTOMER_CRASH_COUNT: 1

PROCESS_NAME: System

DPC_STACK_BASE: FFFFF80182898FB0

STACK_OVERFLOW: Stack Limit: fffff80182892fb0. Use (kF) and (!stackusage) to investigate stack usage.

STACK_TEXT:
fffff801`82892958 fffff801`8104493d : 00000000`00000124 00000000`00000004 fffffa80`19f65038 00000000`00000000 : nt!KeBugCheckEx
fffff801`82892960 fffff801`811dcc09 : 00000000`00000001 fffffa80`19f910f8 00000000`00000000 fffffa80`19f65038 : hal!HalBugCheckSystem+0xf9
fffff801`828929a0 fffff880`0128a36b : fffffa80`00000750 fffffa80`19f910f8 00000000`00000000 fffffa80`19f66010 : nt!WheaReportHwError+0x249
fffff801`82892a00 fffff880`01289d75 : 00000000`00000000 fffff801`82892aa0 fffff880`03be6780 00000000`00000000 : pci!ExpressRootPortAerInterruptRoutine+0x2ab
fffff801`82892a60 fffff801`810f7106 : fffff801`81379180 fffffa80`1abf63a8 00000061`568ab8a8 00000000`00000001 : pci!ExpressRootPortInterruptRoutine+0x3d
fffff801`82892ad0 fffff801`81129552 : fffff801`81379180 fffff801`81379180 00000000`00183db0 fffff801`813d3880 : nt!KiInterruptDispatch+0x1d6
fffff801`82892c60 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiIdleLoop+0x32


MODULE_NAME: GenuineIntel

IMAGE_NAME: GenuineIntel.sys

STACK_COMMAND: .thread ; .cxr ; kb

FAILURE_BUCKET_ID: 0x124_4_GenuineIntel_PCIEXPRESS_SURPRISE_DOWN_ERROR_IMAGE_GenuineIntel.sys

OS_VERSION: 8.0.9200.16384

BUILDLAB_STR: win8_rtm

OSPLATFORM_TYPE: x64

OSNAME: Windows 8

FAILURE_ID_HASH: {a1c876dc-7f6f-101c-5993-282d7471de42}

Followup: MachineOwner
---------

0: kd> !errrec
0: kd> dt hal!_MCi_STATUS
+0x000 McaErrorCode : Uint2B
+0x002 ModelErrorCode : Uint2B
+0x004 OtherInformation : Pos 0, 23 Bits
+0x004 ActionRequired : Pos 23, 1 Bit
+0x004 Signalling : Pos 24, 1 Bit
+0x004 ContextCorrupt : Pos 25, 1 Bit
+0x004 AddressValid : Pos 26, 1 Bit
+0x004 MiscValid : Pos 27, 1 Bit
+0x004 ErrorEnabled : Pos 28, 1 Bit
+0x004 UncorrectedError : Pos 29, 1 Bit
+0x004 StatusOverFlow : Pos 30, 1 Bit
+0x004 Valid : Pos 31, 1 Bit
+0x000 QuadPart : Uint8B
0: kd> .formats
Numeric expression missing from '<EOL>'
0: kd> dt nt!_KPRCB -y VendorString
+0x5948 VendorString : [13] UChar
0: kd> !prcb
PRCB for Processor 0 at fffff780ffff0000:
Current IRQL -- 8
Threads-- Current fffff801813d3880 Next fffffa801aa39b00 Idle fffff801813d3880
Processor Index 0 Number (0, 0) GroupSetMember 1
Interrupt Count -- 0e9e5e41
Times -- Dpc 00020ad8 Interrupt 00006706
Kernel 00281360 User 0000d0e6
0: kd> dt nt!_KPRCB -y VendorString fffff780ffff0000
+0x5948 VendorString : [13] "GenuineIntel"
0: kd> !errrec fffffa80080d2028
===============================================================================
Common Platform Error Record @ fffffa80080d2028
-------------------------------------------------------------------------------
Signature : *** INVALID ***
Revision : 0.0
Record Id : 0000000000000000
Severity : Recoverable (0)
Length : 0
Creator : {00000000-0000-0000-0000-000000000000}
Notify Type : {00000000-0000-0000-0000-000000000000}
Flags : 0x00000000

0: kd> !errrec fffffa8019f65038
===============================================================================
Common Platform Error Record @ fffffa8019f65038
-------------------------------------------------------------------------------
Record Id : 01d89ba8f795a29c
Severity : Fatal (1)
Length : 672
Creator : Microsoft
Notify Type : PCI Express Error
Timestamp : 7/20/2022 7:25:31 (UTC)
Flags : 0x00000000

===============================================================================
Section 0 : PCI Express
-------------------------------------------------------------------------------
Descriptor @ fffffa8019f650b8
Section @ fffffa8019f65148
Offset : 272
Length : 208
Flags : 0x00000001 Primary
Severity : Recoverable

Port Type : Root Port
Version : 1.1
Command/Status: 0x0010/0x0407
Device Id :
VenId:DevId : 8086:0e08
Class code : 030400
Function No : 0x00
Device No : 0x03
Segment : 0x0000
Primary Bus : 0x00
Second. Bus : 0x00
Slot : 0x0000
Dev. Serial # : 0000000000000000
Express Capability Information @ fffffa8019f6517c
Device Caps : 00008021 Role-Based Error Reporting: 1
Device Ctl : 0127 ur FE NF CE
Dev Status : 0002 ur fe NF ce
Root Ctl : 0008 fs nfs cs

AER Information @ fffffa8019f651b8
Uncorrectable Error Status : 00000020 ur ecrc mtlp rof uc ca cto fcp ptlp SD dlp und
Uncorrectable Error Mask : 00000000 ur ecrc mtlp rof uc ca cto fcp ptlp sd dlp und
Uncorrectable Error Severity : 00062010 ur ecrc MTLP ROF uc ca cto FCP ptlp sd DLP und
Correctable Error Status : 00000000 adv rtto rnro dllp tlp re
Correctable Error Mask : 00000000 adv rtto rnro dllp tlp re
Caps & Control : 00000005 ecrcchken ecrcchkcap ecrcgenen ecrcgencap FEP
Header Log : 00000000 00000000 00000000 00000000
Root Error Command : 00000000 fen nfen cen
Root Error Status : 00000000 MSG# 00 fer nfer fuf mur ur mcr cer
Correctable Error Source ID : 00,00,00
Correctable Error Source ID : 00,00,00

===============================================================================
Section 1 : Processor Generic
-------------------------------------------------------------------------------
Descriptor @ fffffa8019f65100
Section @ fffffa8019f65218
Offset : 480
Length : 192
Flags : 0x00000000
Severity : Informational

Proc. Type : x86/x64
Instr. Set : x64
CPU Version : 0x00000000000306e4
Processor ID : 0x0000000000000000

Here is the relevant vendor/device ID looked up using pci-db.com

Vendor ID 8086
Vendor Name Intel Corporation
Device ID 0E08
Device Name Xeon E7 v2/Xeon E5 v2/Core i7 PCI Express Root Port 3a


------------------------------------

At first I had thought ok, it's gotta be an issue with the CPU - so I checked the running temps which are fine, sitting around 30c stable. I will run a stress test later tonight after hours anyway just to try to rule that out. Once I kept going further in analysis I got to the point I'm at now, where it seems to be saying that PCI-E root port 3a is the culprit. I'm guessing I may need to try finding out what's connected to that slot directly and reseat it/clean it and hope for the best. I can't get into the physical server yet as it's been in production, but I wonder if it's one of the NICs as I see two there. Which of course leads me back to my initial idea about the router changes.


Can anyone help me analyze this, or am I at about the furthest I can go with this? I am very out of my element here with how deep I've gone here, but trying my best to figure this out.

Appreciate any and all advice, as I'm part of a 2 man IT department and my higher up has his own work to focus on unfortunately.

Thank you!
 
First try this; Go in the Uefi bios, and choose the option to Reset everything to Defaults.
Save and Exit, Do not enable any overclocking or XMP settings, just the bare minimum to start. If it works now, then you have to change settings one by one to know what the culprit was.

I had the same error after update my bios with the same settings, then i reset to default and enabled the same settings again by hand. The error and BSOD was away.
 
Please note that the user hasn't been online in almost a month so I doubt they're going to come back.
 

Has Sysnative Forums helped you? Please consider donating to help us support the site!

Back
Top