wutangfan2222
New member
- Jul 20, 2022
- 3
Hi Everyone
I'm facing a difficult time with my server right now and having issues trying to figure out what's going on with it. This is a Server 2012 Standard server. I'm running 2 VMs off of it if that makes any difference.
I used this PCI-E WHEA errors (0x124) thread to help me analyze the dump file and I was very thankful to Vir Gnarus for that thread - it got me much further than I would have.
So our server went down with a BSOD at about 2 PM EST yesterday - with the code showing WHEA_uncorrectable_error. The only change I made recently was at the router level - we had to input a VPN site to site to azure connection, the only reason I think this may be relevant is of course because it's the only change I made - as well because the remote VPN subnet is on the same subnet as our heartbeat for our cluster (10.0.0.1 is heartbeat address - remote subnet is the same, so I'm not sure if this could have in some crazy way caused some sort of feedback loop and brought the NIC down - maybe this is nonsense, but it's for some reason where my mind keeps going).
Here is the bugcheck info/analysis
kd> !analyze -v
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************
WHEA_UNCORRECTABLE_ERROR (124)
A fatal hardware error has occurred. Parameter 1 identifies the type of error
source that reported the error. Parameter 2 holds the address of the
WHEA_ERROR_RECORD structure that describes the error conditon. Try !errrec Address of the WHEA_ERROR_RECORD structure to get more details.
Arguments:
Arg1: 0000000000000004, PCI Express Error
Arg2: fffffa8019f65038, Address of the WHEA_ERROR_RECORD structure.
Arg3: 0000000000000000
Arg4: 0000000000000000
Debugging Details:
------------------
KEY_VALUES_STRING: 1
Key : Analysis.CPU.mSec
Value: 1499
Key : Analysis.DebugAnalysisManager
Value: Create
Key : Analysis.Elapsed.mSec
Value: 1495
Key : Analysis.Init.CPU.mSec
Value: 3937
Key : Analysis.Init.Elapsed.mSec
Value: 2245282
Key : Analysis.Memory.CommitPeak.Mb
Value: 82
Key : WER.OS.Branch
Value: win8_rtm
Key : WER.OS.Timestamp
Value: 2012-07-25T12:47:00Z
Key : WER.OS.Version
Value: 8.0.9200.16384
BUGCHECK_CODE: 124
BUGCHECK_P1: 4
BUGCHECK_P2: fffffa8019f65038
BUGCHECK_P3: 0
BUGCHECK_P4: 0
CUSTOMER_CRASH_COUNT: 1
PROCESS_NAME: System
DPC_STACK_BASE: FFFFF80182898FB0
STACK_OVERFLOW: Stack Limit: fffff80182892fb0. Use (kF) and (!stackusage) to investigate stack usage.
STACK_TEXT:
fffff801`82892958 fffff801`8104493d : 00000000`00000124 00000000`00000004 fffffa80`19f65038 00000000`00000000 : nt!KeBugCheckEx
fffff801`82892960 fffff801`811dcc09 : 00000000`00000001 fffffa80`19f910f8 00000000`00000000 fffffa80`19f65038 : hal!HalBugCheckSystem+0xf9
fffff801`828929a0 fffff880`0128a36b : fffffa80`00000750 fffffa80`19f910f8 00000000`00000000 fffffa80`19f66010 : nt!WheaReportHwError+0x249
fffff801`82892a00 fffff880`01289d75 : 00000000`00000000 fffff801`82892aa0 fffff880`03be6780 00000000`00000000 : pci!ExpressRootPortAerInterruptRoutine+0x2ab
fffff801`82892a60 fffff801`810f7106 : fffff801`81379180 fffffa80`1abf63a8 00000061`568ab8a8 00000000`00000001 : pci!ExpressRootPortInterruptRoutine+0x3d
fffff801`82892ad0 fffff801`81129552 : fffff801`81379180 fffff801`81379180 00000000`00183db0 fffff801`813d3880 : nt!KiInterruptDispatch+0x1d6
fffff801`82892c60 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiIdleLoop+0x32
MODULE_NAME: GenuineIntel
IMAGE_NAME: GenuineIntel.sys
STACK_COMMAND: .thread ; .cxr ; kb
FAILURE_BUCKET_ID: 0x124_4_GenuineIntel_PCIEXPRESS_SURPRISE_DOWN_ERROR_IMAGE_GenuineIntel.sys
OS_VERSION: 8.0.9200.16384
BUILDLAB_STR: win8_rtm
OSPLATFORM_TYPE: x64
OSNAME: Windows 8
FAILURE_ID_HASH: {a1c876dc-7f6f-101c-5993-282d7471de42}
Followup: MachineOwner
---------
0: kd> !errrec
0: kd> dt hal!_MCi_STATUS
+0x000 McaErrorCode : Uint2B
+0x002 ModelErrorCode : Uint2B
+0x004 OtherInformation : Pos 0, 23 Bits
+0x004 ActionRequired : Pos 23, 1 Bit
+0x004 Signalling : Pos 24, 1 Bit
+0x004 ContextCorrupt : Pos 25, 1 Bit
+0x004 AddressValid : Pos 26, 1 Bit
+0x004 MiscValid : Pos 27, 1 Bit
+0x004 ErrorEnabled : Pos 28, 1 Bit
+0x004 UncorrectedError : Pos 29, 1 Bit
+0x004 StatusOverFlow : Pos 30, 1 Bit
+0x004 Valid : Pos 31, 1 Bit
+0x000 QuadPart : Uint8B
0: kd> .formats
Numeric expression missing from '<EOL>'
0: kd> dt nt!_KPRCB -y VendorString
+0x5948 VendorString : [13] UChar
0: kd> !prcb
PRCB for Processor 0 at fffff780ffff0000:
Current IRQL -- 8
Threads-- Current fffff801813d3880 Next fffffa801aa39b00 Idle fffff801813d3880
Processor Index 0 Number (0, 0) GroupSetMember 1
Interrupt Count -- 0e9e5e41
Times -- Dpc 00020ad8 Interrupt 00006706
Kernel 00281360 User 0000d0e6
0: kd> dt nt!_KPRCB -y VendorString fffff780ffff0000
+0x5948 VendorString : [13] "GenuineIntel"
0: kd> !errrec fffffa80080d2028
===============================================================================
Common Platform Error Record @ fffffa80080d2028
-------------------------------------------------------------------------------
Signature : *** INVALID ***
Revision : 0.0
Record Id : 0000000000000000
Severity : Recoverable (0)
Length : 0
Creator : {00000000-0000-0000-0000-000000000000}
Notify Type : {00000000-0000-0000-0000-000000000000}
Flags : 0x00000000
0: kd> !errrec fffffa8019f65038
===============================================================================
Common Platform Error Record @ fffffa8019f65038
-------------------------------------------------------------------------------
Record Id : 01d89ba8f795a29c
Severity : Fatal (1)
Length : 672
Creator : Microsoft
Notify Type : PCI Express Error
Timestamp : 7/20/2022 7:25:31 (UTC)
Flags : 0x00000000
===============================================================================
Section 0 : PCI Express
-------------------------------------------------------------------------------
Descriptor @ fffffa8019f650b8
Section @ fffffa8019f65148
Offset : 272
Length : 208
Flags : 0x00000001 Primary
Severity : Recoverable
Port Type : Root Port
Version : 1.1
Command/Status: 0x0010/0x0407
Device Id :
VenIdevId : 8086:0e08
Class code : 030400
Function No : 0x00
Device No : 0x03
Segment : 0x0000
Primary Bus : 0x00
Second. Bus : 0x00
Slot : 0x0000
Dev. Serial # : 0000000000000000
Express Capability Information @ fffffa8019f6517c
Device Caps : 00008021 Role-Based Error Reporting: 1
Device Ctl : 0127 ur FE NF CE
Dev Status : 0002 ur fe NF ce
Root Ctl : 0008 fs nfs cs
AER Information @ fffffa8019f651b8
Uncorrectable Error Status : 00000020 ur ecrc mtlp rof uc ca cto fcp ptlp SD dlp und
Uncorrectable Error Mask : 00000000 ur ecrc mtlp rof uc ca cto fcp ptlp sd dlp und
Uncorrectable Error Severity : 00062010 ur ecrc MTLP ROF uc ca cto FCP ptlp sd DLP und
Correctable Error Status : 00000000 adv rtto rnro dllp tlp re
Correctable Error Mask : 00000000 adv rtto rnro dllp tlp re
Caps & Control : 00000005 ecrcchken ecrcchkcap ecrcgenen ecrcgencap FEP
Header Log : 00000000 00000000 00000000 00000000
Root Error Command : 00000000 fen nfen cen
Root Error Status : 00000000 MSG# 00 fer nfer fuf mur ur mcr cer
Correctable Error Source ID : 00,00,00
Correctable Error Source ID : 00,00,00
===============================================================================
Section 1 : Processor Generic
-------------------------------------------------------------------------------
Descriptor @ fffffa8019f65100
Section @ fffffa8019f65218
Offset : 480
Length : 192
Flags : 0x00000000
Severity : Informational
Proc. Type : x86/x64
Instr. Set : x64
CPU Version : 0x00000000000306e4
Processor ID : 0x0000000000000000
Here is the relevant vendor/device ID looked up using pci-db.com
Vendor ID 8086
Vendor Name Intel Corporation
Device ID 0E08
Device Name Xeon E7 v2/Xeon E5 v2/Core i7 PCI Express Root Port 3a
------------------------------------
At first I had thought ok, it's gotta be an issue with the CPU - so I checked the running temps which are fine, sitting around 30c stable. I will run a stress test later tonight after hours anyway just to try to rule that out. Once I kept going further in analysis I got to the point I'm at now, where it seems to be saying that PCI-E root port 3a is the culprit. I'm guessing I may need to try finding out what's connected to that slot directly and reseat it/clean it and hope for the best. I can't get into the physical server yet as it's been in production, but I wonder if it's one of the NICs as I see two there. Which of course leads me back to my initial idea about the router changes.
Can anyone help me analyze this, or am I at about the furthest I can go with this? I am very out of my element here with how deep I've gone here, but trying my best to figure this out.
Appreciate any and all advice, as I'm part of a 2 man IT department and my higher up has his own work to focus on unfortunately.
Thank you!
I'm facing a difficult time with my server right now and having issues trying to figure out what's going on with it. This is a Server 2012 Standard server. I'm running 2 VMs off of it if that makes any difference.
I used this PCI-E WHEA errors (0x124) thread to help me analyze the dump file and I was very thankful to Vir Gnarus for that thread - it got me much further than I would have.
So our server went down with a BSOD at about 2 PM EST yesterday - with the code showing WHEA_uncorrectable_error. The only change I made recently was at the router level - we had to input a VPN site to site to azure connection, the only reason I think this may be relevant is of course because it's the only change I made - as well because the remote VPN subnet is on the same subnet as our heartbeat for our cluster (10.0.0.1 is heartbeat address - remote subnet is the same, so I'm not sure if this could have in some crazy way caused some sort of feedback loop and brought the NIC down - maybe this is nonsense, but it's for some reason where my mind keeps going).
Here is the bugcheck info/analysis
kd> !analyze -v
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************
WHEA_UNCORRECTABLE_ERROR (124)
A fatal hardware error has occurred. Parameter 1 identifies the type of error
source that reported the error. Parameter 2 holds the address of the
WHEA_ERROR_RECORD structure that describes the error conditon. Try !errrec Address of the WHEA_ERROR_RECORD structure to get more details.
Arguments:
Arg1: 0000000000000004, PCI Express Error
Arg2: fffffa8019f65038, Address of the WHEA_ERROR_RECORD structure.
Arg3: 0000000000000000
Arg4: 0000000000000000
Debugging Details:
------------------
KEY_VALUES_STRING: 1
Key : Analysis.CPU.mSec
Value: 1499
Key : Analysis.DebugAnalysisManager
Value: Create
Key : Analysis.Elapsed.mSec
Value: 1495
Key : Analysis.Init.CPU.mSec
Value: 3937
Key : Analysis.Init.Elapsed.mSec
Value: 2245282
Key : Analysis.Memory.CommitPeak.Mb
Value: 82
Key : WER.OS.Branch
Value: win8_rtm
Key : WER.OS.Timestamp
Value: 2012-07-25T12:47:00Z
Key : WER.OS.Version
Value: 8.0.9200.16384
BUGCHECK_CODE: 124
BUGCHECK_P1: 4
BUGCHECK_P2: fffffa8019f65038
BUGCHECK_P3: 0
BUGCHECK_P4: 0
CUSTOMER_CRASH_COUNT: 1
PROCESS_NAME: System
DPC_STACK_BASE: FFFFF80182898FB0
STACK_OVERFLOW: Stack Limit: fffff80182892fb0. Use (kF) and (!stackusage) to investigate stack usage.
STACK_TEXT:
fffff801`82892958 fffff801`8104493d : 00000000`00000124 00000000`00000004 fffffa80`19f65038 00000000`00000000 : nt!KeBugCheckEx
fffff801`82892960 fffff801`811dcc09 : 00000000`00000001 fffffa80`19f910f8 00000000`00000000 fffffa80`19f65038 : hal!HalBugCheckSystem+0xf9
fffff801`828929a0 fffff880`0128a36b : fffffa80`00000750 fffffa80`19f910f8 00000000`00000000 fffffa80`19f66010 : nt!WheaReportHwError+0x249
fffff801`82892a00 fffff880`01289d75 : 00000000`00000000 fffff801`82892aa0 fffff880`03be6780 00000000`00000000 : pci!ExpressRootPortAerInterruptRoutine+0x2ab
fffff801`82892a60 fffff801`810f7106 : fffff801`81379180 fffffa80`1abf63a8 00000061`568ab8a8 00000000`00000001 : pci!ExpressRootPortInterruptRoutine+0x3d
fffff801`82892ad0 fffff801`81129552 : fffff801`81379180 fffff801`81379180 00000000`00183db0 fffff801`813d3880 : nt!KiInterruptDispatch+0x1d6
fffff801`82892c60 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiIdleLoop+0x32
MODULE_NAME: GenuineIntel
IMAGE_NAME: GenuineIntel.sys
STACK_COMMAND: .thread ; .cxr ; kb
FAILURE_BUCKET_ID: 0x124_4_GenuineIntel_PCIEXPRESS_SURPRISE_DOWN_ERROR_IMAGE_GenuineIntel.sys
OS_VERSION: 8.0.9200.16384
BUILDLAB_STR: win8_rtm
OSPLATFORM_TYPE: x64
OSNAME: Windows 8
FAILURE_ID_HASH: {a1c876dc-7f6f-101c-5993-282d7471de42}
Followup: MachineOwner
---------
0: kd> !errrec
0: kd> dt hal!_MCi_STATUS
+0x000 McaErrorCode : Uint2B
+0x002 ModelErrorCode : Uint2B
+0x004 OtherInformation : Pos 0, 23 Bits
+0x004 ActionRequired : Pos 23, 1 Bit
+0x004 Signalling : Pos 24, 1 Bit
+0x004 ContextCorrupt : Pos 25, 1 Bit
+0x004 AddressValid : Pos 26, 1 Bit
+0x004 MiscValid : Pos 27, 1 Bit
+0x004 ErrorEnabled : Pos 28, 1 Bit
+0x004 UncorrectedError : Pos 29, 1 Bit
+0x004 StatusOverFlow : Pos 30, 1 Bit
+0x004 Valid : Pos 31, 1 Bit
+0x000 QuadPart : Uint8B
0: kd> .formats
Numeric expression missing from '<EOL>'
0: kd> dt nt!_KPRCB -y VendorString
+0x5948 VendorString : [13] UChar
0: kd> !prcb
PRCB for Processor 0 at fffff780ffff0000:
Current IRQL -- 8
Threads-- Current fffff801813d3880 Next fffffa801aa39b00 Idle fffff801813d3880
Processor Index 0 Number (0, 0) GroupSetMember 1
Interrupt Count -- 0e9e5e41
Times -- Dpc 00020ad8 Interrupt 00006706
Kernel 00281360 User 0000d0e6
0: kd> dt nt!_KPRCB -y VendorString fffff780ffff0000
+0x5948 VendorString : [13] "GenuineIntel"
0: kd> !errrec fffffa80080d2028
===============================================================================
Common Platform Error Record @ fffffa80080d2028
-------------------------------------------------------------------------------
Signature : *** INVALID ***
Revision : 0.0
Record Id : 0000000000000000
Severity : Recoverable (0)
Length : 0
Creator : {00000000-0000-0000-0000-000000000000}
Notify Type : {00000000-0000-0000-0000-000000000000}
Flags : 0x00000000
0: kd> !errrec fffffa8019f65038
===============================================================================
Common Platform Error Record @ fffffa8019f65038
-------------------------------------------------------------------------------
Record Id : 01d89ba8f795a29c
Severity : Fatal (1)
Length : 672
Creator : Microsoft
Notify Type : PCI Express Error
Timestamp : 7/20/2022 7:25:31 (UTC)
Flags : 0x00000000
===============================================================================
Section 0 : PCI Express
-------------------------------------------------------------------------------
Descriptor @ fffffa8019f650b8
Section @ fffffa8019f65148
Offset : 272
Length : 208
Flags : 0x00000001 Primary
Severity : Recoverable
Port Type : Root Port
Version : 1.1
Command/Status: 0x0010/0x0407
Device Id :
VenIdevId : 8086:0e08
Class code : 030400
Function No : 0x00
Device No : 0x03
Segment : 0x0000
Primary Bus : 0x00
Second. Bus : 0x00
Slot : 0x0000
Dev. Serial # : 0000000000000000
Express Capability Information @ fffffa8019f6517c
Device Caps : 00008021 Role-Based Error Reporting: 1
Device Ctl : 0127 ur FE NF CE
Dev Status : 0002 ur fe NF ce
Root Ctl : 0008 fs nfs cs
AER Information @ fffffa8019f651b8
Uncorrectable Error Status : 00000020 ur ecrc mtlp rof uc ca cto fcp ptlp SD dlp und
Uncorrectable Error Mask : 00000000 ur ecrc mtlp rof uc ca cto fcp ptlp sd dlp und
Uncorrectable Error Severity : 00062010 ur ecrc MTLP ROF uc ca cto FCP ptlp sd DLP und
Correctable Error Status : 00000000 adv rtto rnro dllp tlp re
Correctable Error Mask : 00000000 adv rtto rnro dllp tlp re
Caps & Control : 00000005 ecrcchken ecrcchkcap ecrcgenen ecrcgencap FEP
Header Log : 00000000 00000000 00000000 00000000
Root Error Command : 00000000 fen nfen cen
Root Error Status : 00000000 MSG# 00 fer nfer fuf mur ur mcr cer
Correctable Error Source ID : 00,00,00
Correctable Error Source ID : 00,00,00
===============================================================================
Section 1 : Processor Generic
-------------------------------------------------------------------------------
Descriptor @ fffffa8019f65100
Section @ fffffa8019f65218
Offset : 480
Length : 192
Flags : 0x00000000
Severity : Informational
Proc. Type : x86/x64
Instr. Set : x64
CPU Version : 0x00000000000306e4
Processor ID : 0x0000000000000000
Here is the relevant vendor/device ID looked up using pci-db.com
Vendor ID 8086
Vendor Name Intel Corporation
Device ID 0E08
Device Name Xeon E7 v2/Xeon E5 v2/Core i7 PCI Express Root Port 3a
------------------------------------
At first I had thought ok, it's gotta be an issue with the CPU - so I checked the running temps which are fine, sitting around 30c stable. I will run a stress test later tonight after hours anyway just to try to rule that out. Once I kept going further in analysis I got to the point I'm at now, where it seems to be saying that PCI-E root port 3a is the culprit. I'm guessing I may need to try finding out what's connected to that slot directly and reseat it/clean it and hope for the best. I can't get into the physical server yet as it's been in production, but I wonder if it's one of the NICs as I see two there. Which of course leads me back to my initial idea about the router changes.
Can anyone help me analyze this, or am I at about the furthest I can go with this? I am very out of my element here with how deep I've gone here, but trying my best to figure this out.
Appreciate any and all advice, as I'm part of a 2 man IT department and my higher up has his own work to focus on unfortunately.
Thank you!