Fairly Sudden Onset of extreme instability.

fedup-fromAZ

Active member
Joined
Mar 28, 2024
Posts
34
I did originally post in the Windows Update topic here but was asked to move it to this Topic.

First things first. I am Jared, I have 18+ years experience System Admin / Virtualization engineer.
I used to say, I have seen parts of Windows nobody was ever meant to see.
All that being said. I do not have a home lab, and have been doing architect/system and network design for the last 5 years. I do not have the resources and my knowledge is out of date.
I still thought I could solve this myself. I cannot. Please help.

I built the PC initially in 2019 and it ran great until last year (2023). I started having random network disconnects. I ruled out everything but the motherboard, so I took the opportunity to upgrade.
I replaced/upgraded everything except the Non-OS Storage and PSU (EVGA 850 Bronze)
Fought with Microsoft for a long time to get them to activate it, since to them it looked like a new PC, not an upgrade.

This version of the PC ran pretty great for about 4 months now. This week, I started getting random BSOD as many as 10 per day, tons (every 5 or so minutes) of "aw snap" errors from Chrome, Random App Crashes (10-20 per day), and the occasional hard system crash with no errors (Probably 2 or 3 times per day).

Started troubleshooting and found SFC / DISM both reporting corrupt system files. Through research I found this is most likely a driver conflict.


I have done the following and I am still getting BSOD, RPC Service crashes, and overall horrible instability.
  • sfc /scannow :: Half the time this fails, when it does it says it found corrupt files but could not fix them.
  • DISM /online /cleanup-image /restorehealth (both with and without the source pointing to a USB mounted fresh Official ISO) :: This either crashes corresponding with a crash of the RPC Service or completes and says source files could not be located. then proceeds to tell me how to specify a source.
  • I have updated the firmware for my motherboard and SSDs. I installed the latest chipset drivers from ASUS.
  • I even partitioned off the free space on my one of the drives and did a clean install of Windows 11. It BSOD before I even got to the login screen. I did not take that any further. I wiped that partition as soon as I saw it act the same way.

SIDE NOTE I HOPE IS NOT RELATED: The only thing that changed near the time all this started happening, was I bought an Asustor AS5404 created an iscsi target and connected it to the system using the Microsoft iscsi initiator. it was working fine. I have since disconnected and disable that service thinking it might be causing the issues.)

System Specs:
Home Build
Windows 11 Pro
(I think about 4 months Since this install)
Version 10.0.22631 Build 22631

CPU

Intel Core i9 14900K

GPU
EVGA GeForce RTX 3080 Ti FTW3 ULTRA GAMING
- NVIDIA GeForce RTX 3080 Ti

Motherboard
ASUS ROG Strix Z790-E Gaming WiFi 6EWithout

RAM - 64Gb
Slot A2: Corsair CMK64GX5M2B5600C40 DDR5
Slot B2: Corsair CMK64GX5M2B5600C40 DDR5

Storage
Samsung SSD 860 PRO 1TB
WDC WDS500G2B0A-00SM50
Samsung SSD 970 EVO 1TB
Samsung SSD 980 PRO 1TB
*Samsung SSD 970 EVO Plus 2TB* - OS DRIVE
Samsung PSSD T7 Shield SCSI Disk Device

Audio
Realtek USB Audio
A50 X Game
NVIDIA High Definition Audio
Realtek USB Audio
A50 X Voice
NVIDIA High Definition Audio
A50 X Mic Out
A50 X Stream Out
Logitech BRIO

Network
Built in Intel 2.5Gb Ethernet
PCIe TP-Link 2.5Gb (TX201)

Power
EVGA 850 Bronze


As I sat hear (last night) watching chrome crash multiple times, and rereading what I wrote. It occurred to me that my PSU very well could have caused this issue. I will look at upgrading it next. But in the mean time, as kind of a test, This morning I turned off the high performance power mode. and set the system to Balanced Power mode. It seems to be crashing WAY less. I wonder is weak or failing power supply could cause corruption and random failures? Also, If I upgrade my PSU, can my OS be fixed? All my HW Health checks come back as ok, so I do not think there is any damaged HW yet.
 

Attachments

Verifier is not currently running. I did turn it on, and the system would not boot. it produced 0 results. I had top go into recovery mode to turn it off. I can try again, because I am not certain it did it right.

Ran verifier. I followed the instructions post in this forum. It will not allow the system to boot. I had to use recovery mode to disable it again. I tried twice.
 
Welcome to the BSOD forum! :-)

When you say it won't boot with Driver Verifier enabled, using the instructions here, what happens exactly? Does it BSOD-loop (common if a flaky driver is loaded at boot time), or does it freeze or crash, or restart, or power off, or what? As I'm sure you appreciate, a healthy system should boot with Driver Verifier enabled.

My first impressions, looking through your dumps and logs, plus your saying that the problem exists after a clean reinstall of Windows, very strongly suggests that this is probably a hardware problem. I'm seeing some services ending abruptly due to invalid memory references in the logs and invalid memory references in the dumps, so in the first instance I suggest you focus on your RAM. There are two ways you can test RAM...
  • You can download and run Memtest86 (free), running it twice (to get 8 iterations of the 13 different tests), but this will take a long time on your 64GB and you won;t be able to use the PC at all during this testing. In addition, no memory tester can find 100% of potential memory issues.

  • Since you have two 32GB RAM sticks you could try removing one and running on just the other for 24 hours or so, then swap sticks. If one stick is flaky you'll see BSODs with it but not with the other. The two main advantages of this method are; you can use the PC as normal whilst 'testing', and this is guaranteed to highlight a flaky stick.
Before we look any further, could you please test your RAM using one of the above methods - I would recommend removing a stick.
 
Hi,

Just out of curiosity, could you elaborate on what you have done to rule out everything but the motherboard? Considering a motherboard connects everything together, I'd think it's one of the last things to rule out normally, which makes me curious how you got to this conclusion.
I built the PC initially in 2019 and it ran great until last year (2023). I started having random network disconnects. I ruled out everything but the motherboard, so I took the opportunity to upgrade.
 
When you say it won't boot with Driver Verifier enabled, using the instructions here, what happens exactly? Does it BSOD-loop (common if a flaky driver is loaded at boot time)
BSOD loop.

You can download and run Memtest86 (free), running it twice (to get 8 iterations of the 13 different tests), but this will take a long time on your 64GB and you won;t be able to use the PC at all during this testing. In addition, no memory tester can find 100% of potential memory issues.
My motherboard BIOS has Memtest86 built in so I ran that over night. It did 40 tests and took 9 hours. Found 0 errors or faults.

I'll try reseating the RAM. and I'll try each stick separately. It'll be late tonight. Thanks for the feed back.

I did notice a significant improvement in the crashes when I put the system in power saving mode. I wonder if my PSU could cause memory errors? The PCpart picker psu calculator says an 850 should be fine. But it's nearly 5 years old. And it's an 80 plus bronze PSU.
 
Hi,

Just out of curiosity, could you elaborate on what you have done to rule out everything but the motherboard? Considering a motherboard connects everything together, I'd think it's one of the last things to rule out normally, which makes me curious how you got to this conclusion.
That was my last build. And the only issue it had was random NIC disconnects. I replaced the cables all the way to the modem, and I tried different ports on the router. And the other devices on the network weren't having the issue. So I assumed the built in nic had failed due to a heat issue I had one weekend. I had since fix the air flow and heat issue. After upgrading, I did more research and found that the Intel i226-v NICs have issues. So, maybe I didn't NEED to upgrade. But I'd been meaning to.
 
BSOD loop.
That would suggest a boot-loaded driver is failing a Driver Verifier check - hence the BSOD loop. In the instructions for Driver Verifier here we mostly test only third-party drivers, so it almost certainly one of these causing the BSOD loop - that may be the cause of the issue you first noticed. You could try a clean boot of Windows by going into msconfig, click the Services tab, check the box to hide all Microsoft services, and then uncheck all those third-party services you can temporarily do without. Then open Task Manager, click the Startup tab and disable all those startup processes that you can temporarily do without. If you then enable Driver Verifier as per the instructions here again and reboot it should boot just fine. Give that a try and if it does boot we'll move on from there.

My motherboard BIOS has Memtest86 built in so I ran that over night. It did 40 tests and took 9 hours. Found 0 errors or faults.

I'll try reseating the RAM. and I'll try each stick separately. It'll be late tonight. Thanks for the feed back.
Perfect. Reseating RAM is always a good idea, though with large RAM configurations like yours I'm a big fan or removing one stick...
 
That would suggest a boot-loaded driver is failing a Driver Verifier check - hence the BSOD loop. In the instructions for Driver Verifier here we mostly test only third-party drivers, so it almost certainly one of these causing the BSOD loop - that may be the cause of the issue you first noticed. You could try a clean boot of Windows by going into msconfig, click the Services tab, check the box to hide all Microsoft services, and then uncheck all those third-party services you can temporarily do without. Then open Task Manager, click the Startup tab and disable all those startup processes that you can temporarily do without. If you then enable Driver Verifier as per the instructions here again and reboot it should boot just fine. Give that a try and if it does boot we'll move on from there.
Hi Ubuysa,
Thank you very much for the feedback. It is greatly appreciated.

The clean boot resulted in the same issue.
First BSOD on boot cause a reboot. 2nd boot resulted in the following BSOD and was up long enough to get a picture. This cycle continued until recovery mode intervention.
 

Attachments

  • 20240331_144911.jpg
    20240331_144911.jpg
    81.5 KB · Views: 5
UPDATE: Clean start of Windows has no effect, I still get constant app and service crashes. If my PSU did cause a file corruption that we have to track down and fix, It won't do that again, I just replaced the PSU with 1000W 80+Gold.
 
That would suggest a boot-loaded driver is failing a Driver Verifier check - hence the BSOD loop. In the instructions for Driver Verifier here we mostly test only third-party drivers, so it almost certainly one of these causing the BSOD loop - that may be the cause of the issue you first noticed. You could try a clean boot of Windows by going into msconfig, click the Services tab, check the box to hide all Microsoft services, and then uncheck all those third-party services you can temporarily do without. Then open Task Manager, click the Startup tab and disable all those startup processes that you can temporarily do without. If you then enable Driver Verifier as per the instructions here again and reboot it should boot just fine. Give that a try and if it does boot we'll move on from there.


Perfect. Reseating RAM is always a good idea, though with large RAM configurations like yours I'm a big fan or removing one stick...
Clean start of Windows has no effect, I still get constant app and service crashes.

Also, took ram only one stick and got BSOD, then ran the other stick and also got BSOD.

Buy as stated above, it does not seem like the BSODs are producing new DUMP files. its weird.
 
@axe0 is way more experienced than I, and since they asked for the Sysnative file upload I will defer to them.

That said, what happened on the 30th March? Up until then you were getting BSODs, which means that Windows was able to catch the failure. From 30th onwards there are no BSODs in your log but there are plenty of crashes. That suggests that indicates that Windows was not able to catch the failures from 30th onwards. Is the 30th when you enabled Driver Verifier?
 
I presume the voltages are also correct within their respective ranges?

I suspect Windows wasn't able to register the crashes whilst they were happening in the past few days.

Are you still getting crashes after the PSU replacement?
 
I am one version behind on my BIOS. Should I go ahead and upgrade that? or try to? OR maybe roll back a few?
 
I'm not a BSOD expert, but since this issue was also posted in the Windows Update section, I am following this thread as well.

In your latest (System) Event log I noticed the following: "critical kernel-power Event ID 41" Event ID 41 The system has rebooted without cleanly shutting down first - Windows Client

Rich (BB code):
The description for Event ID 41 from source Microsoft-Windows-Kernel-Power cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

.. and just before this errror was logged, I see:

Rich (BB code):
ACPI thermal zone \_TZ.TZ00 has been enumerated.

This reminds me to a bad temperature sensor on the motherboard, poor assembled CPU cooler, airflow etc...

To troubleshoot this issue further, it might be helpful to disconnect all the external hardware and /or PCI devices like: the TP-Link 2.5Gb (TX201) adapter!
 
This reminds me to a bad temperature sensor on the motherboard, poor assembled CPU cooler, airflow etc...

To troubleshoot this issue further, it might be helpful to disconnect all the external hardware and /or PCI devices like: the TP-Link 2.5Gb (TX201) adapter!
  • What are the idle / high load temps of the i9 14900K? Idle = 33-35° C. / High Load = 80-83° C :: I have not seen my CPU go above 87° and that is just super short peaks.
  • Which cooler is installed stock or aftermarket? NZXT Kraken Elite 240 AIO
  • What is the result when you run Prime95? Runs fine with the small FFTs, with the Large FFTs I get these two lines every time.
[Main thread Apr 2 10:19] TORTURE TEST FAILED on worker #4.
[Main thread Apr 2 10:20] TORTURE TEST FAILED on worker #8.
 
Results.txt
[Tue Apr 2 13:56:45 2024]
FATAL ERROR: Final result was 3E977090, expected: 6186CC09.
Hardware failure detected running 200K FFT size, consult stress.txt file.
FATAL ERROR: Final result was BA58520F, expected: 6186CC09.
Hardware failure detected running 200K FFT size, consult stress.txt file.
FATAL ERROR: Final result was 728F21AD, expected: 6186CC09.
Hardware failure detected running 200K FFT size, consult stress.txt file.
Self-test 200K passed!
Self-test 200K passed!
Self-test 200K passed!
Self-test 200K passed!
Self-test 200K passed!
Self-test 200K passed!
Self-test 200K passed!
Self-test 200K passed!
Self-test 200K passed!
Self-test 200K passed!
Self-test 200K passed!
Self-test 200K passed!
Self-test 200K passed!
Self-test 200K passed!
Self-test 200K passed!
Self-test 200K passed!
Self-test 200K passed!
Self-test 200K passed!
Self-test 200K passed!
Self-test 200K passed!
Self-test 200K passed!
[Tue Apr 2 14:01:48 2024]
Self-test 224K passed!
Self-test 224K passed!
Self-test 224K passed!
Self-test 224K passed!
Self-test 224K passed!
Self-test 224K passed!
Self-test 224K passed!
Self-test 224K passed!
Self-test 224K passed!
Self-test 224K passed!
Self-test 224K passed!
Self-test 224K passed!
Self-test 224K passed!
Self-test 224K passed!
Self-test 224K passed!
Self-test 224K passed!
Self-test 224K passed!
Self-test 224K passed!
Self-test 224K passed!
Self-test 224K passed!
Self-test 224K passed!
 
Back
Top