Server 2019 stability issues - Looking for suggestions....

Deek

Well-known member
Joined
Apr 9, 2013
Posts
152
Location
Sacramento, Ca
Hi - seasoned server tech here...I am at a loss on this one and am just hoping someone can say something that creates a spark!!
Please help is you can

Specs:
Server 2019 - Supermicro Big Twin 4 blade chassis with 3 blades, X11-DPT-B motherboard
128GB RAM, Xeon Silver, 2x Intel NVME Raid 1 arrays using SM RaidKey
SupperMicro built it with 100% approved HW
UEFI Install
Nothing but native HW drivers installed (Video[ASpeed], NIC[Intel i350],Intel VROC
Not really any other 3rd software installed by the std stuff.

Symptoms:
Every 3-5 days server goes offline - Remote Access via screen connect and IPMI behave the same. I can connect to server, but just get a greyish screen
I can see the mouse, but that is it. I can not get to login, server pings but stuff like DHCP and DNS stop working. I can connect to ScreenConnect agent on the server
so I is at least somewhat there.

The only telling event log I get is that it can't talk to the other DC - which I kind of expect when a server is balled up...no DNS would cause this.
No crash dumps, no useful events, no hardware errors. There are some DCOM errors we all ignore...

All three blades do this, but the DC every 3-5days, the other 2 only do it maybe once every 2 weeks (but they are not in production yet).
I have never had it crash while using it. Only recovery is a reset through IPMI
No DMP's

What I have Tried:
-Bios updates (been through 2 at this point)
-Memtests all clear
-Chkdsk all volumes
-SFC/Scannow
-Updated every driver I could find...multiple times
-Windows updates (duh!)
-Disabled un-used features in bios relating to IO

If anyone has a logical next step - I would love to hear it!!! How can I generate clues?
 
You say all 3 blades are experiencing similar issues? Are they all Server 2019 installs - any differences between the installs apart from the fact one's a DC? The fact it's happening on all 3 blades is very odd (side note - that BigTwin is a nice looking machine!). What's common between them - storage, RAM, power?

I'd suggest trying driver verifier on one of the non-production machines and seeing if it's stable - if it starts BSODing or crashing more with driver verifier we might be able to narrow it down to a driver issue - Driver Verifier - BSOD related - Windows 10, 8.1, 8, 7 + Vista

For debugging weird issues like this that don't happen regularly, it might be worth giving Sysmon a try as it can be useful at capturing events that normally go unnoticed in Event Viewer: Sysmon - Windows Sysinternals. It's a bit of a pain to set the config up correctly, but there's a very good reference here: SwiftOnSecurity/sysmon-config. Sysmon is normally used for monitoring and diagnosing malware & security issues, but can be helpful at capturing what happens before stuff goes wrong.

Another useful tool would be to set up ProcDump to capture dump files of any processes that crash: ProcDump - Windows Sysinternals. The command for that is procdump -ma -i C:\dumps, which will save any process crash dumps into C:\dumps. This can be useful if there's some process that's crashing and bringing the server down (although that's unlikely from the symptoms). More info: Community

I wonder if forcing a BSOD when the machine is hung would reveal anything? Forcing a System Crash from the Keyboard - Windows drivers

Sorry I can't offer an answer, but hopefully I've come up with a few useful points!

-Stephen
 
Tekno - Thank you, at least it is a place to start. I will give sysmon a try. The only thing common between them is the 2 PSU's, everything else is discrete. Thank goodness for IPMI, it has saved me a lot of truck rolls.
 
I would assume it emulates a USB HID device, you might be able to look in Device Manager and see if you can see it listed
 
Bumping this one again as I have new info. This was preceded by Event 161 saying there was an error creating dump file....so I still can not get a .dmp file, but it did reboot this time on it's own. Also, I was running driver verifier at the time. It's query info will be in the second post


System

- Provider

[ Name] Microsoft-Windows-Kernel-Power
[ Guid] {331c3b3a-2005-44c2-ac5e-77220c37d6b4}

EventID 41

Version 6

Level 1

Task 63

Opcode 0

Keywords 0x8000400000000002

- TimeCreated

[ SystemTime] 2020-05-11T22:48:46.167608400Z

EventRecordID 65640

Correlation

- Execution

[ ProcessID] 4
[ ThreadID] 8

Channel System

Computer HRK.******.local

- Security

[ UserID] S-1-5-18


- EventData

BugcheckCode 80
BugcheckParameter1 0xffffcd009a51eff0
BugcheckParameter2 0x0
BugcheckParameter3 0x0
BugcheckParameter4 0x0
SleepInProgress 0
PowerButtonTimestamp 0
BootAppStatus 0
Checkpoint 0
ConnectedStandbyInProgress false
SystemSleepTransitionsToOn 0
CsEntryScenarioInstanceId 0
BugcheckInfoFromEFI true
CheckpointStatus 0
 
Driver Verifier: Does this tell anyone anything?


C:\Windows\system32>verifier /query

Time Stamp: 05/11/2020 16:38:09.500

Verifier Flags: 0x0012892b

Standard Flags:

[X] 0x00000001 Special pool.
[X] 0x00000002 Force IRQL checking.
[X] 0x00000008 Pool tracking.
[ ] 0x00000010 I/O verification.
[X] 0x00000020 Deadlock detection.
[ ] 0x00000080 DMA checking.
[X] 0x00000100 Security checks.
[X] 0x00000800 Miscellaneous checks.
[X] 0x00020000 DDI compliance checking.

Additional Flags:

[ ] 0x00000004 Randomized low resources simulation.
[ ] 0x00000200 Force pending I/O requests.
[ ] 0x00000400 IRP logging.
[ ] 0x00002000 Invariant MDL checking for stack.
[ ] 0x00004000 Invariant MDL checking for driver.
[X] 0x00008000 Power framework delay fuzzing.
[ ] 0x00010000 Port/miniport interface checking.
[ ] 0x00040000 Systematic low resources simulation.
[ ] 0x00080000 DDI compliance checking (additional).
[ ] 0x00200000 NDIS/WIFI verification.
[ ] 0x00800000 Kernel synchronization delay fuzzing.
[ ] 0x01000000 VM switch verification.
[ ] 0x02000000 Code integrity checks.

Internal Flags:

[X] 0x00100000 Extended Verifier flags (internal).

[X] Indicates flag is enabled.

Verifier Statistics Summary

Raise IRQLs: 14851679
Acquire Spin Locks: 291006089
Synchronize Executions: 0
Trims: 195827

Pool Allocations Attempted: 102118
Pool Allocations Succeeded: 102118
Pool Allocations Succeeded SpecialPool: 102118
Pool Allocations With No Tag: 0
Pool Allocations Not Tracked: 0
Pool Allocations Failed: 0
Pool Allocations Failed Deliberately: 0

Driver Verification List

MODULE: iavroc.sys (load: 1 / unload: 0)

Pool Allocation Statistics: ( NonPaged / Paged )

Current Pool Allocations: ( 15 / 0 )
Current Pool Bytes: ( 249143848 / 0 )
Peak Pool Allocations: ( 17 / 0 )
Peak Pool Bytes: ( 249144024 / 0 )
Contiguous Memory Bytes: 0
Peak Contiguous Memory Bytes: 0

MODULE: dattofsf.sys (load: 1 / unload: 0)

Pool Allocation Statistics: ( NonPaged / Paged )

Current Pool Allocations: ( 0 / 4 )
Current Pool Bytes: ( 0 / 210 )
Peak Pool Allocations: ( 0 / 6 )
Peak Pool Bytes: ( 0 / 330 )
Contiguous Memory Bytes: 0
Peak Contiguous Memory Bytes: 0

MODULE: dattofltr.sys (load: 1 / unload: 0)

Pool Allocation Statistics: ( NonPaged / Paged )

Current Pool Allocations: ( 21 / 4 )
Current Pool Bytes: ( 80058417 / 452 )
Peak Pool Allocations: ( 89 / 7 )
Peak Pool Bytes: ( 241239411 / 4634 )
Contiguous Memory Bytes: 0
Peak Contiguous Memory Bytes: 0

MODULE: iastore.sys (load: 1 / unload: 0)

Pool Allocation Statistics: ( NonPaged / Paged )

Current Pool Allocations: ( 7 / 0 )
Current Pool Bytes: ( 71135352 / 0 )
Peak Pool Allocations: ( 7 / 1 )
Peak Pool Bytes: ( 71135352 / 3416 )
Contiguous Memory Bytes: 0
Peak Contiguous Memory Bytes: 0

MODULE: astkmd.sys (load: 1 / unload: 0)

Pool Allocation Statistics: ( NonPaged / Paged )

Current Pool Allocations: ( 2 / 0 )
Current Pool Bytes: ( 10320 / 0 )
Peak Pool Allocations: ( 2 / 1 )
Peak Pool Bytes: ( 10320 / 512 )
Contiguous Memory Bytes: 0
Peak Contiguous Memory Bytes: 0

MODULE: e1r68x64.sys (load: 1 / unload: 0)

Pool Allocation Statistics: ( NonPaged / Paged )

Current Pool Allocations: ( 1067 / 0 )
Current Pool Bytes: ( 478431 / 0 )
Peak Pool Allocations: ( 1067 / 1 )
Peak Pool Bytes: ( 478431 / 1040 )
Contiguous Memory Bytes: 0
Peak Contiguous Memory Bytes: 0

MODULE: dattobusdriver.sys (load: 1 / unload: 0)

Pool Allocation Statistics: ( NonPaged / Paged )

Current Pool Allocations: ( 5 / 1 )
Current Pool Bytes: ( 400 / 126 )
Peak Pool Allocations: ( 5 / 16 )
Peak Pool Bytes: ( 400 / 1232 )
Contiguous Memory Bytes: 0
Peak Contiguous Memory Bytes: 0

MODULE: dump_diskdump.sys (load: 0 / unload: 0)

MODULE: dump_iavroc.sys (load: 6 / unload: 5)

Pool Allocation Statistics: ( NonPaged / Paged )

Current Pool Allocations: ( 0 / 0 )
Current Pool Bytes: ( 0 / 0 )
Peak Pool Allocations: ( 0 / 0 )
Peak Pool Bytes: ( 0 / 0 )
Contiguous Memory Bytes: 0
Peak Contiguous Memory Bytes: 0

MODULE: iqvw64e.sys (load: 1 / unload: 0)

Pool Allocation Statistics: ( NonPaged / Paged )

Current Pool Allocations: ( 0 / 0 )
Current Pool Bytes: ( 0 / 0 )
Peak Pool Allocations: ( 0 / 2 )
Peak Pool Bytes: ( 0 / 176 )
Contiguous Memory Bytes: 0
Peak Contiguous Memory Bytes: 0

C:\Windows\system32>
 
On another note, I think I am having a shutdown crash as well. I have noticed it on several other servers that are 2019...but it just occurred to me that they too are Supermicro X11 boards.

After a normal reboot, I always get the "Unexpected shutdown" message on the next reboot. Not sure if that might somehow be related.
 

Has Sysnative Forums helped you? Please consider donating to help us support the site!

Back
Top