Deek
Well-known member
Hi - seasoned server tech here...I am at a loss on this one and am just hoping someone can say something that creates a spark!!
Please help is you can
Specs:
Server 2019 - Supermicro Big Twin 4 blade chassis with 3 blades, X11-DPT-B motherboard
128GB RAM, Xeon Silver, 2x Intel NVME Raid 1 arrays using SM RaidKey
SupperMicro built it with 100% approved HW
UEFI Install
Nothing but native HW drivers installed (Video[ASpeed], NIC[Intel i350],Intel VROC
Not really any other 3rd software installed by the std stuff.
Symptoms:
Every 3-5 days server goes offline - Remote Access via screen connect and IPMI behave the same. I can connect to server, but just get a greyish screen
I can see the mouse, but that is it. I can not get to login, server pings but stuff like DHCP and DNS stop working. I can connect to ScreenConnect agent on the server
so I is at least somewhat there.
The only telling event log I get is that it can't talk to the other DC - which I kind of expect when a server is balled up...no DNS would cause this.
No crash dumps, no useful events, no hardware errors. There are some DCOM errors we all ignore...
All three blades do this, but the DC every 3-5days, the other 2 only do it maybe once every 2 weeks (but they are not in production yet).
I have never had it crash while using it. Only recovery is a reset through IPMI
No DMP's
What I have Tried:
-Bios updates (been through 2 at this point)
-Memtests all clear
-Chkdsk all volumes
-SFC/Scannow
-Updated every driver I could find...multiple times
-Windows updates (duh!)
-Disabled un-used features in bios relating to IO
If anyone has a logical next step - I would love to hear it!!! How can I generate clues?
Please help is you can
Specs:
Server 2019 - Supermicro Big Twin 4 blade chassis with 3 blades, X11-DPT-B motherboard
128GB RAM, Xeon Silver, 2x Intel NVME Raid 1 arrays using SM RaidKey
SupperMicro built it with 100% approved HW
UEFI Install
Nothing but native HW drivers installed (Video[ASpeed], NIC[Intel i350],Intel VROC
Not really any other 3rd software installed by the std stuff.
Symptoms:
Every 3-5 days server goes offline - Remote Access via screen connect and IPMI behave the same. I can connect to server, but just get a greyish screen
I can see the mouse, but that is it. I can not get to login, server pings but stuff like DHCP and DNS stop working. I can connect to ScreenConnect agent on the server
so I is at least somewhat there.
The only telling event log I get is that it can't talk to the other DC - which I kind of expect when a server is balled up...no DNS would cause this.
No crash dumps, no useful events, no hardware errors. There are some DCOM errors we all ignore...
All three blades do this, but the DC every 3-5days, the other 2 only do it maybe once every 2 weeks (but they are not in production yet).
I have never had it crash while using it. Only recovery is a reset through IPMI
No DMP's
What I have Tried:
-Bios updates (been through 2 at this point)
-Memtests all clear
-Chkdsk all volumes
-SFC/Scannow
-Updated every driver I could find...multiple times
-Windows updates (duh!)
-Disabled un-used features in bios relating to IO
If anyone has a logical next step - I would love to hear it!!! How can I generate clues?