Hi guys,
So I don't know if this 100% classifies as a BSOD, it's a bit of a long story and I'll summarize it as succinctly as possible.
- It's a 2012 R2 Server which hosts AD/DNS/SQL/RDS. It's mainly used for a rental program called Point of Rentals so we have anywhere from 30-60 people on an average day remoted into the server mostly to use that program which is SQL based (we have sql management studio installed on the server). The server has 112GB RAM
- A few months ago, the server crashed (BSOD, I assume, and wouldn't boot. Boot repair didn't work, but a chkdsk got it booting again)
- 2 months later, a similar issue except this time we had to restore the most recent backup image to get it working.
- To my dismay, a week later, it happens again. A tech finds that the RAID card in the Dell server is extremely hot / looks loose so we replace it.
- Things work pretty well for a while, but then we start running into issues where the server gets very slow. Sometimes, it'll be slow but only showing 65% memory usage, but many others times, it'll show 97% or so memory usage, even though in task manager the most consuming process is SQL which is using about 65GB (which is the limit we set for it, we've tried other limits as well for testing)
- Rebooting the server fixes this for a couple days, or more, then it happens again.
- I've checked with Dell Openmanage and everything checks out (no failing drives, etc). This past weekend, I updated the RAID drivers (but could not update bios and RAID firmware remotely) so we'll see if that has any impact. I also cleared out and ran chkdsk on some drives and all looks good for the most part. sfc scan found one error it couldn't fix but a DISM restorehealth looked to have fixed it.
I'm technically testing it right now, but I'm trying to be proactive so I collected some logs a few days ago in hopes that we might get a hint of what it may be in case it happens again.
Attached is the sysnative log zip.
Also, I've run poolmon while the server was hitting high memory usage last week and here are some screenshots (keep in mind, this is with task manager saying SQL was the top offender with 65GB of RAM, and everything else didn't seem to use much at all)
Here's a snapshot of task manager during this climbing memory usage:
I tried running this command to see which drivers CM31 and MmSt were tied to , but I assume it couldn't find it because it was paged memory?
Any help is appreciated and this is also a precious learning experience for me. Thank you
So I don't know if this 100% classifies as a BSOD, it's a bit of a long story and I'll summarize it as succinctly as possible.
- It's a 2012 R2 Server which hosts AD/DNS/SQL/RDS. It's mainly used for a rental program called Point of Rentals so we have anywhere from 30-60 people on an average day remoted into the server mostly to use that program which is SQL based (we have sql management studio installed on the server). The server has 112GB RAM
- A few months ago, the server crashed (BSOD, I assume, and wouldn't boot. Boot repair didn't work, but a chkdsk got it booting again)
- 2 months later, a similar issue except this time we had to restore the most recent backup image to get it working.
- To my dismay, a week later, it happens again. A tech finds that the RAID card in the Dell server is extremely hot / looks loose so we replace it.
- Things work pretty well for a while, but then we start running into issues where the server gets very slow. Sometimes, it'll be slow but only showing 65% memory usage, but many others times, it'll show 97% or so memory usage, even though in task manager the most consuming process is SQL which is using about 65GB (which is the limit we set for it, we've tried other limits as well for testing)
- Rebooting the server fixes this for a couple days, or more, then it happens again.
- I've checked with Dell Openmanage and everything checks out (no failing drives, etc). This past weekend, I updated the RAID drivers (but could not update bios and RAID firmware remotely) so we'll see if that has any impact. I also cleared out and ran chkdsk on some drives and all looks good for the most part. sfc scan found one error it couldn't fix but a DISM restorehealth looked to have fixed it.
I'm technically testing it right now, but I'm trying to be proactive so I collected some logs a few days ago in hopes that we might get a hint of what it may be in case it happens again.
Attached is the sysnative log zip.
Also, I've run poolmon while the server was hitting high memory usage last week and here are some screenshots (keep in mind, this is with task manager saying SQL was the top offender with 65GB of RAM, and everything else didn't seem to use much at all)
Here's a snapshot of task manager during this climbing memory usage:
I tried running this command to see which drivers CM31 and MmSt were tied to , but I assume it couldn't find it because it was paged memory?
Any help is appreciated and this is also a precious learning experience for me. Thank you