Hi guys,
I'm currently working with an individual in one of the other forums on a particular case and figured I'd share. At the time of writing this I'm not complete with my analysis but I have scrounged up enough to deem it worth posting here. I'll update with any new discoveries. Understand that this article covers a number of Windbg extensions and commands that I've covered one way or another in previous articles so I decided not to repeat myself on describing their purpose again here. If you still have questions, ask away. As always, dump file is attached below.
To continue the introduction by providing some context, the client started by mentioning that after a month of purchasing a system he was getting BSODs. I'm sure all of you - as well as I - will immediately suspect hardware, which is the current speculation. The fellow also went ahead and tried Driver Verifier, but it wasn't showing up any hits, which further implicates hardware. Without is is also showing up some relatively scattered crashes with no strong patterns. Hardware tests however were not very successful either so far, so the culprit is becoming rather elusive. Rather than try to continue with the usual process of elimination, I decided to get adventurous and trek through one of the fella's crashdumps, which was a 0x19 bugcheck, or BAD_POOL_HEADER.
Now, usually these minidumps - like all minidumps - typically don't harbor the data we seek, as they are very fickle creatures. Yet it was our fortunate luck that we stumbled upon one that was willing to cooperate and provide us info on the pool actually causing the problem. So therefore our journey begins.
The following is a copy-paste of my original post to the client to describe the situation to fellow analysts:
_____
[/B]Strangely, and with considerable luck, the minidump for the 0x19 crash did indeed retain the pool information we need to look into this further. Start with
!analyze -v:
Code:
6: kd> !analyze -v
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************
BAD_POOL_HEADER (19)
The pool is already corrupt at the time of the current request.
This may or may not be due to the caller.
The internal pool links must be walked to figure out a possible cause of
the problem, and then special pool applied to the suspect tags or the driver
verifier to a suspect driver.
Arguments:
Arg1: 0000000000000020, a pool block header size is corrupt.
Arg2: fffff8a00feb0390, The pool entry we were looking for within the page.
Arg3: fffff8a00feb03c0, The next pool entry.
Arg4: 0000000005030115, (reserved)
Debugging Details:
------------------
TRIAGER: Could not open triage file : C:\Program Files (x86)\Windows Kits\8.0\Debuggers\x64\triage\modclass.ini, error 2
BUGCHECK_STR: 0x19_20
POOL_ADDRESS: fffff8a00feb0390 Paged pool
CUSTOMER_CRASH_COUNT: 1
DEFAULT_BUCKET_ID: WIN7_DRIVER_FAULT
PROCESS_NAME: svchost.exe
CURRENT_IRQL: 0
LOCK_ADDRESS: fffff80002c8db80 -- (!locks fffff80002c8db80)
Resource @ nt!PiEngineLock (0xfffff80002c8db80) Available
WARNING: SystemResourcesList->Flink chain invalid. Resource may be corrupted, or already deleted.
WARNING: SystemResourcesList->Blink chain invalid. Resource may be corrupted, or already deleted.
1 total locks
PNP_TRIAGE:
Lock address : 0xfffff80002c8db80
Thread Count : 0
Thread address: 0x0000000000000000
Thread wait : 0x0
LAST_CONTROL_TRANSFER: from fffff80002bbccae to fffff80002a91fc0
STACK_TEXT:
fffff880`09521d28 fffff800`02bbccae : 00000000`00000019 00000000`00000020 fffff8a0`0feb0390 fffff8a0`0feb03c0 : nt!KeBugCheckEx
fffff880`09521d30 fffff800`02d67663 : fffff8a0`019364d8 fffff8a0`019364a8 fffff8a0`624e4d43 fffff880`09522440 : nt!ExDeferredFreePool+0x12da
fffff880`09521de0 fffff800`02d583a8 : fffff8a0`00024010 fffff8a0`00024010 fffff880`09521f01 fffff880`09522440 : nt!CmpGetSymbolicLink+0x2b3
fffff880`09521ee0 fffff800`02d883d8 : fffffa80`07883cc8 fffff800`02a13000 fffffa80`07883b10 fffff800`00000000 : nt!CmpParseKey+0x948
fffff880`095221e0 fffff800`02d895f6 : 00000000`00000000 fffffa80`07883b10 fffff8a0`0ff09060 fffffa80`067047a0 : nt!ObpLookupObjectName+0x588
fffff880`095222d0 fffff800`02d3f088 : fffff880`030fb180 00000000`00000000 00000000`00000000 fffff880`095224a8 : nt!ObOpenObjectByName+0x306
fffff880`095223a0 fffff800`02d3f926 : fffff880`095227b8 fffff880`000f003f fffff880`095227d0 fffff800`02cc4600 : nt!CmCreateKey+0x2e1
fffff880`09522510 fffff800`02a91253 : ffffffff`ffffffff 00000000`00000000 fffff6fc`50053698 fffff800`02ab4038 : nt!NtCreateKey+0x2e
fffff880`09522560 fffff800`02a8d810 : fffff800`02cfca66 00000000`00000001 20207050`02a4b5d8 00000000`000000b5 : nt!KiSystemServiceCopyEnd+0x13
fffff880`09522768 fffff800`02cfca66 : 00000000`00000001 20207050`02a4b5d8 00000000`000000b5 fffff880`095229e0 : nt!KiServiceLinkage
fffff880`09522770 fffff800`02cf96b2 : 00000000`00000001 00000000`00000002 00000000`00000002 00000000`000f003f : nt!IopCreateRegistryKeyEx+0x7e
fffff880`09522840 fffff800`02cfa081 : 00000000`00000002 00000000`00000000 fffff880`09522a40 00000000`00000001 : nt!IopGetDeviceInterfaces+0x146
fffff880`095229b0 fffff800`02cf9351 : 00000000`00000002 00000000`00000000 00000000`00000000 fffff8a0`09334bd0 : nt!PiGetInterfaceDeviceList+0x41
fffff880`09522a10 fffff800`02d521a0 : fffff8a0`09334bd0 fffff800`00008000 fffff880`09522b01 fffff800`02f59da0 : nt!PiControlGetInterfaceDeviceList+0x111
fffff880`09522a90 fffff800`02a91253 : fffffa80`096fc060 00000000`013ee880 fffff880`09522b60 00000000`013ee908 : nt!NtPlugPlayControl+0x100
fffff880`09522ae0 00000000`77c7236a : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x13
00000000`013ee848 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x77c7236a
STACK_COMMAND: kb
FOLLOWUP_IP:
nt!ExDeferredFreePool+12da
fffff800`02bbccae cc int 3
SYMBOL_STACK_INDEX: 1
SYMBOL_NAME: nt!ExDeferredFreePool+12da
FOLLOWUP_NAME: MachineOwner
MODULE_NAME: nt
IMAGE_NAME: ntkrnlmp.exe
DEBUG_FLR_IMAGE_TIMESTAMP: 503f82be
FAILURE_BUCKET_ID: X64_0x19_20_nt!ExDeferredFreePool+12da
BUCKET_ID: X64_0x19_20_nt!ExDeferredFreePool+12da
Followup: MachineOwner
---------
It's stating the header for a pool allocation has got corrupted. Each allocation of pool memory (driver/kernel reserved memory) has a header that has the pool tag which describes the purpose of the allocation - given by the driver allocating it - and some other details like the size of it. The subcode tells us that the header size defined in a pool header is incorrect and has been corrupted. Often this is caused by a driver mishandling its pool allocation and overflowing its allocated pool, which ends up spilling into the nearby pool allocation with its contents. We'll want to look at the pool table for this to see what we got, using the pool address mentioned:
Code:
6: kd> !pool fffff8a00feb0390
Pool page fffff8a00feb0390 region is Paged pool
fffff8a00feb0000 size: 110 previous size: 0 (Allocated) IoNm
fffff8a00feb0110 size: 40 previous size: 110 (Allocated) MmSm
fffff8a00feb0150 size: 40 previous size: 40 (Allocated) MmSm
fffff8a00feb0190 size: 70 previous size: 40 (Free ) CMNb (Protected)
fffff8a00feb0200 size: 40 previous size: 70 (Allocated) MmSm
fffff8a00feb0240 size: 150 previous size: 40 (Allocated) NtFs
*fffff8a00feb0390 size: 30 previous size: 150 (Free ) *CMNb (Protected)
Pooltag CMNb : Configuration Manager Name Tag, Binary : nt!cm
fffff8a00feb03c0 doesn't look like a valid small pool allocation, checking to see
if the entire page is actually part of a large page allocation...
GetUlongFromAddress: unable to read from fffff80002c2fa38
fffff8a00feb03c0 is not a valid small pool allocation, checking large pool...
unable to get pool big page table - either wrong symbols or pool tagging is disabled
fffff8a00feb03c0 is freed (or corrupt) pool
Bad previous allocation size @fffff8a00feb03c0, last size was 3
***
*** An error (or corruption) in the pool was detected;
*** Attempting to diagnose the problem.
***
*** Use !poolval fffff8a00feb0000 for more details.
Pool page [ fffff8a00feb0000 ] is __inVALID.
Analyzing linked list...
[ fffff8a00feb0390 --> fffff8a00feb0400 (size = 0x70 bytes)]: Corrupt region
Scanning for single bit errors...
None found
The highlighted pool region is the allocation made that has the fault. It will tell you the size (which is corrupted) and the size of the allocation made before it. This is rather misleading so make sure you're aware that
previous size doesn't mean the size of the region we're looking at in its 'previous state', but it's only referring to the size of the allocation before it, which is tagged
NtFs and is for an NTFS allocation. For this one, the tag is
CMNb, which it explains is the Configuration Manager Name Tag (we wouldn't get this info if it wasn't for
pooltag.txt in Windbg directory). It also gives you the driver that allocated it, which is
nt!cm, as in the Conf Manager for the NT module. Consider these elements when looking at future pool regions, but for now the only element we're interested in is the size, which says it's 0x30.
Hmm, 0x30 seems rather odd given that the previous region tagged
CMNb in the list was valued as 70 and was also recently freed, probably along with this one. Curious. We'll compare the two regions by dumping their contents based on their specified sizes. I usually use
dc instead of
dps first on pool allocation since strings are often the best clues to gain from em. Here they are:
Code:
6: kd> dc fffff8a00feb0190 fffff8a00feb0190+70
fffff8a0`0feb0190 05070104 e24e4d43 09d68b30 fffffa80 ....CMN.0.......
fffff8a0`0feb01a0 00000000 00000000 0feb01b0 fffff8a0 ................
fffff8a0`0feb01b0 0052005c 00470045 00530049 00520054 \.R.E.G.I.S.T.R.
fffff8a0`0feb01c0 005c0059 0041004d 00480043 004e0049 Y.\.M.A.C.H.I.N.
fffff8a0`0feb01d0 005c0045 00590053 00540053 004d0045 E.\.S.Y.S.T.E.M.
fffff8a0`0feb01e0 0043005c 006e006f 00720074 006c006f \.C.o.n.t.r.o.l.
fffff8a0`0feb01f0 00650053 00300074 00310030 00000000 S.e.t.0.0.1.....
fffff8a0`0feb0200 03040107 ....
6: kd> dc fffff8a00feb0390 fffff8a00feb0390+30
fffff8a0`0feb0390 05030115 e24e4d43 00000000 00000000 ....CMN.........
fffff8a0`0feb03a0 004c004c 00000000 0feb03b0 fffff8a0 L.L.............
fffff8a0`0feb03b0 0052005c 00470045 00530049 00520054 \.R.E.G.I.S.T.R.
fffff8a0`0feb03c0 005c0059 Y.\.
Hmm, without a doubt the corrupt region looks to be cut off midway. Let's see what it looks like if we give it the 'normal' size of 0x70:
Code:
6: kd> dc fffff8a00feb0390 fffff8a00feb0390+70
fffff8a0`0feb0390 05030115 e24e4d43 00000000 00000000 ....CMN.........
fffff8a0`0feb03a0 004c004c 00000000 0feb03b0 fffff8a0 L.L.............
fffff8a0`0feb03b0 0052005c 00470045 00530049 00520054 \.R.E.G.I.S.T.R.
fffff8a0`0feb03c0 005c0059 0041004d 00480043 004e0049 Y.\.M.A.C.H.I.N.
fffff8a0`0feb03d0 005c0045 00590053 00540053 004d0045 E.\.S.Y.S.T.E.M.
fffff8a0`0feb03e0 0043005c 006e006f 00720074 006c006f \.C.o.n.t.r.o.l.
fffff8a0`0feb03f0 00650053 00300074 00310030 fffff880 S.e.t.0.0.1.....
fffff8a0`0feb0400 03040104 ....
Lookin pretty. Now let's look at each header by dumping the headers as data structures. I assumed the name was going to be _POOL_HEADER and sure enough, I was correct:
Code:
6: kd> dt !_POOL_HEADER
nt!_POOL_HEADER
+0x000 PreviousSize : Pos 0, 8 Bits
+0x000 PoolIndex : Pos 8, 8 Bits
+0x000 BlockSize : Pos 16, 8 Bits
+0x000 PoolType : Pos 24, 8 Bits
+0x000 Ulong1 : Uint4B
+0x004 PoolTag : Uint4B
+0x008 ProcessBilled : Ptr64 _EPROCESS
+0x008 AllocatorBackTraceIndex : Uint2B
+0x00a PoolTagHash : Uint2B
6: kd> dt !_POOL_HEADER fffff8a00feb0190
nt!_POOL_HEADER
+0x000 PreviousSize : 0y00000100 (0x4)
+0x000 PoolIndex : 0y00000001 (0x1)
+0x000 BlockSize : 0y00000111 (0x7)
+0x000 PoolType : 0y00000101 (0x5)
+0x000 Ulong1 : 0x5070104
+0x004 PoolTag : 0xe24e4d43
+0x008 ProcessBilled : 0xfffffa80`09d68b30 _EPROCESS
+0x008 AllocatorBackTraceIndex : 0x8b30
+0x00a PoolTagHash : 0x9d6
6: kd> dt !_POOL_HEADER fffff8a00feb0390
nt!_POOL_HEADER
+0x000 PreviousSize : 0y00010101 (0x15)
+0x000 PoolIndex : 0y00000001 (0x1)
+0x000 BlockSize : 0y00000011 (0x3)
+0x000 PoolType : 0y00000101 (0x5)
+0x000 Ulong1 : 0x5030115
+0x004 PoolTag : 0xe24e4d43
+0x008 ProcessBilled : (null)
+0x008 AllocatorBackTraceIndex : 0
+0x00a PoolTagHash : 0
Aside from the manipulated
BlockSize value which is the corrupted value, I'm initially suspicious about the lack of values in the other ones like
ProcessBilled, AllocatorBackTraceIndex & PoolTagHash. However research hasn't really come up with anything solid on the last two, and
ProcessBilled appears to be only necessary under the condition the driver requesting the pool allocation didn't provide a pool tag. So I'm not confident we're dealing with any issues with those items there, to my current knowledge. So at the moment, unless I'm completely unaware, we don't appear to see any other problems in the header besides the pool size getting messed up.
So far we're looking at just the 0x30 value being erroneous. Now, when you look at it in binary and compare it to 0x70, you'll see that there's only a solitary bit difference between the two:
Code:
6: kd> .formats 30
Evaluate expression:
Hex: 00000000`00000030
Decimal: 48
Octal: 0000000000000000000060
Binary: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00110000
Chars: .......0
Time: Wed Dec 31 19:00:48 1969
Float: low 6.72623e-044 high 0
Double: 2.37152e-322
6: kd> .formats 70
Evaluate expression:
Hex: 00000000`00000070
Decimal: 112
Octal: 0000000000000000000160
Binary: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 01110000
Chars: .......p
Time: Wed Dec 31 19:01:52 1969
Float: low 1.56945e-043 high 0
Double: 5.53354e-322
So my suspicion weighs rather heavy on there being a bit flip here. Everything else just looks too clean to have it be otherwise. Now just to make sure, we're going to dump a little further more than the end of the corrupt pool allocation to see if there's anything that matches up. I would suspect the pool allocation after this should actually have the valid
PreviousSize, which should be 0x70 instead of 0x30 (actually, 0x7 instead of 0x3, since block sizes are counted by 0x10's). Let's give a look-see:
Code:
6: kd> dc fffff8a00feb0390 fffff8a00feb0390+100
fffff8a0`0feb0390 05030115 e24e4d43 00000000 00000000 ....CMN.........
fffff8a0`0feb03a0 004c004c 00000000 0feb03b0 fffff8a0 L.L.............
fffff8a0`0feb03b0 0052005c 00470045 00530049 00520054 \.R.E.G.I.S.T.R.
fffff8a0`0feb03c0 005c0059 0041004d 00480043 004e0049 Y.\.M.A.C.H.I.N.
fffff8a0`0feb03d0 005c0045 00590053 00540053 004d0045 E.\.S.Y.S.T.E.M.
fffff8a0`0feb03e0 0043005c 006e006f 00720074 006c006f \.C.o.n.t.r.o.l.
fffff8a0`0feb03f0 00650053 00300074 00310030 fffff880 S.e.t.0.0.1.....
fffff8a0`0feb0400 03040104 6d536d4d 00000000 00000000 ....MmSm........ //Notice the 'MmSm' pooltag? This is where this allocation starts.
fffff8a0`0feb0410 068882d0 fffffa80 00000040 000c0000 ........@.......
fffff8a0`0feb0420 00000000 00000000 00040000 00000000 ................
fffff8a0`0feb0430 00000000 00000000 00000000 00000000 ................
fffff8a0`0feb0440 03140104 7346744e 00000000 00000000 ....NtFs........
fffff8a0`0feb0450 000009df 00010000 00000000 00000000 ................
fffff8a0`0feb0460 00000000 00000000 00000000 00000000 ................
fffff8a0`0feb0470 00000000 00000000 00000000 00000000 ................
fffff8a0`0feb0480 00000000 00000000 00000000 00000000 ................
fffff8a0`0feb0490 0050013b ;.P.
6: kd> dt !_POOL_HEADER fffff8a0`0feb0400
nt!_POOL_HEADER
+0x000 PreviousSize : 0y00000100 (0x4)
+0x000 PoolIndex : 0y00000001 (0x1)
+0x000 BlockSize : 0y00000100 (0x4)
+0x000 PoolType : 0y00000011 (0x3)
+0x000 Ulong1 : 0x3040104
+0x004 PoolTag : 0x6d536d4d
+0x008 ProcessBilled : (null)
+0x008 AllocatorBackTraceIndex : 0
+0x00a PoolTagHash : 0
What?! This is where the monkey wrench is thrown in. It doesn't say 0x3, and it doesn't say 0x7, but instead it says previous size is 0x4, or 0x40. This isn't right! I went ahead and checked the next allocation after this one and that one correctly gave
PreviousSize of 0x4, which is the size of the allocation we're looking at now. So it wasn't that the whole pool table was messed up all the way down. Something's odd here.
TBH, this is the item here that had me pretty stumped. Part of me thinks that a little bit of math was involved in calculating the previous size, since 0x3 and 0x4 add up to 0x7 which is the correct amount, it might have thought that there was another allocation sitting here. In fact, upon pondering on it further, there probably was!
See, what it looks like now, is that this wasn't because the pool allocation for the Config. Manager happened to have been originally 0x70 and all of a sudden was reduced to 0x30. Rather, the original value
was probably in fact 0x30! So the Configuration Manager erroneously allocated 30 bytes for this region when it was supposed to be 70, and then it indeed overflowed by shoving that long string of text into it, which spilled over and completely obliterated the pool allocation made after it.
This is where the plot thickens. Now the problem is, we need to know where on earth the Config. Manager decided to allocate 30 bytes instead of 70 like its previous allocations. When dumping the raw stack of this faulting thread I do see that before the attempt to free the pool (which, btw, is the only time when Windows - without Driver Verifier - checks for pool corruption) allocation that has the erroneous size, that this thread did previously allocate some pool (e.g.
nt!MiAllocatePagedPoolPages) so now I have to figure out if this is the same pool allocation we're dealing with, and if so when it gave the request to allocate 30 bytes instead of 70. This is where things get pretty heavy, and where I seriously doubt I'll be able to pull much from this minidump, but I'm willing to give it a shot.
That's all I have for this right now, but my suspicion still rests on hardware problems. The fact it's only a bit away from being correct is just too close to be a suspected driver bug, and TBH I'm not sure any third party driver was bugging here, since this looks like it was all on the Config. Manager (which, btw, is the subsystem in Windows that implements the registry, hence the registry keys in the pool allocations). It just looks too much like a hardware fluke right now, so I still recommend approaching it with that hypothesis in mind.
To be continued...