- May 7, 2013
- 10,400
Since this a live thread, I thought I would post my findings here instead of the library.
The thread can be found here - BSOD PAGE_FAULT_IN_NONPAGED_AREA and REGISTRY_ERROR
As always, let’s begin with the bugcheck description and its parameters.
The documentation on this bugcheck is very sparse and the parameter description provides no conclusive information. However, the bugcheck description does hint towards either filesystem corruption or hard disk failure. Let’s check where the system bugchecked by examining the call stack.
Since the stack contained some compiler optimisations, I’ve used CMKD’s stack unwind extension.
Notice how often the reserved bugcheck parameter appears in our call stack? Interesting, it appears that the reserved parameter may be the address of the key’s registry hive which had been marked as modified. The key name can be found in the raw stack.
To confirm this, we can load the MEX debugger extension library and then pass the path to !mreg command.
Notice how the hive address directly corresponds to the second parameter? We’ve now found the hive for the key which has become corrupt or couldn’t be read. This key is a backup of the user’s current control set and is what Windows will query when attempting to boot from its last known good configuration. If this becomes corrupt or unable to be read, then Windows will produce a Stop 0x51 bugcheck as shown above.
Let’s go back to the bugcheck description and examine the fourth parameter which is dependent on where the crash occurred. To check this, we’ll need to disassemble backwards from when the bugcheck function was called.
As we can see, it appears that the error code was returned by HvpSetRangeProtection and then pushed onto the stack for KeBugCheckEx. What does the error code correspond to? This the question which I’ve been trying to find the answer to. It does not seem to be documented by Microsoft and from looking at crashes from the same bugcheck, most – if not all – crashes appear to occur because of HvpSetRangeProtection.
What about the rest of the call stack? What does that tell us? To understand this, we’ll need to describe how Windows manages the on disk representation of the registry and the version which is loaded into memory.
The Configuration Manager maps hive files into memory as required by the system by utilising the Cache Manager’s file mapping feature. During a system boot, a portion of the System hive into mapped into physical memory with the read-only protection bit set.
Now, to ensure that the on-disk version of the hive and the version currently loaded into memory are in synchronisation with each other, the Configuration Manager makes use of a special file called a log file. Each of these log files has a file extension of logN, where N represents any whole number. Since every hive can have multiple log files in case of failure when updating the primary log file.
When a hive is loaded, the Configuration Manager will allocate a 512-byte array called a dirty sector array which represents a section of the hive. When a modification has occurred to the corresponding hive file, then a certain part of the array will be set to 1’s (called On bits). Likewise, if no modification has occurred, then parts – if not all – of the array will contain 0’s called Off bits which mean that the hive in memory and on disk are in synchronisation with each other.
When a key is modified or created, then the appropriate bits within the dirty sector key are set to On and the Configuration Manager will schedule a lazy write operation to occur 5 seconds later. The write operation will then update the on-disk version of the hive file. However, as mentioned earlier, if an error occurs while the hive file is being updated, then corresponding log file will be updated with the contents of the dirty sector array. Upon the next reboot, the log file will read by Configuration Manager and another write will be scheduled. Otherwise, if hive is found to be corrupt, then the boot loader will attempt to repair the hive file, which may include deleting subkeys etc.
Now we have a greater understanding of what the dirty functions correspond to, let’s go back and examine the key which was mentioned in the second parameter which is the Last Known Good Configuration.
There are multiple copies of a user’s control set, with the CurrentControlSet key merely being a symbolic link to the control set which is considered as the primary one. In fact, this is determined by querying the Select key and it’s values. These are queried on boot by the Service Control Manager.
The Current value indicates which control set is the current control set as described above, whereas, the LastKnownGood value indicates which control set is the last known good configuration. The Failed value shows which control set was attempted to be used at boot but then subsequently failed, leading to the system reverting to the last known good configuration control set instead.
To load the last known good configuration, Windows will call the NtIntializeRegistry function. This loads the control set saved in the LastKnownGood value and then attempts to synchronise its keys with the current control set’s keys. If the LastKnownGood value doesn’t exist, then the system will create a new control set for it.
Let’s put everything together and try and understand why the system crashed. Please note that most of these functions are undocumented by Microsoft and therefore I’ve tried to drawn some assumptions from the information shared above.
The system attempts to load the last known good configuration control set since it has noticed that part of the system is corrupt and unbootable. This begins with the NtIntializeRegistry function. It then begins to synchronise the differences between the current control set and the last known good control set using CmpCopySyncTree. This inevitably leads to some modifications which causes some keys to be considered as now dirty via CmpMarkKeyDirty. Shortly afterwards, we experience a crash in HvpSetRangeProtection for an unknown reason.
Following from this, I would suggest firstly checking the state of the any HDD/SSDs and then begin to check the health of the file system itself. Otherwise, I would suggest trying to attempt a repair install, although, I’ve seen some suggestions that deleting the last known good configuration control set to also be a viable solution.
The thread can be found here - BSOD PAGE_FAULT_IN_NONPAGED_AREA and REGISTRY_ERROR
As always, let’s begin with the bugcheck description and its parameters.
Rich (BB code):
REGISTRY_ERROR (51)
Something has gone badly wrong with the registry. If a kernel debugger
is available, get a stack trace. It can also indicate that the registry got
an I/O error while trying to read one of its files, so it can be caused by
hardware problems or filesystem corruption.
It may occur due to a failure in a refresh operation, which is used only
in by the security system, and then only when resource limits are encountered.
Arguments:
Arg1: 0000000000000001, (reserved)
Arg2: fffff8a0000232d0, (reserved)
Arg3: 0000000005d96000, depends on where Windows bugchecked, may be pointer to hive
Arg4: 0000000000000374, depends on where Windows bugchecked, may be return code of
HvCheckHive if the hive is corrupt.
The documentation on this bugcheck is very sparse and the parameter description provides no conclusive information. However, the bugcheck description does hint towards either filesystem corruption or hard disk failure. Let’s check where the system bugchecked by examining the call stack.
Since the stack contained some compiler optimisations, I’ve used CMKD’s stack unwind extension.
Rich (BB code):
3: kd> !stack -p
Call Stack : 17 frames
## Stack-Pointer Return-Address Call-Site
00 fffff8800797b368 fffff80007faa688 nt!KeBugCheckEx+0
Parameter[0] = 0000000000000051
Parameter[1] = 0000000000000001
Parameter[2] = fffff8a0000232d0
Parameter[3] = 0000000005d96000
01 fffff8800797b370 fffff80007f14845 nt!HvpSetRangeProtection+93700 (perf)
Parameter[0] = fffff8a0000232d0
Parameter[1] = 0000000005d96000
Parameter[2] = 0000000064891000
Parameter[3] = 0000000000000004
02 fffff8800797b3d0 fffff80007f1461c nt!HvMarkDirty+175 (perf)
Parameter[0] = 000000000026f44d
Parameter[1] = (unknown)
Parameter[2] = (unknown)
Parameter[3] = 0000000000000000
03 fffff8800797b430 fffff80007fafa10 nt!HvMarkCellDirty+150
Parameter[0] = fffff8a0000232d0
Parameter[1] = (unknown)
Parameter[2] = 0000000000000000
Parameter[3] = (unknown)
04 fffff8800797b480 fffff80007efbd6c nt!CmpMarkValueDataDirty+9af10 (perf)
Parameter[0] = fffff8a0000232d0
Parameter[1] = fffff8a0097123f4
Parameter[2] = (unknown)
Parameter[3] = (unknown)
05 fffff8800797b4c0 fffff80007efb244 nt!CmpMarkKeyDirty+16c
Parameter[0] = fffff8a0000232d0
Parameter[1] = 000000000276e540
Parameter[2] = 0000000000000001
Parameter[3] = (unknown)
06 fffff8800797b500 fffff8000803fcc9 nt!CmpFreeKeyByCell+44
Parameter[0] = fffff8a0000232d0
Parameter[1] = 0000000000000001
Parameter[2] = 0000000000000001
Parameter[3] = (unknown)
07 fffff8800797b540 fffff80007ed60f9 nt!CmpSyncSubKeysAfterDelete+d9
Parameter[0] = fffff8a0000232d0
Parameter[1] = fffff8a00b5a7024
Parameter[2] = fffff8a0000232d0
Parameter[3] = fffff8a00b1bdcdc
08 fffff8800797b5c0 fffff80007ed84de nt!CmpCopySyncTree2+179
Parameter[0] = fffff8a01110d000
Parameter[1] = 0000000000000200
Parameter[2] = 0000000000000000
Parameter[3] = fffff8a0000232d0
09 fffff8800797b670 fffff80007ed83f7 nt!CmpCopySyncTree+6e
Parameter[0] = fffff8a0000232d0
Parameter[1] = 0000000000000160
Parameter[2] = fffff8a0000232d0
Parameter[3] = 00000000003fa3b8
0a fffff8800797b6c0 fffff80007ed7fc6 nt!CmpSaveBootControlSet+307
Parameter[0] = fffff8a002ce44a0
Parameter[1] = (unknown)
Parameter[2] = (unknown)
Parameter[3] = (unknown)
[...]
Notice how often the reserved bugcheck parameter appears in our call stack? Interesting, it appears that the reserved parameter may be the address of the key’s registry hive which had been marked as modified. The key name can be found in the raw stack.
Rich (BB code):
3: kd> !dpx
Start memory scan : 0xfffff8800797b368 ($csp)
End memory scan : 0xfffff8800797c000 (Kernel Stack Base)
0xfffff8800797b3a8 : 0xfffff80007f147ce : nt!HvMarkDirty+0xff
0xfffff8800797b3c8 : 0xfffff80007f14845 : nt!HvMarkDirty+0x176
0xfffff8800797b428 : 0xfffff80007f1461c : nt!HvMarkCellDirty+0x150
[...]
0xfffff8800797b668 : 0xfffff80007ed84de : nt!CmpCopySyncTree+0x6e
0xfffff8800797b6b8 : 0xfffff80007ed83f7 : nt!CmpSaveBootControlSet+0x307
0xfffff8800797b760 : 0xfffff8800797b770 : !du "\Registry\Machine\System\ControlSet002"
0xfffff8800797b770 : 0x006700650052005c : !du "\Registry\Machine\System\ControlSet002"
0xfffff8800797b778 : 0x0072007400730069 : !du "istry\Machine\System\ControlSet002"
0xfffff8800797b780 : 0x0061004d005c0079 : !du "y\Machine\System\ControlSet002"
0xfffff8800797b788 : 0x006e006900680063 : !du "chine\System\ControlSet002"
0xfffff8800797b790 : 0x00790053005c0065 : !du "e\System\ControlSet002"
0xfffff8800797b798 : 0x006d006500740073 : !du "stem\ControlSet002"
0xfffff8800797b7a0 : 0x006e006f0043005c : !du "\ControlSet002"
0xfffff8800797b7a8 : 0x006c006f00720074 : !du "trolSet002"
0xfffff8800797b7b0 : 0x0030007400650053 : !du "Set002"
0xfffff8800797b7c8 : 0xfffff80007f4e3f3 : nt!CmpUnlockRegistry+0x2f
[...]
To confirm this, we can load the MEX debugger extension library and then pass the path to !mreg command.
Rich (BB code):
3: kd> !mreg -p "\Registry\Machine\System\ControlSet002"
Found KCB = fffff8a011141b98 :: \REGISTRY\MACHINE\SYSTEM\CONTROLSET002
Hive fffff8a0000232d0
KeyNode fffff8a00b49b3bc
Notice how the hive address directly corresponds to the second parameter? We’ve now found the hive for the key which has become corrupt or couldn’t be read. This key is a backup of the user’s current control set and is what Windows will query when attempting to boot from its last known good configuration. If this becomes corrupt or unable to be read, then Windows will produce a Stop 0x51 bugcheck as shown above.
Let’s go back to the bugcheck description and examine the fourth parameter which is dependent on where the crash occurred. To check this, we’ll need to disassemble backwards from when the bugcheck function was called.
Rich (BB code):
3: kd> knL
# Child-SP RetAddr Call Site
00 fffff880`0797b368 fffff800`07faa688 nt!KeBugCheckEx
01 fffff880`0797b370 fffff800`07f14845 nt! ?? ::NNGAKEGL::`string'+0x9d9a
02 fffff880`0797b3d0 fffff800`07f1461c nt!HvMarkDirty+0x176
03 fffff880`0797b430 fffff800`07fafa10 nt!HvMarkCellDirty+0x150
04 fffff880`0797b480 fffff800`07efbd6c nt! ?? ::NNGAKEGL::`string'+0x11714
05 fffff880`0797b4c0 fffff800`07efb244 nt!CmpMarkKeyDirty+0x16c
06 fffff880`0797b500 fffff800`0803fcc9 nt!CmpFreeKeyByCell+0x44
07 fffff880`0797b540 fffff800`07ed60f9 nt!CmpSyncSubKeysAfterDelete+0xd9
08 fffff880`0797b5c0 fffff800`07ed84de nt!CmpCopySyncTree2+0x179
09 fffff880`0797b670 fffff800`07ed83f7 nt!CmpCopySyncTree+0x6e
0a fffff880`0797b6c0 fffff800`07ed7fc6 nt!CmpSaveBootControlSet+0x307
[...]
Rich (BB code):
3: kd> ub fffff800`07faa688
nt! ?? ::NNGAKEGL::`string'+0x9d77:
fffff800`07faa665 32c0 xor al,al
fffff800`07faa667 e95ecaf6ff jmp nt!HvpSetRangeProtection+0x142 (fffff800`07f170ca)
fffff800`07faa66c 448bcb mov r9d,ebx
fffff800`07faa66f 48c744242074030000 mov qword ptr [rsp+20h],374h
fffff800`07faa678 4d8bc5 mov r8,r13
fffff800`07faa67b ba01000000 mov edx,1
fffff800`07faa680 8d4a50 lea ecx,[rdx+50h]
fffff800`07faa683 e83805cdff call nt!KeBugCheckEx (fffff800`07c7abc0)
As we can see, it appears that the error code was returned by HvpSetRangeProtection and then pushed onto the stack for KeBugCheckEx. What does the error code correspond to? This the question which I’ve been trying to find the answer to. It does not seem to be documented by Microsoft and from looking at crashes from the same bugcheck, most – if not all – crashes appear to occur because of HvpSetRangeProtection.
What about the rest of the call stack? What does that tell us? To understand this, we’ll need to describe how Windows manages the on disk representation of the registry and the version which is loaded into memory.
The Configuration Manager maps hive files into memory as required by the system by utilising the Cache Manager’s file mapping feature. During a system boot, a portion of the System hive into mapped into physical memory with the read-only protection bit set.
Now, to ensure that the on-disk version of the hive and the version currently loaded into memory are in synchronisation with each other, the Configuration Manager makes use of a special file called a log file. Each of these log files has a file extension of logN, where N represents any whole number. Since every hive can have multiple log files in case of failure when updating the primary log file.
When a hive is loaded, the Configuration Manager will allocate a 512-byte array called a dirty sector array which represents a section of the hive. When a modification has occurred to the corresponding hive file, then a certain part of the array will be set to 1’s (called On bits). Likewise, if no modification has occurred, then parts – if not all – of the array will contain 0’s called Off bits which mean that the hive in memory and on disk are in synchronisation with each other.
When a key is modified or created, then the appropriate bits within the dirty sector key are set to On and the Configuration Manager will schedule a lazy write operation to occur 5 seconds later. The write operation will then update the on-disk version of the hive file. However, as mentioned earlier, if an error occurs while the hive file is being updated, then corresponding log file will be updated with the contents of the dirty sector array. Upon the next reboot, the log file will read by Configuration Manager and another write will be scheduled. Otherwise, if hive is found to be corrupt, then the boot loader will attempt to repair the hive file, which may include deleting subkeys etc.
Now we have a greater understanding of what the dirty functions correspond to, let’s go back and examine the key which was mentioned in the second parameter which is the Last Known Good Configuration.
There are multiple copies of a user’s control set, with the CurrentControlSet key merely being a symbolic link to the control set which is considered as the primary one. In fact, this is determined by querying the Select key and it’s values. These are queried on boot by the Service Control Manager.
Rich (BB code):
3: kd> !mreg -p "\REGISTRY\MACHINE\SYSTEM\SELECT"
Found KCB = fffff8a011141a70 :: \REGISTRY\MACHINE\SYSTEM\SELECT
Hive fffff8a0000232d0
KeyNode fffff8a00b8068ac
[ValueType] [ValueName] [ValueData]
REG_DWORD Current 1
REG_DWORD Default 1
REG_DWORD Failed 0
REG_DWORD LastKnownGood 2
The Current value indicates which control set is the current control set as described above, whereas, the LastKnownGood value indicates which control set is the last known good configuration. The Failed value shows which control set was attempted to be used at boot but then subsequently failed, leading to the system reverting to the last known good configuration control set instead.
To load the last known good configuration, Windows will call the NtIntializeRegistry function. This loads the control set saved in the LastKnownGood value and then attempts to synchronise its keys with the current control set’s keys. If the LastKnownGood value doesn’t exist, then the system will create a new control set for it.
Let’s put everything together and try and understand why the system crashed. Please note that most of these functions are undocumented by Microsoft and therefore I’ve tried to drawn some assumptions from the information shared above.
Rich (BB code):
3: kd> knL
# Child-SP RetAddr Call Site
00 fffff880`0797b368 fffff800`07faa688 nt!KeBugCheckEx
01 fffff880`0797b370 fffff800`07f14845 nt! ?? ::NNGAKEGL::`string'+0x9d9a << We crash here!
02 fffff880`0797b3d0 fffff800`07f1461c nt!HvMarkDirty+0x176
03 fffff880`0797b430 fffff800`07fafa10 nt!HvMarkCellDirty+0x150
04 fffff880`0797b480 fffff800`07efbd6c nt! ?? ::NNGAKEGL::`string'+0x11714
05 fffff880`0797b4c0 fffff800`07efb244 nt!CmpMarkKeyDirty+0x16c
06 fffff880`0797b500 fffff800`0803fcc9 nt!CmpFreeKeyByCell+0x44
07 fffff880`0797b540 fffff800`07ed60f9 nt!CmpSyncSubKeysAfterDelete+0xd9
08 fffff880`0797b5c0 fffff800`07ed84de nt!CmpCopySyncTree2+0x179
09 fffff880`0797b670 fffff800`07ed83f7 nt!CmpCopySyncTree+0x6e
0a fffff880`0797b6c0 fffff800`07ed7fc6 nt!CmpSaveBootControlSet+0x307
0b fffff880`0797b8a0 fffff800`07c79e53 nt!NtInitializeRegistry+0xc6
[...]
The system attempts to load the last known good configuration control set since it has noticed that part of the system is corrupt and unbootable. This begins with the NtIntializeRegistry function. It then begins to synchronise the differences between the current control set and the last known good control set using CmpCopySyncTree. This inevitably leads to some modifications which causes some keys to be considered as now dirty via CmpMarkKeyDirty. Shortly afterwards, we experience a crash in HvpSetRangeProtection for an unknown reason.
Following from this, I would suggest firstly checking the state of the any HDD/SSDs and then begin to check the health of the file system itself. Otherwise, I would suggest trying to attempt a repair install, although, I’ve seen some suggestions that deleting the last known good configuration control set to also be a viable solution.