Bit flips

Jared

Sysnative Staff, BSOD Kernel Dump Expert
Joined
Feb 3, 2014
Posts
1,591
I've stumbled across a dump file which I find interesting.

Code:
BugCheck 1A, {[COLOR="#FF0000"]41792[/COLOR], [COLOR="#008000"]fffff680003b7110[/COLOR], [COLOR="#800080"]8000000000[/COLOR], 0}

[COLOR="#FF0000"]Probably caused by : memory_corruption ( ONE_BIT )[/COLOR]

As you can tell a one bit error has occurred, but the question is where?

A page table entry has become corrupt so I took a look and this is what I found.

Code:
6: kd> [COLOR="#008000"]dt nt!_MMPFN fffff680003b7110[/COLOR]
   +0x000 u1               : <unnamed-tag>
   +0x008 u2               : <unnamed-tag>
   +0x010 PteAddress       : (null) 
   +0x010 VolatilePteAddress : (null) 
   +0x010 Lock             : 0n0
   +0x010 PteLong          : 0
   +0x018 u3               : <unnamed-tag>
   +0x01c NodeBlinkLow     : 0
   +0x01e Unused           : 0y0000
   +0x01e VaType           : 0y0000
   +0x01f ViewCount        : 0 ''
   +0x01f NodeFlinkLow     : 0 ''
   +0x020 OriginalPte      : _MMPTE
   +0x028 u4               : <unnamed-tag>

Everything is null which is strange, now the bit (hehe get it?) I don't understand is the 3rd and 4th parameters.

Code:
MEMORY_MANAGEMENT (1a)
    # Any other values for parameter 1 must be individually examined.
Arguments:
Arg1: 0000000000041792, A corrupt PTE has been detected. Parameter 2 contains the address of
	the PTE. [B][COLOR="#FF0000"]Parameters 3/4 contain the low/high parts of the PTE.[/COLOR][/B]
Arg2: fffff680003b7110
Arg3: [COLOR="#800080"]000000[COLOR="#FF8C00"]8[/COLOR]000000000[/COLOR]
Arg4: [COLOR="#800080"]0000000000000000[/COLOR]

It mentions the low/high parts of the PTE? What exactly does that mean? I can't find much on it.

Code:
6: kd> [COLOR="#008000"].formats 0000008000000000[/COLOR]
Evaluate expression:
  Hex:     00000080`00000000
  Decimal: 549755813888
  Octal:   0000000010000000000000
  Binary:  00000000 00000000 00000000 [COLOR="#FF0000"][B]1[/B][/COLOR]0000000 00000000 00000000 00000000 00000000
  Chars:   ........
  Time:    Mon Jan  1 16:16:15.581 1601 (UTC + 1:00)
  Float:   low 0 high 1.79366e-043
  Double:  2.71615e-312

Isn't that a bit flip which I presume is the low part of the PTE, now I take it this invalidates it?

Can anyone enlighten me on this?
 
Oh boy, PTE management..

Off the top of my head, with 41792 as the argument, I can't remember if 3 is low and 4 is high, or vice-versa. Anyway, with PTEs we have all sorts of bits that are used by the hardware and are separated by low address space/high memory. In high memory, the kernel can access the memory directly as it frequently walks the page table. It's a very expensive process so high memory maps the virtual addresses into low address space before it can be used. However, with that said, there are very limited slots and bottlenecking is pretty easy. There's a thousand other things in addition to this as well.

AFAIK, when you have a one bit corruption occur, it generally means that a bad address was passed somewhere along this process by exactly one bit. For example, making an 8 a 9:

Code:
00111000

to

Code:
00111001

In most cases it's bad RAM because the corruption had to occur long before this process to even initiate such a thing.
 
Thanks Patrick, I need to look at Windows Internals about this.

I've seen it mentioned before but what do you mean by "walks" exactly?
 
Walks (in this case) is just fancy engineer talk for navigating a page table for a valid address.

Code:
407         pgd_t *pgd;
408         pmd_t *pmd;
409         pte_t *ptep, pte;
410 
411         pgd = pgd_offset(mm, address);
412         if (pgd_none(*pgd) || pgd_bad(*pgd))
413                 goto out;
414 
415         pmd = pmd_offset(pgd, address);
416         if (pmd_none(*pmd) || pmd_bad(*pmd))
417                 goto out;
418 
419         ptep = pte_offset(pmd, address);
420         if (!ptep)
421                 goto out;
422 
423         pte = *ptep;

That's an excerpt example of a walk function using three offset macros to ultimately navigate a page table. _none() and _bad() are what ensures it's only looking for valid address within the page table.
 
That's very helpful, thank you very much Patrick.
I've tried looking that up before but I could never find an answer.
 
Page Walks are an expensive process too, that's why we have Caches and Tagging of Memory Addresses.
 

Has Sysnative Forums helped you? Please consider donating to help us support the site!

Back
Top