[HN Gopher] The Multi-Generational LRU
       ___________________________________________________________________
        
       The Multi-Generational LRU
        
       Author : chmaynard
       Score  : 66 points
       Date   : 2021-04-02 14:48 UTC (8 hours ago)
        
 (HTM) web link (lwn.net)
 (TXT) w3m dump (lwn.net)
        
       | curt15 wrote:
       | Would ZFS's Adaptive Replacement Cache be considered for
       | inclusion after IBM's patent expires?
        
       | pengaru wrote:
       | It always seemed odd to me that page reclaim was completely
       | ignorant of CPU/IO priority of the accessing processes.
       | 
       | If I run a backup job with a low priority, especially before
       | cgroups existed where I could scope the task's memory use,
       | there's nothing preventing it from displacing everything else
       | from the page cache as it churns through an entire filesystem.
       | Even if everything else in the page cache was put there by equal
       | or higher priority processes.
       | 
       | It seems like the priority should at least hint reclaim _some_
       | amount rather than just age and activity levels... like the page
       | struct should record the highest priority process accessing its
       | contents, and not let lower priority processes evict that page
       | for reclaim unless pressure is high enough.
        
         | corbet wrote:
         | Tracking priority in that way would almost certainly be an
         | expensive thing to do. Nowadays, control groups can contain
         | that backup job pretty nicely.
        
           | skyde wrote:
           | Could you explain how CGroup would prevent page-fault causing
           | more disk IO than allocated to the Control group ?
           | 
           | On most LINUX distro I have seen, running a process that
           | allocate too much memory just freeze the whole use interface.
           | 
           | 1- mouse cursor move super slowly 2- it become almost
           | impossible to drag window and close windows ...
           | 
           | If what you are saying it true, simply running critical
           | process (mouse driver, X-Windows, Gnome) in a high priority
           | Cgroup would solve this problem.
        
             | corbet wrote:
             | The I/O bandwidth controller would appear to be the droid
             | you are looking for.
        
         | KMag wrote:
         | > like the page struct should record the highest priority
         | process accessing its contents
         | 
         | I'm not aware of any architecture where the MMU knows anything
         | about task scheduling priority when marking pages as dirty or
         | accessed, so there's a hardware limitation there. A close
         | approximation might be to look at the task's (at the kernel
         | level, threads and processes are tasks) scheduling priority
         | when page-faulting in a page, to set some score on the page,
         | and decrementing the score when the eviction algorithm finds
         | the page is untouched. I guess it would then evict the first
         | page found with a score of zero, or after examining N pages,
         | evicting the lowest scored page found. (One could also use
         | weighted reservoir sampling to achieve something closer to
         | fairness, so high priority tasks don't always starve low
         | priority tasks.)
         | 
         | In the case of multiple generations of LRUs, presumably one
         | could use the I/O priority of the faulting task to determine
         | the initial generation of a newly faulted-in page. This would
         | be a pretty minimal change (and might already be implemented
         | but that detail left out of the LWN article) and wouldn't
         | require any extra state in the page struct.
         | 
         | Though, maybe you get close enough to the same effect simply by
         | taking scheduling priority into effect when paging in pages.
         | 
         | Some university OS courses must have some rough kernel
         | simulators where these sorts of policy changes can be
         | simulated, right? Or is it easier to actually just modify a
         | Linux or *BSD kernel and run some benchmarks?
        
           | toast0 wrote:
           | I'm x86 centric, but the MMU doesn't make decisions about
           | which pages to evict; it just handles mapping virtual
           | addresses to physical addresses based on the page table, and
           | marking pages in the page table as dirty/accessed.
           | 
           | The OS would need to manage priority information to evict
           | lower priority pages. That may be part of what's done with
           | posix_fadvise information, but I haven't dug deeply into
           | that.
           | 
           | With this mutli-generational LRU, you could probably add a
           | memory priority to a process, and use that to decide which
           | generation to put pages in when a process needs them. If a
           | low (memory) priority process's pages always go in the oldest
           | generation, they would get considered for reclaim much more
           | often than a high priority process whose pages get put in the
           | youngest generation.
        
             | KMag wrote:
             | I know the MMU doesn't make eviction decisions, but it does
             | mark the "dirty" and "accessed" bits for pages.
             | 
             | The GP was suggesting keeping track of the scheduling
             | priorities of tasks that were "accessing its[the page's]
             | contents", not just keeping track of the highest priority
             | task mapping a given page. This would require the MMU, as
             | it's setting the dirty and accessed bits, to also keep
             | track of the maximum priority that has set either the dirty
             | or accessed bit.
             | 
             | Edit: from the GP's reply, it sounds like by "accessing"
             | they really just meant mapping, not accessing is the sense
             | of the accessed and dirty bits.
        
               | johncolanduoni wrote:
               | The dirty and accessed bits are per page table structure,
               | so they are per-process. So you can at least take process
               | priority into account, if not thread. In theory you could
               | take thread priority into account by observing and
               | resetting the dirty/accessed bits every context switch,
               | but this would be unlikely to be worth it.
        
               | toast0 wrote:
               | > In theory you could take thread priority into account
               | by observing and resetting the dirty/accessed bits every
               | context switch, but this would be unlikely to be worth
               | it.
               | 
               | That would be most likely terrible (as you suggest). You
               | might have better luck giving each thread a copy of the
               | process page table, and keeping that in sync somehow
               | (which also sounds terrible, but might be more workable)
        
           | pengaru wrote:
           | I assumed it'd be maintained when the processes mapped pages
           | into their address space, not at page fault time. Then if the
           | process' priority increased, I guess it would have to walk
           | the mapped pages to bump their priorities too for
           | completeness.
           | 
           | Not that I spent much time thinking about the implementation
           | details or seriously exploring how to go about making it
           | happen. Having to touch all mapped pages when a process
           | changes its priority sounds painful, so it probably requires
           | more sophistication to make things optimal.
        
             | KMag wrote:
             | Ahh, by "accessing its[the page's] contents", I thought you
             | meant keeping track of the priorities of the tasks actually
             | making the read and write operations that cause the MMU to
             | set the dirty and accessed bits.
             | 
             | It sounds like by "accesed" you weren't referring to the
             | page access bit set by the MMU, but rather just having the
             | page mapped in the address space.
        
         | rini17 wrote:
         | That would require maintaining priority for every page in the
         | cache, or even to process-if a process using the cache page
         | ends, priority should be adjusted. Too complicated. Better
         | simpler option is that the backup process says "do not cache
         | this". That option is already implemented in kernel and
         | available to userspace.
        
           | pengaru wrote:
           | Hmm yes, unwinding the priorities when processes exit is
           | complicated. But I wonder if simply leaving the priority at
           | its highest level reached would be harmless; it really
           | depends on how much influence the heuristic has.
           | 
           | It doesn't strike me as particularly wrong for a page to
           | still be classified as elevated priority if some higher
           | priority process had classified it as such even if that
           | process no longer exists. At least not for file-backed pages,
           | which are really what I'm interested in, in this context.
           | Chances are good the file gets accessed again in a similar
           | fashion, the priority stickiness could help keep it around.
           | 
           | The "do not cache this" approach is quite heavy-handed and
           | can be problematic if you want _some_ caching for things like
           | producing deltas, or in the case of a restore from backup
           | reassembling from myriad delta files, without blowing out the
           | page cache.
           | 
           | As corbet mentioned the cgroups approach does provide a
           | solution to constrain page cache effects.
        
       | marcodiego wrote:
       | Related: https://news.ycombinator.com/item?id=26451220
        
       | greggyb wrote:
       | Bryan Cantrill does a pretty good survey of caching strategies in
       | the intro to this Papers We Love talk on the ARC in ZFS.
       | 
       | https://www.youtube.com/watch?v=F8sZRBdmqc0
        
       | devit wrote:
       | Why are multiple lists needed instead of a single list where
       | pages that are found to be accessed are put in the front and
       | reclaim acts on the back?
        
         | teraflop wrote:
         | That's explained in the second paragraph of the article:
         | 
         | > Chances are that pages that have been accessed in the recent
         | past will be useful again in the future, but there are
         | exceptions. Consider, for example, an application that is
         | reading sequentially through a file. Each page of the file will
         | be put into the page cache as it is read, but the application
         | will never need it again; in this case, recent access is not a
         | sign that the page will be used again soon.
        
           | dllthomas wrote:
           | In a similar vein I once found a sharp decrease in cache
           | misses when I started proactively evicting outbound messages
           | from cache after sending, in a system passing messages
           | between CPUs in circular buffers. (The reasoning generalizes
           | but I would be skeptical that the results do without
           | measuring.)
        
       ___________________________________________________________________
       (page generated 2021-04-02 23:01 UTC)