[HN Gopher] The Multi-Generational LRU
___________________________________________________________________
The Multi-Generational LRU
Author : chmaynard
Score : 66 points
Date : 2021-04-02 14:48 UTC (8 hours ago)
(HTM) web link (lwn.net)
(TXT) w3m dump (lwn.net)
| curt15 wrote:
| Would ZFS's Adaptive Replacement Cache be considered for
| inclusion after IBM's patent expires?
| pengaru wrote:
| It always seemed odd to me that page reclaim was completely
| ignorant of CPU/IO priority of the accessing processes.
|
| If I run a backup job with a low priority, especially before
| cgroups existed where I could scope the task's memory use,
| there's nothing preventing it from displacing everything else
| from the page cache as it churns through an entire filesystem.
| Even if everything else in the page cache was put there by equal
| or higher priority processes.
|
| It seems like the priority should at least hint reclaim _some_
| amount rather than just age and activity levels... like the page
| struct should record the highest priority process accessing its
| contents, and not let lower priority processes evict that page
| for reclaim unless pressure is high enough.
| corbet wrote:
| Tracking priority in that way would almost certainly be an
| expensive thing to do. Nowadays, control groups can contain
| that backup job pretty nicely.
| skyde wrote:
| Could you explain how CGroup would prevent page-fault causing
| more disk IO than allocated to the Control group ?
|
| On most LINUX distro I have seen, running a process that
| allocate too much memory just freeze the whole use interface.
|
| 1- mouse cursor move super slowly 2- it become almost
| impossible to drag window and close windows ...
|
| If what you are saying it true, simply running critical
| process (mouse driver, X-Windows, Gnome) in a high priority
| Cgroup would solve this problem.
| corbet wrote:
| The I/O bandwidth controller would appear to be the droid
| you are looking for.
| KMag wrote:
| > like the page struct should record the highest priority
| process accessing its contents
|
| I'm not aware of any architecture where the MMU knows anything
| about task scheduling priority when marking pages as dirty or
| accessed, so there's a hardware limitation there. A close
| approximation might be to look at the task's (at the kernel
| level, threads and processes are tasks) scheduling priority
| when page-faulting in a page, to set some score on the page,
| and decrementing the score when the eviction algorithm finds
| the page is untouched. I guess it would then evict the first
| page found with a score of zero, or after examining N pages,
| evicting the lowest scored page found. (One could also use
| weighted reservoir sampling to achieve something closer to
| fairness, so high priority tasks don't always starve low
| priority tasks.)
|
| In the case of multiple generations of LRUs, presumably one
| could use the I/O priority of the faulting task to determine
| the initial generation of a newly faulted-in page. This would
| be a pretty minimal change (and might already be implemented
| but that detail left out of the LWN article) and wouldn't
| require any extra state in the page struct.
|
| Though, maybe you get close enough to the same effect simply by
| taking scheduling priority into effect when paging in pages.
|
| Some university OS courses must have some rough kernel
| simulators where these sorts of policy changes can be
| simulated, right? Or is it easier to actually just modify a
| Linux or *BSD kernel and run some benchmarks?
| toast0 wrote:
| I'm x86 centric, but the MMU doesn't make decisions about
| which pages to evict; it just handles mapping virtual
| addresses to physical addresses based on the page table, and
| marking pages in the page table as dirty/accessed.
|
| The OS would need to manage priority information to evict
| lower priority pages. That may be part of what's done with
| posix_fadvise information, but I haven't dug deeply into
| that.
|
| With this mutli-generational LRU, you could probably add a
| memory priority to a process, and use that to decide which
| generation to put pages in when a process needs them. If a
| low (memory) priority process's pages always go in the oldest
| generation, they would get considered for reclaim much more
| often than a high priority process whose pages get put in the
| youngest generation.
| KMag wrote:
| I know the MMU doesn't make eviction decisions, but it does
| mark the "dirty" and "accessed" bits for pages.
|
| The GP was suggesting keeping track of the scheduling
| priorities of tasks that were "accessing its[the page's]
| contents", not just keeping track of the highest priority
| task mapping a given page. This would require the MMU, as
| it's setting the dirty and accessed bits, to also keep
| track of the maximum priority that has set either the dirty
| or accessed bit.
|
| Edit: from the GP's reply, it sounds like by "accessing"
| they really just meant mapping, not accessing is the sense
| of the accessed and dirty bits.
| johncolanduoni wrote:
| The dirty and accessed bits are per page table structure,
| so they are per-process. So you can at least take process
| priority into account, if not thread. In theory you could
| take thread priority into account by observing and
| resetting the dirty/accessed bits every context switch,
| but this would be unlikely to be worth it.
| toast0 wrote:
| > In theory you could take thread priority into account
| by observing and resetting the dirty/accessed bits every
| context switch, but this would be unlikely to be worth
| it.
|
| That would be most likely terrible (as you suggest). You
| might have better luck giving each thread a copy of the
| process page table, and keeping that in sync somehow
| (which also sounds terrible, but might be more workable)
| pengaru wrote:
| I assumed it'd be maintained when the processes mapped pages
| into their address space, not at page fault time. Then if the
| process' priority increased, I guess it would have to walk
| the mapped pages to bump their priorities too for
| completeness.
|
| Not that I spent much time thinking about the implementation
| details or seriously exploring how to go about making it
| happen. Having to touch all mapped pages when a process
| changes its priority sounds painful, so it probably requires
| more sophistication to make things optimal.
| KMag wrote:
| Ahh, by "accessing its[the page's] contents", I thought you
| meant keeping track of the priorities of the tasks actually
| making the read and write operations that cause the MMU to
| set the dirty and accessed bits.
|
| It sounds like by "accesed" you weren't referring to the
| page access bit set by the MMU, but rather just having the
| page mapped in the address space.
| rini17 wrote:
| That would require maintaining priority for every page in the
| cache, or even to process-if a process using the cache page
| ends, priority should be adjusted. Too complicated. Better
| simpler option is that the backup process says "do not cache
| this". That option is already implemented in kernel and
| available to userspace.
| pengaru wrote:
| Hmm yes, unwinding the priorities when processes exit is
| complicated. But I wonder if simply leaving the priority at
| its highest level reached would be harmless; it really
| depends on how much influence the heuristic has.
|
| It doesn't strike me as particularly wrong for a page to
| still be classified as elevated priority if some higher
| priority process had classified it as such even if that
| process no longer exists. At least not for file-backed pages,
| which are really what I'm interested in, in this context.
| Chances are good the file gets accessed again in a similar
| fashion, the priority stickiness could help keep it around.
|
| The "do not cache this" approach is quite heavy-handed and
| can be problematic if you want _some_ caching for things like
| producing deltas, or in the case of a restore from backup
| reassembling from myriad delta files, without blowing out the
| page cache.
|
| As corbet mentioned the cgroups approach does provide a
| solution to constrain page cache effects.
| marcodiego wrote:
| Related: https://news.ycombinator.com/item?id=26451220
| greggyb wrote:
| Bryan Cantrill does a pretty good survey of caching strategies in
| the intro to this Papers We Love talk on the ARC in ZFS.
|
| https://www.youtube.com/watch?v=F8sZRBdmqc0
| devit wrote:
| Why are multiple lists needed instead of a single list where
| pages that are found to be accessed are put in the front and
| reclaim acts on the back?
| teraflop wrote:
| That's explained in the second paragraph of the article:
|
| > Chances are that pages that have been accessed in the recent
| past will be useful again in the future, but there are
| exceptions. Consider, for example, an application that is
| reading sequentially through a file. Each page of the file will
| be put into the page cache as it is read, but the application
| will never need it again; in this case, recent access is not a
| sign that the page will be used again soon.
| dllthomas wrote:
| In a similar vein I once found a sharp decrease in cache
| misses when I started proactively evicting outbound messages
| from cache after sending, in a system passing messages
| between CPUs in circular buffers. (The reasoning generalizes
| but I would be skeptical that the results do without
| measuring.)
___________________________________________________________________
(page generated 2021-04-02 23:01 UTC)