https://lwn.net/SubscriberLink/851184/01351eb745a6405d/

LWN.net Logo LWN
.net News from the source LWN

  * Content
      + Weekly Edition
      + Archives
      + Search
      + Kernel
      + Security
      + Distributions
      + Events calendar
      + Unread comments
      + -------------------------------------------------------------
      + LWN FAQ
      + Write for us

User: [        ] Password: [        ] [Log in]
|
[Subscribe]
|
[Register]
Subscribe / Log in / New account

The multi-generational LRU

[LWN subscriber-only content]

    Welcome to LWN.net

    The following subscription-only content has been made available
    to you by an LWN subscriber. Thousands of subscribers depend on
    LWN for the best news from the Linux and free software
    communities. If you enjoy this article, please consider
    subscribing to LWN. Thank you for visiting LWN.net!

By Jonathan Corbet
April 2, 2021
One of the key tasks assigned to the memory-management subsystem is
to optimize the system's use of the available memory; that means
pushing out pages containing unused data so that they can be put to
better use elsewhere. Predicting which pages will be accessed in the
near future is a tricky task, and the kernel has evolved a number of
mechanisms designed to improve its chances of guessing right. But the
kernel not only often gets it wrong, it also can expend a lot of CPU
time to make the incorrect choice. The multi-generational LRU patch
set posted by Yu Zhao is an attempt to improve that situation.

In general, the kernel cannot know which pages will be accessed in
the near future, so it must rely on the next-best indicator: the set
of pages that have been used recently. Chances are that pages that
have been accessed in the recent past will be useful again in the
future, but there are exceptions. Consider, for example, an
application that is reading sequentially through a file. Each page of
the file will be put into the page cache as it is read, but the
application will never need it again; in this case, recent access is
not a sign that the page will be used again soon.

The kernel tracks pages using a pair of least-recently-used (LRU)
lists. Pages that have been recently accessed are kept on the
"active" list, with just-accessed pages put at the head of the list.
Pages are taken off the tail of the list if they have not been
accessed recently and placed at the head of the "inactive" list. That
list is a sort of purgatory; if some process accesses a page on the
inactive list, it will be promoted back to the active list. Some
pages, like those from the sequentially read file described above,
start life on the inactive list, meaning they will be reclaimed
relatively quickly if there is no further need for them.

There are more details, of course. It's worth noting that there are
actually two pairs of lists, one for anonymous pages and one for
file-backed pages. If memory control groups are in use, there is a
whole set of LRU lists for each active group.

Zhao's patch set identifies a number of problems with the current
state of affairs. The active/inactive sorting is too coarse for
accurate decision making, and pages often end up on the wrong lists
anyway. The use of independent lists in control groups makes it hard
for the kernel to compare the relative age of pages across groups.
The kernel has a longstanding bias toward evicting file-backed pages
for a number of reasons, which can cause useful file-backed pages to
be tossed while idle anonymous pages remain in memory. This problem
has gotten worse in cloud-computing environments, where clients have
relatively little local storage and, thus, relatively few file-backed
pages in the first place. Meanwhile, the scanning of anonymous pages
is expensive, partly because it uses a complex reverse-mapping
mechanism that does not perform well when a lot of scanning must be
done.

Closing the generation gap

The multi-generational LRU patches try to address these problems with
two fundamental changes:

  * Add more LRU lists to cover a range of page ages between the
    current active and inactive lists; these lists are called
    "generations".
  * Change the way page scanning is done to reduce its overhead.

Newly activated pages are assigned to the youngest generation (though
there are some exceptions described below). Over time, the
memory-management subsystem will scan over a process's pages to
determine whether each has been used since the last scan; any that
have remained idle are moved to the next older generation. Pages of
any generation that show activity are moved back to the youngest
generation.

The result of this work is a spectrum of page ages, from those quite
recently accessed to those that have not been used in some time. The
number of generations can be configured into the kernel; that number
seems to be as small as four for phones to several times that for
cloud-based servers.

When the time comes to reclaim pages, only the oldest generation need
be considered. The "oldest generation" can be different for anonymous
and file-backed pages; anonymous pages can be harder to reclaim in
general (they must always be written to swap) and the new code
retains some of the bias toward reclaiming file-backed pages more
aggressively. So file-backed pages may not escape reclaim for as many
generations as anonymous pages do. The current patch only allows
reclaim of file-backed pages to get one generation ahead of that for
anonymous pages, though.

The multi-generational mechanism, it is claimed, is more accurate
than the current two-list approach; by the time a page makes it to
the oldest generation, its chances of being unneeded are rather
higher than they are for pages on the inactive list. That, in turn,
means that these pages can be reclaimed more aggressively, making
more memory available for tasks that will actually make use of it.
This mechanism allows for ready comparison of the ages of anonymous
and file-backed pages, and, by tracking the creation time of each
generation, of the ages of pages in different control groups; this
information is lost in current kernels. That, in turn, makes it
easier to identify and reclaim idle anonymous pages.

The other claimed advantage is in the change to how pages are
scanned. Pages are accessed via the page-table entries (PTEs) in
every process that has them mapped; the "recently accessed" bit lives
in those page-table entries. Current kernels, though, scan through
the pages themselves, and must use reverse-mapping to find and check
the associated PTEs; that is expensive. The multi-generational LRU
code, instead, scans over PTEs directly, an approach with better
locality. A hook in the scheduler helps to track processes that have
actually run since the last scan, so idle processes can be skipped.

The multi-generational LRU also benefits from skipping many of the
heuristics that are used in current kernels to decide which pages
should be reclaimed. There are still a few, though. For example, when
a page is first established, its generation is picked with these
rules:

  * Pages that are being faulted in are assigned to the youngest
    generation, as one would expect.
  * The activation of pages that are unmapped (pages resident in
    memory but with no PTEs pointing to them; these can include pages
    chosen for reclaim but not actually reclaimed before being
    referenced again) are added to the second-youngest generation.
    This is seemingly done to avoid making the youngest generation
    look too big, which might delay further page scanning until the
    next generation can be created.
  * Pages that are being reclaimed, but which must persist while
    their contents are written to backing store, are added to the
    second-oldest generation. That prevents another attempt to
    reclaim them while the writeback is underway.
  * Pages that are being deactivated go into the oldest generation.
    That is also the fate of pages that were brought in by the
    readahead mechanism; reading those pages is a speculative act on
    the kernel's part in the first place, with no guarantee that they
    will ever be useful.

There are a few knobs exported to user space to control this
mechanism, including the ability to turn the multi-generational code
off entirely; see this documentation patch for more information.

Generational change

The end result of all this work, it is claimed, is that page reclaim
is much more efficient and better targeted than before. Systems like
Android, when using this code, record fewer low-memory kills (when an
app process is killed due to memory pressure), Chrome OS shows fewer
out-of-memory kills, and server systems are better able to use
available memory. It looks like an improvement all around.

Given that, one might wonder why the multi-generational algorithm is
kept separate from the rest of the memory-management code and is made
optional. It is, in essence, an independent approach to page aging
and reclaim that exists alongside the current LRU lists. The answer,
presumably, is that there are a lot of workloads out there, and some
of them may not benefit from the multi-generational approach. There
will need to be a lot more testing done to learn where the
multi-generational LRU falls down and what might need to be done to
keep that from happening.

The multi-generational LRU might eventually win over the
memory-management developers, most of whom have not yet commented on
this patch set. It does seem likely, though, that it will need to
demonstrate better performance (or at least a lack of performance
regressions) across the bulk of the workloads out there, to the point
that it could be considered as a replacement for the current LRU
rather than an addition to it. The idea of maintaining two separate
LRU schemes is going to be a hard sell in the kernel community; it
would be far better to just switch over completely to the
multi-generational LRU if it is truly better.

Answering that question is certain to be a long process. Even
relatively small memory-management changes can take a while to merge;
it is just too easy to introduce performance penalties for some
users. This change is not "relatively small", so the bar for
inclusion will be high. But if the multi-generational LRU lives up to
its claims, it may just be able to clear that bar -- eventually.

           Index entries for this article
Kernel Memory management/Page replacement algorithms


[Send a free link]


-----------------------------------------
(Log in to post comments)


                  Copyright (c) 2021, Eklektix, Inc.
   Comments and public postings are copyrighted by their creators.
          Linux is a registered trademark of Linus Torvalds