[HN Gopher] Linux Memory Management FAQ
       ___________________________________________________________________
        
       Linux Memory Management FAQ
        
       Author : janselman
       Score  : 80 points
       Date   : 2021-02-12 19:25 UTC (3 hours ago)
        
 (HTM) web link (landley.net)
 (TXT) w3m dump (landley.net)
        
       | mlaretallack wrote:
       | The times I have had to explain how mm works is draining. yes you
       | can malloc 2M, no that does not mean you have 2M to use.
        
         | dataflow wrote:
         | Well, it does mean that in C. But some folks prefer to play by
         | their own rules.
        
           | barnacled wrote:
           | Actually no, the malloc doesn't allocate any memory it just
           | updates the process's VMA to say that the allocated virtual
           | range is valid. The pages are then faulted in on write. This
           | is where things like OOM killer become very confusing for
           | people.
           | 
           | In linux (in sane configurations) allocations are just
           | preorders.
           | 
           | EDIT: I can't reply below due to rate limiting:
           | 
           | I'd argue that overcommit just makes the difference between
           | allocation and backing very stark.
           | 
           | Your memory IS in fact allocated in the process VMA, it's
           | just the anonymous pages cannot necessarily be backed.
           | 
           | This differs, obviously, in other OSes as pointed out. Also
           | differs if you turn overcommit off but since so much in linux
           | assumes it your system will soon break if you try it.
        
             | wahern wrote:
             | > Also differs if you turn overcommit off but since so much
             | in linux assumes it your system will soon break if you try
             | it.
             | 
             | I agree, reliance on overcommit has resulted in stability
             | problems in Linux. But IME stability problems aren't
             | induced by disabling overcommit, they're induced by
             | disabling swap. The stability problems occur precisely
             | because by relying on magical heuristics to save the day,
             | we end up with an overall MM architecture that reacts
             | extremely poorly under memory pressure. Whether or not
             | overcommit is enabled, physical memory is a limited
             | resource, and when Linux can't relieve physical memory
             | pressure by dumping pages to disk, bad things happen,
             | especially when under heavy I/O load (e.g. the buffer cache
             | can grab pages faster than the OOM killer can free them).
        
             | quotemstr wrote:
             | And that's why strict allocation tracking (no overcommit)
             | should be the default. But those of us in favor of
             | guaranteed forward progress and sensible resource
             | accounting lost this fight a long time ago.
        
             | dataflow wrote:
             | I said "in C". You're talking "in Linux" (or
             | glibc/whatever). Which, as I already said, plays by its own
             | rules and defies C. It's broken by design.
        
             | jdsully wrote:
             | In the C standard malloc should return null if it can't
             | fulfill the request. Linux violates this but it usually
             | works out in the end since virtual memory makes true OOM
             | very rare.
        
             | wahern wrote:
             | This depends on the OS. Solaris and Windows both do strict
             | accounting by default, and overcommit is opt-in at a fine-
             | grain API level. Linux is relatively extreme in its embrace
             | of overcommit. So extreme that strict accounting isn't even
             | possible--even if you disable overcommit in Linux, there
             | are too many corner cases in the kernel where a process
             | (including innocent processes) will be shot down under
             | memory pressure. Too many Linux kernel programmers designed
             | their subsystems with the overcommit mentality. That said,
             | I still always disable overcommit as it makes it less
             | likely for innocent processes to be killed when under heavy
             | load.
             | 
             | An example of a split-the-difference approach is macOS,
             | which AFAIU implements overcommit but also dynamically
             | instantiates swap so that overcommit-induced OOM killing
             | won't occur until your disk is full.
             | 
             | Also, it's worth mentioning that on _all_ these systems
             | process limits (see, e.g., setrlimit(2)) can still result
             | in malloc returning NULL.
        
       | phtrivier wrote:
       | The Drepper series of article dates from 2007. Is it still
       | relevant or has anything fundamental changed in memory handling
       | in the last 13 years ?
        
         | Agingcoder wrote:
         | It's still relevant.
         | 
         | Other than that, I also think that even when outdated,
         | computing history is worth reading anyway, since it gives you a
         | natural understanding of _why_ we do what we do these days. In
         | your day job, it also gives you a different appreciation for
         | what people did and why they did it, and why 'this horrible
         | code' may have made sense at the time.
         | 
         | Furthermore, performance engineering is fundamentally about
         | opposing code and hardware limitations. If hardware limitations
         | are different, you'll get different code, but the principles
         | remain the same.
         | 
         | If you're curious, write a basic emulator for older hardware
         | (the NES is a great choice) , it's both fun and eye-opening!
         | 
         | Edit: the NES emulator will answer 'how do you fit super mario
         | bros in 32k, and how can it run on such limited hardware?'
        
           | einpoklum wrote:
           | > computing history is worth reading anyway
           | 
           | Sometimes, but a description of the state of the art in the
           | past does not become a historical tract with the passage of
           | time. The better ones do; others just become outdated.
        
       | [deleted]
        
       | blt wrote:
       | Much of this applies to other OSes with virtual memory also.
        
       | dbattaglia wrote:
       | "Virtual addresses are the size of a CPU register. On 32 bit
       | systems each process has 4 gigabytes of virtual address space all
       | to itself, which is often more memory than the system actually
       | has."
       | 
       | I guess this is not the most up-to-date document?
        
         | bonzini wrote:
         | On 32-bit systems, 4 GiB is indeed often more memory than the
         | system has (think 512 MiB for some Raspberry Pis). And on
         | 64-bit x86 systems each process has 256 PiB, which is also more
         | memory than the system has.
        
         | sigjuice wrote:
         | I think there might be some more hardware-specific nuance here.
         | e.g. /proc/cpuinfo says this on a couple of different x86_64
         | systems that I checked.                 address sizes : 36 bits
         | physical, 48 bits virtual       address sizes : 40 bits
         | physical, 48 bits virtual
         | 
         | PS: I don't understand what this means, btw.
        
           | db48x wrote:
           | Your CPU can handle 39-bit physical memory addresses (up to
           | 512 GB of physical memory), and 48-bit virtual addresses (256
           | TB). Your operating system maintains a mapping from virtual
           | to physical addresses, usually arranging the map so that
           | every process has a separate memory space. Pointers are all
           | still 64 bits long though.
        
             | barnacled wrote:
             | In practice the actual available usable address space for
             | userland is 64 TiB due to user/kernel split and the kernel
             | maintaining a virtual mapping of the entire physical
             | address space (minus I/O ranges) [0].
             | 
             | However newer incoming 5-level page intel chips [1] will
             | allow up to 57 bits of address space, 128 PiB in theory
             | though in practice 32 PiB of userland memory. See also [0]
             | for discussion on practical limit for 5-page too!
             | 
             | [0]:https://github.com/lorenzo-stoakes/linux-mm-
             | notes/blob/maste...
             | 
             | [1]:https://en.wikipedia.org/wiki/Intel_5-level_paging
        
               | db48x wrote:
               | True, though /proc/cpuinfo only reports the size, which
               | is ultimately what the CPU cares about. Plus the most
               | relevant limit is what your motherboard and wallet
               | supports, which is often far lower.
        
               | barnacled wrote:
               | Indeed, and as you say, sensibly speaking you are hardly
               | likely to hit those limits in any likely (esp. home)
               | setup. The actual meaningful limit is usually the CPU
               | physical one as home CPUs very often have stringent
               | memory limits (often 32 GiB or so) and of course you rely
               | on the motherboard's limitations also.
               | 
               | Having said that I did write a patch to ensure that the
               | system would boot correctly with 256 TiB of RAM [0] so
               | perhaps I am not always a realist... or dream of the day
               | I can own that system ;)
               | 
               | [0]:https://git.kernel.org/pub/scm/linux/kernel/git/next/
               | linux-n...
        
               | db48x wrote:
               | You're not the only one dreaming; I had to use >200GB of
               | swap on my home system last year.
        
             | sigjuice wrote:
             | So are the 16 leftmost bits of a virtual address always 0?
        
               | jcul wrote:
               | Yes generally for userspace addresses they are 0. But
               | more importantly they can be used for other stuff,
               | commonly referred to as pointer tagging / smuggling etc.
               | 
               | It's a useful optimisation technique where you can add
               | some extra metadata without having to dereference a
               | pointer.
        
               | barnacled wrote:
               | They have to be same as the maximum addressable bit, i.e.
               | in the case of 48 bit virtual address size the 48th bit.
               | 
               | This is actually kind of a cute way of dividing kernel
               | and userland space as you just set the upper bit to 1 for
               | kernel addresses and 0 for userland.
               | 
               | EDIT: Specifically talking about x86-64 here.
               | 
               | https://github.com/lorenzo-stoakes/linux-mm-
               | notes/blob/maste...
        
               | amscanne wrote:
               | No, it must be sign-extended from the top bit of the
               | valid set. Otherwise the address is non-canonical.
        
               | xxpor wrote:
               | This is true for x86-64, not true for other architectures
               | such as arm64.
               | 
               | Apple uses the high bits to cryptographicly sign the
               | pointer value.
        
               | everybodyknows wrote:
               | Fascinating. Does this confer some of the benefits of ECC
               | RAM, for pointer data only -- without the hardware cost?
        
               | db48x wrote:
               | Oddly enough the unused bits are in the middle of the
               | address. They're also sign-extended rather than filled
               | with zeros, so sometimes they are ones and other times
               | they are zeros.
        
           | throwaway8581 wrote:
           | It means that it can address 40-bits of address space worth
           | of physical memory, but that virtual memory addresses can use
           | 48 bits. Physical addresses are just your RAM bytes numbered
           | 1 through whatever. Virtual address space is the address
           | space of a process, which includes mapped physical memory,
           | unmapped pages, guard pages, and other virtual memory tricks.
        
         | littlestymaar wrote:
         | If you are thinking about the "which is often more memory than
         | the system actually has" part, I don't know if it's outdated
         | even today: the vast majority of Linux systems these days are
         | Android phones, and I wouldn't be surprised at all if a good
         | proportion of those didn't have more than 4GB of RAM.
        
       | barnacled wrote:
       | For anybody who's interested I also wrote up a whole bunch of
       | notes on this at https://github.com/lorenzo-stoakes/linux-vm-
       | notes and superceded by far more recent
       | https://github.com/lorenzo-stoakes/linux-mm-notes
       | 
       | I have made a few patches into the mm subsystem some simply
       | inspired by researching for the articles.
        
         | aduitsis wrote:
         | Thank you, this is great!
        
       ___________________________________________________________________
       (page generated 2021-02-12 23:00 UTC)