[HN Gopher] Linux Memory Management FAQ
___________________________________________________________________
Linux Memory Management FAQ
Author : janselman
Score : 80 points
Date : 2021-02-12 19:25 UTC (3 hours ago)
(HTM) web link (landley.net)
(TXT) w3m dump (landley.net)
| mlaretallack wrote:
| The times I have had to explain how mm works is draining. yes you
| can malloc 2M, no that does not mean you have 2M to use.
| dataflow wrote:
| Well, it does mean that in C. But some folks prefer to play by
| their own rules.
| barnacled wrote:
| Actually no, the malloc doesn't allocate any memory it just
| updates the process's VMA to say that the allocated virtual
| range is valid. The pages are then faulted in on write. This
| is where things like OOM killer become very confusing for
| people.
|
| In linux (in sane configurations) allocations are just
| preorders.
|
| EDIT: I can't reply below due to rate limiting:
|
| I'd argue that overcommit just makes the difference between
| allocation and backing very stark.
|
| Your memory IS in fact allocated in the process VMA, it's
| just the anonymous pages cannot necessarily be backed.
|
| This differs, obviously, in other OSes as pointed out. Also
| differs if you turn overcommit off but since so much in linux
| assumes it your system will soon break if you try it.
| wahern wrote:
| > Also differs if you turn overcommit off but since so much
| in linux assumes it your system will soon break if you try
| it.
|
| I agree, reliance on overcommit has resulted in stability
| problems in Linux. But IME stability problems aren't
| induced by disabling overcommit, they're induced by
| disabling swap. The stability problems occur precisely
| because by relying on magical heuristics to save the day,
| we end up with an overall MM architecture that reacts
| extremely poorly under memory pressure. Whether or not
| overcommit is enabled, physical memory is a limited
| resource, and when Linux can't relieve physical memory
| pressure by dumping pages to disk, bad things happen,
| especially when under heavy I/O load (e.g. the buffer cache
| can grab pages faster than the OOM killer can free them).
| quotemstr wrote:
| And that's why strict allocation tracking (no overcommit)
| should be the default. But those of us in favor of
| guaranteed forward progress and sensible resource
| accounting lost this fight a long time ago.
| dataflow wrote:
| I said "in C". You're talking "in Linux" (or
| glibc/whatever). Which, as I already said, plays by its own
| rules and defies C. It's broken by design.
| jdsully wrote:
| In the C standard malloc should return null if it can't
| fulfill the request. Linux violates this but it usually
| works out in the end since virtual memory makes true OOM
| very rare.
| wahern wrote:
| This depends on the OS. Solaris and Windows both do strict
| accounting by default, and overcommit is opt-in at a fine-
| grain API level. Linux is relatively extreme in its embrace
| of overcommit. So extreme that strict accounting isn't even
| possible--even if you disable overcommit in Linux, there
| are too many corner cases in the kernel where a process
| (including innocent processes) will be shot down under
| memory pressure. Too many Linux kernel programmers designed
| their subsystems with the overcommit mentality. That said,
| I still always disable overcommit as it makes it less
| likely for innocent processes to be killed when under heavy
| load.
|
| An example of a split-the-difference approach is macOS,
| which AFAIU implements overcommit but also dynamically
| instantiates swap so that overcommit-induced OOM killing
| won't occur until your disk is full.
|
| Also, it's worth mentioning that on _all_ these systems
| process limits (see, e.g., setrlimit(2)) can still result
| in malloc returning NULL.
| phtrivier wrote:
| The Drepper series of article dates from 2007. Is it still
| relevant or has anything fundamental changed in memory handling
| in the last 13 years ?
| Agingcoder wrote:
| It's still relevant.
|
| Other than that, I also think that even when outdated,
| computing history is worth reading anyway, since it gives you a
| natural understanding of _why_ we do what we do these days. In
| your day job, it also gives you a different appreciation for
| what people did and why they did it, and why 'this horrible
| code' may have made sense at the time.
|
| Furthermore, performance engineering is fundamentally about
| opposing code and hardware limitations. If hardware limitations
| are different, you'll get different code, but the principles
| remain the same.
|
| If you're curious, write a basic emulator for older hardware
| (the NES is a great choice) , it's both fun and eye-opening!
|
| Edit: the NES emulator will answer 'how do you fit super mario
| bros in 32k, and how can it run on such limited hardware?'
| einpoklum wrote:
| > computing history is worth reading anyway
|
| Sometimes, but a description of the state of the art in the
| past does not become a historical tract with the passage of
| time. The better ones do; others just become outdated.
| [deleted]
| blt wrote:
| Much of this applies to other OSes with virtual memory also.
| dbattaglia wrote:
| "Virtual addresses are the size of a CPU register. On 32 bit
| systems each process has 4 gigabytes of virtual address space all
| to itself, which is often more memory than the system actually
| has."
|
| I guess this is not the most up-to-date document?
| bonzini wrote:
| On 32-bit systems, 4 GiB is indeed often more memory than the
| system has (think 512 MiB for some Raspberry Pis). And on
| 64-bit x86 systems each process has 256 PiB, which is also more
| memory than the system has.
| sigjuice wrote:
| I think there might be some more hardware-specific nuance here.
| e.g. /proc/cpuinfo says this on a couple of different x86_64
| systems that I checked. address sizes : 36 bits
| physical, 48 bits virtual address sizes : 40 bits
| physical, 48 bits virtual
|
| PS: I don't understand what this means, btw.
| db48x wrote:
| Your CPU can handle 39-bit physical memory addresses (up to
| 512 GB of physical memory), and 48-bit virtual addresses (256
| TB). Your operating system maintains a mapping from virtual
| to physical addresses, usually arranging the map so that
| every process has a separate memory space. Pointers are all
| still 64 bits long though.
| barnacled wrote:
| In practice the actual available usable address space for
| userland is 64 TiB due to user/kernel split and the kernel
| maintaining a virtual mapping of the entire physical
| address space (minus I/O ranges) [0].
|
| However newer incoming 5-level page intel chips [1] will
| allow up to 57 bits of address space, 128 PiB in theory
| though in practice 32 PiB of userland memory. See also [0]
| for discussion on practical limit for 5-page too!
|
| [0]:https://github.com/lorenzo-stoakes/linux-mm-
| notes/blob/maste...
|
| [1]:https://en.wikipedia.org/wiki/Intel_5-level_paging
| db48x wrote:
| True, though /proc/cpuinfo only reports the size, which
| is ultimately what the CPU cares about. Plus the most
| relevant limit is what your motherboard and wallet
| supports, which is often far lower.
| barnacled wrote:
| Indeed, and as you say, sensibly speaking you are hardly
| likely to hit those limits in any likely (esp. home)
| setup. The actual meaningful limit is usually the CPU
| physical one as home CPUs very often have stringent
| memory limits (often 32 GiB or so) and of course you rely
| on the motherboard's limitations also.
|
| Having said that I did write a patch to ensure that the
| system would boot correctly with 256 TiB of RAM [0] so
| perhaps I am not always a realist... or dream of the day
| I can own that system ;)
|
| [0]:https://git.kernel.org/pub/scm/linux/kernel/git/next/
| linux-n...
| db48x wrote:
| You're not the only one dreaming; I had to use >200GB of
| swap on my home system last year.
| sigjuice wrote:
| So are the 16 leftmost bits of a virtual address always 0?
| jcul wrote:
| Yes generally for userspace addresses they are 0. But
| more importantly they can be used for other stuff,
| commonly referred to as pointer tagging / smuggling etc.
|
| It's a useful optimisation technique where you can add
| some extra metadata without having to dereference a
| pointer.
| barnacled wrote:
| They have to be same as the maximum addressable bit, i.e.
| in the case of 48 bit virtual address size the 48th bit.
|
| This is actually kind of a cute way of dividing kernel
| and userland space as you just set the upper bit to 1 for
| kernel addresses and 0 for userland.
|
| EDIT: Specifically talking about x86-64 here.
|
| https://github.com/lorenzo-stoakes/linux-mm-
| notes/blob/maste...
| amscanne wrote:
| No, it must be sign-extended from the top bit of the
| valid set. Otherwise the address is non-canonical.
| xxpor wrote:
| This is true for x86-64, not true for other architectures
| such as arm64.
|
| Apple uses the high bits to cryptographicly sign the
| pointer value.
| everybodyknows wrote:
| Fascinating. Does this confer some of the benefits of ECC
| RAM, for pointer data only -- without the hardware cost?
| db48x wrote:
| Oddly enough the unused bits are in the middle of the
| address. They're also sign-extended rather than filled
| with zeros, so sometimes they are ones and other times
| they are zeros.
| throwaway8581 wrote:
| It means that it can address 40-bits of address space worth
| of physical memory, but that virtual memory addresses can use
| 48 bits. Physical addresses are just your RAM bytes numbered
| 1 through whatever. Virtual address space is the address
| space of a process, which includes mapped physical memory,
| unmapped pages, guard pages, and other virtual memory tricks.
| littlestymaar wrote:
| If you are thinking about the "which is often more memory than
| the system actually has" part, I don't know if it's outdated
| even today: the vast majority of Linux systems these days are
| Android phones, and I wouldn't be surprised at all if a good
| proportion of those didn't have more than 4GB of RAM.
| barnacled wrote:
| For anybody who's interested I also wrote up a whole bunch of
| notes on this at https://github.com/lorenzo-stoakes/linux-vm-
| notes and superceded by far more recent
| https://github.com/lorenzo-stoakes/linux-mm-notes
|
| I have made a few patches into the mm subsystem some simply
| inspired by researching for the articles.
| aduitsis wrote:
| Thank you, this is great!
___________________________________________________________________
(page generated 2021-02-12 23:00 UTC)