[HN Gopher] Is RAM wiped before use in another LXC container?
___________________________________________________________________
Is RAM wiped before use in another LXC container?
Author : Aachen
Score : 280 points
Date : 2023-04-05 10:22 UTC (12 hours ago)
(HTM) web link (security.stackexchange.com)
(TXT) w3m dump (security.stackexchange.com)
| pmontra wrote:
| I thought that everybody knows how the CPU and the kernel manage
| memory. Everything else follows. However as a very good developer
| told me once, "I don't know how networking works, only that I
| have to type a URL." "How is that possible?" I replied, "They
| teach it at a school." And he told me "yes, but I studied graphic
| design, then discovered that I like programming."
|
| Replace and repeat with SQL, machine language, everything below
| modern programming languages.
| jk_i_am_a_robot wrote:
| If I could answer it would be a solid "maybe". If you're willing
| to consider the Translation Lookaside Buffer as part of the
| contents of RAM then, if the claims are to be believed, SPECTRE
| and Meltdown could read the contents of other processes' (and
| containers) RAM.
| ritcgab wrote:
| I think the main insight is that user-level memory allocation
| does not necessarily involve the kernel. If the application uses
| `malloc` to get memory allocated, the real "application" that
| will (possibly) request free pages from the kernel is the memory
| allocator. If the space requested can be served by the pages
| already assigned to the allocator, it can just hand them out. So
| when you `malloc`, it is not zeroed, unless you know it is
| freshly from kernel (but you don't know the internal state of the
| allocator). That is why one needs to manually zero out space
| returned from `malloc`, or use `calloc` instead.
| CodesInChaos wrote:
| Is GPU memory wiped before it's handed to a new process, and is a
| process prevented from accessing GPU memory of a different
| process? (assuming low level APIs like Vulkan)
| tryauuum wrote:
| this is certainly more interesting question than OPs
| johntb86 wrote:
| By the Vulkan spec it is guaranteed that memory from one
| process can't be seen by another.
| https://registry.khronos.org/vulkan/specs/1.3-extensions/htm...
| says: In particular, any guarantees made by an operating system
| about whether memory from one process can be visible to another
| process or not must not be violated by a Vulkan implementation
| for any memory allocation.
|
| In theory you could have some sort of complex per-process
| scrambling system to avoid leaking information, but I think
| implementations actually just zero the memory.
|
| GPU drivers on different operating systems can be more or less
| buggy; Windows and Linux generally seem to do the right thing,
| but MacOS is a bit more haphazard.
| NotCamelCase wrote:
| You can request Windows to zero out every GPU memory
| allocation, but it's more of a driver thing that's not exposed
| in APIs. Such option is likely to be off by default in drivers
| as it might induce additional, unintended overhead. In
| practice, you are likely to see memory cleared more often than
| not due to other reasons, though.
|
| You can't just peek another process' GPU memory thru UMD app,
| either. Per-process virtual memory mechanisms similar to CPUs
| are also present in GPUs, which is the whole reason that
| resources are explicitly imported/exported across APIs via
| special API calls.
| coppsilgold wrote:
| It is if you tell the kernel.
|
| zero memory on free (3-5% average system performance impact, due
| to touching cold memory) init_on_free=1
|
| zero memory on alloc (<1% average system performance impact)
| init_on_alloc=1
| adamgordonbell wrote:
| My shortcut for answering these questions is just replace
| 'container' with process and find the answer for that.
|
| A running container is just a process that is using certain
| kernel features: cgroup, namespaces, pivotroot.
|
| LXC has a different DX around containers than OCI containers do,
| but the same rules apply.
| tomcam wrote:
| Thank you! Annoyed at myself for not thinking it through that
| way.
| commandersaki wrote:
| One exercise on pwn.college in their kernel security module is
| they have a program that forks, opens the flag file (/flag
| owned by root), reads the content, and then child process
| exits.
|
| The program continues to run to allow you to load in shell code
| that'll be run by the kernel.
|
| Your task is basically to write some shellcode to scan the
| memory for the flag. So now I know that at least some Linux-es
| don't clean up after a process exits, and you can get the
| contents of memory when you have kernel privileges. This is not
| so easy if you're scanning memory as root since /dev/mem or
| whatever won't reveal that process memory.
| dathinab wrote:
| > don't clean up after a process exits
|
| exactly, the only guarantee is that things are zeroed before
| handing them out to a different process, but there is some
| potential time gap between releasing memory back to the
| kernel and it being cleaned, a gap which can outlive the live
| of a process
|
| > and you can get the contents of memory when you have kernel
| privileges. This is not so easy [..] as root
|
| yes, root has much less privileges then the kernel, but often
| can gain kernel privileges.
|
| But this is where e.g. lockdown mode comes in which denies
| the root users such privilege escalation (oversimplified,
| it's complicated). Main problem is that lockdown mode is not
| yet compatible with suspend to disk (hibernation), even
| through its documentation implies it is, if your have a
| encrypted hibernation. (This is misleading as it refers to a
| not yet existing feature where the kernel creates a encrypted
| image which is also tamper proof even if root tries to
| tamper. On the other hand suspend to an encrypted partition
| is possible in Linux, but not enough for lockdown mode to
| work.)
| CodesInChaos wrote:
| By default memory is wiped before it's handed out again, not
| when it's freed. This improves performance, but means secrets
| can remain in RAM for longer than necessary, where they can
| be accessed by privileged attackers (software running as
| root, DMA without MMU, or hardware attacks). For unprivileged
| processes eager and lazy zeroing look the same.
| seri4l wrote:
| Apparently there's a kernel config flag to zero the memory
| on free (CONFIG_INIT_ON_FREE_DEFAULT_ON) but it has a quite
| expensive performance cost (3-5% according to the docs). I
| wonder in what kind of scenario it would make sense to
| enable it.
| MR4D wrote:
| So, if it's only 3-5% slower, then for $50-100 I could
| buy a slightly faster processor and never know the
| difference?
|
| Just trying to check my understanding of what the 3-5%
| delta is. Seems like a tiny tradeoff for any workstation
| (I wouldn't notice the difference at least). The tradeoff
| for servers might vary depending on what they are doing
| (shared versus owned, etc)
| postalrat wrote:
| How many thousand tradeoffs like this are you willing to
| pay for?
| soulofmischief wrote:
| This seems beneficial in systems where security concerns
| trump performance concerns. The above poster has probably
| made many such trade-offs already and would likely make
| more. (Full disk encryption, virtualization, protection
| rings, spectre mitigations, MMIO, ECC, etc.)
|
| With exponentially increasing processor performance it
| does make sense for workstations where physical access
| should be considered in the threat model.
| dathinab wrote:
| When running non performance sensitive but security
| sensitive code. Even adding protections summing up to
| much higher performance penalties can be very acceptable.
|
| E.g. on a crypto key server. Less if it's a server which
| encrypts data en mass, but e.g. one which signs longer
| valid auth tokens or one which hold middle layer
| certificates which are once every few hours used to
| create a cert used to encrypt/sign data en mass used on a
| different server etc.
| gus_massa wrote:
| I don't understand why it is slower. It has to be zeroed
| anyway.
|
| In the normal configuration:
|
| Is it not zeroed if the memory is assigned to the same
| process???
|
| Is it zeroed when the system is idle???
|
| Is it zeroed in batches that are more memory friendly???
| dathinab wrote:
| > Is it zeroed when the system is idle???
|
| yes mainly that,
|
| and if the system isn't idle but also doesn't use all
| phys. memory it might not be zeroed for a very long time
|
| > Is it not zeroed if the memory is assigned to the same
| process???
|
| idk. what the current state of this is in linux but at
| least in the past for some systems for some use cases
| related to mapped memory this was the case
| [deleted]
| wongarsu wrote:
| Two reasons:
|
| - lots of code is written under the assumption that free
| is fast
|
| - memory is zeroed in the background, unless memory
| pressure forces the kernel to zero when handing it out
| KMag wrote:
| > I don't understand why it is slower. It has to be
| zeroed anyway.
|
| Memory pages freed from userspace might be reused in
| kernelspace.
|
| If, for instance, the memory is re-used in the kernel's
| page cache, then the kernel doesn't need to zero it out
| before copying the to-be-cached data into the page.
|
| Edit: I seem to remember back in the 1990s that the
| kernel at least in some cases wouldn't zero-out pages
| previously used by the kernel before giving them to
| userspace, sometimes resulting in kernel secrets being
| leaked to arbitrary userspace processes. Maybe I'm
| missremembering, and it was just leakage of secrets
| between userspace processes. In any case, in the 1990s,
| Linux was way too lax about leaking data from freed
| pages.
| matt_heimer wrote:
| Just a guess but since apps can fail to free memory
| correctly you probably have to zero it on allocation and
| deallocation (to be secure) when you enable the feature.
| So you aren't swapping one for the other, you are now
| doing both.
| Denvercoder9 wrote:
| > Just a guess but since apps can fail to free memory
| correctly
|
| That's not relevant here; from the perspective of the
| kernel pages are either assigned to a process, or they're
| not. If an application fails to free memory correctly,
| that only means it'll keep having pages assigned to it
| that it no longer uses, but eventually those pages will
| always be released (by the kernel upon termination of the
| process, in the worst case).
| dfox wrote:
| That is the worst case if the process had leaked that
| part of the heap, but it is an optimal case on process
| exit. On OS with any kind of process isolation walking
| over most of the heap before exiting as to "correctly
| free it" is pure waste of the CPU cycles and in worst
| case even IO bandwidth (when it causes parts of the heap
| to be paged in).
| samus wrote:
| Pages can be completely avoided to be paged in if the
| intention is to just zero them. The kernel could either
| just "forget" them, or use copy-on-write with a properly
| zeroed out page as a base.
| shawabawa3 wrote:
| My guess is it won't always have to be zeroed
|
| e.g. if your code is doing ptr =
| malloc() memcpy(mydata, ptr)
|
| You can presumably optimise out the zeroing of the memory
| KMag wrote:
| As far as I know, the Linux kernel never inspects the
| userspace thread to adjust behavior based on what the
| thread is going to do next. This would be a very brittle
| sort of optimization.
|
| More importantly, it's not safe. Another thread in the
| same process can see ptr between the malloc and the
| memcpy!
|
| Edit: also, of course, malloc and memcpy are C runtime
| functions, not syscalls, so checking what happens after
| malloc() would require the kernel to have much more
| sophisticated analysis than just looking a few
| instructions ahead of the calling thread's %%eip/%%rip.
| While handling malloc()'s mmap() or brk() allocation, the
| kernel would need to be able to look one or two call
| frames up the call stack, past the metadata accounting
| that malloc is doing to keep track of the newly acquired
| memory, perhaps look at a few conditional branches, trace
| through the GOT and PLT entries to see where the memcpy
| call is actually going, and do so in a way that is robust
| to changes in the C runtime implementation. (Of course,
| in practice, most C compilers will inline a memcpy
| implementation, so in the common case, it wouldn't have
| to chase the GOT and PLT entries, but even then, it's way
| too complicated for the kernel to figure out if anything
| non-trivial is happening between mmap()/brk() and the
| memory being overwritten.)
|
| Edit 2: To be robust in the completely general case, even
| if it were trivial to identify the inlined memcpy
| implementation, and it were clearly defined "something
| non-trivial happens", determining if "something non-
| trivial happens" between mmap()/brk() and memcyp() would
| involve solving the halting problem. (Imposssible in the
| general case.)
| mjevans wrote:
| I'm NOT an expert here, but offhand.
| malloc() == 'reservation' (but not paged in!) memory
| // If touched / updated THEN the memory's paged in
|
| A copy _might_ not even become a copy if the kernel's
| smart enough / able to setup a hardware trigger to force
| a copy on writes to that area, at which point the
| physical memory backing two distinct logical memory zones
| would be copied and then different.
| KMag wrote:
| That's a good point that Linux doesn't actually allocate
| the pages until they're faulted in by a read or write.
| So, if it were doing some kind of thread inspection
| optimization, it would presumably just need to check if
| the faulting thread is currently in a loop that will
| overwrite at least the full page.
|
| However, that wouldn't solve the problem of other threads
| in the same process being able to see the page before
| it's fully overwritten, or debugging processes, or using
| a signal handler to invisibly jump out of the
| initialization loop in the middle, etc. There are
| workarounds to all of these issues, but they all have
| performance and complexity costs.
| sumtechguy wrote:
| malloc gets memory from the heap which may or may not be
| paged in/reused. That means you may get reused memory
| from the heap (which is up to the CRT).
|
| If you want make sure it is zero you will want calloc. If
| you know you are going to copy something in on the next
| step like your example you probably can skip calloc and
| just us malloc. calloc is nice for when you are doing
| thigs like linked lists/trees/buffers and do not want to
| have steps to clean out the pointers or data.
| vlovich123 wrote:
| Yes. It's zeroed in a low priority background process to
| avoid interfering with foreground apps.
| bayindirh wrote:
| Any multi-user system where users don't know each other
| and handle sensitive data.
| LinuxBender wrote:
| Do you mean _init_on_alloc=1_ and _init_on_free=1_? Here
| [1] is a thread on the options and performance impact.
| FWIW I use it on all my workstations but these days I am
| not doing anything that would be greatly impacted by it.
| I 've never tried it on a gaming machine and never tried
| it on a large memory hypervisor.
|
| I wish there were flags similar to this for the GPU
| memory. Even something that zero's GPU memory on reboot
| would be nice. I can always see the previous desktop
| after a reboot for a brief moment.
|
| [1] - https://patchwork.kernel.org/project/linux-
| mm/patch/20190617...
| ape4 wrote:
| I believe the docs but I would have thought that memset()
| would be really quick - implemented in hardware?
| dataflow wrote:
| "Real quick" is human speak. For large amounts of memory
| it's still bound by RAM speed for a machine, which is
| much lower (a couple orders of magnitude I believe) than,
| say, cache speed. Things might be different if there was
| a RAM equivalent of SSD TRIM (making the RAM module zero
| itself without transferring lots of zeros across the
| bus), but there isn't.
| throwaway894345 wrote:
| I'm completely unfamiliar with how the CPU communicates
| with the memory modules, but is there not a way for the
| CPU to tell the memory modules to zero out a whole range
| of memory rather than one byte/sector/whatever-the-
| standard-unit-is at a time?
|
| As I type this, I'm realizing how little I know about the
| protocol between the CPU and the memory modules--if
| anyone has an accessible link on the subject, I'd be
| grateful.
| dataflow wrote:
| That's what I referred to as "TRIM for RAM". I'm not
| aware of it being a thing. And I don't know the protocol,
| but I'm also not sure it's just a matter of protocol. It
| might require additional circuitry per bit of memory that
| would increase the cost.
| mjevans wrote:
| 'trim' for RAM is a virtual to physical page table hack.
| Memory that isn't backed by a page is just a zero, it
| doesn't need to be initialized. Offhand it's supposed to
| be before it's handed to a process, but I don't know if
| there are E.G. mechanisms to use some spare cycles to
| proactively zero non-allocated memory that's a candidate
| for being attached to VM space.
| andrewf wrote:
| Oldie but a goodie: https://people.freebsd.org/~lstewart/
| articles/cpumemory.pdf
| vlovich123 wrote:
| No. Memset (and bzero) aren't HW accelerated. There is a
| special CPU instruction that can do it but in practice
| it's faster to do it in a loop. In user space you can
| frequently leverage SIMD instructions to speed it up (of
| course those aren't available in the kernel because it
| avoids saving/restoring those and FP registers on every
| syscall (only when you switch contexts).
|
| What could be interesting if there were a CPU instruction
| to tell the RAM to do it. Then you would avoid the memory
| bandwidth impact of freeing the memory. But I don't think
| there's any such instruction for the CPU/memory protocol
| even today. Not sure why.
| dathinab wrote:
| Through modern CPUs are explicitly build to make sure
| such a loop is fast.
|
| And in some cases on some systems the DRM controller
| might zero the memory in some situations, in which cases
| you could say it was done by hardware.
| pflanze wrote:
| > DRM controller
|
| Did you mean DMA controller? Or do you have more
| information?
| dathinab wrote:
| yes DMA, not the direct rendering manager ;=)
| Arrath wrote:
| That seems wild to be honest. I know how easy it is to
| say "well they can just.."
|
| But...wouldn't it be relatively trivial to have an
| instruction that tells the memory controller "set range
| from address y to x to 0" and let it handle it? Actually
| slamming a bunch of 0's out over the bus seems so very
| suboptimal.
| mlyle wrote:
| > But...wouldn't it be relatively trivial to have an
| instruction that tells the memory controller "set range
| from address y to x to 0" and let it handle it?
|
| Having the memory controller or memory module do it is
| complicated somewhat because it needs to be coherent with
| the caches, needs to obey translation, etc. If you have
| the memory controller do it, it doesn't save bandwidth.
| But, on the other hand, with a write back cache, your
| zeroing may never need to get stored to memory at all.
|
| Further, if you have the module do it, the module/sdram
| state machine needs to get more complicated... and if you
| just have one module on the channel, then you don't
| benefit in bandwidth, either.
|
| A DMA controller can be set up to do it... but in
| practice this is usually more expensive on big CPUs than
| just letting a CPU do it.
|
| It's not _really_ tying up a processor because of
| superscalar, hyperthreading, etc, either; modern
| processors have an abundance of resources and what slows
| things doing is things that must be done serially or
| resources that are most contended (like the bus to
| memory).
| Arrath wrote:
| Thanks for the answer!
| dathinab wrote:
| really quick still doesn't mean it's free, especially if
| you always have to zero all the allocated pages even if
| the process might just have used part of the page.
|
| Also the question is what is this % in relation to?
|
| Probably that freeing get up to 5% slower, which is
| reasonable given that before you often could use idle
| time to zero many of the pages or might not have zeroed
| some of the pages at all (as they where never reused).
| MarkSweep wrote:
| Some processors have "hardware store elimination" that
| makes writing all zeros a bit faster than writing other
| values.
|
| https://travisdowns.github.io/blog/2020/05/13/intel-zero-
| opt...
| CodesInChaos wrote:
| I'd like the ability to control this at a process or even
| allocation (i.e. as a flag on mmap) level. That way a
| password manager could enable this, while a game could
| disable it.
| bhawks wrote:
| You want to enable this if your concerned about forensic
| attacks. A simple example would be someone has physical
| access to your device. They're able to power it down, and
| boot it with their own custom kernel. If the memory has
| not been eagerly zeroed they may be able to extract from
| RAM sensitive data.
|
| This flag puts an additional obstacle in the attacker's
| path. If you have private key material protecting
| valuable property, you definitely want to throw up as
| many roadblocks as possible.
| Wowfunhappy wrote:
| How is the attacker powering down the device while
| retaining the contents of its RAM?
| soulofmischief wrote:
| If your PC is connected to a power strip, it's my
| understanding that law enforcement can attach a live
| male-to-male power cable to the power strip and then
| remove the power strip from the wall while still powering
| the computer. That, and yeah freezing ram.
| Gracana wrote:
| Data fades slowly from DRAM, especially if you freeze it
| first.
| l33t233372 wrote:
| Perhaps by using a can of compressed air[0].
|
| [0] https://www.usenix.org/legacy/event/sec08/tech/full_p
| apers/h...
| l33t233372 wrote:
| I don't understand why this would help prevent cold boot
| attacks.
|
| Wouldn't the memory need to bee free'd first for this to
| have any effect?
| ender341341 wrote:
| the idea is that well written software would release
| memory as soon as possible so with it enabled you'd have
| the secret in memory for as little time as possible.
|
| Though in my mind well written software should be zeroing
| the memory out before freeing if it held sensitive data.
| bhawks wrote:
| Yes it would - either through the free syscall or a
| process exit. This is a defense in depth strategy and not
| 100% perfect. If you yanked the power cord and a long
| lived process had sensitive data in memory you're still
| vulnerable. But if you had a clean power down or very
| short lifetimes of sensitive data being active in RAM it
| would afford you additional security.
| WalterBright wrote:
| ?? Cutting the power means the RAM contents vanish.
| SllX wrote:
| They vanish eventually which is usually measured in
| seconds. This can be extended to minutes or hours if
| someone performs a cold boot attack: https://security.sta
| ckexchange.com/questions/10643/recover-t...
| l33t233372 wrote:
| I find that phrasing weird.
|
| A cold boot attack relies on a cold boot of the system to
| evade kernel protections(as opposed to a warm boot where
| the kernel can zero memory.)
|
| The name has nothing to do with reducing the temperature
| of the ram to extend the time it takes bytes to vanish in
| ram.
| SllX wrote:
| I think it's a little bit of column A and a little bit of
| column B, but admit while I remember reading about using
| technique a long time ago, I'm not sure of the history of
| the nomenclature. From the StackExchange:
|
| > For those who think this is only theoretical: They were
| able to use this technique to create a bootable USB
| device which could determine someone's Truecrypt hard-
| drive encryption key automatically, just by plugging it
| in and restarting the computer. They were also able to
| recover the memory-contents 30 minutes+ later by freezing
| the ram (using a simple bottle of canned-air) and
| removing it. Using liquid nitrogen increased this time to
| hours.
| l33t233372 wrote:
| Reducing the temperature of the RAM can be done to make a
| cold boot attack easier, but it's not the origin of the
| name.
|
| For more details, see the paper Lest We Remember.
| WalterBright wrote:
| i didn't know that. Thanks!
| zamnos wrote:
| here, I made it clickable: http://pwn.college
| gaudat wrote:
| Oh it is free too. Gotta try this.
| djbusby wrote:
| The memory wipe is a kernel build-time config option.
| CodesInChaos wrote:
| Do you know what the option is called?
|
| edit: apparently the runtime option is called
| `init_on_alloc` and the compile-time option (which
| determines the default of the runtime option) is called
| `CONFIG_INIT_ON_FREE_DEFAULT_ON`.
| jwilk wrote:
| There are two parameters:
|
| * init_on_alloc (default set by
| CONFIG_INIT_ON_ALLOC_DEFAULT_ON)
|
| * init_on_free (default set by
| CONFIG_INIT_ON_FREE_DEFAULT_ON)
| sgt wrote:
| DX?
| adamgordonbell wrote:
| Developer Experience (Maybe there is a better term, but how
| you use them)
|
| Docker/ OCI containers tend to be a single process and LXC
| containers have a collection of processes.
|
| But in either case they are just processes running on the
| host in a different namespace.
|
| So they feel different to use, but use the same building
| blocks (to my understanding).
| Karellen wrote:
| Shouldn't "DX" refer to the experience of people hacking on
| the Docker (or whatever) code itself? Like... how easy the
| build system is to work with if you're adding a new source
| file to the docker codebase?
|
| The people working with Docker, even if they are developers
| doing development work, are still users _of Docker_ ,
| aren't they? I mean, the GUI of an IDE is still part of its
| UX, right? Even though it's for developers doing
| development work?
| adamgordonbell wrote:
| I was thinking that developer experience is user
| experience - where the user is a developer. You are
| suggesting that the user and developer are different
| roles because even when the user is a developer there is
| still the developer who builds that tool
|
| It's possible you are right but I'm not an expert. I
| always think of the developer experience is the
| experience of developers using the tools and APIs you
| produce.
| sgt wrote:
| Unrelated, but your name rang a... bell. You're the
| corecursive guy. Great podcast!
| adamgordonbell wrote:
| Thanks!
| dwb wrote:
| "Developer experience" (sigh)
| Proven wrote:
| [dead]
| vrglvrglvrgl wrote:
| [dead]
| worthless-trash wrote:
| One day, the general technical population will understand
| containers are just processes on a system. No magic under the
| hood.
| bluedino wrote:
| Is the confusion coming from when you malloc() memory in C you
| aren't getting zereoed memory?
| dale_glass wrote:
| I tried on Linux, and got all zeroes.
|
| Of course that's for a trivial program. If you freed something,
| that probably wasn't returned to the OS, and the next malloc
| might just recycle it.
| bluedino wrote:
| Malloc a bunch of memory, dirty it, free it, then malloc some
| more. And then check.
|
| Your first request in your program, you'll get clean memory
| from the OS.
| jcarrano wrote:
| I'm curious, when did OSes first begin zeroing out pages. Did the
| first operating systems already zero the memory?
| hansendc wrote:
| Uh... Did I miss the patches that add a pre-zeroed page pool to
| Linux? Wouldn't be the first time I missed something like that
| getting added, but 6.3-rc5 definitely zeroes _some_ pages at
| allocation time, and I don't see any indiciation of it consulting
| a prezeroed page pool:
| https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...
| gregw2 wrote:
| It's not clear to me why the top rated reply which seemed
| extremely knowledgeable in discussing Linux kernel auto zeroing
| of memory mentioned a bunch of caveats including mentioning "some
| memory management libraries may reuse memory" but didn't say
| "including glibc".
|
| That's a pretty big library to gloss over from a security or
| working programmer perspective!
|
| (Corrections to perspective welcome.)
| hn92726819 wrote:
| I thought the shared memory they were referring to would be
| read only memory or something. So yes, the memory is shared,
| but your process doesn't have any access to overwrite any of
| it. Likewise, when you call library functions, when it mallocs
| something, it uses memory that your process has write access
| to, not the shared library memory.
|
| In the context of the question, I assume the asker was mostly
| interested in reading some kind of sensitive data from a
| previous process, not reading the same librsry-code-only memory
| or something.
|
| Note: all of this could be wrong, it was just my understanding
|
| Edit: looks like this answers that:
| https://stackoverflow.com/questions/20857134/memory-write-pr...
| IcePic wrote:
| Even if your libc IS one of the offenders, this libc code runs
| in a single process, so even though libc does or does not use
| evil tricks for your process, it doesn't mean you will get
| pages from some other process memory map, only that you may get
| your own old pages back.
| gregw2 wrote:
| Thanks, makes sense. (Looks like the poster commenting about
| glibc has corrected themself.)
| Aachen wrote:
| I found the currently accepted answer (by user10489) interesting,
| tying together a lot of concepts I had heard of but didn't
| properly understand. For example, looking in htop, chromium has
| six processes each using a literal terabyte of virtual memory,
| which is obviously greater than swap + physical RAM on my
| ordinary laptop. Or, why does an out-of-memory system hang
| instead of just telling the next process "nope, we don't have
| 300KB to allocate to you anymore" and either crash the program or
| let the program handle the failed allocation gracefully? This
| answer explains how this all works.
|
| The TL;DR answer to the actual question is: processes generally
| don't get access to each other's memory unless there is some
| trust relation (like being the parent process, or being allowed
| to attach a debugger), and being in a container doesn't change
| that, the same restrictions apply and you always get zeroed-out
| memory from the kernel. It's when you use a different allocator
| that you might get nonzeroed memory from elsewhere in your own
| process (not a random other process).
| jeffbee wrote:
| In Linux, a task that tries to allocate more than the allowed
| amount will be immediately terminated and not notified about
| anything. The reason out-of-memory systems thrash is merely
| because people do not always bother setting the limits, and the
| default assumes users prefer a process that continues through
| hours of thrashing instead of being promptly terminated.
| Sakos wrote:
| Who is "people" in this? If you mean your average user, it's
| unrealistic to expect them to know or care about details like
| this. This is the kind of thing that needs sane defaults.
| dale_glass wrote:
| > Or, why does an out-of-memory system hang instead of just
| telling the next process "nope, we don't have 300KB to allocate
| to you anymore"
|
| Blame UNIX for that, and the fork() system call.
|
| It's a design quirk. fork() duplicates the process. So suppose
| your web browser consumes 10GB RAM out of the 16GB total on the
| system, and wants to run a process for anything. Like it just
| wants to exec something tiny, like `uname`.
|
| 1. 10GB process does fork().
|
| 2. Instantly, you have two 10GB processes
|
| 3. A microsecond later, the child calls exec(), completely
| destroying its own state and replacing it with a 36K binary,
| freeing 10GB RAM.
|
| So there's two ways to go there:
|
| 1. You could require step 2 to be a full copy. Which means
| either you need more RAM, a huge chunk of which would always
| sit idle, or you need a lot of swap, for the same purpose.
|
| 2. We could overlook the memory usage increase and pretend that
| we have enough memory, and only really panic if the second
| process truly needs its own 10GB RAM that we don't have. That's
| what Linux does.
|
| The problem with #2 is that dealing with this happens
| completely in the background, at times completely unpredictable
| to the code. The OS allocates memory when the child changes
| memory, like does "a=1" somewhere. A program can't handle
| memory allocations failures there because as far as it knows,
| it's not allocating anything.
|
| So what you get is this fragile fiction that sometimes breaks
| and requires the kernel to kill something to maintain the
| system in some sort of working state.
|
| Windows doesn't have this issue at all because it has no
| fork(). New processes aren't children and start from scratch,
| so firefox never gets another 10GB sized clone. It just starts
| a new, 36K sized process.
| l33tman wrote:
| The Windows example is a non-sequitur as in both cases, you
| end up with the 36K sized process both in Windows and Linux
| if you want to spawn a sub-process that exec's. The fork() =>
| exec() path is not the issue (if there is an issue at all
| here), and if you use threading the memory is not forked like
| this to start with (on either of the OSes).
|
| I guess the case you want to highlight is more if you for
| example mmap() 10 GB of RAM on that 16 GB machine that only
| has 5 GB unused swap space left and where all of the physical
| RAM is filled with dirty pages already. Should the mmap()
| succeed, and then the process is killed if it eventually
| tries to use more pages than will fit in RAM or the backing
| swap? This is the overcommit option which is selectable on
| Linux. I think the defaults seem pretty good and accept that
| a process can get killed long after the "explicit" memory
| mapping call is done.
| peheje wrote:
| I find your writing style really pleasant and understandable!
| Much more so than the StackExchange answer. I really like the
| breakdown into steps, then what could happen steps and the
| follow ups. Where can I read more (from you?) in this style
| about OS and memory management?
| josefx wrote:
| > blame UNIX for that, and the fork() system call.
|
| Given that most code I have seen would not be able to handle
| an allocation failure gracefully I wouldn't call it "blame",
| if the OS just silently failed memory allocations on whatever
| program tried to allocate next you would basically end up
| with a system where random applications crash, which is
| similar to what the OOM killer does, just with no attempt to
| be smart about it. Even better, it is outright impossible to
| gracefully handle allocation failures in some languages, see
| for example variable length arrays in C.
| Arnavion wrote:
| No code is written to handle allocation failure because it
| knows that it's running on an OS with overcommit where
| handling allocation failure is impossible. Overcommit means
| that you encounter the problem not when you call `malloc()`
| but when you do `*pointer = value;`, which is impossible to
| handle.
| kevin_thibedeau wrote:
| Plenty of code runs on systems without that behavior.
| Graceful handling of malloc failure is still useful.
| Arnavion wrote:
| I know. I myself write code that checks the result of
| malloc. I was responding with josefx's words.
| tjoff wrote:
| That is a very weak argument.
|
| Also, why would you bother to handle it gracefully when the
| OS won't allow you to do it?
|
| Also, outright impossible in some languages? Just don't use
| VLAs if then? "Problem" solved.
| josefx wrote:
| > Also, why would you bother to handle it gracefully when
| the OS won't allow you to do it?
|
| There are many situations where you can get an allocation
| failure even with over provisioning enabled.
|
| > Just don't use VLAs if then? "Problem" solved.
|
| Yes, just don't use that language feature that is
| visually identical to a normal array. Then make sure that
| your standard library implementation doesn't have random
| malloc calls hidden in functions that cannot communicate
| an error and abort instead
| https://www.thingsquare.com/blog/articles/rand-may-call-
| mall.... Then ensure that your dependencies follow the
| same standards of handling allocation failures ... .
|
| I concede that it might be possible, but you are working
| against an ecosystem that is actively trying to sabotage
| you.
| tjoff wrote:
| VLAs are barely used and frowned upon by most. It is not
| relevant enough to discuss.
|
| Yes, mallocs in standard library is a problem. But this
| is rather the result of a mindset where over provision
| exist than anything else.
| jacquesm wrote:
| > 2. Instantly, you have two 10GB processes
|
| No, that's not how it works. The process table gets
| duplicated and copy-on-write takes care of the pages. As long
| as they are identical they will be shared, there is no way
| that 10GB of RAM will be allocated to the forked process and
| that all of the data will be copied.
| blitzkrieg3 wrote:
| This is the only right answer. What actually happens is you
| instantly have two 10G processes which share the same
| address space, and:
|
| 3. A microsecond later, the child calls exec(),
| decrementing the reference count to the memory shared with
| the parent[1] and faulting in a 36k binary, bringing our
| new total memory usage to 1,045,612KB (1,048,576K + 36K)
|
| CoW has existed since at least 1986, when CMU developed the
| Mach kernel.
|
| What GP is really talking about is overcommit, which is a
| feature (on by default) in Linux which allows you to ask
| for more memory than you have. This was famously a
| departure from other Unixes at the time[2], a departure
| that fueled confusion and countless flame wars in the early
| Internet.
|
| [1] https://unix.stackexchange.com/questions/469328/fork-
| and-cow... [2] https://groups.google.com/g/comp.unix.solari
| s/c/nLWKWW2ODZo/...
| worthless-trash wrote:
| OP is referring to what would happen in the naive
| implementation, not what actually happens.
| reisse wrote:
| The data won't be copied, but kernel has to reserve the
| memory for both processes.
| l33tman wrote:
| No, it doesn't. See "overcommit"
| enedil wrote:
| In a sense it does. Page tables would need to be copied.
| jacquesm wrote:
| Yes, but they all point to the _same_ pages. The tables
| take up a small fraction of the memory of the pages
| themselves.
| enedil wrote:
| But large fraction, if all you do afterwards is an exec
| call. Given 8 bytes per page table entry and 4k pages,
| it's 1/512 memory wasted. So if your process uses 8GB,
| it's 16MB. Still takes noticeable time if you spawn
| often.
| lxgr wrote:
| Aren't page tables nested? I don't know if any OS or
| hardware architecture actually supports it, but I could
| imagine the parent-level page table being virtual and
| copy-on-write itself.
| jacquesm wrote:
| I've never had the page tables be the cause of out of
| memory issues. Besides the fact that they are usually
| pre-allocated to avoid recursive page faults, but nothing
| would stop you from making the page tables themselves
| also copy-on-write during a fork.
| reisse wrote:
| Please read the parent comments. Overcommit is necessary
| exactly because kernel has to reserve memory for both
| processes, and overcommit allows to reserve more memory
| than there is physically present.
|
| If kernel could not reserve memory for forked process,
| overcommit would not be necessary.
| blitzkrieg3 wrote:
| This is a misconception you and parent are perpetuating.
| fork() existed in this problematic 2x memory
| implementation _way_ before overcommit, and overcommit
| was non-existent or disabled on Unix (which has fork())
| before Linux made it the default. Today with CoW we don't
| even have this "reserve memory for forked process"
| problem, so overcommit does nothing for us with regard to
| fork()/exec() (to say nothing of the vfork()/clone()
| point others have brought up). But if you want you can
| still disable overcommit on linux and observe that your
| apps can still create new processes.
|
| What overcommit enables is more efficient use of memory
| for applications that request more memory than they use
| (which is most of them) and more efficient use of page
| cache. It also pretty much guarantees an app gets memory
| when it asks for it, at the cost of getting oom-killed
| later if the system as a whole runs out.
| lxgr wrote:
| I think you've got it backwards: With overcommit, _there
| is no memory reservation_. The forked processes gets an
| exact copy of the other 's page table, but with all
| writable memory marked as copy-on-write instead. The
| kernel might well be tallying these up to some number,
| but nothing important happens with it.
|
| Only without overcommit does the kernel does need to
| start accounting for hypothetically-writable memory
| before it actually is written to.
| reisse wrote:
| Yes, you're right here, that was what I wanted to say but
| probably was not able to formulate.
| senko wrote:
| This is more or less what the second 2. explains:
|
| > 2. We could overlook the memory usage increase and
| pretend that we have enough memory, and only really panic
| if the second process truly needs its own 10GB RAM that we
| don't have. That's what Linux does
|
| "pretend" - share the memory and hope most of it will be
| read-only or unallocated eventually; "truly needs to own" -
| CoW
| jacquesm wrote:
| It will _never_ happen. To begin with all of the code
| pages are going to be shared because they are not
| modified.
|
| Besides that the bulk of the fork calls are just a
| preamble to starting up another process and exiting the
| current one. It's mostly a hack to ensure continuity for
| stdin/stdout/stderr and some other resources.
| msm_ wrote:
| It will _most likely_ not happen? It 's absolutely
| possible to write a program that forks and both forks
| overwrite 99% of shared memory pages. It almost never
| happens, which is GP's point, but it's possible and the
| reason it's a fragile hack.
|
| What usually happens in practice is you're almost OOM,
| and one of the processes running in the system writes to
| a page shared with another process, forcing the system to
| start good ol' OOM killer.
| jacquesm wrote:
| 99% isn't 100%.
|
| Sorry, but no, it _can 't_ happen, you can not fork a
| process and end up with twice the memory requirements
| just because of the fork. What you can do is to simply
| allocate more memory than you were using before and keep
| writing.
|
| The OOM killer is a nasty hack, it essentially moves the
| decision about what stays and what goes to a process that
| is making calls way above its pay grade, but overcommit
| and OOM go hand in hand.
| [deleted]
| blitzkrieg3 wrote:
| It does not happen using fork()/exec() as described
| above. For it to happen we would need to fork() and
| continue using old variables and data buffers in the
| child that we used in the parent, which is a valid but
| rarely used pattern.
| RcouF1uZ4gsC wrote:
| In retrospect, fork was big mistake. It complicates a bunch
| of other things and introduces suboptimal patterns.
| t0suj4 wrote:
| A small reminder: In the age of Unix multiuser systems were
| very common. Fork was the optimal solution to be able to
| serve as much concurrent users or programs as possible
| while keeping the implementation simple.
|
| Today's RAM is cheap.
| moonchrome wrote:
| So much of design constraints in our base abstractions
| are not relevant today but we're still cobbling together
| solutions built on legacy technical decisions.
| archgoon wrote:
| [dead]
| reisse wrote:
| > Blame UNIX for that, and the fork() system call.
|
| At least that design failure of UNIX has been fixed long ago.
| There are posix_spawn(3) and various clone(2) flavours which
| allow to spawn new process without copying the old one. And a
| lot of memory-intensive software actually use them, so modern
| Linux distros can be used without memory overprovisioning.
|
| I'd rather blame people who are still using fork(2) for
| anything that can consume more than 100MB of memory.
| drougge wrote:
| I'm someone who likes to use fork() and then actually use
| both processes as they are, with shared copy-on-write
| memory. I'm happy to use it on things consuming much more
| than 100MB of memory. In fact that's where I like it the
| most. I'm probably a terrible person.
|
| But what would be better? This way I can massage my data in
| one process, and then fork as many other processes that use
| this data as I like without having to serialise it to disk
| and and then load it again. If the data is not modified
| after fork it consumes much less memory (only the page
| tables). Usually a little is modified, consuming only a
| little memory extra. If all of it is modified it doesn't
| consume more memory than I would have otherwise (hopefully,
| not sure if the Linux implementation still keeps the pre-
| fork copy around).
|
| (And no, not threads. They would share modifications, which
| I don't want. Also since I do this in python they would
| have terrible performance.)
| reisse wrote:
| So if I got it right, you're using fork(2) as a glorified
| shared memory interface. If my memory is (also) right,
| you can allocate shared read-only mapping with
| shm_open(3) + mmap(2) in parent process, and open it as a
| private copy-on-write mapping in child processes.
| jacquesm wrote:
| No, he's using fork the way it is intended.
|
| Shared memory came _much_ later than fork did.
| Asooka wrote:
| I have used fork as a stupid simple memory arena
| implementation. fork(); do work in the child; only
| malloc, never free; exit. It is much, much heavier than a
| normal memory arena would be, but also much simpler to
| use. Plus, if you can split the work in independent
| batches, you can run multiple children at a time in
| parallel.
|
| As with all such stupid simple mechanisms, I would not
| advise its use if your program spans more than one .c
| file and more than a thousand lines.
| lxgr wrote:
| Unless you're sure that you're going to write to the
| majority of the copy-on-write memory resulting from fork(),
| this seems like overkill.
|
| Maybe there should be yet another flavor of fork() that
| does copy-on-write, but treats the memory as already-copied
| for physical memory accounting purposes? (Not sure if
| "copy-on-write but budget as distinct" is actually
| representable in Linux's or other Unixes' memory model,
| though.)
| panzi wrote:
| posix_spawn() is great, but Linux doesn't implement it.
| glibc does based on fork()+exec(). Other Unix(-like) OSes
| do implement posix_spawn() as system call. Also while you
| can use posix_spawn() in the vast majority of cases, if it
| doesn't cover certain process setup options that you need
| you still have to use fork()+exec(). But yeah, it would be
| good if Linux had it as a system call. It would probably
| help PostgreSQL.
| the8472 wrote:
| glibc uses vfork+exec to implement posix_spawn, which
| makes it much faster than fork+exec.
| AnIdiotOnTheNet wrote:
| Can a modern distro really be used without over
| provisioning? Because the last time I tried it either the
| DE or display server hard locked immediately and I had to
| reboot the system.
|
| Having this ridiculous setting as the default has basically
| ensured that we can never turn it off because developers
| expect things to work this way. They have no idea what to
| do if malloc errors on them. They like being able to make
| 1TB allocs without worrying about the consequences and just
| letting the kernel shoot processes in the head randomly
| when it all goes south. Hell, the last time this came up
| many swore that there was literally nothing a programmer
| _could_ do in the event of OOM. Learned helplessness.
|
| It's a goddamned mess and like many of Linux's goddamned
| messes not only are we still dealing with it in 2023, but
| every effort to do anything about it faces angry ranty
| backlash.
| lxgr wrote:
| Almost everything in life is overprovisioned, if you
| think about it: Your ISP, the phone network, hospitals,
| bank reserves (and deposit insurance)...
|
| What makes the approach uniquely unsuitable for memory
| management? The entire idea of swapping goes out of the
| window without overprovisioning as well, for better or
| worse.
| mrguyorama wrote:
| What an absurdly whataboutism filled response. Meanwhile
| Windows has been doing it the correct way for 20 years or
| more and never has to kill a random process just to keep
| functioning.
| lxgr wrote:
| So you're saying the correct way to support fork() is
| to... not support it? This seems pretty wasteful in the
| majority of scenarios.
|
| For example, it's a common pattern in many languages and
| frameworks to preload and fully initialize one worker
| process and then just fork that as often as required. The
| assumption there is that, while most of the memory is
| _theoretically_ writable, practically, much of it is
| written exactly once and can then be shared across all
| workers. This both saves memory and the time needed to
| uselessly copy it for every worker instance (or
| alternatively to re-initialize the worker every single
| time, which can be costly if many of its data structures
| are dynamically computed and not just read from disk).
|
| How do you do that without fork()/overprovisioning?
|
| I'm also not sure whether "giving other examples" fits
| the bill of "whataboutism", as I'm not listing other
| examples of bad things to detract from a bad thing under
| discussion - I'm claiming that all of these things are
| (mostly) good and useful :)
| AnIdiotOnTheNet wrote:
| Perhaps there is some confusion because I used
| "overprovision" when the appropriate term here is
| "overcommit", but Windows manages to work fine without
| unix-style overcommit. I suspect most OSs in history do
| not use unix's style of overcommit.
|
| > What makes the approach uniquely unsuitable for memory
| management?
|
| The fact that something like OOM killer even needs to
| exist. Killing random processes to free up memory you
| blindly promised but couldn't deliver is not a reasonable
| way to do things.
|
| Edit: https://lwn.net/Articles/627725/
| 0x0 wrote:
| You don't HAVE to use fork() + exec() in Linux as far as I
| know. There is for example posix_swawn() which does the
| double whammy combo for you.
|
| Also, if you DO use fork() without immediately doing an
| exec(), and start writing all over your existing
| allocations... just don't?
| the8472 wrote:
| posix_spawn isn't powerful enough since it only supports a
| limited set of process setup operations. So if you need to
| do some specific syscalls before exec that aren't covered
| by it then the fork/exec dance is still necessary. In
| principle one can use vfork+exec instead but that very hard
| to get right.
|
| What's really needed is io_uring_spawn but that's still a
| WIP. https://lwn.net/Articles/908268/
| whartung wrote:
| We actually ran into this a long time ago with Solaris and
| Java.
|
| Java has JSP, Java Server Pages. JSP processing translates a
| JSP file into Java source code, compiles it, then caches,
| loads, and executes the resulting class file.
|
| Back then, the server would invoke the javac compiler through
| a standard fork and exec.
|
| That's all well and good, save when you have a large app
| server image sucking up the majority of the machine. As far
| as we could tell, it was a copy on write kind of process, it
| didn't actually try to do the actual work when forking the
| app server. Rather it tried to do the allocation, found it
| didn't have the space or swap, and just failed with a system
| OOM error (which differs from a Java out of memory/heap
| error).
|
| As I recall adding swap was the short term fix (once we
| convinced the ops guy that, yes it was possible to "run out
| of memory"). Long term we made sure all of our JSPs were pre-
| compiled.
|
| Later, this became a non issue for a variety of reasons,
| including being able to run the compiler natively within a
| running JVM.
| KMag wrote:
| Option 3: vfork() has existed for a long time. The child
| process temporarily borrows all of the parent's address
| space. The calling process is frozen until the child exits or
| calls a flavor of exec _. Granted, it 's pretty brittle and
| any modification of non-stack address space other than
| changing a variable of type pid_t is undefined behavior
| before exec_ is called. However, it gets around the
| disadvantages of fork() while maintaining all of the
| flexibility of Unix's separation of process creation
| (fork/vfork) and process initialization (exec*).
|
| vfork followed immediately by exec gives you Windows-like
| process creation, and last I checked, despite having the
| overhead of a second syscall, was still faster than process
| creation on Windows.
| eru wrote:
| Keep in mind that copy-on-write makes analysing the situation
| a bit more complicated.
| nnntriplesec wrote:
| How so?
| pdpi wrote:
| CoW is a strategy where you don't actually copy memory
| until you write to it. So, when the 10GB process spawns a
| child process, that child process also has 10GB of
| virtual memory, but both processes are backed by the same
| pages. It's only when one of them writes to a page that a
| copy happens. When you fork+exec you never actually touch
| most of those pages, so you never actually pay for them.
|
| (Obviously, that's the super-simplified version, and I
| don't fully understand the subtleties involved, but
| that's exactly what GP means: it's harder to analyse)
| eru wrote:
| Thanks for writing the explanation.
|
| To make it slightly more complicated: you don't pay for
| the 10 GB directly, but you still pay for setting up the
| metadata, and that scales with the amount of virtual
| memory used.
| progbits wrote:
| If you would like to go deeper down the rabbit hole,
| fasterthanlime recently made a few videos that explore this
| topic:
|
| https://youtu.be/YB6LTaGRQJg https://youtu.be/c_5Jy_AVDaM
| https://youtu.be/DpnXaNkM9_M
|
| The code is in Rust but that doesn't matter for the
| explanation.
| skyeto wrote:
| What I find to be a rather interesting tidbit related to this
| is that some applications (e.g. certain garbage collectors) map
| multiple ranges of virtual memory that address the same
| physical memory.
| yakubin wrote:
| That's how you can create a gapless ring buffer:
| <https://learn.microsoft.com/en-
| us/windows/win32/api/memoryap...> (see scenario 1 in
| examples).
___________________________________________________________________
(page generated 2023-04-05 23:01 UTC)