[HN Gopher] Is RAM wiped before use in another LXC container?
       ___________________________________________________________________
        
       Is RAM wiped before use in another LXC container?
        
       Author : Aachen
       Score  : 280 points
       Date   : 2023-04-05 10:22 UTC (12 hours ago)
        
 (HTM) web link (security.stackexchange.com)
 (TXT) w3m dump (security.stackexchange.com)
        
       | pmontra wrote:
       | I thought that everybody knows how the CPU and the kernel manage
       | memory. Everything else follows. However as a very good developer
       | told me once, "I don't know how networking works, only that I
       | have to type a URL." "How is that possible?" I replied, "They
       | teach it at a school." And he told me "yes, but I studied graphic
       | design, then discovered that I like programming."
       | 
       | Replace and repeat with SQL, machine language, everything below
       | modern programming languages.
        
       | jk_i_am_a_robot wrote:
       | If I could answer it would be a solid "maybe". If you're willing
       | to consider the Translation Lookaside Buffer as part of the
       | contents of RAM then, if the claims are to be believed, SPECTRE
       | and Meltdown could read the contents of other processes' (and
       | containers) RAM.
        
       | ritcgab wrote:
       | I think the main insight is that user-level memory allocation
       | does not necessarily involve the kernel. If the application uses
       | `malloc` to get memory allocated, the real "application" that
       | will (possibly) request free pages from the kernel is the memory
       | allocator. If the space requested can be served by the pages
       | already assigned to the allocator, it can just hand them out. So
       | when you `malloc`, it is not zeroed, unless you know it is
       | freshly from kernel (but you don't know the internal state of the
       | allocator). That is why one needs to manually zero out space
       | returned from `malloc`, or use `calloc` instead.
        
       | CodesInChaos wrote:
       | Is GPU memory wiped before it's handed to a new process, and is a
       | process prevented from accessing GPU memory of a different
       | process? (assuming low level APIs like Vulkan)
        
         | tryauuum wrote:
         | this is certainly more interesting question than OPs
        
         | johntb86 wrote:
         | By the Vulkan spec it is guaranteed that memory from one
         | process can't be seen by another.
         | https://registry.khronos.org/vulkan/specs/1.3-extensions/htm...
         | says: In particular, any guarantees made by an operating system
         | about whether memory from one process can be visible to another
         | process or not must not be violated by a Vulkan implementation
         | for any memory allocation.
         | 
         | In theory you could have some sort of complex per-process
         | scrambling system to avoid leaking information, but I think
         | implementations actually just zero the memory.
         | 
         | GPU drivers on different operating systems can be more or less
         | buggy; Windows and Linux generally seem to do the right thing,
         | but MacOS is a bit more haphazard.
        
         | NotCamelCase wrote:
         | You can request Windows to zero out every GPU memory
         | allocation, but it's more of a driver thing that's not exposed
         | in APIs. Such option is likely to be off by default in drivers
         | as it might induce additional, unintended overhead. In
         | practice, you are likely to see memory cleared more often than
         | not due to other reasons, though.
         | 
         | You can't just peek another process' GPU memory thru UMD app,
         | either. Per-process virtual memory mechanisms similar to CPUs
         | are also present in GPUs, which is the whole reason that
         | resources are explicitly imported/exported across APIs via
         | special API calls.
        
       | coppsilgold wrote:
       | It is if you tell the kernel.
       | 
       | zero memory on free (3-5% average system performance impact, due
       | to touching cold memory)                 init_on_free=1
       | 
       | zero memory on alloc (<1% average system performance impact)
       | init_on_alloc=1
        
       | adamgordonbell wrote:
       | My shortcut for answering these questions is just replace
       | 'container' with process and find the answer for that.
       | 
       | A running container is just a process that is using certain
       | kernel features: cgroup, namespaces, pivotroot.
       | 
       | LXC has a different DX around containers than OCI containers do,
       | but the same rules apply.
        
         | tomcam wrote:
         | Thank you! Annoyed at myself for not thinking it through that
         | way.
        
         | commandersaki wrote:
         | One exercise on pwn.college in their kernel security module is
         | they have a program that forks, opens the flag file (/flag
         | owned by root), reads the content, and then child process
         | exits.
         | 
         | The program continues to run to allow you to load in shell code
         | that'll be run by the kernel.
         | 
         | Your task is basically to write some shellcode to scan the
         | memory for the flag. So now I know that at least some Linux-es
         | don't clean up after a process exits, and you can get the
         | contents of memory when you have kernel privileges. This is not
         | so easy if you're scanning memory as root since /dev/mem or
         | whatever won't reveal that process memory.
        
           | dathinab wrote:
           | > don't clean up after a process exits
           | 
           | exactly, the only guarantee is that things are zeroed before
           | handing them out to a different process, but there is some
           | potential time gap between releasing memory back to the
           | kernel and it being cleaned, a gap which can outlive the live
           | of a process
           | 
           | > and you can get the contents of memory when you have kernel
           | privileges. This is not so easy [..] as root
           | 
           | yes, root has much less privileges then the kernel, but often
           | can gain kernel privileges.
           | 
           | But this is where e.g. lockdown mode comes in which denies
           | the root users such privilege escalation (oversimplified,
           | it's complicated). Main problem is that lockdown mode is not
           | yet compatible with suspend to disk (hibernation), even
           | through its documentation implies it is, if your have a
           | encrypted hibernation. (This is misleading as it refers to a
           | not yet existing feature where the kernel creates a encrypted
           | image which is also tamper proof even if root tries to
           | tamper. On the other hand suspend to an encrypted partition
           | is possible in Linux, but not enough for lockdown mode to
           | work.)
        
           | CodesInChaos wrote:
           | By default memory is wiped before it's handed out again, not
           | when it's freed. This improves performance, but means secrets
           | can remain in RAM for longer than necessary, where they can
           | be accessed by privileged attackers (software running as
           | root, DMA without MMU, or hardware attacks). For unprivileged
           | processes eager and lazy zeroing look the same.
        
             | seri4l wrote:
             | Apparently there's a kernel config flag to zero the memory
             | on free (CONFIG_INIT_ON_FREE_DEFAULT_ON) but it has a quite
             | expensive performance cost (3-5% according to the docs). I
             | wonder in what kind of scenario it would make sense to
             | enable it.
        
               | MR4D wrote:
               | So, if it's only 3-5% slower, then for $50-100 I could
               | buy a slightly faster processor and never know the
               | difference?
               | 
               | Just trying to check my understanding of what the 3-5%
               | delta is. Seems like a tiny tradeoff for any workstation
               | (I wouldn't notice the difference at least). The tradeoff
               | for servers might vary depending on what they are doing
               | (shared versus owned, etc)
        
               | postalrat wrote:
               | How many thousand tradeoffs like this are you willing to
               | pay for?
        
               | soulofmischief wrote:
               | This seems beneficial in systems where security concerns
               | trump performance concerns. The above poster has probably
               | made many such trade-offs already and would likely make
               | more. (Full disk encryption, virtualization, protection
               | rings, spectre mitigations, MMIO, ECC, etc.)
               | 
               | With exponentially increasing processor performance it
               | does make sense for workstations where physical access
               | should be considered in the threat model.
        
               | dathinab wrote:
               | When running non performance sensitive but security
               | sensitive code. Even adding protections summing up to
               | much higher performance penalties can be very acceptable.
               | 
               | E.g. on a crypto key server. Less if it's a server which
               | encrypts data en mass, but e.g. one which signs longer
               | valid auth tokens or one which hold middle layer
               | certificates which are once every few hours used to
               | create a cert used to encrypt/sign data en mass used on a
               | different server etc.
        
               | gus_massa wrote:
               | I don't understand why it is slower. It has to be zeroed
               | anyway.
               | 
               | In the normal configuration:
               | 
               | Is it not zeroed if the memory is assigned to the same
               | process???
               | 
               | Is it zeroed when the system is idle???
               | 
               | Is it zeroed in batches that are more memory friendly???
        
               | dathinab wrote:
               | > Is it zeroed when the system is idle???
               | 
               | yes mainly that,
               | 
               | and if the system isn't idle but also doesn't use all
               | phys. memory it might not be zeroed for a very long time
               | 
               | > Is it not zeroed if the memory is assigned to the same
               | process???
               | 
               | idk. what the current state of this is in linux but at
               | least in the past for some systems for some use cases
               | related to mapped memory this was the case
        
               | [deleted]
        
               | wongarsu wrote:
               | Two reasons:
               | 
               | - lots of code is written under the assumption that free
               | is fast
               | 
               | - memory is zeroed in the background, unless memory
               | pressure forces the kernel to zero when handing it out
        
               | KMag wrote:
               | > I don't understand why it is slower. It has to be
               | zeroed anyway.
               | 
               | Memory pages freed from userspace might be reused in
               | kernelspace.
               | 
               | If, for instance, the memory is re-used in the kernel's
               | page cache, then the kernel doesn't need to zero it out
               | before copying the to-be-cached data into the page.
               | 
               | Edit: I seem to remember back in the 1990s that the
               | kernel at least in some cases wouldn't zero-out pages
               | previously used by the kernel before giving them to
               | userspace, sometimes resulting in kernel secrets being
               | leaked to arbitrary userspace processes. Maybe I'm
               | missremembering, and it was just leakage of secrets
               | between userspace processes. In any case, in the 1990s,
               | Linux was way too lax about leaking data from freed
               | pages.
        
               | matt_heimer wrote:
               | Just a guess but since apps can fail to free memory
               | correctly you probably have to zero it on allocation and
               | deallocation (to be secure) when you enable the feature.
               | So you aren't swapping one for the other, you are now
               | doing both.
        
               | Denvercoder9 wrote:
               | > Just a guess but since apps can fail to free memory
               | correctly
               | 
               | That's not relevant here; from the perspective of the
               | kernel pages are either assigned to a process, or they're
               | not. If an application fails to free memory correctly,
               | that only means it'll keep having pages assigned to it
               | that it no longer uses, but eventually those pages will
               | always be released (by the kernel upon termination of the
               | process, in the worst case).
        
               | dfox wrote:
               | That is the worst case if the process had leaked that
               | part of the heap, but it is an optimal case on process
               | exit. On OS with any kind of process isolation walking
               | over most of the heap before exiting as to "correctly
               | free it" is pure waste of the CPU cycles and in worst
               | case even IO bandwidth (when it causes parts of the heap
               | to be paged in).
        
               | samus wrote:
               | Pages can be completely avoided to be paged in if the
               | intention is to just zero them. The kernel could either
               | just "forget" them, or use copy-on-write with a properly
               | zeroed out page as a base.
        
               | shawabawa3 wrote:
               | My guess is it won't always have to be zeroed
               | 
               | e.g. if your code is doing                   ptr =
               | malloc()         memcpy(mydata, ptr)
               | 
               | You can presumably optimise out the zeroing of the memory
        
               | KMag wrote:
               | As far as I know, the Linux kernel never inspects the
               | userspace thread to adjust behavior based on what the
               | thread is going to do next. This would be a very brittle
               | sort of optimization.
               | 
               | More importantly, it's not safe. Another thread in the
               | same process can see ptr between the malloc and the
               | memcpy!
               | 
               | Edit: also, of course, malloc and memcpy are C runtime
               | functions, not syscalls, so checking what happens after
               | malloc() would require the kernel to have much more
               | sophisticated analysis than just looking a few
               | instructions ahead of the calling thread's %%eip/%%rip.
               | While handling malloc()'s mmap() or brk() allocation, the
               | kernel would need to be able to look one or two call
               | frames up the call stack, past the metadata accounting
               | that malloc is doing to keep track of the newly acquired
               | memory, perhaps look at a few conditional branches, trace
               | through the GOT and PLT entries to see where the memcpy
               | call is actually going, and do so in a way that is robust
               | to changes in the C runtime implementation. (Of course,
               | in practice, most C compilers will inline a memcpy
               | implementation, so in the common case, it wouldn't have
               | to chase the GOT and PLT entries, but even then, it's way
               | too complicated for the kernel to figure out if anything
               | non-trivial is happening between mmap()/brk() and the
               | memory being overwritten.)
               | 
               | Edit 2: To be robust in the completely general case, even
               | if it were trivial to identify the inlined memcpy
               | implementation, and it were clearly defined "something
               | non-trivial happens", determining if "something non-
               | trivial happens" between mmap()/brk() and memcyp() would
               | involve solving the halting problem. (Imposssible in the
               | general case.)
        
               | mjevans wrote:
               | I'm NOT an expert here, but offhand.
               | malloc() == 'reservation' (but not paged in!) memory
               | // If touched / updated THEN the memory's paged in
               | 
               | A copy _might_ not even become a copy if the kernel's
               | smart enough / able to setup a hardware trigger to force
               | a copy on writes to that area, at which point the
               | physical memory backing two distinct logical memory zones
               | would be copied and then different.
        
               | KMag wrote:
               | That's a good point that Linux doesn't actually allocate
               | the pages until they're faulted in by a read or write.
               | So, if it were doing some kind of thread inspection
               | optimization, it would presumably just need to check if
               | the faulting thread is currently in a loop that will
               | overwrite at least the full page.
               | 
               | However, that wouldn't solve the problem of other threads
               | in the same process being able to see the page before
               | it's fully overwritten, or debugging processes, or using
               | a signal handler to invisibly jump out of the
               | initialization loop in the middle, etc. There are
               | workarounds to all of these issues, but they all have
               | performance and complexity costs.
        
               | sumtechguy wrote:
               | malloc gets memory from the heap which may or may not be
               | paged in/reused. That means you may get reused memory
               | from the heap (which is up to the CRT).
               | 
               | If you want make sure it is zero you will want calloc. If
               | you know you are going to copy something in on the next
               | step like your example you probably can skip calloc and
               | just us malloc. calloc is nice for when you are doing
               | thigs like linked lists/trees/buffers and do not want to
               | have steps to clean out the pointers or data.
        
               | vlovich123 wrote:
               | Yes. It's zeroed in a low priority background process to
               | avoid interfering with foreground apps.
        
               | bayindirh wrote:
               | Any multi-user system where users don't know each other
               | and handle sensitive data.
        
               | LinuxBender wrote:
               | Do you mean _init_on_alloc=1_ and _init_on_free=1_? Here
               | [1] is a thread on the options and performance impact.
               | FWIW I use it on all my workstations but these days I am
               | not doing anything that would be greatly impacted by it.
               | I 've never tried it on a gaming machine and never tried
               | it on a large memory hypervisor.
               | 
               | I wish there were flags similar to this for the GPU
               | memory. Even something that zero's GPU memory on reboot
               | would be nice. I can always see the previous desktop
               | after a reboot for a brief moment.
               | 
               | [1] - https://patchwork.kernel.org/project/linux-
               | mm/patch/20190617...
        
               | ape4 wrote:
               | I believe the docs but I would have thought that memset()
               | would be really quick - implemented in hardware?
        
               | dataflow wrote:
               | "Real quick" is human speak. For large amounts of memory
               | it's still bound by RAM speed for a machine, which is
               | much lower (a couple orders of magnitude I believe) than,
               | say, cache speed. Things might be different if there was
               | a RAM equivalent of SSD TRIM (making the RAM module zero
               | itself without transferring lots of zeros across the
               | bus), but there isn't.
        
               | throwaway894345 wrote:
               | I'm completely unfamiliar with how the CPU communicates
               | with the memory modules, but is there not a way for the
               | CPU to tell the memory modules to zero out a whole range
               | of memory rather than one byte/sector/whatever-the-
               | standard-unit-is at a time?
               | 
               | As I type this, I'm realizing how little I know about the
               | protocol between the CPU and the memory modules--if
               | anyone has an accessible link on the subject, I'd be
               | grateful.
        
               | dataflow wrote:
               | That's what I referred to as "TRIM for RAM". I'm not
               | aware of it being a thing. And I don't know the protocol,
               | but I'm also not sure it's just a matter of protocol. It
               | might require additional circuitry per bit of memory that
               | would increase the cost.
        
               | mjevans wrote:
               | 'trim' for RAM is a virtual to physical page table hack.
               | Memory that isn't backed by a page is just a zero, it
               | doesn't need to be initialized. Offhand it's supposed to
               | be before it's handed to a process, but I don't know if
               | there are E.G. mechanisms to use some spare cycles to
               | proactively zero non-allocated memory that's a candidate
               | for being attached to VM space.
        
               | andrewf wrote:
               | Oldie but a goodie: https://people.freebsd.org/~lstewart/
               | articles/cpumemory.pdf
        
               | vlovich123 wrote:
               | No. Memset (and bzero) aren't HW accelerated. There is a
               | special CPU instruction that can do it but in practice
               | it's faster to do it in a loop. In user space you can
               | frequently leverage SIMD instructions to speed it up (of
               | course those aren't available in the kernel because it
               | avoids saving/restoring those and FP registers on every
               | syscall (only when you switch contexts).
               | 
               | What could be interesting if there were a CPU instruction
               | to tell the RAM to do it. Then you would avoid the memory
               | bandwidth impact of freeing the memory. But I don't think
               | there's any such instruction for the CPU/memory protocol
               | even today. Not sure why.
        
               | dathinab wrote:
               | Through modern CPUs are explicitly build to make sure
               | such a loop is fast.
               | 
               | And in some cases on some systems the DRM controller
               | might zero the memory in some situations, in which cases
               | you could say it was done by hardware.
        
               | pflanze wrote:
               | > DRM controller
               | 
               | Did you mean DMA controller? Or do you have more
               | information?
        
               | dathinab wrote:
               | yes DMA, not the direct rendering manager ;=)
        
               | Arrath wrote:
               | That seems wild to be honest. I know how easy it is to
               | say "well they can just.."
               | 
               | But...wouldn't it be relatively trivial to have an
               | instruction that tells the memory controller "set range
               | from address y to x to 0" and let it handle it? Actually
               | slamming a bunch of 0's out over the bus seems so very
               | suboptimal.
        
               | mlyle wrote:
               | > But...wouldn't it be relatively trivial to have an
               | instruction that tells the memory controller "set range
               | from address y to x to 0" and let it handle it?
               | 
               | Having the memory controller or memory module do it is
               | complicated somewhat because it needs to be coherent with
               | the caches, needs to obey translation, etc. If you have
               | the memory controller do it, it doesn't save bandwidth.
               | But, on the other hand, with a write back cache, your
               | zeroing may never need to get stored to memory at all.
               | 
               | Further, if you have the module do it, the module/sdram
               | state machine needs to get more complicated... and if you
               | just have one module on the channel, then you don't
               | benefit in bandwidth, either.
               | 
               | A DMA controller can be set up to do it... but in
               | practice this is usually more expensive on big CPUs than
               | just letting a CPU do it.
               | 
               | It's not _really_ tying up a processor because of
               | superscalar, hyperthreading, etc, either; modern
               | processors have an abundance of resources and what slows
               | things doing is things that must be done serially or
               | resources that are most contended (like the bus to
               | memory).
        
               | Arrath wrote:
               | Thanks for the answer!
        
               | dathinab wrote:
               | really quick still doesn't mean it's free, especially if
               | you always have to zero all the allocated pages even if
               | the process might just have used part of the page.
               | 
               | Also the question is what is this % in relation to?
               | 
               | Probably that freeing get up to 5% slower, which is
               | reasonable given that before you often could use idle
               | time to zero many of the pages or might not have zeroed
               | some of the pages at all (as they where never reused).
        
               | MarkSweep wrote:
               | Some processors have "hardware store elimination" that
               | makes writing all zeros a bit faster than writing other
               | values.
               | 
               | https://travisdowns.github.io/blog/2020/05/13/intel-zero-
               | opt...
        
               | CodesInChaos wrote:
               | I'd like the ability to control this at a process or even
               | allocation (i.e. as a flag on mmap) level. That way a
               | password manager could enable this, while a game could
               | disable it.
        
               | bhawks wrote:
               | You want to enable this if your concerned about forensic
               | attacks. A simple example would be someone has physical
               | access to your device. They're able to power it down, and
               | boot it with their own custom kernel. If the memory has
               | not been eagerly zeroed they may be able to extract from
               | RAM sensitive data.
               | 
               | This flag puts an additional obstacle in the attacker's
               | path. If you have private key material protecting
               | valuable property, you definitely want to throw up as
               | many roadblocks as possible.
        
               | Wowfunhappy wrote:
               | How is the attacker powering down the device while
               | retaining the contents of its RAM?
        
               | soulofmischief wrote:
               | If your PC is connected to a power strip, it's my
               | understanding that law enforcement can attach a live
               | male-to-male power cable to the power strip and then
               | remove the power strip from the wall while still powering
               | the computer. That, and yeah freezing ram.
        
               | Gracana wrote:
               | Data fades slowly from DRAM, especially if you freeze it
               | first.
        
               | l33t233372 wrote:
               | Perhaps by using a can of compressed air[0].
               | 
               | [0] https://www.usenix.org/legacy/event/sec08/tech/full_p
               | apers/h...
        
               | l33t233372 wrote:
               | I don't understand why this would help prevent cold boot
               | attacks.
               | 
               | Wouldn't the memory need to bee free'd first for this to
               | have any effect?
        
               | ender341341 wrote:
               | the idea is that well written software would release
               | memory as soon as possible so with it enabled you'd have
               | the secret in memory for as little time as possible.
               | 
               | Though in my mind well written software should be zeroing
               | the memory out before freeing if it held sensitive data.
        
               | bhawks wrote:
               | Yes it would - either through the free syscall or a
               | process exit. This is a defense in depth strategy and not
               | 100% perfect. If you yanked the power cord and a long
               | lived process had sensitive data in memory you're still
               | vulnerable. But if you had a clean power down or very
               | short lifetimes of sensitive data being active in RAM it
               | would afford you additional security.
        
               | WalterBright wrote:
               | ?? Cutting the power means the RAM contents vanish.
        
               | SllX wrote:
               | They vanish eventually which is usually measured in
               | seconds. This can be extended to minutes or hours if
               | someone performs a cold boot attack: https://security.sta
               | ckexchange.com/questions/10643/recover-t...
        
               | l33t233372 wrote:
               | I find that phrasing weird.
               | 
               | A cold boot attack relies on a cold boot of the system to
               | evade kernel protections(as opposed to a warm boot where
               | the kernel can zero memory.)
               | 
               | The name has nothing to do with reducing the temperature
               | of the ram to extend the time it takes bytes to vanish in
               | ram.
        
               | SllX wrote:
               | I think it's a little bit of column A and a little bit of
               | column B, but admit while I remember reading about using
               | technique a long time ago, I'm not sure of the history of
               | the nomenclature. From the StackExchange:
               | 
               | > For those who think this is only theoretical: They were
               | able to use this technique to create a bootable USB
               | device which could determine someone's Truecrypt hard-
               | drive encryption key automatically, just by plugging it
               | in and restarting the computer. They were also able to
               | recover the memory-contents 30 minutes+ later by freezing
               | the ram (using a simple bottle of canned-air) and
               | removing it. Using liquid nitrogen increased this time to
               | hours.
        
               | l33t233372 wrote:
               | Reducing the temperature of the RAM can be done to make a
               | cold boot attack easier, but it's not the origin of the
               | name.
               | 
               | For more details, see the paper Lest We Remember.
        
               | WalterBright wrote:
               | i didn't know that. Thanks!
        
           | zamnos wrote:
           | here, I made it clickable: http://pwn.college
        
             | gaudat wrote:
             | Oh it is free too. Gotta try this.
        
           | djbusby wrote:
           | The memory wipe is a kernel build-time config option.
        
             | CodesInChaos wrote:
             | Do you know what the option is called?
             | 
             | edit: apparently the runtime option is called
             | `init_on_alloc` and the compile-time option (which
             | determines the default of the runtime option) is called
             | `CONFIG_INIT_ON_FREE_DEFAULT_ON`.
        
               | jwilk wrote:
               | There are two parameters:
               | 
               | * init_on_alloc (default set by
               | CONFIG_INIT_ON_ALLOC_DEFAULT_ON)
               | 
               | * init_on_free (default set by
               | CONFIG_INIT_ON_FREE_DEFAULT_ON)
        
         | sgt wrote:
         | DX?
        
           | adamgordonbell wrote:
           | Developer Experience (Maybe there is a better term, but how
           | you use them)
           | 
           | Docker/ OCI containers tend to be a single process and LXC
           | containers have a collection of processes.
           | 
           | But in either case they are just processes running on the
           | host in a different namespace.
           | 
           | So they feel different to use, but use the same building
           | blocks (to my understanding).
        
             | Karellen wrote:
             | Shouldn't "DX" refer to the experience of people hacking on
             | the Docker (or whatever) code itself? Like... how easy the
             | build system is to work with if you're adding a new source
             | file to the docker codebase?
             | 
             | The people working with Docker, even if they are developers
             | doing development work, are still users _of Docker_ ,
             | aren't they? I mean, the GUI of an IDE is still part of its
             | UX, right? Even though it's for developers doing
             | development work?
        
               | adamgordonbell wrote:
               | I was thinking that developer experience is user
               | experience - where the user is a developer. You are
               | suggesting that the user and developer are different
               | roles because even when the user is a developer there is
               | still the developer who builds that tool
               | 
               | It's possible you are right but I'm not an expert. I
               | always think of the developer experience is the
               | experience of developers using the tools and APIs you
               | produce.
        
             | sgt wrote:
             | Unrelated, but your name rang a... bell. You're the
             | corecursive guy. Great podcast!
        
               | adamgordonbell wrote:
               | Thanks!
        
           | dwb wrote:
           | "Developer experience" (sigh)
        
       | Proven wrote:
       | [dead]
        
       | vrglvrglvrgl wrote:
       | [dead]
        
       | worthless-trash wrote:
       | One day, the general technical population will understand
       | containers are just processes on a system. No magic under the
       | hood.
        
       | bluedino wrote:
       | Is the confusion coming from when you malloc() memory in C you
       | aren't getting zereoed memory?
        
         | dale_glass wrote:
         | I tried on Linux, and got all zeroes.
         | 
         | Of course that's for a trivial program. If you freed something,
         | that probably wasn't returned to the OS, and the next malloc
         | might just recycle it.
        
           | bluedino wrote:
           | Malloc a bunch of memory, dirty it, free it, then malloc some
           | more. And then check.
           | 
           | Your first request in your program, you'll get clean memory
           | from the OS.
        
       | jcarrano wrote:
       | I'm curious, when did OSes first begin zeroing out pages. Did the
       | first operating systems already zero the memory?
        
       | hansendc wrote:
       | Uh... Did I miss the patches that add a pre-zeroed page pool to
       | Linux? Wouldn't be the first time I missed something like that
       | getting added, but 6.3-rc5 definitely zeroes _some_ pages at
       | allocation time, and I don't see any indiciation of it consulting
       | a prezeroed page pool:
       | https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...
        
       | gregw2 wrote:
       | It's not clear to me why the top rated reply which seemed
       | extremely knowledgeable in discussing Linux kernel auto zeroing
       | of memory mentioned a bunch of caveats including mentioning "some
       | memory management libraries may reuse memory" but didn't say
       | "including glibc".
       | 
       | That's a pretty big library to gloss over from a security or
       | working programmer perspective!
       | 
       | (Corrections to perspective welcome.)
        
         | hn92726819 wrote:
         | I thought the shared memory they were referring to would be
         | read only memory or something. So yes, the memory is shared,
         | but your process doesn't have any access to overwrite any of
         | it. Likewise, when you call library functions, when it mallocs
         | something, it uses memory that your process has write access
         | to, not the shared library memory.
         | 
         | In the context of the question, I assume the asker was mostly
         | interested in reading some kind of sensitive data from a
         | previous process, not reading the same librsry-code-only memory
         | or something.
         | 
         | Note: all of this could be wrong, it was just my understanding
         | 
         | Edit: looks like this answers that:
         | https://stackoverflow.com/questions/20857134/memory-write-pr...
        
         | IcePic wrote:
         | Even if your libc IS one of the offenders, this libc code runs
         | in a single process, so even though libc does or does not use
         | evil tricks for your process, it doesn't mean you will get
         | pages from some other process memory map, only that you may get
         | your own old pages back.
        
           | gregw2 wrote:
           | Thanks, makes sense. (Looks like the poster commenting about
           | glibc has corrected themself.)
        
       | Aachen wrote:
       | I found the currently accepted answer (by user10489) interesting,
       | tying together a lot of concepts I had heard of but didn't
       | properly understand. For example, looking in htop, chromium has
       | six processes each using a literal terabyte of virtual memory,
       | which is obviously greater than swap + physical RAM on my
       | ordinary laptop. Or, why does an out-of-memory system hang
       | instead of just telling the next process "nope, we don't have
       | 300KB to allocate to you anymore" and either crash the program or
       | let the program handle the failed allocation gracefully? This
       | answer explains how this all works.
       | 
       | The TL;DR answer to the actual question is: processes generally
       | don't get access to each other's memory unless there is some
       | trust relation (like being the parent process, or being allowed
       | to attach a debugger), and being in a container doesn't change
       | that, the same restrictions apply and you always get zeroed-out
       | memory from the kernel. It's when you use a different allocator
       | that you might get nonzeroed memory from elsewhere in your own
       | process (not a random other process).
        
         | jeffbee wrote:
         | In Linux, a task that tries to allocate more than the allowed
         | amount will be immediately terminated and not notified about
         | anything. The reason out-of-memory systems thrash is merely
         | because people do not always bother setting the limits, and the
         | default assumes users prefer a process that continues through
         | hours of thrashing instead of being promptly terminated.
        
           | Sakos wrote:
           | Who is "people" in this? If you mean your average user, it's
           | unrealistic to expect them to know or care about details like
           | this. This is the kind of thing that needs sane defaults.
        
         | dale_glass wrote:
         | > Or, why does an out-of-memory system hang instead of just
         | telling the next process "nope, we don't have 300KB to allocate
         | to you anymore"
         | 
         | Blame UNIX for that, and the fork() system call.
         | 
         | It's a design quirk. fork() duplicates the process. So suppose
         | your web browser consumes 10GB RAM out of the 16GB total on the
         | system, and wants to run a process for anything. Like it just
         | wants to exec something tiny, like `uname`.
         | 
         | 1. 10GB process does fork().
         | 
         | 2. Instantly, you have two 10GB processes
         | 
         | 3. A microsecond later, the child calls exec(), completely
         | destroying its own state and replacing it with a 36K binary,
         | freeing 10GB RAM.
         | 
         | So there's two ways to go there:
         | 
         | 1. You could require step 2 to be a full copy. Which means
         | either you need more RAM, a huge chunk of which would always
         | sit idle, or you need a lot of swap, for the same purpose.
         | 
         | 2. We could overlook the memory usage increase and pretend that
         | we have enough memory, and only really panic if the second
         | process truly needs its own 10GB RAM that we don't have. That's
         | what Linux does.
         | 
         | The problem with #2 is that dealing with this happens
         | completely in the background, at times completely unpredictable
         | to the code. The OS allocates memory when the child changes
         | memory, like does "a=1" somewhere. A program can't handle
         | memory allocations failures there because as far as it knows,
         | it's not allocating anything.
         | 
         | So what you get is this fragile fiction that sometimes breaks
         | and requires the kernel to kill something to maintain the
         | system in some sort of working state.
         | 
         | Windows doesn't have this issue at all because it has no
         | fork(). New processes aren't children and start from scratch,
         | so firefox never gets another 10GB sized clone. It just starts
         | a new, 36K sized process.
        
           | l33tman wrote:
           | The Windows example is a non-sequitur as in both cases, you
           | end up with the 36K sized process both in Windows and Linux
           | if you want to spawn a sub-process that exec's. The fork() =>
           | exec() path is not the issue (if there is an issue at all
           | here), and if you use threading the memory is not forked like
           | this to start with (on either of the OSes).
           | 
           | I guess the case you want to highlight is more if you for
           | example mmap() 10 GB of RAM on that 16 GB machine that only
           | has 5 GB unused swap space left and where all of the physical
           | RAM is filled with dirty pages already. Should the mmap()
           | succeed, and then the process is killed if it eventually
           | tries to use more pages than will fit in RAM or the backing
           | swap? This is the overcommit option which is selectable on
           | Linux. I think the defaults seem pretty good and accept that
           | a process can get killed long after the "explicit" memory
           | mapping call is done.
        
           | peheje wrote:
           | I find your writing style really pleasant and understandable!
           | Much more so than the StackExchange answer. I really like the
           | breakdown into steps, then what could happen steps and the
           | follow ups. Where can I read more (from you?) in this style
           | about OS and memory management?
        
           | josefx wrote:
           | > blame UNIX for that, and the fork() system call.
           | 
           | Given that most code I have seen would not be able to handle
           | an allocation failure gracefully I wouldn't call it "blame",
           | if the OS just silently failed memory allocations on whatever
           | program tried to allocate next you would basically end up
           | with a system where random applications crash, which is
           | similar to what the OOM killer does, just with no attempt to
           | be smart about it. Even better, it is outright impossible to
           | gracefully handle allocation failures in some languages, see
           | for example variable length arrays in C.
        
             | Arnavion wrote:
             | No code is written to handle allocation failure because it
             | knows that it's running on an OS with overcommit where
             | handling allocation failure is impossible. Overcommit means
             | that you encounter the problem not when you call `malloc()`
             | but when you do `*pointer = value;`, which is impossible to
             | handle.
        
               | kevin_thibedeau wrote:
               | Plenty of code runs on systems without that behavior.
               | Graceful handling of malloc failure is still useful.
        
               | Arnavion wrote:
               | I know. I myself write code that checks the result of
               | malloc. I was responding with josefx's words.
        
             | tjoff wrote:
             | That is a very weak argument.
             | 
             | Also, why would you bother to handle it gracefully when the
             | OS won't allow you to do it?
             | 
             | Also, outright impossible in some languages? Just don't use
             | VLAs if then? "Problem" solved.
        
               | josefx wrote:
               | > Also, why would you bother to handle it gracefully when
               | the OS won't allow you to do it?
               | 
               | There are many situations where you can get an allocation
               | failure even with over provisioning enabled.
               | 
               | > Just don't use VLAs if then? "Problem" solved.
               | 
               | Yes, just don't use that language feature that is
               | visually identical to a normal array. Then make sure that
               | your standard library implementation doesn't have random
               | malloc calls hidden in functions that cannot communicate
               | an error and abort instead
               | https://www.thingsquare.com/blog/articles/rand-may-call-
               | mall.... Then ensure that your dependencies follow the
               | same standards of handling allocation failures ... .
               | 
               | I concede that it might be possible, but you are working
               | against an ecosystem that is actively trying to sabotage
               | you.
        
               | tjoff wrote:
               | VLAs are barely used and frowned upon by most. It is not
               | relevant enough to discuss.
               | 
               | Yes, mallocs in standard library is a problem. But this
               | is rather the result of a mindset where over provision
               | exist than anything else.
        
           | jacquesm wrote:
           | > 2. Instantly, you have two 10GB processes
           | 
           | No, that's not how it works. The process table gets
           | duplicated and copy-on-write takes care of the pages. As long
           | as they are identical they will be shared, there is no way
           | that 10GB of RAM will be allocated to the forked process and
           | that all of the data will be copied.
        
             | blitzkrieg3 wrote:
             | This is the only right answer. What actually happens is you
             | instantly have two 10G processes which share the same
             | address space, and:
             | 
             | 3. A microsecond later, the child calls exec(),
             | decrementing the reference count to the memory shared with
             | the parent[1] and faulting in a 36k binary, bringing our
             | new total memory usage to 1,045,612KB (1,048,576K + 36K)
             | 
             | CoW has existed since at least 1986, when CMU developed the
             | Mach kernel.
             | 
             | What GP is really talking about is overcommit, which is a
             | feature (on by default) in Linux which allows you to ask
             | for more memory than you have. This was famously a
             | departure from other Unixes at the time[2], a departure
             | that fueled confusion and countless flame wars in the early
             | Internet.
             | 
             | [1] https://unix.stackexchange.com/questions/469328/fork-
             | and-cow... [2] https://groups.google.com/g/comp.unix.solari
             | s/c/nLWKWW2ODZo/...
        
             | worthless-trash wrote:
             | OP is referring to what would happen in the naive
             | implementation, not what actually happens.
        
             | reisse wrote:
             | The data won't be copied, but kernel has to reserve the
             | memory for both processes.
        
               | l33tman wrote:
               | No, it doesn't. See "overcommit"
        
               | enedil wrote:
               | In a sense it does. Page tables would need to be copied.
        
               | jacquesm wrote:
               | Yes, but they all point to the _same_ pages. The tables
               | take up a small fraction of the memory of the pages
               | themselves.
        
               | enedil wrote:
               | But large fraction, if all you do afterwards is an exec
               | call. Given 8 bytes per page table entry and 4k pages,
               | it's 1/512 memory wasted. So if your process uses 8GB,
               | it's 16MB. Still takes noticeable time if you spawn
               | often.
        
               | lxgr wrote:
               | Aren't page tables nested? I don't know if any OS or
               | hardware architecture actually supports it, but I could
               | imagine the parent-level page table being virtual and
               | copy-on-write itself.
        
               | jacquesm wrote:
               | I've never had the page tables be the cause of out of
               | memory issues. Besides the fact that they are usually
               | pre-allocated to avoid recursive page faults, but nothing
               | would stop you from making the page tables themselves
               | also copy-on-write during a fork.
        
               | reisse wrote:
               | Please read the parent comments. Overcommit is necessary
               | exactly because kernel has to reserve memory for both
               | processes, and overcommit allows to reserve more memory
               | than there is physically present.
               | 
               | If kernel could not reserve memory for forked process,
               | overcommit would not be necessary.
        
               | blitzkrieg3 wrote:
               | This is a misconception you and parent are perpetuating.
               | fork() existed in this problematic 2x memory
               | implementation _way_ before overcommit, and overcommit
               | was non-existent or disabled on Unix (which has fork())
               | before Linux made it the default. Today with CoW we don't
               | even have this "reserve memory for forked process"
               | problem, so overcommit does nothing for us with regard to
               | fork()/exec() (to say nothing of the vfork()/clone()
               | point others have brought up). But if you want you can
               | still disable overcommit on linux and observe that your
               | apps can still create new processes.
               | 
               | What overcommit enables is more efficient use of memory
               | for applications that request more memory than they use
               | (which is most of them) and more efficient use of page
               | cache. It also pretty much guarantees an app gets memory
               | when it asks for it, at the cost of getting oom-killed
               | later if the system as a whole runs out.
        
               | lxgr wrote:
               | I think you've got it backwards: With overcommit, _there
               | is no memory reservation_. The forked processes gets an
               | exact copy of the other 's page table, but with all
               | writable memory marked as copy-on-write instead. The
               | kernel might well be tallying these up to some number,
               | but nothing important happens with it.
               | 
               | Only without overcommit does the kernel does need to
               | start accounting for hypothetically-writable memory
               | before it actually is written to.
        
               | reisse wrote:
               | Yes, you're right here, that was what I wanted to say but
               | probably was not able to formulate.
        
             | senko wrote:
             | This is more or less what the second 2. explains:
             | 
             | > 2. We could overlook the memory usage increase and
             | pretend that we have enough memory, and only really panic
             | if the second process truly needs its own 10GB RAM that we
             | don't have. That's what Linux does
             | 
             | "pretend" - share the memory and hope most of it will be
             | read-only or unallocated eventually; "truly needs to own" -
             | CoW
        
               | jacquesm wrote:
               | It will _never_ happen. To begin with all of the code
               | pages are going to be shared because they are not
               | modified.
               | 
               | Besides that the bulk of the fork calls are just a
               | preamble to starting up another process and exiting the
               | current one. It's mostly a hack to ensure continuity for
               | stdin/stdout/stderr and some other resources.
        
               | msm_ wrote:
               | It will _most likely_ not happen? It 's absolutely
               | possible to write a program that forks and both forks
               | overwrite 99% of shared memory pages. It almost never
               | happens, which is GP's point, but it's possible and the
               | reason it's a fragile hack.
               | 
               | What usually happens in practice is you're almost OOM,
               | and one of the processes running in the system writes to
               | a page shared with another process, forcing the system to
               | start good ol' OOM killer.
        
               | jacquesm wrote:
               | 99% isn't 100%.
               | 
               | Sorry, but no, it _can 't_ happen, you can not fork a
               | process and end up with twice the memory requirements
               | just because of the fork. What you can do is to simply
               | allocate more memory than you were using before and keep
               | writing.
               | 
               | The OOM killer is a nasty hack, it essentially moves the
               | decision about what stays and what goes to a process that
               | is making calls way above its pay grade, but overcommit
               | and OOM go hand in hand.
        
               | [deleted]
        
               | blitzkrieg3 wrote:
               | It does not happen using fork()/exec() as described
               | above. For it to happen we would need to fork() and
               | continue using old variables and data buffers in the
               | child that we used in the parent, which is a valid but
               | rarely used pattern.
        
           | RcouF1uZ4gsC wrote:
           | In retrospect, fork was big mistake. It complicates a bunch
           | of other things and introduces suboptimal patterns.
        
             | t0suj4 wrote:
             | A small reminder: In the age of Unix multiuser systems were
             | very common. Fork was the optimal solution to be able to
             | serve as much concurrent users or programs as possible
             | while keeping the implementation simple.
             | 
             | Today's RAM is cheap.
        
               | moonchrome wrote:
               | So much of design constraints in our base abstractions
               | are not relevant today but we're still cobbling together
               | solutions built on legacy technical decisions.
        
               | archgoon wrote:
               | [dead]
        
           | reisse wrote:
           | > Blame UNIX for that, and the fork() system call.
           | 
           | At least that design failure of UNIX has been fixed long ago.
           | There are posix_spawn(3) and various clone(2) flavours which
           | allow to spawn new process without copying the old one. And a
           | lot of memory-intensive software actually use them, so modern
           | Linux distros can be used without memory overprovisioning.
           | 
           | I'd rather blame people who are still using fork(2) for
           | anything that can consume more than 100MB of memory.
        
             | drougge wrote:
             | I'm someone who likes to use fork() and then actually use
             | both processes as they are, with shared copy-on-write
             | memory. I'm happy to use it on things consuming much more
             | than 100MB of memory. In fact that's where I like it the
             | most. I'm probably a terrible person.
             | 
             | But what would be better? This way I can massage my data in
             | one process, and then fork as many other processes that use
             | this data as I like without having to serialise it to disk
             | and and then load it again. If the data is not modified
             | after fork it consumes much less memory (only the page
             | tables). Usually a little is modified, consuming only a
             | little memory extra. If all of it is modified it doesn't
             | consume more memory than I would have otherwise (hopefully,
             | not sure if the Linux implementation still keeps the pre-
             | fork copy around).
             | 
             | (And no, not threads. They would share modifications, which
             | I don't want. Also since I do this in python they would
             | have terrible performance.)
        
               | reisse wrote:
               | So if I got it right, you're using fork(2) as a glorified
               | shared memory interface. If my memory is (also) right,
               | you can allocate shared read-only mapping with
               | shm_open(3) + mmap(2) in parent process, and open it as a
               | private copy-on-write mapping in child processes.
        
               | jacquesm wrote:
               | No, he's using fork the way it is intended.
               | 
               | Shared memory came _much_ later than fork did.
        
               | Asooka wrote:
               | I have used fork as a stupid simple memory arena
               | implementation. fork(); do work in the child; only
               | malloc, never free; exit. It is much, much heavier than a
               | normal memory arena would be, but also much simpler to
               | use. Plus, if you can split the work in independent
               | batches, you can run multiple children at a time in
               | parallel.
               | 
               | As with all such stupid simple mechanisms, I would not
               | advise its use if your program spans more than one .c
               | file and more than a thousand lines.
        
             | lxgr wrote:
             | Unless you're sure that you're going to write to the
             | majority of the copy-on-write memory resulting from fork(),
             | this seems like overkill.
             | 
             | Maybe there should be yet another flavor of fork() that
             | does copy-on-write, but treats the memory as already-copied
             | for physical memory accounting purposes? (Not sure if
             | "copy-on-write but budget as distinct" is actually
             | representable in Linux's or other Unixes' memory model,
             | though.)
        
             | panzi wrote:
             | posix_spawn() is great, but Linux doesn't implement it.
             | glibc does based on fork()+exec(). Other Unix(-like) OSes
             | do implement posix_spawn() as system call. Also while you
             | can use posix_spawn() in the vast majority of cases, if it
             | doesn't cover certain process setup options that you need
             | you still have to use fork()+exec(). But yeah, it would be
             | good if Linux had it as a system call. It would probably
             | help PostgreSQL.
        
               | the8472 wrote:
               | glibc uses vfork+exec to implement posix_spawn, which
               | makes it much faster than fork+exec.
        
             | AnIdiotOnTheNet wrote:
             | Can a modern distro really be used without over
             | provisioning? Because the last time I tried it either the
             | DE or display server hard locked immediately and I had to
             | reboot the system.
             | 
             | Having this ridiculous setting as the default has basically
             | ensured that we can never turn it off because developers
             | expect things to work this way. They have no idea what to
             | do if malloc errors on them. They like being able to make
             | 1TB allocs without worrying about the consequences and just
             | letting the kernel shoot processes in the head randomly
             | when it all goes south. Hell, the last time this came up
             | many swore that there was literally nothing a programmer
             | _could_ do in the event of OOM. Learned helplessness.
             | 
             | It's a goddamned mess and like many of Linux's goddamned
             | messes not only are we still dealing with it in 2023, but
             | every effort to do anything about it faces angry ranty
             | backlash.
        
               | lxgr wrote:
               | Almost everything in life is overprovisioned, if you
               | think about it: Your ISP, the phone network, hospitals,
               | bank reserves (and deposit insurance)...
               | 
               | What makes the approach uniquely unsuitable for memory
               | management? The entire idea of swapping goes out of the
               | window without overprovisioning as well, for better or
               | worse.
        
               | mrguyorama wrote:
               | What an absurdly whataboutism filled response. Meanwhile
               | Windows has been doing it the correct way for 20 years or
               | more and never has to kill a random process just to keep
               | functioning.
        
               | lxgr wrote:
               | So you're saying the correct way to support fork() is
               | to... not support it? This seems pretty wasteful in the
               | majority of scenarios.
               | 
               | For example, it's a common pattern in many languages and
               | frameworks to preload and fully initialize one worker
               | process and then just fork that as often as required. The
               | assumption there is that, while most of the memory is
               | _theoretically_ writable, practically, much of it is
               | written exactly once and can then be shared across all
               | workers. This both saves memory and the time needed to
               | uselessly copy it for every worker instance (or
               | alternatively to re-initialize the worker every single
               | time, which can be costly if many of its data structures
               | are dynamically computed and not just read from disk).
               | 
               | How do you do that without fork()/overprovisioning?
               | 
               | I'm also not sure whether "giving other examples" fits
               | the bill of "whataboutism", as I'm not listing other
               | examples of bad things to detract from a bad thing under
               | discussion - I'm claiming that all of these things are
               | (mostly) good and useful :)
        
               | AnIdiotOnTheNet wrote:
               | Perhaps there is some confusion because I used
               | "overprovision" when the appropriate term here is
               | "overcommit", but Windows manages to work fine without
               | unix-style overcommit. I suspect most OSs in history do
               | not use unix's style of overcommit.
               | 
               | > What makes the approach uniquely unsuitable for memory
               | management?
               | 
               | The fact that something like OOM killer even needs to
               | exist. Killing random processes to free up memory you
               | blindly promised but couldn't deliver is not a reasonable
               | way to do things.
               | 
               | Edit: https://lwn.net/Articles/627725/
        
           | 0x0 wrote:
           | You don't HAVE to use fork() + exec() in Linux as far as I
           | know. There is for example posix_swawn() which does the
           | double whammy combo for you.
           | 
           | Also, if you DO use fork() without immediately doing an
           | exec(), and start writing all over your existing
           | allocations... just don't?
        
             | the8472 wrote:
             | posix_spawn isn't powerful enough since it only supports a
             | limited set of process setup operations. So if you need to
             | do some specific syscalls before exec that aren't covered
             | by it then the fork/exec dance is still necessary. In
             | principle one can use vfork+exec instead but that very hard
             | to get right.
             | 
             | What's really needed is io_uring_spawn but that's still a
             | WIP. https://lwn.net/Articles/908268/
        
           | whartung wrote:
           | We actually ran into this a long time ago with Solaris and
           | Java.
           | 
           | Java has JSP, Java Server Pages. JSP processing translates a
           | JSP file into Java source code, compiles it, then caches,
           | loads, and executes the resulting class file.
           | 
           | Back then, the server would invoke the javac compiler through
           | a standard fork and exec.
           | 
           | That's all well and good, save when you have a large app
           | server image sucking up the majority of the machine. As far
           | as we could tell, it was a copy on write kind of process, it
           | didn't actually try to do the actual work when forking the
           | app server. Rather it tried to do the allocation, found it
           | didn't have the space or swap, and just failed with a system
           | OOM error (which differs from a Java out of memory/heap
           | error).
           | 
           | As I recall adding swap was the short term fix (once we
           | convinced the ops guy that, yes it was possible to "run out
           | of memory"). Long term we made sure all of our JSPs were pre-
           | compiled.
           | 
           | Later, this became a non issue for a variety of reasons,
           | including being able to run the compiler natively within a
           | running JVM.
        
           | KMag wrote:
           | Option 3: vfork() has existed for a long time. The child
           | process temporarily borrows all of the parent's address
           | space. The calling process is frozen until the child exits or
           | calls a flavor of exec _. Granted, it 's pretty brittle and
           | any modification of non-stack address space other than
           | changing a variable of type pid_t is undefined behavior
           | before exec_ is called. However, it gets around the
           | disadvantages of fork() while maintaining all of the
           | flexibility of Unix's separation of process creation
           | (fork/vfork) and process initialization (exec*).
           | 
           | vfork followed immediately by exec gives you Windows-like
           | process creation, and last I checked, despite having the
           | overhead of a second syscall, was still faster than process
           | creation on Windows.
        
           | eru wrote:
           | Keep in mind that copy-on-write makes analysing the situation
           | a bit more complicated.
        
             | nnntriplesec wrote:
             | How so?
        
               | pdpi wrote:
               | CoW is a strategy where you don't actually copy memory
               | until you write to it. So, when the 10GB process spawns a
               | child process, that child process also has 10GB of
               | virtual memory, but both processes are backed by the same
               | pages. It's only when one of them writes to a page that a
               | copy happens. When you fork+exec you never actually touch
               | most of those pages, so you never actually pay for them.
               | 
               | (Obviously, that's the super-simplified version, and I
               | don't fully understand the subtleties involved, but
               | that's exactly what GP means: it's harder to analyse)
        
               | eru wrote:
               | Thanks for writing the explanation.
               | 
               | To make it slightly more complicated: you don't pay for
               | the 10 GB directly, but you still pay for setting up the
               | metadata, and that scales with the amount of virtual
               | memory used.
        
         | progbits wrote:
         | If you would like to go deeper down the rabbit hole,
         | fasterthanlime recently made a few videos that explore this
         | topic:
         | 
         | https://youtu.be/YB6LTaGRQJg https://youtu.be/c_5Jy_AVDaM
         | https://youtu.be/DpnXaNkM9_M
         | 
         | The code is in Rust but that doesn't matter for the
         | explanation.
        
         | skyeto wrote:
         | What I find to be a rather interesting tidbit related to this
         | is that some applications (e.g. certain garbage collectors) map
         | multiple ranges of virtual memory that address the same
         | physical memory.
        
           | yakubin wrote:
           | That's how you can create a gapless ring buffer:
           | <https://learn.microsoft.com/en-
           | us/windows/win32/api/memoryap...> (see scenario 1 in
           | examples).
        
       ___________________________________________________________________
       (page generated 2023-04-05 23:01 UTC)