[HN Gopher] Stack size invisibility in C and the effects on "por...
       ___________________________________________________________________
        
       Stack size invisibility in C and the effects on "portability"
        
       Author : todsacerdoti
       Score  : 157 points
       Date   : 2021-09-28 14:07 UTC (8 hours ago)
        
 (HTM) web link (utcc.utoronto.ca)
 (TXT) w3m dump (utcc.utoronto.ca)
        
       | shadowgovt wrote:
       | I once worked on a project where we were building a browser
       | plugin in IE (back when plugins existed) and found ourselves
       | needing to use Open Dynamics Engine within the plugin. Under the
       | hood, ODE uses `alloca` a _lot_... Once per frame to set aside
       | enough RAM to compute a Jacobian that solves for all active
       | collisions. But on Windows, you can configure  "redzone
       | protection" to fault the program if the stack takes a flying leap
       | of a (relatively) paltry few kB off the current top of the stack,
       | relative to the default configuration of redzone protection on
       | most single-threaded Windows apps. From ODE's point of view this
       | was fine; they didn't really care about the "compiling on
       | Windows" use case at all and considered our problem to be ours.
       | But this proved disastrous in the context of running on Windows
       | in a thread, and since we were running in IE's process context,
       | we couldn't change the thread allocation rules because they're
       | baked into the process itself.
       | 
       | The nice thing about C++ is that the preprocessor is a monster.
       | We #define'd all references to alloca in the ODE codebase away
       | into calls to our own function that returned, instead of a bare
       | pointer, an object with the right pointer-dereference semantics
       | to act like a bare pointer but that was really a reference to a
       | slice of a colossal chunk of heap memory. When the context of the
       | alloca call closed, the returned object was destroyed, and its
       | destructor moved our imaginary "stack" pointer in the big chunk
       | of heap memory back into place.
       | 
       | Ugliest thing I ever wrote. ;)
        
       | pikhq wrote:
       | This article is a bit strange. It's referring to the thread stack
       | size but acting as though it's the main stack size. It's true
       | there's no portable C API for doing anything with the main stack
       | size, but musl libc doesn't behave any differently than glibc
       | here -- the main stack is allocated by the kernel. The thread
       | stack size on musl is small, yes, but the only way to even end up
       | in a thread involves using the pthread API. The pthread API
       | exposes an API for configuring the stack size for a thread; it's
       | called pthread_attr_setstacksize. You can find it here:
       | https://pubs.opengroup.org/onlinepubs/9699919799/functions/p...
       | 
       | Granted, it is maddening that C does not offer any great ways to
       | reason about what stack size is needed, but it is simply nonsense
       | to imply (as people seem to be inferring here) that there's no
       | way for an application to handle running into musl's stack size
       | limits.
        
       | unwind wrote:
       | It's interesting that nobody has mentioned the fact that "stack"
       | is a word that is not mentioned in the C standard (caveat: I
       | don't own it. but I just checked in a draft copy). In other
       | words, there is no assumption/guarantee that a C compiler will
       | rely on the machine stack in order to implement certain fetures
       | (i.e. automatic variables, parameter passing, and function
       | call/return).
       | 
       | So I guess that means that an imaginary API to query these things
       | should not talk about "stack", since that is making an assumption
       | that is a bit out of line. That does not make naming any easier
       | ... and, as pointed out by vasama, it would be hard to use an API
       | that queries about the current function only. Perhaps something
       | like "autosizeof f" where f is a function could return, at
       | compile-time, the projected "automatic storage size" of the given
       | function (perhaps making it into a compilation error if f() uses
       | VLAs, or calls functions that do), and a runtime "autoleft()"
       | function could be used to determine how much space is left.
       | 
       | Uh, no, probably not. These things are hard, and I'm making
       | assumptions about code analysis capabilities that are probably
       | not realistic.
        
         | Joker_vD wrote:
         | LLVM-MOS targets 6502 and so goes out of its way to use as few
         | of the hardware stack as possible: it does global program
         | optimization, figuring out what calls what, and what functions
         | don't need to be reentrant so it could allocate as many of
         | activation records as possible statically, FORTRAN-style.
        
         | silon42 wrote:
         | I can easily imagine compiler+linker being able to forbid
         | recursion and statically calculate maximum stack needed.
         | Probably you need to ban general function invocation via
         | pointers... methods on objects could be fine, you just evaluate
         | all virtual method override combinations.
        
           | noneeeed wrote:
           | The SPARK Ada language prevents recursion by only allowing
           | you to call methods that are defined above the method you are
           | in. It has a similar rule to prevent recursive calls between
           | packages although I forget the details of that. From this you
           | can calculate the worst case stack size. As all other memory
           | has to be statically allocated you can always caclculate the
           | maximum memory footprint of any progam.
           | 
           | Obviously those restrictions mean you have to write your code
           | without recursion, but in practice the uses that SPARK is put
           | to don't tend to need it. A statically allocated stack in
           | your code and a loop can solve a lot of problems (along with
           | suitable logic for gracefully handling that stack filling
           | up).
        
         | Ironlikebike wrote:
         | The stack isn't in the C-standard, but it is often minimally
         | defined in the psABIs (e.g., the ELF ABI on Linux) for the
         | architecture. The psABI is generally trying to balance
         | performance of addressability (on the architecture), with
         | security (position-independence), while constraining the
         | toolchain developers' implementation as little as possible. As
         | a programmer you don't have to implement to the ABI, but if you
         | want interoperability and you want to use a C language
         | compiler, you should (good luck creating your own ABI
         | otherwise).
         | 
         | I'm not aware of any way to programmatically query the rules
         | laid out in the psABI. The implementation of the corner cases
         | is often based on mutual agreement (convention) between the
         | compiler and the link-editor.
         | 
         | You could probably write a tool that maintains a representation
         | of these rules and returns the results in some sort of parse-
         | friendly EBNF, but you'd need a convention linter to make sure
         | that the implementation of your tool tracks the change in
         | convention in the toolchain components.
         | 
         | On-the-other hand, you could write an extension to the compiler
         | that warns you when certain C code results in behaviors around
         | the stack that might be undesirable, i.e., if you're passing
         | too many parameters to a function and it's going to spill those
         | to the stack.
         | 
         | The standard way to avoid unknown behavior in this regard is to
         | explicitly tell the compiler what you want to accomplish by not
         | relying on auto-allocation or parameter passing rules (as
         | stated in the top thread). This means explicit allocation on
         | the heap, rather than than allowing the compiler to auto-
         | allocate on the stack, and using pass-by-reference to avoid
         | parameters spilling to the stack. All of this is slower since
         | accessing values in the stack is often a single instruction
         | operation (small number of cycles), vs multi-instruction loads
         | from far-away addresses (large number of cycles).
        
         | pjc50 wrote:
         | Yeah, I've written C for the PIC16 which has a hardware stack
         | .. of depth 8.
         | https://electronics.stackexchange.com/questions/158307/is-th...
         | 
         | I'm not entirely sure it was a standards-compliant compiler
         | though. I think it may have denied recursion?
         | 
         | Mind you, I also think people have unreasonably high
         | expectations of the C standard; in order to get things done you
         | inevitably have to make compromises and use features that
         | aren't available on _every_ platform. Which leads to autoconf
         | hell, but if you 're still developing on C you've accepted its
         | imperfections.
        
         | cryptonector wrote:
         | You ask for an automatic allocation of N bytes. You can use
         | alloca() if you have that available, or VLAs if you have that
         | available.
         | 
         | If you have neither, then you still have to worry that fixed
         | sized automatic allocations might be too large, but at least
         | this can be checked statically.
         | 
         | Given either alloca() or VLAs, those can be allocated on a
         | stack or on a heap, and the spec does not say either way. The
         | OS might over-commit memory, too, and if the C run-time has
         | fixed-sized stacks there may or may not be a guard page at the
         | end.
         | 
         | However, even with all of that, using memcpy() or similar (see
         | elsewhere in this thread) to touch every byte or every 4KB's
         | worth of bytes in what might be the direction of stack growth
         | (if there's a stack, otherwise the direction is irrelevant), is
         | something you can do to make sure that the memory allocated is
         | committed, and that any failure due to over-committing (bounded
         | stacks look like over-committed unbounded stacks) is detected
         | as early as possible and without corrupting memory (e.g., by
         | wildly exceeding stack bounds). Failure may ensue, but it
         | should be behavior defined by the machine rather than undefined
         | behavior.
        
           | alcover wrote:
           | > fixed sized automatic allocations [...] can be checked
           | statically
           | 
           | May I ask how you'd do this ? I need this for optimizing a C
           | project.
        
             | cryptonector wrote:
             | Check every function's automatic variables' sizes, add them
             | all up, and add whatever function call frame size the
             | compiler and ABI are using. Getting an exact accounting of
             | function call stack memory usage is tricky because you
             | might have framepointer-less code, and you might have
             | inlining of function calls, but you can figure it out
             | assuming no inlining and assuming you don't enable
             | framepointer-less-ness.
        
               | alcover wrote:
               | Thank you. It doesn't err.. look super straightforward. I
               | was hoping for an automated way of some sort.
               | 
               | This stack affair of `local arrays are super fast! but
               | keep it small, 1024 should be okay` is very puzzling to
               | aspiring C lovers like me.
        
               | cryptonector wrote:
               | There's a way to dump DWARF debug information that you
               | might then be able to parse and use to do just this.
        
             | jart wrote:
             | You can guarantee undetected stack overflow is impossible
             | if you do two things. First you mprotect() a 4096 byte at
             | the bottom of the stack. Second, you program the compiler
             | to fail if an individual function allocates a stack frame
             | larger than that.                   #ifndef
             | STACK_FRAME_UNLIMITED         #pragma GCC diagnostic error
             | "-Wframe-larger-than=4096"         #if __GNUC__ >= 9
             | #pragma GCC diagnostic error "-Walloca-larger-than=1024"
             | #pragma GCC diagnostic error "-Wvla-larger-than=1024"
             | #endif /* GCC 9+ */         #endif /* STACK_FRAME_UNLIMITED
             | */
        
         | b3morales wrote:
         | Conversely, I don't think `malloc` et al. say anything about a
         | "heap". All they promise is they'll allocate some memory, with
         | certain size/alignment and lifetime guarantees. It's up to the
         | platform to figure out how to fulfill those guarantees.
        
       | zabzonk wrote:
       | How can you predict how much stack you need? Isn't this basically
       | the halting problem?
        
         | [deleted]
        
         | phkahler wrote:
         | With the exception of recursion, static analysis can answer
         | that question.
        
         | Verdex wrote:
         | As far as I know, there aren't any languages which provide a
         | way to do this out of the box. If a language wants to provide
         | an environment where no stack overflows can happen, then it
         | needs to provide a runtime where the stack programs use is
         | transplanted over to the heap as it gets too large (or maybe it
         | just starts out there). At this point the answer is that you
         | only need a stack big enough to hold your runtime (which you
         | can engineer to be sufficiently small always).
         | 
         | Of course you can imagine a language where your function
         | signature includes how much stack space it takes up. The stack
         | size being recursively calculated by looking at all the
         | functions your function calls etc.
         | 
         | If recursion is included, then you're correct that this would
         | require solving the halting problem. The obvious solution would
         | be to just not allow recursion. Although, if you also consider
         | languages like Idris (which are technically sometimes not
         | turing complete) you can setup a termination checker and then
         | certain rules can give you an upper bound on how long a
         | recursive function will run (and therefore how much stack it
         | takes up).
        
       | rurban wrote:
       | Well, there's still libsigsegv and its
       | stackoverflow_install_handler. You won't get the size, but get
       | the overflows.
        
       | CJefferson wrote:
       | The lack of stack size is one of my continual irritations. It
       | makes it difficult to write tree-searches in the obvious
       | recursive way, because there is a good chance one day you will
       | have an input big enough to overflow the stack. I wish on modern,
       | 64-bit OSes the default thread's stack size was just made a
       | decent size (128MB or even 1GB) -- would it really cost that much
       | (as until the memory was touched it wouldn't really be
       | allocated).
       | 
       | Your choices are:
       | 
       | * Make a "stack" yourself, in heap memory, and implement your
       | algorithm to manually push/pop from the stack you implemented
       | (often not hard to do, but feels like reimplementing for no
       | reason).
       | 
       | * Spawn a new thread, where in most languages you can set the
       | stack size. This is what I tend to do, although it does seem like
       | a waste to have an extra thread for no reason other than to get
       | more stack.
       | 
       | You can (in principle) set the stack size when you compile a
       | program, but how to do it seems to differ between OSes and
       | compilers, so I prefer the new thread strategy as an in-language
       | option.
        
         | LegionMammal978 wrote:
         | I recall that the Rust compiler uses the stacker crate [0] to
         | grow the stack into the heap, and it's under the rust-lang
         | organization on GitHub, so it's probably the closest you can
         | get to an "in-language option" without it actually being in the
         | standard library. However, it currently isn't supported for
         | several platforms due to the lack of a suitable OS interface.
         | 
         | [0] https://docs.rs/stacker/0.1.14/stacker/
        
         | kevincox wrote:
         | > would it really cost that much (as until the memory was
         | touched it wouldn't really be allocated).
         | 
         | The problem here is that once it is used once it is never
         | freed. (Other than being swapped out). So you effectively cache
         | all of the space your program needs forever.
         | 
         | There are some solutions like trying to mark the stack below
         | you freeable but it is hard to do this in a way that doesn't
         | cause huge performance concerns if you happen to be crossing
         | whatever boundary is picked often. IIRC Golang used to do this
         | but switched to copying, growable stacks because no matter how
         | hard they tried they always ended up in scenarios where they
         | were allocating and deallocating stacks in a hot loop.
        
           | owl57 wrote:
           | > IIRC Golang used to do this but switched to copying,
           | growable stacks because no matter how hard they tried they
           | always ended up in scenarios where they were allocating and
           | deallocating stacks in a hot loop.
           | 
           | They used to do segmented stacks. The bad overhead is not in
           | allocation, but in switching between segments when stack
           | pointer is oscillating around the boundary -- but if you want
           | to allocate really many stacks that could grow large, you
           | don't have much choice: either this or tracking pointers that
           | need to be updated on reallocation. Normal contiguously
           | mapped stack, but with dynamic growing and shrinking of
           | actually mapped pages, is an entirely different proposition.
        
             | Dylan16807 wrote:
             | Ah, that makes more sense, since you could easily limit
             | stack deallocation to once a second or whatever.
        
           | Pxtl wrote:
           | > The problem here is that once it is used once it is never
           | freed. (Other than being swapped out). So you effectively
           | cache all of the space your program needs forever.
           | 
           | What do you mean? It will get freed once you pop that stack-
           | frame won't it?
        
             | marcosdumay wrote:
             | No. It was never allocated either.
             | 
             | It was virtual memory before it was used, but once you use
             | it, it's mapped into physical memory and marked as dirty.
             | So it may be swapped out or live there, those are the only
             | two options.
             | 
             | It would take your OS being fully aware of your usage
             | schema for it to be able to free the memory.
        
           | LeifCarrotson wrote:
           | > So you effectively cache all of the space your program
           | needs forever.
           | 
           | All the space your program uses at the largest stack size,
           | you mean, which is different. Until you touch it, the huge
           | stack is purely virtual. And it only lasts until you restart
           | the program.
           | 
           | Yes, if you default to a 10 GB stack, then recursively tree-
           | search over 9 GB of data, then close the data and spend the
           | next week with 1 MB data that only touches the bottom of the
           | stack, you'll have 8.999 GB swapped out, which does suck.
           | 
           | But that only happens if you run the program on that 9 GB of
           | data. The alternative was to fault when you overflowed the
           | smaller stack when that data was imported.
           | 
           | It does mean if there's an infinite recursion bug, instead of
           | crashing quickly with a stack overflow, you have to wait
           | while it pages out other programs and fills the whole 10 GB
           | of stack...
        
             | kevincox wrote:
             | Yes, that is basically what I mean. If you are going to use
             | a very large amount of stack space you should consider
             | alternatively allocating something on the heap and freeing
             | it afterwards.
             | 
             | Yes, if most of your program is using roughly the same
             | amout of stack space then it is great. However I think if
             | you are worried about running out then you are likely using
             | far more than the average part of your program.
             | 
             | So yes, it isn't intrinsically bad, but context important
             | to evaluate the decision.
        
         | ximeng wrote:
         | Zig appears to have a proposed solution to this similar to your
         | option 1, move recursion to the heap.
         | 
         | https://github.com/ziglang/zig/issues/1639
         | 
         | They're also looking at static analysis of the maximum stack
         | depth which might help in some cases.
        
           | ziml77 wrote:
           | Being able to make recursive functions use the heap would be
           | great. I assume it would need to be opt-in since you'll
           | change performance characteristics, but I almost always
           | rearrange recursive algorithms to use a loop and a heap
           | allocated stack structure instead. There's no reason the
           | compiler couldn't help do that for me.
        
         | phkahler wrote:
         | A good tree will have something like log(n) depth so normal
         | recursion should be fine.
         | 
         | Apart from recursive algorithms, stack size can be determined
         | by static analysis.
         | 
         | Which reminds me: what is a good tool for analyzing / viewing a
         | C++ programs call tree?
        
           | wallstprog wrote:
           | > Which reminds me: what is a good tool for analyzing /
           | viewing a C++ programs call tree?
           | 
           | As already mentioned, perf can capture call stacks, but those
           | are statistical samples, so not necessarily complete.
           | 
           | callgrind (https://www.valgrind.org/docs/manual/cl-
           | manual.html) by contrast can capture all call-stacks.
           | 
           | kCacheGrind/qCacheGrind
           | (https://kcachegrind.github.io/html/Home.html) can be used to
           | view the output from either perf or callgrind. Sources for
           | kcachegrind, including converters from perf, oprofile, etc.,
           | can be found at https://github.com/KDE/kcachegrind
        
           | xfer wrote:
           | https://llvm.org/docs/Passes.html#dot-callgraph-print-
           | call-g...
        
           | AnimalMuppet wrote:
           | Yeah. You won't be able to fit the tree in memory long before
           | you overflow the stack while recursing down the tree.
        
           | stncls wrote:
           | > A good tree will have something like log(n) depth so normal
           | recursion should be fine.
           | 
           | Yes, this is true for comparison/search/sort/bucketing
           | algorithms. But in other applications (e.g. graph theory,
           | networks, sparse matrices) a depth of n, for n elements, can
           | be perfectly legitimate. Or it can even be an edge case, but
           | one you don't want to handle by just segfaulting.
           | 
           | > Which reminds me: what is a good tool for analyzing /
           | viewing a C++ programs call tree?
           | 
           | Linux perf can collect the necessary data. Then for
           | frontends, I am not sure. KDAB's "hotspot" (
           | https://www.kdab.com/tag/perf/ ) can do flamegraphs, but this
           | may be more performance-oriented than what you're looking
           | for.
           | 
           | Edit: perf can collect the data _statistically_. It will miss
           | many calls, the ones that are the least time-consuming.
        
           | dllthomas wrote:
           | > A good tree will have something like log(n) depth so normal
           | recursion should be fine.
           | 
           | But in many contexts you have to be robust against a bad
           | tree.
           | 
           | > Apart from recursive algorithms, stack size can be
           | determined by static analysis.
           | 
           | Not always, in the presence of dynamically sized arrays (or
           | alloca and friends).
        
         | foobiekr wrote:
         | Honestly, there is almost no difference between a purely
         | recursive implementation on the stack and one that uses a
         | passed-in stack structure (whether extensible or fixed) in
         | terms of complexity. Though I can see why it would be nice if
         | the languages supported some sugar for this.
        
         | likeabbas wrote:
         | Could also be solved if the compiler supported Tail calls
         | optimization
        
           | Jensson wrote:
           | The main reason to use recursion is to use the stack
           | variables, when tail call optimization is possible people
           | usually just prefers loops.
        
           | dnautics wrote:
           | I don't think tail call will help depth-first ("obvious
           | recursive way") tree algos.
        
             | aidenn0 wrote:
             | It doesn't help with breadth-first searches either. A non-
             | degenerate tree of depth N will use O(N) space, and that
             | space will be on the call-stack if you use a recursive
             | algorithm to access it.
             | 
             | As long as you use less than 1 page per activation record
             | though, most modern libc implementations behave safely
             | under unbounded recursion as they will allocate a guard-
             | page around each stack.
             | 
             | [edit]
             | 
             | It would also be useful if the compiler could warn about
             | activation records larger than 1 page. Absent alloca (and
             | deprecated c99 VLAs), that should be statically knowable.
        
               | dnautics wrote:
               | A BFS is essentially a while loop with a queue
               | datastructure. This is (relatively) easy to implement in
               | TCO-recursive erlang/elixir, which is what I'm currently
               | most proficient in. And, yeah, TCO would help.
        
               | Dylan16807 wrote:
               | What does a recursive breadth-first search look like? I
               | think you'd have to contort yourself to write that, which
               | is why people are specifically mentioning depth-first
               | where recursion is easy and natural.
        
               | aidenn0 wrote:
               | Ah you're right; the most natural BFS would be tail-
               | recursive or iterative.
        
           | cryptonector wrote:
           | TCO is an optimization. It's not always possible.
        
             | jhgb wrote:
             | It's a misnomer. It's not just "an optimization"; in some
             | cases, it alters program semantics.
        
               | cryptonector wrote:
               | Usually the only visible side-effect of TCO is stack
               | trace compression and failure to exhaust the stack. You
               | can call that semantics if you like, but I think most
               | people wouldn't. It's absurd that Java doesn't do TCO
               | because of the stack compression "issue" (as I
               | understand, that's why Java doesn't TCO).
        
               | jhgb wrote:
               | Termination instead of divergence definitely is a change
               | in semantics.
        
               | adrianN wrote:
               | C++ compilers are free to optimize infinite loops without
               | side effects away and people are generally fine with
               | that.
        
               | jhgb wrote:
               | Well, I'm sure this is not the worst thing about C++ that
               | C++ people are "generally fine with", so that really
               | doesn't tell me a lot. Not quite sure if this applies to
               | useful programs, though. Lack of tail call elimination
               | can definitely break a class of useful programs even with
               | side effects.
        
               | cryptonector wrote:
               | No, failure to fail is not a change in semantics.
               | 
               | EDIT: "My program used to fail with a stack size
               | exhaustion error, but now it works -- I'm so angry about
               | this!" -no one ever.
        
               | jhgb wrote:
               | But they'd complain about the opposite -- for example if
               | the program were an external-event-driven state
               | automaton.
        
               | cryptonector wrote:
               | TCO does not do the opposite. TCO helps avoid stack
               | exhaustion without introducing new faults.
               | 
               | TCO might cause your program to avoid one stack
               | exhaustion fault only to reveal a different fault
               | (possibly a different stack exhaustion fault). This is
               | not a semantics change. And it's not a bug either.
        
               | jhgb wrote:
               | What do you mean by "reveal a different fault"? A program
               | with an indefinite amount of tail calls does not have "a
               | fault".
        
               | cryptonector wrote:
               | If you have a program that fails due to stack exhaustion,
               | but fixing it by enabling TCO causes it to run longer and
               | therefore run into some other problem...
        
               | fsckboy wrote:
               | > You can call that semantics if you like, but I think
               | most people wouldn't.
               | 
               | Tail Call Optimization (TCO) is an optimization, yes;
               | 
               | but if Tail Call Elimination is required in the
               | implementation of a language as it is in Scheme, then
               | programmers in the language can rely on it (along with
               | lexical scope) to write iterative loops instead as
               | recursive calls. That is a change in program language
               | semantics.
               | 
               | see Guy Steele's "Debunking the 'Expensive Procedure
               | Call' Myth, or, Procedure Call Implementations Considered
               | Harmful, or, Lambda: The Ultimate GOTO"
               | http://dspace.mit.edu/handle/1721.1/5753
        
               | haberman wrote:
               | > but if Tail Call Elimination is required in the
               | implementation of a language as it is in Scheme
               | 
               | This is also supported as a non-standard extension in C,
               | C++, and Objective-C as of Clang 13: https://clang.llvm.o
               | rg/docs/AttributeReference.html#musttail (I implemented
               | this).
               | 
               | > see Guy Steele's "Debunking the 'Expensive Procedure
               | Call' Myth
               | 
               | Yes! :) I cited this paper in my blog article describing
               | how we are using musttail:
               | https://blog.reverberate.org/2021/04/21/musttail-
               | efficient-i...
        
               | cryptonector wrote:
               | My point is that adding TCO is not a breaking semantics
               | change, as its only visibility will be through a) the
               | fact that more code works with it than w/o it, b) elision
               | of tail-calling frames in stack traces. In particular I
               | don't consider (b) to be a problem. I'm aware (b) is a
               | matter of opinion, but really, TCO is much too valuable
               | to use (b) as an argument against TCO.
        
         | ploxiln wrote:
         | The initial/main thread stack size is dynamic, it grows until
         | it runs into another allocation (and these days there are guard
         | pages and such). You can already use 128MB on the initial/main
         | stack, even with musl-libc / Alpine linux (it needs to be
         | linearly probed it if it's unreasonably huge, so the OS isn't
         | surprised by a random memory access out in the middle of
         | nowhere, but the compiler should do that automatically these
         | days).
         | 
         | You can also request whatever stack size you need for
         | additional threads when you create them, this has always been a
         | good idea.
         | 
         | It's only the default stack size for non-main threads, if you
         | don't specify it, which may be considered reasonably limited.
        
           | masklinn wrote:
           | > The initial/main thread stack size is dynamic That seems
           | doubtful, AFAIK all OS have a static allocation for the main
           | thread, just one that is commonly (though not necessarily)
           | much larger than what they give off-threads, as
           | https://ariadne.space/2021/06/25/understanding-thread-
           | stack-... quotes Windows gives 1MB to all threads, while
           | macOS or OpenBSD 4.6+ give 8MB to the main thread and 512k to
           | off-threads.
           | 
           | The former would not merit consideration if it was
           | automatically increased.
           | 
           | > it grows until it runs into another allocation
           | 
           | Which it won't because vmem.
           | 
           | > and these days there are guard pages and such
           | 
           | That just makes the access segfault if the stack allocation
           | is not large enough to straight jump over the guard page.
        
             | ploxiln wrote:
             | I'm not so familiar with other operating systems, but on
             | Linux and presumably other unixes, the only thing holding
             | back the initial/main "process stack" is ulimit. Try
             | "ulimit -s unlimited". (On macOS that increases it from 8MB
             | to 64MB, but on linux it seems to really go unlimited)
        
       | grumpyprole wrote:
       | A brilliantly written article. Couldn't have said it better
       | myself.
        
       | ghoward wrote:
       | TFA is right, sadly.
       | 
       | One of the projects I was to go in the future is implement
       | boringcc [1], and this sounds like something I would want in it.
       | 
       | The question then becomes: would it even be possible to do?
       | 
       | [1]: https://groups.google.com/g/boring-
       | crypto/c/48qa1kWignU/m/o8...
        
         | qsort wrote:
         | Wouldn't the majority of it just be, effectively, "all pointers
         | are now fat"?
         | 
         | It could disable some (many?) optimizations, but I don't see it
         | as a particularly hard thing to do (at least conceptually).
        
           | ghoward wrote:
           | No, not a hard thing to make a boringcc compiler, but I was
           | referring specifically to querying stack size.
        
         | hdjjhhvvhga wrote:
         | Probably the only way is to use platform-specific quirks, but
         | that's basically the same with everything else. The portability
         | only works for some generic code, pretty much any non-
         | trivial/real-world software uses a lot of platform-specific
         | code.
        
         | cryptonector wrote:
         | TFA is wrong. The C run-time's call frame allocation method and
         | actual resource limits leak past the C function call and
         | variable-length-array (VLA) abstractions. You can write
         | portable code to force defined (by the platform, not by the C
         | spec) resource exhaustion behavior sooner than undefined
         | behavior caused by memory corruption caused by use of much-
         | larger-than-the-stack stack allocations.
        
           | ghoward wrote:
           | Can you really? If you depend on a SIGSEGV handler, that
           | handler needs stack space unless you set up a separate stack
           | for it, and I am not sure that is portable.
        
             | cryptonector wrote:
             | The point is to cause a defined (by the platform) stack
             | exhaustion failure mode before possibly causing undefined
             | behavior by corrupting memory.
             | 
             | The C specs do not define any resource exhaustion failure
             | modes. But actual operating systems, ABIs, platforms --
             | they do.
             | 
             | If you are on a system that gives you a SIGSEGV but no way
             | to catch it on an alternate stack -or if you don't catch or
             | ignore it- then your process will crash. This is fine. It's
             | better than memory corruption.
        
               | ghoward wrote:
               | Can you give me an example of a platform where your
               | program cannot cause memory corruption if you access an
               | array on the stack out of bounds far enough to corrupt
               | actual existing memory on the heap? Because a SIGSEGV is
               | the best case scenario; the other scenarios are not
               | helped at all by platforms.
        
               | cryptonector wrote:
               | Say you alloca(1GB), and alloca() is naive and gives it
               | to you even though that 1GB spans several other stacks,
               | and now you write somewhere in the middle. If you're
               | lucky you'll hit unmapped space and trigger a segfault.
               | If you're unlucky you'll scribble over another thread's
               | stack or whatever.
               | 
               | But if you progressively write into the alloca()ed area
               | in the direction of stack growth then you cannot corrupt
               | other mappings, and you can only segfault if you exceed
               | your stack's maximum size. Of course, there can be
               | platforms with no SIGSEGV or whatever.
        
         | miloshadzic wrote:
         | The spam on that thread is really something.
        
       | yholio wrote:
       | Seems like a moot point on any 64 bit platform, where you can
       | effectively reserve GBytes of virtual memory for each of
       | thousands of threads, and page in physical memory as required.
       | 
       | It's the only sane interpretation I can give to a standard that
       | hides any way to manage stack usage - the compiler/OS layer takes
       | responsibility.
        
         | noobermin wrote:
         | If it's the OS's responsibility then Alpine Linux shouldn't
         | blame developers or say their code isn't "portable."
        
         | aidenn0 wrote:
         | Gigabytes for thousands of threads is just barely true; if you
         | reserved say 8GB each for 16k threads, you would just about use
         | up the 47 bits of address space reserved for userland on Linux
         | leaving nothing for text or heap.
         | 
         | Certainly a few MB per thread isn't too crazy though; at 8MB
         | per thread, you can have a million threads and use only a tiny
         | fraction of the address space available.
        
           | yholio wrote:
           | You could reserve, say, 46 bits for heap and 46 for stack, 70
           | terabytes each _ought to be enough for anybody_. Then start
           | allocating thread stack in successive halvings: 2 threads get
           | 35TB each, 65536 threads get 1GB each etc.
           | 
           | Effectively, you can make sure the program runs out of
           | available memory long before a stack exhaustion issue, unless
           | the stack usage pattern is severely unbalanced for some
           | thread.
        
             | jart wrote:
             | Virtual memory isn't free. 16k threads with 8gb stacks
             | technically needs like 250gb of page tables. You must be
             | using some strange memory model and microprocessor that
             | lets you have 92 bits of address space.
        
       | Koshkin wrote:
       | How do functional languages and, in general, the languages where
       | the use of recursion is pervasive - how do they deal with this?
        
         | Verdex wrote:
         | So, you can't answer "in general" because things are handled
         | differently depending on the language.
         | 
         | I believe Scheme basically only uses the heap (which is
         | necessary because of call/cc iirc).
         | 
         | Haskell uses a variation of a G machine (looks like a spineless
         | tagless g machine to be specific). The way that works is that
         | they convert your program into a giant lambda term and then
         | they work on reducing it. This of course looks nothing like a
         | traditional call stack, so you can replace that whole thing
         | with the G machine. Recursion (at least of the non-tail
         | recursive variety) will look like a lambda term slowly getting
         | bigger in heap space.
         | 
         | And then I also heard of someone (I forget which language)
         | talking about setting up some logic to look for stack overflows
         | and then throw all the current stack onto a heap and then start
         | over.
        
           | dfox wrote:
           | The part about detecting stack overflows probably refers to
           | implementation strategy in some scheme implementations (IIRC
           | chicken scheme does this) which involve using the
           | platform/machine stack as effectively hardware-assisted
           | nursery generation for generational copying GC, when it gets
           | full (usually well before true overflow) the minor GC cycle
           | gets done and the stack gets completely emptied. In these
           | designs the "stack" is usually used for all allocations, not
           | only for activation records.
        
         | miloignis wrote:
         | Tail Call Optimization & allocating their own stacks,
         | generally.
        
         | elipsitz wrote:
         | In many cases, with tail call optimization, turning specific
         | types of recursion into iteration (which doesn't use more than
         | a constant amount of stack space).
        
           | masklinn wrote:
           | Not tail call optimisation, tail call _elimination_.
           | 
           | If the language is based on recursion, you can't just hope it
           | happens commonly enough, it has to happen all the time, even
           | for corecursive functions.
           | 
           | Except I've never been super clear on Haskell having TCE or
           | not, because laziness means there are limited opportunities
           | for it to even fire.
        
       | cryptonector wrote:
       | As long as there's a guard page and as long as you can assume
       | that page is at least 4KB in size, you can detect C stack
       | boundaries safely. Basically, don't use alloca() or C99 VLAs, or
       | huge automatics, or, if you do, touch a byte every 4KB in the
       | direction of stack growth before you use bulky stack allocations.
       | I have a macro for this somewhere.
       | 
       | Also, this isn't limited to C... The real problem is not that the
       | stack size is invisible to the program, or that you can run out.
       | The real problem is that you can make a stack allocation so large
       | that you might miss the end of the stack and scribble over some
       | other memory. This is why you need to touch a byte ever guard-
       | page-size bytes (in the direction of stack growth) in large stack
       | allocations before using those allocations.
       | 
       | What's specific to C is alloca() and C99 VLAs. Not that other
       | languages don't have those, but that in their absence it's
       | extremely unlikely that any function would have a frame so large
       | as to run past the guard page. A maximum frame size can be
       | computed by the compiler in the absence of alloca() or VLAs.
       | 
       | Also, the compiler could (and arguably should) generate code to
       | do this touching of a byte every guard-size-bytes thing for
       | dynamically sized stack allocations, which would make the problem
       | disappear. (You could still run out of stack space, but more
       | safely.)
        
         | Joker_vD wrote:
         | > touch a byte every 4KB in the direction of stack growth
         | before you use bulky stack allocations. I have a macro for this
         | somewhere.
         | 
         | > you need to touch a byte ever guard-page-size bytes (in the
         | direction of stack growth) in large stack allocations before
         | using those allocations
         | 
         | > Also, the compiler could (and arguably should) generate code
         | to do this touching of a byte every guard-size-bytes thing for
         | dynamically sized stack allocations, which would make the
         | problem disappear.
         | 
         | That's precisely what MSVC does since 1995 or so: inserting the
         | calls to the _chkstk() function that touches the guard pages. I
         | believe gcc also does this when it targets Windows.
        
           | cryptonector wrote:
           | Excellent. Thanks for the tip. Some references:
           | 
           | https://docs.microsoft.com/en-
           | us/windows/win32/devnotes/-win...
           | 
           | https://metricpanda.com/rival-fortress-update-45-dealing-
           | wit...
        
         | titzer wrote:
         | This doesn't work in practice, see my above comment. You can't
         | put guard pages into the middle of stack segment (using
         | mprotect()--the kernel will just override that mapping, happily
         | growing right over top of it). Even putting a guard page at the
         | end doesn't work, as you can still have spurious SIGSEGVs, as
         | you mention, because programs might just stride more than 4KiB
         | at a time; you can't trust a C compiler to insert strided
         | accesses for you, and that's just overhead.
         | 
         | The only way you can reliably do this, I've found, is to
         | allocate your own stack segment.
        
           | cryptonector wrote:
           | Of course you can have a SIGSEGV when you exceed the max
           | stack size. That's fine and to be expected.
           | 
           | What's NOT fine is for a large stack allocation to cause the
           | program to silently scribble over another stack or heap.
        
             | masklinn wrote:
             | > Of course you can have a SIGSEGV when you exceed the max
             | stack size. That's fine and to be expected.
             | 
             | It's more that you can't expect anything better, having an
             | actual stack overflow error would be a lot more helpful.
        
               | cryptonector wrote:
               | Right. With 64-bit ISAs you can address 2^64 bytes,
               | but... you can't exactly _have_ 2^64 bytes. At some point
               | the system will fail. How it fails is important. A ran-
               | out-of-resources errors is a lot better than undefined
               | behavior resulting from random memory corruption.
               | 
               | And yes, portable C has finite size_t. Therefore portable
               | C must fail in some way when limits are exceeded.
        
           | gpderetta wrote:
           | Have you considered --split-stack? It provides infinitely
           | growable stacks (up to the heap limit I guess) and a defined
           | ABI to find the stack boundaries.
           | 
           | The cost is slower function calls and an ABI break.
        
         | chrisseaton wrote:
         | > As long as there's a guard page and as long as you can assume
         | that page is at least 4KB in size, you can detect C stack
         | boundaries safely.
         | 
         | How do you do this in pure C?
        
           | cryptonector wrote:
           | You're gonna make me dig it out, ain't ya. I don't know where
           | I put it, but I've written a macro and function that does
           | this. Basically, given a pointer and allocation size, if it's
           | smaller than 4KB, do nothing, if it's larger, loop for
           | allocation size div 4KB touching a byte from one end of the
           | allocation to the other. The tricky things are: determining
           | in which direction the stack grows, and the page size (which,
           | arguably, you can hard-code to 4KB) if you really don't want
           | to hard-code 4KB.
           | 
           | You can determine stack growth direction by having one
           | function call another with a pointer to a local variable in
           | the former and then compare that to a pointer to a local
           | variable in the latter. To do this correctly you have to cast
           | to uintptr_t.
           | 
           | And, yes, pure C does not even assume a stack, much less a
           | bounded stack. In pure C it's possible to have every call
           | frame allocated on the heap, which means that assuming that
           | they are allocated contiguously is not safe. Moreover, even
           | with stacks, the run-time could allocate a second stack when
           | the first is exhausted and push new call frames on the second
           | rather than die. In pure C the only assumptions about limits
           | have to do with the size of size_t. The real world intrudes
           | though, and in practice all C run-times allocate call frames
           | on bounded stacks.
        
             | chrisseaton wrote:
             | None of this is possible in a well-defined pure C program.
             | Pages with special memory protection aren't a thing C knows
             | anything about.
             | 
             | That's the point - you can't do it without sacrificing
             | portability.
        
               | cryptonector wrote:
               | What I'm describing is portable. It may be unnecessary in
               | C run-times that allocate call frames on the heap, but
               | those aren't realistic C run-times.
        
               | chrisseaton wrote:
               | Say I ran your code on a special C implementation that
               | tracks all memory access and disallows it when you try to
               | read from stack memory that doesn't contain a live
               | object.
               | 
               | That'd be a completely standards-compliant C
               | implementation.
               | 
               | Your code wouldn't run on it. Your code would not be
               | portable to my standards-complaint implementation.
               | Therefore your code is not portable.
        
               | cryptonector wrote:
               | You can almost certainly make it portable. After all, the
               | allocation has a pointer, size, and type, and if it is at
               | all an array, you could dereference every Nth element as
               | needed, possibly with memcpy(), and avoid aliasing rules
               | via casting to unsigned char *.
        
               | Hercuros wrote:
               | Wouldn't the compiler be allowed to elide the memory
               | accesses that are being used to poke the memory region?
               | If you are doing the poking before the memory has been
               | initialized, then I'm not sure there are any guarantees
               | that the compiler will read from the actual memory region
               | (since it's uninitialized). And if you do writes to do
               | the poking then there might be similar issues if the
               | result of the write is never used. Even for reads you
               | would have to make sure that the result gets used. Also
               | not sure if casting to volatile is allowed if the object
               | itself is not volatile. This is all true regardless of
               | UB.
               | 
               | It seems to me like it is impossible to do poking in a
               | way that could not be elided by the compiler, because
               | removing poking never alters the semantics of the program
               | (stack overflow or guard pages are not part of the
               | semantics of C, I believe).
               | 
               | Of course, in practice you'll probably get the right
               | behavior, but I'm personally always interested in
               | language lawyery questions. I do believe that there are
               | cases of security issues introduced by compilers eliding
               | memset operations on password buffers, because the
               | compiler can tell that the result of memset is never used
               | and so it doesn't actually clear the memory. These sorts
               | of things are tricky to get right.
        
               | cryptonector wrote:
               | Yes, you need to find a way to have the compiler not
               | elide memset() or this particular function. That part is
               | probably not portable.
               | 
               | You can cast to volatile, but casting away volatile-ness
               | and dereferencing the resulting pointer is UB.
        
               | chrisseaton wrote:
               | > That part is probably not portable.
               | 
               | If you need that part, and that part is not portable,
               | then your solution is not portable is it!
        
               | chrisseaton wrote:
               | So you're going to touch every byte of the memory, to
               | check it's been successfully allocated.
               | 
               | What happens when you access memory which has not been
               | successfully allocated?
        
               | cryptonector wrote:
               | > So you're going to touch every byte of the memory, to
               | check it's been successfully allocated.
               | 
               | Strictly less (not more than every 4KB), but, yes :)
               | 
               | > What happens when you access memory which has not been
               | successfully allocated?
               | 
               | Portable C has finite size_t. Actual machines, virtual or
               | otherwise, have finite resources. At some point you may
               | run out of resources. With a dynamic stack allocation
               | checker you'll fail with a segmentation fault when you
               | run out. Without a dynamic stack allocation checker (but
               | _with_ dynamic stack allocations) you may well fail in
               | random ways due to corruption of other stacks, heap,
               | mmaps, etc.
               | 
               | If your C run-time doesn't have stacks then a dynamic
               | stack allocation checker may be pointless... unless your
               | OS is over-committing memory, then it may cause failure
               | to happen sooner than it might otherwise.
        
               | chrisseaton wrote:
               | > you'll fail with a segmentation fault
               | 
               | Who says you will?
               | 
               | Nothing in the C specification says that.
        
               | cryptonector wrote:
               | The C specs assume unlimited (up to size_t) resources.
               | 
               | Real life has no machines with 64-bit ISAs and 2^64
               | bytes' worth of resources.
               | 
               | Real life isn't portable. Who knew.
               | 
               | If there's no guard pages, no segmentation faults, no OOM
               | killers, etc., then an allocation checker will cause
               | failure in much the same kinds of undefined ways as not
               | having the checker.
               | 
               | A segmentation fault _is_ what you can expect in real
               | systems, and it 's absolutely better than undefined
               | behavior.
               | 
               | You can do this portably, in C. It may not do you any
               | good, and it may waste CPU cycles and memory bus usage.
               | But it shouldn't harm. If the alternative is to not use
               | alloca() or VLAs, and to make it so you can statically
               | know your program's resource needs (which is _very_
               | limiting, and not really attempted outside contexts where
               | that such limitations are very strongly desired), then do
               | that.
        
               | Joker_vD wrote:
               | > A segmentation fault _is_ what you can expect in real
               | systems,
               | 
               | Nope. Just today I've seen a stack corruption bug that
               | instead of producing a segmentation fault caused the
               | process to go into uninterruptible sleep, somehow.
        
               | chrisseaton wrote:
               | You're saying 'if run on a system that does things in a
               | way I expect then it works' but the point of the article
               | is 'but what about if you want to run on a system that's
               | not as you expect and you want to run pure C'.
               | 
               | We all know what you're saying works in practice on most
               | machines.
               | 
               | But it's not pure C and it's not portable if you follow
               | the spec.
        
               | cryptonector wrote:
               | It's portably assuming that you can cast an allocation's
               | pointer to a pointer to unsigned char. It's like assuming
               | you can memset(), since that's effectively what it's
               | doing. In the absence of memory over-commit, this will
               | have no effect beyond wasting resources. In the presence
               | of over-commit -bounded stacks resemble memory over-
               | commitment!- it will have the effect of committing the
               | memory OR failing in whatever way the machine fails when
               | resources are exhausted. However, in the latter case, and
               | when the machine uses stacks for C function calls, the
               | failure will be defined _by the machine_ rather than
               | undefined (due to memory corruption).
               | 
               | Also, in the absence of stacks, there's no "direction in
               | which the stack grows", and any attempt to determine that
               | direction will yield a nonsensical result, but doing
               | memset() in one direction or the other will also then be
               | equivalent anyways.
               | 
               | EDIT: There is one non-portable thing here: the
               | assumption that machines are more limited than size_t
               | implies. However, that's a distinction without a
               | difference for any 64-bit ABIs or larger because all such
               | machines must be more limited than size_t implies. That
               | there is any failure mode when resources are exhausted is
               | a non-portable assumption is uselessly true. Not being
               | able to assume defined resource exhaustion failure modes
               | would be destructive -- that the C spec does not is
               | irrelevant.
        
               | khrbtxyz wrote:
               | _determine stack growth direction by having one function
               | call another with a pointer to a local variable in the
               | former and then compare that to a pointer to a local
               | variable in the latter._
               | 
               | This is undefined behavior.
               | 
               | See n1570.pdf, Annex J page 560, J.2 Undefined behavior
               | 
               | Pointers that do not point to the same aggregate or union
               | (nor just beyond the same array object) are compared
               | using relational operators (6.5.8).
        
               | cryptonector wrote:
               | Comparing the uintptr_t cast of two disparate pointers is
               | not undefined behavior. A comparison may not yield a
               | portable result because the integer representation of
               | pointers may not be as plain memory addresses.
               | 
               | So such a simplistic stack growth detector may get the
               | wrong answer, but it's not undefined behavior:
               | char stack_grows_up(uintptr_t callers_local) {
               | char my_local = (uintptr_t)&my_local > callers_local;
               | return my_local;       }
               | 
               | EDIT: Ah, and uintptr_t is an optional type. If the
               | platform doesn't have it (e.g., because it's a segmented
               | memory architecture) then this won't work, so it's not
               | portable, but aside from that it's not undefined
               | behavior, it's just that the computation of stack growth
               | direction may well be wrong.
        
             | [deleted]
        
           | shadowgovt wrote:
           | In pure C, there is no stack abstraction. Technically
           | speaking, the stack is an extremely common implementation
           | detail shared by almost all C runtimes.
           | 
           | There's no real way to answer this question at that layer of
           | abstraction; the minute you add any library calls that deal
           | with a stack at all (such as alloca), we're no longer talking
           | about pure C.
        
             | cryptonector wrote:
             | Correct. Note that alloca() can be an alias for malloc() in
             | stackless Cs. If the stack assumption is invalid, then a
             | checker will not trip over a guard page, and it will at
             | best trip over memory over-commit (which may still be
             | desirable). If the guard page assumption is invalid, then a
             | checker will cause failure just as surely as not checking
             | (provided you intended to use the whole allocation
             | anyways), and either way the failure will be undefined,
             | therefore this is still portable C!
        
             | chrisseaton wrote:
             | I was going to reply that alloca is a standard function,
             | but then I checked and it's actually not is it!
        
               | cryptonector wrote:
               | However VLAs _are_ in C99 and up. They 're rather similar
               | to alloca() in that whatever mechanism you'd use to
               | implement alloca() (automatic stack allocation, or heap
               | allocation), you'd also use to implement VLAs.
        
               | wahern wrote:
               | Except alloca'd space is only dropped at the end of a
               | function. VLAs are dropped at the end of a block. This
               | makes a huge difference in loops as alloca can quickly
               | exhaust the stack.
        
               | cryptonector wrote:
               | No real difference:                 for (size_t i = 1; i
               | < WHATEVER; i <<= 1) {         char vla[i];
               | vla[0] = 0;         vla[i - 1] = 0;         ...       }
               | 
               | that is, you can still exhaust your stack trivially with
               | VLAs.
               | 
               | Sure, the two things have a different "API", but as far
               | as allocating arbitrary amounts of memory goes, they both
               | let you do it.
        
               | wahern wrote:
               | I'm not sure what your point is. You could replace the
               | VLA with malloc in that example and exhaust your entire
               | VM space.
               | 
               | In the rare cases where VLAs might prove useful, e.g. a
               | vector math library (IIRC, a major reason for VLAs was to
               | help FORTRAN programmers transition to C), this
               | difference absolutely does matter in reality.
        
               | cryptonector wrote:
               | Maybe I missed the point of your comment. Mine is that
               | the scoping difference between VLAs and alloca() isn't
               | relevant to the question at hand of how to protect
               | against memory corruption due to large stack allocations.
        
               | steveklabnik wrote:
               | Notably, C11 moved alloca to an optional feature, not a
               | required one.
        
         | Someone wrote:
         | > touch a byte every 4KB in the direction of stack growth
         | before you use bulky stack allocations
         | 
         | I don't think you can portably (1) do that. You have to
         | allocate before you can touch those bytes.
         | 
         | Also, if you do _before_ , I don't think there's a guarantee
         | that you won't hit some volatile memory that interacts with
         | hardware, and of course, there's no portable way to get the
         | stack pointer, in the first place (if only because it,
         | conceptually, doesn't exist in C)
         | 
         | (1) specific compilers can safely do that because they know how
         | they are implemented, and many compilers are fairly similar, so
         | one can make this "quite portable", but not really portable.
        
           | cryptonector wrote:
           | One way is to allocate first. alloca(1GB) typically isn't
           | going to cause a SEGV, so you can check the allocation after
           | making it. With VLAs you have to make sure (this is not
           | portable) that the compiler will put your loop iterator state
           | variables ahead of the VLA.
           | 
           | If you can assume a linear stack (which is kinda the whole
           | point of this discussion) then you can call a function that
           | will repeatedly alloca() in small chunks and touch a byte and
           | return on success or SEGV on failure, and then when it
           | returns it will release that memory but then you know you can
           | safely alloca() that much.
        
       | Spivak wrote:
       | I don't get Alpine's reasoning here. In order to be portable does
       | every C program have to work with any stack size limit? How could
       | that even work? Without a guaranteed minimum there's nothing to
       | do except "portable to common platforms-ish."
        
         | gpderetta wrote:
         | Do you know about the shortest fully conforming C compiler?
         | #include <stdio.h>        int main(int argc, char*argv[]) {
         | printf("ERROR: resources exhausted!");           return -1;
         | }
        
       | jeff-davis wrote:
       | How does something like Ada help you measure and control stack
       | usage?
        
         | nagisa wrote:
         | Measuring of stack requirements is typically implemented by the
         | way of walking the call graph and adding up functions' stack
         | sizes (for usual function these are well known at compile
         | time). There are "only" a couple problematic cases where this
         | algorithm fails: recursion, `alloca` and the likes, and
         | indirect calls.
         | 
         | For all of these problems there are more complex analyses that
         | can be done to determine e.g. maximum recursion depth or
         | possible functions that an indirect call could resolve to. And
         | failing all of these there is usually a way for an engineer to
         | override or provide this information.
        
       | trasz wrote:
       | What's the point of a low stack limit, apart from places like the
       | kernel?
        
         | m_eiman wrote:
         | ... and embedded, of course. The main problem is that there IS
         | a limit, and you have no idea what the limit is. The only way
         | you find out is when you crash.
        
           | kevin_thibedeau wrote:
           | > The only way you find out is when you crash.
           | 
           | You can fill the stack region with sentinel data and
           | periodically scan it to warn of impending doom.
        
           | trasz wrote:
           | There is - in embedded (also kernel), but not in general
           | purpose systems, like the Unix userspace. And the same piece
           | of software very rarely runs on both. So, again, what's the
           | point of this exercise?
        
           | StillBored wrote:
           | People have differing opinions of "embedded" but I would
           | claim that most embedded uses cases the stack is actually
           | specified by the programmer via linker scripts and/or
           | manually setting it up during early bootstrap. Plus, its not
           | exactly hard in most of these environments to write target
           | and/or compiler specific macros which detects top of stack
           | and can runtime check it via DEBUG() like macros or modifying
           | function entry macros. None of this is of course portable,
           | but neither is code designed to run in 1-16k of RAM and a few
           | more K of flash.
           | 
           | OTOH, there seem to be a lot of "we don't control anything"
           | IoT devices which are random overly powerful processors with
           | random uboot+linux kernels+rootfs's running
           | my_first_c_program.c level applications. All of it tossed
           | together with very little engineering effort, and frequently
           | less low level knowledge, and left to rot because its
           | basically unmaintainable, even if it happens to be the 1% of
           | devices that can get over the air updates.
        
             | formerly_proven wrote:
             | gcc's -fstack-usage works well, at least for the kind of C
             | I write on small microcontrollers. You still have to add it
             | up manually across the callgraph and of course take ISRs
             | into account.
        
         | ridaj wrote:
         | Any environment with restricted memory resources compared to
         | needs, and a need for reliable behavior
         | 
         | Embedded applications such as automotive, or industrial control
         | come to mind (or even consumer electronics, with less life-or-
         | death consequences possible)
        
         | jameshart wrote:
         | A microcontroller programmer says: "What kernel?"
        
       | JoeAltmaier wrote:
       | Time to abandon stack as a 'display'? That's why they overflow
       | (mostly) - lots of large items or lots of recursion and local
       | variables.
       | 
       | Also having the addressable method arguments in the same space as
       | the return address, allows a whole host of other bugs and cracks.
       | 
       | I have long thought, the return stack should be a memory region
       | un-addressable by the code except via call and return.
       | 
       | Local method arguments could then revert to ordinary allocations
       | or some other check-able allocation, and have normal error
       | recovery.
        
       | QuadrupleA wrote:
       | Never thought about it but yes, what a huge (and easily
       | implementable) oversight - stackSize(), stackUsed(),
       | stackTopPtr() would be a few asm instructions in most cases.
        
         | PerkinWarwick wrote:
         | It seems to me that in the case of the x86, the virtual memory
         | system could detect overrunning the end of the stack and
         | announce it to you.
         | 
         | The meta-problem is what to do with the information.
        
           | masklinn wrote:
           | A clean / clear error would already be better than just
           | segfaulting with no indication (that's why e.g. Ruby or
           | Python have explicit stack depth check and will raise an
           | exception hopefully before they run out of C stack).
        
         | vasama wrote:
         | There is no useful way to employ these functions. It's
         | equivalent to asking if a mutex is locked. That piece of
         | information is worthless.
        
           | OskarS wrote:
           | What are you talking about? For lots of functions, it is
           | perfectly possible to statically determine a maximum stack
           | size (not all functions obviously, that would be the halting
           | problem). An trivial example would be a function that calls
           | no other functions, but you can imagine others: functions
           | with static call trees (i.e. no function pointers), don't use
           | VLAs or alloca, and that are not recursive (mutually,
           | indirectly, or otherwise) will have a static limit on
           | possible stack usage.
           | 
           | If you've formally verified that function X uses at most Y
           | pages of stack, and the remaining stack is smaller than that,
           | you could return an OUT_OF_STACK error instead of calling it.
           | Most libraries/functions wouldn't need this, but there are C
           | libraries which place an extraordinary importance on
           | reliability, and it seems like it would be a useful feature
           | for them.
           | 
           | There might be other reasons why these functions shouldn't be
           | in the standard (or are impossible on some platforms), but
           | "the information is worthless" is not one of them. Of course
           | this information can be useful.
        
           | adrianN wrote:
           | "I can't complete this recursive algorithm because I ran out
           | of stack at this particular step" is a much nicer error
           | message than "segmentation fault". For that it would be
           | useful if you could check remaining stack space before
           | recursing.
        
           | sumtechguy wrote:
           | I could see that sort of thing being very useful for a
           | debugger.
        
           | notact wrote:
           | If writing a recursive algorithm on arbitrary input, check
           | remaining stack size upon recursive function entry. If space
           | is depleted, allocate a new fiber (aka a brand new clean
           | stack), switch to it, and resume. Switch back to calling
           | fiber on stack unwind.
        
             | 10000truths wrote:
             | If you're writing a recursive algorithm on arbitrary input,
             | you should be using a stack container to store your state,
             | not the call stack.
        
           | Y_Y wrote:
           | Could you explain this please?
        
         | lifthrasiir wrote:
         | As mentioned in the OP though, there is zero guarantee about
         | the stack usage by library functions.
        
           | PerkinWarwick wrote:
           | ...or much of anything else. I've seen commercial libraries
           | monkey with processor state in some nasty ways.
        
           | OskarS wrote:
           | No, but that's a pretty weak argument not to have them:
           | they're better than nothing. Not all code uses libraries. And
           | if it was part of the C standard, libraries could use them as
           | well: libraries could fail gracefully instead of blowing the
           | stack. Many libraries already do this when malloc fails,
           | after all.
        
           | nybble41 wrote:
           | There is also no guarantee that the stack is contiguous--see
           | the "-fsplit-stack" GCC option, or search for "segmented
           | stacks". Libraries for coroutines and delimited continuations
           | also routinely segment the stack. A simple API based around
           | easy access to "top of stack", "bottom of stack", and
           | "current stack pointer" addresses is thus non-universal.
           | There may not be a meaningful limit to the amount of stack
           | available, aside from available heap memory or address space.
           | And this can change over time in response to actions other
           | than allocating or freeing memory on the stack.
        
       | titzer wrote:
       | I just wrestled with this. You cannot effectively handle stack
       | overflow situations if you use the system stack on Linux, because
       | stack segments are in magical "MAP_GROWSDOWN" regions. These
       | regions are managed by the kernel itself. There is a base address
       | and a size, and the kernel initially only maps some small number
       | of pages near the upper end (high addresses). Faulting accesses
       | to addresses that are not mapped, but in the region, just below
       | what is currently mapped cause the kernel to map in new pages
       | automatically, "growing down".
       | 
       | The customary way to handle stack overflow gracefully in userland
       | is to install a SIGSEGV handler and check the faulting address to
       | see if it is near the stack segment end, handling the signal on
       | an alternate stack. Except that determining if the access was in
       | the stack segment requires knowing the stack segment start. The
       | signal API doesn't allow querying that; you'd have to query the
       | process's mappings by other means. It's also not robust because a
       | program with big frames might just stride too far and confuse the
       | kernel's automatic growing, causing spurious SIGSEGVs when in
       | fact, there is stack remaining.
       | 
       | I found it was just easier to allocate my own stack segment and
       | immediately switch to it and just leave the loader-provided stack
       | segment alone. Virgil does this in the first 50 or so
       | instructions upon entry.
       | 
       | This also makes stack overflow very deterministic (only depends
       | on the program) and controllable--there is a compiler option to
       | set the stack size in KiB. There's no need for the compiler to
       | insert strided accesses either, as the whole segment is mapped,
       | and you can have more than a single guard page at the beginning
       | (think: 1MiB guard region).
       | 
       | Stack segments in C (and Linux) are madness if you use very much
       | and want anything to be robust to overflow.
        
         | matheusmoreira wrote:
         | Nice to see you and the Virgil project again!! Really
         | interesting to read about your Linux systems programming
         | experiences. I learned a lot reading your source code but this
         | bit about stacks escaped me.
         | 
         | Also I see x86_64 Linux targets are supported now and that's
         | awesome. Do you plan to support ARM architectures?
        
           | titzer wrote:
           | Hey, thanks! I wrote about 1/3rd of an arm32 backend but it's
           | languished because I've focused Virgil energy on only those
           | things I need for my Wizard Engine project. Probably the next
           | port I'll do will be arm64 to run on the M1 and Raspberry Pi,
           | but that'll be some time in the future.
           | 
           | Nice to hear someone has read the code. Glad you like it.
           | Cheers :)
        
         | whoisburbansky wrote:
         | Tangent because this the first I've heard of Virgil. From the
         | main README, what do you mean when you say you would like
         | programs written in Virgil to be dependency-free?
        
           | titzer wrote:
           | You can write programs that depend on no libraries at all,
           | not even an SDK. They just need a compiler binary (and a
           | machine to run that--x86-linux, x86-darwin, jvm, or even a
           | Wasm engine) to compile, and compile to a binary that also
           | only needs a machine to run--no shared objects, external
           | jars, etc.
           | 
           | The language has no builtin classes, interfaces, or
           | functions. In contrast to say, Java, where the language (and
           | compiler) know about java.lang.Object, java.lang.String, etc.
           | The compiler knows how to _compile_ classes, but it doesn 't
           | know the name of any built-in ones.
           | 
           | Virgil is really designed to build the lowest level of
           | software. Thus all my focus on low-level things like calling
           | the kernel and mapping stacks, etc. The only thing that
           | depends on more than just the kernel is the test suite, which
           | uses bash and a little bit of C.
        
             | whoisburbansky wrote:
             | Ah, that's really cool. The clarification about it
             | targeting the lowest level really helps explain the
             | motivation, thanks, and all the best!
        
         | jeffbee wrote:
         | Managing your own pthread stacks can make thread more costly
         | though, can't it? Use of pthread_set_stack requires stack_size
         | at least PTHREAD_STACK_MIN which on glibc linux tends to be
         | 16384, four times larger than the initial cost of a demand-
         | paged stack map. I understand that on musl this is 2048, so
         | that's nice.
        
           | titzer wrote:
           | Yeah, I think there would be some overhead, because it
           | doesn't look like pthread_create allows specifying a stack
           | segment. But I wouldn't be using pthread_create (or any C
           | library for that matter), because Virgil goes all the way
           | down to kernel; there is no C environment.
           | 
           | I only experimented briefly[1] with getting Virgil threads to
           | run on Linux. You need to call SYS_clone, which forks a
           | lightweight process (i.e. shares all virtual memory), and
           | specify a stack segment. The segment in question would of
           | course, be mapped by Virgil and not have MAP_GROWSDOWN, so it
           | could be made robust to stack overflow along the lines above.
           | 
           | [1] https://github.com/titzer/virgil/blob/master/apps/Multi/M
           | ult...
        
         | stedolan wrote:
         | The way to handle stack overflow gracefully on Linux is to
         | check in your signal handler whether the faulting address was
         | between the beginning of the stack and the current stack
         | pointer. You can read the stack pointer register from the
         | signal handler's third parameter.
         | 
         | This is also how the kernel grows the stack. When there's a
         | fault, it compares the faulting address to the stack pointer
         | register. This way, big frames don't confuse the automatic
         | growing. (On Windows, by contrast, stack growth is detected
         | using a single 4k guard page, so compilers must be careful with
         | big frames and insert strided accesses)
        
         | jeffffff wrote:
         | i think maybe you can do it with userfaultfd, performance won't
         | be as good as using a signal handler because of the context
         | switch but you aren't restricted in what you can do in in the
         | handler thread like you are with a signal handler.
        
       | zwieback wrote:
       | Kids these days - let grandpa tell you a story about writing OS/2
       | 1.x device drivers pre-virtual memory. Stack was obsessively
       | monitored, especially in interrupt handlers where you were called
       | with just a few hundred bytes of stack or so. If running low you
       | had to manually switch to your own stack and keep track which one
       | you were on.
       | 
       | Excellent point being made about C, though. Amazing how we've
       | lived with this limitation for so long. I get that C is supposed
       | to be portable assembler but the cases where stack wasn't used
       | for return address and local var frames are rare so some support
       | should have been added early on.
        
         | scoutt wrote:
         | Wanna feel that thrill again? Do some bare-metal embedded.
        
           | zwieback wrote:
           | Yeah, I do that quite a bit actually, when I write code for
           | smaller embedded systems I get the same rush I got back on my
           | old Apple ][.
        
         | silon42 wrote:
         | I believe even later, userspace code had to grow stack page by
         | page, no skipping.
        
       | skitter wrote:
       | What are the advantages of low maximum stack size limits (given
       | growable stacks)?
        
         | jart wrote:
         | Being able to build web servers with 10,000 threads w/ clean
         | imperative code rather than event-driven yo-yo pattern finite
         | state machines. It's what Google does. They limit stack size to
         | 60kb for just that reason.
        
         | PerkinWarwick wrote:
         | One obvious area is in small embedded systems.
         | 
         | It's surprising how hard it can be to track down a stack-caused
         | bug when you've got a bunch of interrupts banging on it.
         | 
         | Generally it's a little unnerving to just pray that you've got
         | enough stack given it's hidden usage and the possibility for
         | things like recursion. Kind of a shame considering what a slick
         | way it is to deal with memory leak avoidance.
        
           | [deleted]
        
         | DrJokepu wrote:
         | This is speculation, but maybe it catches unintended infinite
         | recursion earlier? I'm willing to bet that the vast majority of
         | stack overflows happen due to programming errors rather than
         | genuine uses running out of stack.
        
           | projektfu wrote:
           | This is the best answer I could think of. We have to assume
           | the question is about Alpine Linux, not about embedded
           | systems per se. Small thread stacks run out of space quickly
           | (fail fast), help prevent a malicious process from gobbling
           | up lots of VM, and force you to program in a style that
           | doesn't use much stack space, although that's questionable
           | because most memory leaks are heap-based.
           | 
           | It's worth noting that the limits are adjustable on a per-
           | package basis in Alpine based on the compiler-time link
           | options.
        
       | Joker_vD wrote:
       | This also makes Chicken Scheme basically guess the stack size: it
       | guesses it's 1 MiB on 64-bit platforms, and 256 KiB on 32-bit
       | ones. Of course, you can also manually override it, at your own
       | risk.
        
       ___________________________________________________________________
       (page generated 2021-09-28 23:01 UTC)