[HN Gopher] Understanding thread stack sizes and how Alpine is d...
___________________________________________________________________
Understanding thread stack sizes and how Alpine is different
Author : notacoward
Score : 93 points
Date : 2021-06-26 11:49 UTC (11 hours ago)
(HTM) web link (ariadne.space)
(TXT) w3m dump (ariadne.space)
| ohazi wrote:
| The distinction that's being characterized as "GNU/Linux" vs.
| just "Alpine" is confusing.
|
| Is Alpine using some new kernel? No, it's a Linux distribution
| that uses the Linux kernel, albeit with some unusual defaults.
|
| Does Alpine not have any of the GNU userspace tools? Also no,
| there are plenty in the Alpine package repository.
|
| Look, I get that GNU/Linux and "GnU pLuS LiNuX" is a loaded term
| and has a lot of baggage, and that everyone would like to just be
| rid of the whole mess, but the characterization used here had me
| thinking that there was some other "Alpine kernel" experimental
| OS project that I had missed that had nothing to do with Alpine
| Linux.
|
| The word "Linux" never once follows the word "Alpine" in this
| article, and it discusses overcommit mode as if it's a uniquely
| "GNU/Linux" thing. WTF does kernel overcommit have to do with
| GNU?
|
| Please just call it what it is.
| lonjil wrote:
| So in your view, having even a single GNU tool installed, or
| even available for installation, means you're using
| "GNU/Linux"? Alpine uses musl and busybox rather than the more
| common GNU equivalents.
|
| Kernel overcommit has nothing to do with GNU, but default stack
| size has a lot to do with which libc you use. Musl has a
| different default than glibc. Overcommit is mentioned because
| it is the justification for glibc having a large stack size by
| default. Musl has defaults that make fewer assumptions about
| how the system is configured.
| ohazi wrote:
| Okay, that's fair. I was confused and responded in
| frustration. Maybe glibc vs. musl would have been more clear
| than "GNU/Linux" vs. "Alpine".
|
| I still think only referring to it as "Alpine" and not _once_
| calling it "Alpine Linux" is weird.
| stefan_ wrote:
| Stop being cringe and adopt the same stack size as everyone else.
| Oh my god. What on earth are you saving?
| swiley wrote:
| IMO: it's nice to have weird platforms. I've caught lurking UDB
| and memory corruption in my programs by trying to run them on
| weird OSes.
| stefan_ wrote:
| There is no UDB or memory corruption exceeding the stack
| size. It means your platform is too small to run the program.
| Of course, Alpine doesn't run on platforms that are too
| small, they just make nonsensical changes like these that
| cause incompatibilities but have zero benefits.
| ariadneconill wrote:
| Alpine runs on all sorts of small platforms. It is possible
| to run it on OpenWRT type devices.
| lmz wrote:
| https://github.com/yaegashi/muslstack seems to indicate the
| limit is coming from musl (the libc) rather than Alpine,
| which may legitimately have a hope of running on small
| platforms.
| eqvinox wrote:
| > Thread-local variables are referenced with the thread_local
| keyword. You must include threads.h in order to use it:
| #include <threads.h> void some_function(void) {
| thread_local char scratchpad[500000];
| memset(scratchpad, 'A', sizeof scratchpad); }
|
| As an important note, thread-local storage through this keyword
| _still_ isn 't supported on OpenBSD. It's a serious PITA.
|
| [------ also, copied from a reply I posted below: ------]
|
| The autofree macro is wrong since __attribute__((cleanup))
| expects a function that takes an additional level of pointer. In
| this case, it'll call "free(&scratchpad);". Which doesn't get you
| a compiler warning in C because passing a char ** as a void * is
| perfectly fine. But your heap is f*cked after this.
|
| Correct way to do it is void free2(char **p)
| { free(*p); } #define autofree
| __attribute__((cleanup(free2)))
| mjw1007 wrote:
| Another complication is that in glibc's implementation TLS
| variables come out of the same space as the per-thread stack.
|
| As far as I can tell this is something they think in principle
| should be changed.
|
| https://sourceware.org/bugzilla/show_bug.cgi?id=11787
| kstenerud wrote:
| Yeah, it sucks that programs crash on your system, but this is
| the way of things: The popular systems get targeted and tested
| in-depth, the less popular systems not so much. This is _NOT_ the
| developer 's fault; this is _pragmatism_.
|
| And so the mountain must come to Mohamed. Increase Alpine's
| default stack size to something more in-line with the big boys.
| api wrote:
| I would at least make it equal to MacOS, another very popular
| target where things are tested a lot. That's 512KiB. 128 is
| teeny.
| arghwhat wrote:
| No, they should keep their low stack size to the benefit of
| everyone.
|
| Diversity helps discover what is fundamentally a broken and
| fragile assumption that a dynamic property will always have
| some value. An assumption that _can_ fail anywhere, including
| on the OS that was initially targeted, and _will_ fail the
| moment another OS is targeted.
|
| The developer _should_ fix their broken assumption, but is
| entirely free to do so by taking control of the value at link
| time.
| choeger wrote:
| I absolutely agree with your point about diversity. But us
| advocates need to understand this "pragmatic" view and
| counter it properly. Normally, the argument that works is:
| "Do you monitor the thread stack size on $POPULAR_PLATFORM so
| if _that_ changes, you won 't be bitte ?
| torh wrote:
| I'm annoyed by the fact that I have to visit "new" reddit to be
| able to accept cookies, and then go back to the old design. This
| is on desktop btw.
| a1369209993 wrote:
| Wrong thread. You probably want
| https://news.ycombinator.com/item?id=27641366.
| nemetroid wrote:
| > As most threads only need a small amount of stack memory, other
| platforms use smaller limits, such as OpenBSD using only 64 KiB
| and Alpine using at most 128 KiB by default.
|
| By your own table, OpenBSD uses 512 KiB unless you're on an
| ancient version. Among the listed, Alpine is the lone outlier.
| mjw1007 wrote:
| I remember reading about arguments over whether Algol should
| permit recursive procedures, where one side of the argument was
| apparently claiming that they wouldn't be possible to implement.
|
| That seems pretty strange to modern ears, but maybe the
| underlying point was that it isn't possible to statically know
| how much stack size would be required.
|
| I suppose it wouldn't have been obvious then that if you fudge
| the issue for the first thirty years or so, everyone will just
| accept that this is the way the world is.
|
| Still, it's a bit of a shame that there are still widely-used
| systems where if you exceed the available stack space you're
| likely to face a "weird crash" rather than a clean error message
| at runtime.
| mananaysiempre wrote:
| Contemporary books (incl. TAoCP Vol. 1 IIRC) don't place return
| addresses on a stack or in a register at all: you're generally
| told to modify the operand of the jump instruction at the end
| of the procedure you want to call then jump to its entry point.
| It took a while before reentrancy was even recognized as a
| possibility, let alone used (by humans or compilers) by
| default.
|
| I remember reading that the call stack was _invented_ as an
| implementation device for recursion as introduced in Algol, but
| I can't recall how that claim was sourced.
| jerf wrote:
| "but maybe the underlying point was that it isn't possible to
| statically know how much stack size would be required."
|
| Given the time frame you're talking about, remember to be
| thinking in terms of kilobytes, not gigabytes. And potentially
| low-single-digit numbers of kilobytes. It could be in the range
| of hundreds of bytes dedicated to stacks at the time. Even if
| you could compute your maximum size it's easy to imagine people
| balking a the results of such a computation and think it's not
| worth it to even consider the possibility because you'd blow
| your stack so quickly it's not like there'd be any benefit to
| it.
| a1369209993 wrote:
| > it isn't possible to statically know how much stack size
| would be required.
|
| It's not just that - if it were merely that you couldn't know
| the size statically, you could use dynamic memory allocation.
| The problem with _that_ though, is that now every (not provably
| nonrecursive) function call can now fail with a memory
| allocation error, and if your language doesn 't surface that to
| the caller, and (correctly) doesn't allow spurious errors to
| appear out of nowhere (cough every modern programming language
| cough cough), there's no way to handle that error.
| totorovirus wrote:
| I think allocating something very large on stack should have been
| reviewed by peers to either fix it to use heap.
| sys_64738 wrote:
| In other words, use heap space and not stack space. This is
| pretty elementary in C programming.
| kstenerud wrote:
| That might have been true in the old days when memory wasn't
| the bottleneck, but in today's world where a cache miss is
| catastrophic, it makes MUCH more sense to use stack space where
| you can. This also has the side effect of facilitating
| idempotent functions and function purity in general.
|
| No sense in clinging to old world ideals when they no longer
| make sense.
| jlokier wrote:
| I agree that using the stack instead of heap makes sense for
| cache reasons.
|
| But in the contemporary world, the trend is increasingly to
| transform functions to "async" forms where much of the
| functions' local state _including return address_ is stored
| in heap-allocated space instead.
| ariadneconill wrote:
| That is certainly a take.
|
| Storing small data, like function-local ints, pointers, etc,
| on the stack is beneficial due to L1$ prefetching semantics,
| but storing a 512KB scratchpad on the stack (which is what
| the article is about) will totally trash your L1$ and you'll
| have MORE cache misses than you would if that scratchpad was
| not on the stack.
| dnautics wrote:
| this is too reductive. Usually stack space is faster because
| you will have fewer cache misses.
|
| future languages will be able to suss this out at compile-time:
| https://github.com/ziglang/zig/blob/2ac769eab9b7dba4cd38e5de...
| sys_64738 wrote:
| If speed issues then it sounds like this is critical code
| path that is sensitive to time. You'd be looking for
| alternative data structures at that point so this point would
| be moot.
| IshKebab wrote:
| Sure, but 128 kB is really small even if you do that properly.
|
| Seems like it would be more sensible if the stack space could
| just grow when required. Surely not that difficult?
| pjmlp wrote:
| Enough for a full COM binary plus one overlay section. :)
| aidenn0 wrote:
| A typical stack frame is around a couple dozen words. Let's
| round that up to 32 words (256 bytes). 128K is enough for a
| 500 deep stack at that size. 128K is huge.
| CodesInChaos wrote:
| Since standard OS stacks are contiguous and unmovable, you
| can't grow them once they run out of space. However while the
| address space for the maximum stack size gets reserved, each
| page only requires backing memory once it's first used. So as
| far as physical memory consumption is concerned, typical
| stacks act as growable with a fixed maximum size. Since
| address space is huge on 64-bit systems, choosing a large
| stack size is cheap on such systems (at least of they allow
| over-commit).
|
| For the main thread, a system can also try to keep other
| allocations far from the stack without committing to any
| particular size (heap grows upwards from the bottom of the
| address space, stack downwards from the top). But this
| doesn't scale to multiple threads and leads to an
| unpredictable maximum stack size, so I prefer the fixed
| reserved space approach.
| viraptor wrote:
| The main thread already does that. The thread stack does not.
| It probably made sense on 32b with relatively limited address
| space... But I'm curious why we're not applying that to all
| threads on 64b. Reserving a few 10MB of address space per
| thread shouldn't be a big issue, right? (Without actually
| mapping those pages)
| megous wrote:
| Mapping a lot of pages has some overhead, too. So if you
| have large stacks per thread you'll take a hit.
|
| If you use huge pages to alleviate that, you'll waste a lot
| of physical memory.
| megous wrote:
| Wait until you realize OS kernel thread stack size is 8 KiB
| or in that range, if you think 128 KiB is small. :))
| tyingq wrote:
| It is easy for a single threaded program, but since each
| thread has it's own part of the stack, it would be non-
| trivial for a multithreaded program.
| viraptor wrote:
| I'd say "be aware of stack sizes". If you can get away with
| just the appropriately sized stack in a thread, that's a nice
| performance gain over dealing with multithreading heap
| allocators.
| ynik wrote:
| It's a bit ridiculous to complicate recursive algorithms just
| because the stack sizes haven't been increased in the past 3
| decades.
|
| Nowadays we have at least 48bit virtual address space
| available; what's the harm in giving each thread a full GB of
| stack?
| lonjil wrote:
| If you need a bigger stack, you can get one. This stuff is
| merely the default.
| sys_64738 wrote:
| Generally recursion is not something you want in production
| code. It's cute for academia when studying algorithms but
| where there's an iterative alternative that should be used.
| moring wrote:
| > In general, it is my opinion that if your program is crashing
| on Alpine, it is because your program is dependent on behavior
| that is not guaranteed to actually exist, which means your
| program is not actually portable. When it comes to this kind of
| dependency, the typical issue has to deal with the thread stack
| size limit.
|
| The wording sounds as if it is trying to assign blame for the
| problem. What, then, _is_ the guaranteed thread stack size? A
| developer would obviously need to know this (and other things
| such as the amount of stack size required by variables,
| parameters and frames) to not fall into this trap of writing non-
| portable programs.
| lonjil wrote:
| > What, then, is the guaranteed thread stack size?
|
| If you _need_ a big stack, you can just ask pthread to give you
| one.
| nsajko wrote:
| > The wording sounds as if it is trying to assign blame for the
| problem.
|
| Yes. Sadly, people often write incorrect programs.
|
| > What, then, is the guaranteed thread stack size?
|
| I can't be bothered to look up the POSIX guarantees, the gist
| of it anyway is that it depends on the application developer
| and system administrator.
|
| > A developer would obviously need to know this (and other
| things such as the amount of stack size required by variables,
| parameters and frames) to not fall into this trap of writing
| non-portable programs.
|
| If you're not going to calculate the exact requirements
| (probably unnecessary), guess/measure a number of bytes and
| allocate a stack that's 50 or 100 times greater than that.
| That's better than ignoring the existence of the stack, anyway.
| tyingq wrote:
| Guessing PTHREAD_STACK_MIN
| formerly_proven wrote:
| > Minimum Acceptable Value: 0
|
| Thanks, posix.
| ClumsyPilot wrote:
| It woupd be nice if the article explained why was the current
| size set and what is the benefit of doing so?
| kosinus wrote:
| This is one of the reasons why I'm no longer using Alpine as a
| base in Docker images. I ran into this limit specifically with
| node-sass.
|
| But in general, the difference in image size is negligible
| because of shared layers, and I just don't think enough testing
| happens on Alpine / musl in any given stack. Even if your app
| runtime is tested this way, how many dependencies are?
|
| Come to think of it, I'm not even sure why there was a push for
| Alpine-based Docker images at some point. Maybe it was just hype.
| aecay wrote:
| At $WORK, there's a process for automatically scanning docker
| images for packages that have CVEs against them. Any docker
| image that includes glibc instantly shoots to the top of the
| charts, mostly because of a boatload of high or critical
| severity CVEs relating to bugs in asm-implemented functions on
| platforms like ARM, POWER9, etc. Everything in our company runs
| on x86, but the CVE scanning tool is dumb, so a switch to
| alpine was heavily encouraged.
|
| This broke teams that rely on python and on node, but the
| docker image guidelines come from a team whose ideal language
| is now go (and most of whose legacy code is in java), so they
| are not really sensitive to those concerns. Ironically we tried
| to move to distroless as implemented by google[1], but that's
| based on debian which includes glibc, so the un-nuanced CVE
| checker freaks out again. That effort was quietly dropped.
|
| (I'm not actually disputing the proposition that alpine is
| better for security under certain circumstances, but I think a
| lot of "the push" comes from what might uncharitably be
| described as cargo culting, or with more insight as
| interpretations that make sense in one context [everything is a
| static binary, little to no reliance on traditional userland
| tools] being unquestioningly extended to other contexts.)
|
| [1] https://github.com/GoogleContainerTools/distroless
| arghwhat wrote:
| > Come to think of it, I'm not even sure why there was a push
| for Alpine-based Docker images at some point.
|
| The continuing push is due to the smaller footprint and better
| security properties. And no amount of sharing makes up for the
| difference between a single-MB image and a GB image.
|
| Any application can just dictate its own thread stack size.
| What is discussed here is a default.
| bradleyjg wrote:
| A slimmer image is better from an attack surface point of view.
| "Distroless" with its tree shaking takes this to its logical
| conclusion but when images on alpine started getting popular
| that wasn't available (at least to the general public).
| ldoughty wrote:
| It took some time before Ubuntu and Debian offered official
| slim containers... Before that, many programs in containers
| defaulted to Ubuntu or Debian, and it was 600MB to run
| something like Nginx or Apache, while alpine was 40MB
| rburhum wrote:
| Why would you overcomplicate your life and use something like the
| autofree example, that is not even portable, if you can use the
| heap which is simple to understand and do? I understand that if
| it is a hot function you may run into memory
| fragmentation/performance issues, but there are some many ways to
| deal with that with custom allocators _if it truly is a problem_.
| This is one of those perfect examples where simple is better IMHO
| eqvinox wrote:
| The autofree example uses the heap. It just makes calling
| free() automatic when the function returns, regardless of where
| it does so. It's leak protection.
|
| It's also wrong since __attribute__((cleanup)) expects a
| function that takes an additional level of pointer. In this
| case, it'll call "free(&scratchpad);". Which doesn't get you a
| compiler warning in C because passing a char ** as a void * is
| perfectly fine. But your heap is f*cked after this.
|
| Correct way to do it is void free2(char **p)
| { free(*p); } #define autofree
| __attribute__((cleanup(free2)))
| chrisseaton wrote:
| How can you write a program that runs without at least some
| guaranteed stack size? Are you at fault if you program doesn't
| run in a 1kb stack? And how do you work out what stack size your
| program takes from looking at the source code?
|
| I guess make sure your required stack size is not a function of
| input, and test against a minimum stack size.
| simias wrote:
| If you have to work in an environment where stack size is very
| limited (typically a few KiB) you have to pay attention to
| certain things that you can brush away in more generous
| environment. In particular you need to be very careful with
| recursive functions and you probably want to use the heap or
| static storage for any object bigger than a couple dozens
| bytes.
|
| But in my experience you don't really compute a "guaranteed"
| stack size, you use your experience and knowledge of the
| program to make an educated guess, and then you apply a
| reasonable multiplier to give you some security margin.
|
| If you don't use (or severely limit) recursive calls you can
| usually just check that your deepest call stack fits within the
| bounds. Although finding the deepest call stack in the first
| place can be tricky given that compilers can aggressively
| inline function calls.
| viraptor wrote:
| Unless you do alloca() or dynamically sized local arrays, you
| can measure your stack usage in the deepest call stack. Add
| some space in each frame for potential instrumentation and you
| have your minimum.
|
| Keep in mind that this is just for thread stacks - you can set
| the size for them yourself, so ideally you'd always do it. Then
| a guaranteed minimum size becomes irrelevant.
| chrisseaton wrote:
| > Unless you do alloca() or dynamically sized local arrays,
| you can measure your stack usage in the deepest call stack.
|
| How does a normal working programmer calculate the size of
| each of their stack frames? I'm a compiler researcher and I'd
| struggle to do that. How are application developers going to
| do it?
|
| And how do you design a program to have a deterministic
| maximum call stack depth?
|
| I don't think these things are as easy as you're making out.
| viraptor wrote:
| You have 2 options: either your functions are recursive and
| you can hope and pray, or they're not and you can figure
| out which of your functions are the bottom of the call
| graph.
|
| In those leaf functions you can check &local_var and
| compare it to pthread_attr_getstack(pthread_getattr_np()).
| (Of course that's not precise for many reasons.)
|
| > And how do you design a program to have a deterministic
| maximum call stack depth?
|
| If you're running only your code - don't use recursion, or
| alloca. If you use external libraries, you have to research
| what they do and add some extra in case of updates.
|
| Bounded stack size is also a common issue if you're
| targeting small microprocessors.
|
| For non-critical apps it should be pretty easy to figure
| out the needed stack size. For cases when you want to
| guarantee it... that gets more tricky.
|
| Edit: just learned that clang has the option -fstack-usage
| which should help a lot.
| CodesInChaos wrote:
| Dynamic dispatch (e.g. function pointers/delegates,
| virtual methods) is another case that makes figuring out
| the call graph and thus the maximum stack size difficult
| (via whole-program data-flow analysis) to impossible
| (function pointers come from outside your codebase or are
| constructed in ways analysis can't handle).
| creata wrote:
| > You have 2 options: either your functions are recursive
| and you can hope and pray
|
| Or you can try to figure out the maximum number of times
| it'll recurse: for example, the height of a red-black
| tree with less than 2^64 nodes is less than 128, iirc.
| lanstin wrote:
| One rather quickly runs into halting-problem type issues,
| especially in the function dispatch method. Imagine a DSL
| that does stuff, and is implemented by function pointers
| in the parser/interpreter, and then the question becomes
| one of program inputs. In any case, having such a small
| limit is crazy, and defending it with references to
| correctness smells of Ulrich Drepper and the memmove
| issue. The whole "sucks less" movement is a little too
| focused on purity for my taste. The only time I'm sad
| when I look at my memory usage on my personal laptop is
| when I have unused memory. Please, pre-fetch some
| news.ycombinator, cache some more inodes for my next ncdu
| or find command; I have already paid for the memory, not
| using it is silly. Sure, we software engineers get lazy,
| but, except in AWS, use all your memory, all your
| processors upto throttling, all the time. Why not?
|
| I remember one day in the 90s counting out like max
| address len and max zip code len and so and trying to
| figure out how long to make my target stack allocated
| buffer, and i was like fuck it, I have more important
| things to do, all my stack buffers are hence forth 65536
| bytes long.
| MaxBarraclough wrote:
| If we forbid things like recursion (including mutual
| recursion), function pointers, dynamic dispatch, and
| unbounded use of _alloca_ , doesn't it then follow from the
| call graph and the per-function worst-case stack-usage
| numbers (which the compiler presumably knows)? Is that
| mistaken, or is the difficulty in generalising this
| approach to where those restrictions are lifted?
|
| I tried googling for how SPARK Ada provides assurances
| against exceeding stack-size limits, but I couldn't find a
| decent answer. I presume it does so, though.
|
| _edit: forgot about alloca_
|
| _edit 2: Turns out the AdaCore folks have a tool
| specifically for static analysis of stack-space
| requirements of Ada /C/C++ code:_
| https://www.adacore.com/gnatpro/toolsuite/gnatstack
| mjw1007 wrote:
| Not easy at all.
|
| I know that in the small-embedded world, people do work on
| such things.
|
| Eg https://github.com/japaric/cargo-call-stack
| megous wrote:
| You ask a compiler, since it knows the max stack
| requirements of every function it compiled, if it's fixed.
| If it's not fixed it may give you at least the minimum.
|
| For total depth, keeping you program simple and predictable
| helps. People certainly manage to do it even for large
| programs like Linux itself, where stack size is like 16KiB
| or so. https://elixir.bootlin.com/linux/v5.2/source/arch/x8
| 6/includ... and less on other archs. 8 KiB on arm https://e
| lixir.bootlin.com/linux/v5.13-rc7/source/arch/arm/i...
| chrisseaton wrote:
| But the compiler may not compile simple 'functions' as
| the user understands them - it may compile loop bodies,
| functions with other functions compiled in them,
| individual branches of functions, multiple versions of
| functions based on where they're called from...
|
| If I tell you as a compiler writer that this loop body
| from this function, but with this branch and this branch
| outlined, but only when called from this context, takes n
| bytes... I don't get what most working programmers are
| going to usefully do with that information.
| megous wrote:
| Where's the problem? Compiler will tell you what amount
| of stack a function will use, if it's inlined it may not
| tell you for that function, but it will tell you for the
| function the function was inlined to, which is what
| matters.
|
| If the language is complicated and has generics or
| whatever, the programmer will have to do more work to
| understand it.
|
| It's not a huge issue in C.
| chrisseaton wrote:
| > Compiler will tell you what amount of stack a function
| will use
|
| If you ask a compiler how much stack a function will use
| the answer for a non-trivial compiler for a complicated
| language is always going to be 'it depends...'
| MaxBarraclough wrote:
| As I mentioned in my other comment, AdaCore's _GNATstack_
| tool appears to be capable of reporting this information
| conservatively but with enough accuracy to be useful.
|
| https://www.adacore.com/gnatpro/toolsuite/gnatstack
| bsdetector wrote:
| > Add some space in each frame for potential instrumentation
| and you have your minimum.
|
| On exit just scan from the maximum stack to minimum looking
| for non-zero.
|
| If you have tests it should be easy to get within a few bytes
| of max stack used, which is probably just as good as
| instrumenting everything.
| viraptor wrote:
| It's possible, but you need to watch out for some cases.
| For example let's say your furthest function declares char
| foo[4096], but uses only a few bytes of it in your testing.
| Your measurement will be 4k short.
| saagarjha wrote:
| In general, you just can't. This means that any function call
| in C can bust the stack, unfortunately. You can try to use
| heuristics to try to avoid using up large amounts of space
| (avoid alloca and large stack arrays, be careful about
| recursion) but other than that there isn't much you can do.
| sesuximo wrote:
| GCC has "-fstack-usage"
___________________________________________________________________
(page generated 2021-06-26 23:03 UTC)