[HN Gopher] How much memory do you need to run 1M concurrent tasks?
       ___________________________________________________________________
        
       How much memory do you need to run 1M concurrent tasks?
        
       Author : pkolaczk
       Score  : 93 points
       Date   : 2023-05-21 19:58 UTC (3 hours ago)
        
 (HTM) web link (pkolaczk.github.io)
 (TXT) w3m dump (pkolaczk.github.io)
        
       | matklad wrote:
       | Benchmarks for threads might slightly underestimate memory usage.
       | For threads, some amount of memory is spent by the kernel (eg,
       | memory maps themselves) and won't show as maxrss.
       | 
       | > a high number of concurrent tasks can consume a significant
       | amount of memory,
       | 
       | So Go needs 3gigs for a million tasks. Relatively speaking, for
       | an app that does a million concurrent somethings, 3gb is probably
       | peanuts.
       | 
       | It seems that memory isn't the limiting factor here.
        
       | davidgrenier wrote:
       | For the sake of pitching an analog to go's concurrency model.
       | Hopac is a concurrency library for F# (partly written in C#)
       | delivering an informal implementation of CSP (similar to Go's
       | goroutine and Clojure Core.async). Since the job does nothing but
       | wait for 10 seconds, the code is:
       | seq{1..1000000}         |> Seq.Con.iterJobIgnore (fun _ ->
       | timeOutMillis 10000)         |> run
       | 
       | I'm not sure how to calculate total memory usage but
       | instrumenting the following way:                   GC.Collect()
       | GC.Collect()         GC.WaitForPendingFinalizers()
       | GC.TryStartNoGCRegion 3000000000L         let before =
       | GC.GetTotalMemory false         seq{1..1000000}         |>
       | Seq.Con.iterJobIgnore (fun _ -> timeOutMillis 10000)         |>
       | run         GC.GetTotalMemory false - before |> printfn "%d"
       | GC.EndNoGCRegion()
       | 
       | Gives the following output, I'm reading 144MB for 1M lightweight
       | thread (in the REPL) unless I did something dumb:
       | 144449024
       | Real: 00:00:10.902, CPU: 00:00:18.570, GC gen0: 3, gen1: 3, gen2:
       | 3
       | val before: int64 = 53902280L
        
       | beders wrote:
       | Another benchmark of the "I don't know what I'm actually
       | measuring" flavor.
       | 
       | I don't even know where to start: from the lack of env settings,
       | system settings, compiler settings to the fact that no concurrent
       | work was actually done, this is one giant flawed benchmark.
       | 
       | The only lesson here should be: Don't take any of this seriously.
        
         | tester756 wrote:
         | >from the lack of env settings, system settings, compiler
         | settings
         | 
         | compiler settings?? I bet he used defaults everywhere
        
         | pkolaczk wrote:
         | Just read the text. It lists versions and states it was using
         | defaults in the release mode. Most of those compilers don't
         | have any fancy options actually. E.g go is just `go build`,
         | Python is just `python3 ...`. Rust options are in the project
         | definition. Anyway, I'll add those command line invocations if
         | it causes so much controversy.
        
       | [deleted]
        
       | elteto wrote:
       | I'm surprised at Elixir failing with 1M tasks. I've definitely
       | done this sort of test with Erlang without an issue. Although I
       | can't really remember right now if I pushed it to 1M tasks.
        
         | fwlr wrote:
         | Unless I'm missing something in the writeup, it doesn't mention
         | how much RAM is on the system. A little odd to mention the CPU
         | and the OS but not the RAM, for a benchmark that doesn't care
         | about execution speed but does care about memory usage. It
         | makes figuring out that Elixir error of "system limit reached"
         | a little harder to figure out.
         | 
         | edit: oh apparently it's got nothing to do with memory, 1M just
         | happens to be higher than the default process limit for Erlang
         | and thus Elixir.
        
         | crazydoggers wrote:
         | Almost certainly a system cofiguration issue. I'm pretty sure
         | the default process limit on BEAM is something like 200K.
         | Additionally I don't see mention of updating file handles or
         | anything else on the operating system.
         | 
         | Maybe I'm cynical but this is what happens if you rely on
         | chatGPT to give you programming answers without understanding
         | the fundamentals first.
        
         | sneako wrote:
         | "The maximum number of simultaneously alive Erlang processes is
         | by default 262,144. This limit can be configured at startup.
         | For more information, see the +P command-line flag in the
         | erl(1) manual page in ERTS."
         | 
         | https://www.erlang.org/doc/efficiency_guide/advanced.html#:~...
         | .
        
         | conradfr wrote:
         | Also someone needs to tell ChatGPT that its Elixir code has a
         | bug ;)
        
       | themerone wrote:
       | This is not a comparison of 1 million concurrent tasks.
       | 
       | For a bunch of these tests, it is measuring 1 million tasks
       | multiplexed onto a smaller number of os threads.
       | 
       | In many implementations the cost of concurrency is going to be
       | small enough that this test is really a measure of the stack size
       | of a million functions.
        
         | creatonez wrote:
         | There is sometimes a distinction made between concurrency and
         | parallelism, where "concurrency" just means the programming
         | interface _looks_ parallel but is allowed to use out-of-order
         | task-switching under the hood, and  "parallelism" is real-deal
         | usage of CPU cores. Both ways of looking at things are useful
         | in different situations.
        
       | wongarsu wrote:
       | I like the general idea of the test. Things I would have liked to
       | see (apart from basic system tweaks like allowing more file
       | handles) are a performance comparison (how much CPU time is
       | spent) and maybe a comparison between operating systems (this
       | might highlight a lot of windows-linux differences, like Windows
       | not committing memory - better have a large swap file to handle
       | stacks of 100k real threads - or windows having a very different
       | scheduler)
        
       | [deleted]
        
       | bsenftner wrote:
       | Why no C++ version?
        
         | cynicalsecurity wrote:
         | Because C/C++ would kick ass, exposing the rest of the hipster
         | technologies as lame.
        
           | Cyph0n wrote:
           | They would also kick memory safety in the ass, unless you're
           | the kind of dev who doesn't make those kinds of silly
           | mistakes
        
       | kevingadd wrote:
       | It would be interesting to see how much the C# memory usage would
       | go down if you allocated a single Task.Delay at the start and had
       | all the tasks wait for it. It would probably result in it
       | finishing in 10sec instead of climbing up to 12, while still
       | running 1 million tasks. I'm not sure how you would do this in
       | other runtimes, and there is still a risk of new problems
       | cropping up.
       | 
       | 1 million Task.Delays means 1 million queued waits in the sleep
       | machinery, which could run into some bad scaling and overhead.
       | But then 1 million tasks waiting on a single awaitable probably
       | hits overhead in other subsystems. My guess would be that 1
       | million waits is much cheaper than 1 million sleeps, because
       | sleeps have to be sorted by due time while the list of tasks
       | waiting for completion just needs to be sequential. Entries in
       | the sleep queue are also going to be bigger (at minimum, 4-8
       | bytes for due time + 4-8 bytes per completion callback) while
       | entries in a completion queue would just be 4-8 bytes for the
       | completion callback.
       | 
       | Task.WhenAll also has potentially worse behavior than the other
       | apps if you care about memory, since it will need to allocate a
       | big array of 1 million Tasks in order to pass it in. Many of the
       | other programs instead loop over the tasks, which eliminates that
       | overhead.
        
         | JB_Dev wrote:
         | The other thing I noticed is that they are wrapping every
         | Task.Delay(...) inside a Task.Run(...) call. That should not be
         | necessary. The inner task will yield back to the called when it
         | hits the delay call anyway. There is no need to queue it with
         | an immediate yield as if it is background or cpu-bound work.
         | 
         | The Run() call might be creating a more complex state machine
         | and using more memory than necessary (2 tasks vs just 1 for
         | each delay). The memory usage might be even lower if the Run()
         | call is dropped.
        
       | dceddia wrote:
       | Tokio seems oddly good compared to the others. Is that because
       | it's using a thread pool instead of creating something new for
       | every task? I wonder how it would fare if the tasks were more
       | intense, and how the time-to-completion compares between these
       | approaches.
        
         | bombela wrote:
         | async Rust functions are compiled down to a finite state
         | machine. A stackless coroutine I think it can be called. I
         | suspect that C#/.NET does something similar though I am just
         | speculating here. The difference between async-std and tokio
         | would then be explained by a lower overhead runtime per task in
         | tokio.
        
       | Traubenfuchs wrote:
       | The Java tests should be optimized by making the ArrayList a
       | fixed size. All this resizing of the huge array backing it and
       | the potential final very high amount of empty backing array slots
       | probably makes the Java result worse than it could be.
       | 
       | I wonder if the parallel stream API with a virtual thread pool
       | would make it better or worse.
        
         | tialaramex wrote:
         | The ArrayList is just a growable array and is presumably using
         | an exponential growth strategy so it's equivalent to what
         | several of the others are doing. You could indeed tell Java how
         | big the ArrayList will grow in advance, but you could for
         | example likewise do that with the Rust Vec by asking for
         | Vec::with_capacity(num_tasks) instead of Vec::new().
         | 
         | Since this is amortized exponential growth it's likely making
         | negligible difference, although it is good style to tell data
         | structures what you know about their size early.
         | 
         | The Go approach doesn't appear to pay this penalty, it's just
         | incrementing a (presumably atomic?) counter, although I have no
         | idea where these tasks actually "live" until they decrement the
         | counter again - but most of the others are likewise using a
         | growable array.
        
         | tester756 wrote:
         | C#'s list works the same way
        
       | cyberax wrote:
       | It's not really fair to compare async/await systems with Go and
       | Java.
       | 
       | Go and Java with lightweight threads provide a full-blown system,
       | without any limitations. You don't have to "color" your code into
       | async and blocking functions, everything "just works". Go can
       | even interrupt tight inner loops with async pre-emption.
       | 
       | The downside is that Go needs around 2k of stack minimum for each
       | goroutine. Java is similar, but has a lower per-thread constant.
       | 
       | The upside is that Go can be _much_ faster due to these
       | contiguous stacks. With async/await each yield point is basically
       | isomorphic to a segmented stack segment. I did a quick benchmark
       | to show this: https://blog.alex.net/benchmarking-go-and-c-async-
       | await
        
         | kevincox wrote:
         | On the contrary I think it is critical to compare them. You can
         | then decide if the tradeoff is worth it for your use case.
         | 
         | Even at a million tasks Go is still under 3 GiB of memory. So
         | that is roughly 3KiB of memory overhead per task. That is
         | likely negligible if your tasks are doing anything significant
         | and most people don't need to worry about this number of tasks.
         | 
         | So this comparison shows that in many cases it is worth paying
         | that price to avoid function colouring. But of course there are
         | still some use cases where the price isn't worth it.
        
         | tester756 wrote:
         | >num += Calculate(20).Result;
         | 
         | You're sure you should use .Result here instead of awaiting it?
        
           | uncheckederror wrote:
           | To expand upon this thought, here is the AsyncGuidance doc[1]
           | on why not to use .Result to get the return value of a
           | completed Task in C#.
           | 
           | To make this simple they introduced async Main[2] a few years
           | ago.
           | 
           | [1]: https://github.com/davidfowl/AspNetCoreDiagnosticScenari
           | os/b...
           | 
           | [2]: https://github.com/dotnet/csharplang/blob/main/proposals
           | /csh...
        
         | pkolaczk wrote:
         | I don't know how C#, but Rust async/await doesn't allocate
         | anything on the heap by itself. So it is not a universal
         | property of all async/await implementations contrary to what
         | you imply in your blog post.
        
           | evntdrvn wrote:
           | For C#, you might want to try a different version with
           | ValueTask instead of Task. It's more memory-friendly.
           | 
           | It would also be interesting to try Native AOT compiling both
           | versions...
        
         | za3faran wrote:
         | One thing I wanted to add is that in golang, you end up passing
         | context.Context to all asynchronous functions to handle
         | cancellations and timeouts, so you "color" them regardless.
         | Java folks with structured concurrency have the right idea
         | here.
        
         | mav88 wrote:
         | You haven't written a concurrent Go program. Not sure what
         | you're trying to demonstrate.
        
         | ilyt wrote:
         | Your benchmark for Go runs in single goroutine
        
         | schemescape wrote:
         | Isn't the first example completely synchronous? If that's
         | what's being compared, why not use C# for both programs?
        
         | djsavvy wrote:
         | In your Go benchmark code, how is the calculation actually
         | being done across threads? I don't see the `go` keyword
         | anywhere in the source code.
        
           | h4x0rr wrote:
           | I'm wondering that as well. The c# code seems really
           | unrealistic though, using a lot of tasks for CPU bound work.
           | A fair comparison would at least need to involve some degree
           | of IO. Does go maybe automatically parallelize everythinh it
           | can? That would be one potential answer to the initial
           | question
        
         | jimsimmons wrote:
         | None of what you say is a gotcha. An ignorant programmer stands
         | only to benefit from Go
        
       | neonsunset wrote:
       | As of today (with .NET 7 and 8 preview) 120MB base footprint
       | memory usage by .NET is indeed surprising. A few questions:
       | 
       | - How are you exactly launching the workload?
       | 
       | - How much memory does your system have?
       | 
       | - Could you try running it with .NET 7 instead?
       | 
       | The last two are especially relevant because in the recent
       | versions each release got extra work to reduce memory footprint
       | on Linux (mostly because it's an important metric for container
       | deployments).
       | 
       | Overall, 120MB sounds about right if you are launching a .NET 6
       | workload with some extra instrumentation attached while the
       | runtime picks Server GC which is much more enthusiastic about
       | allocating large segments of memory, especially if the system has
       | large amount of RAM.
       | 
       | As for Tasks - C# tasks are in many ways similar to Rust's
       | futures (both are yielding-to-the-caller state machines) and take
       | very little memory, which scales directly with the amount of
       | variables that persist across "yield points" (which you have
       | none).
        
         | DaiPlusPlus wrote:
         | > Overall, 120MB sounds about right if you are launching a .NET
         | 6 workload with some extra instrumentation attached while the
         | runtime picks Server GC which is much more enthusiastic about
         | allocating large segments of memory, especially if the system
         | has large amount of RAM.
         | 
         | Also if you have lots of CPU cores then the CLR will
         | (pre-)allocate up-to ~10MB/core for per-hardware-thread GC
         | heaps. So my 16-core (32-thread) desktop will run every .NET
         | program with a starting (OS-level) memory usage of ~200MB for a
         | Hello, World program - though actual total GC usage will be < 1
         | MB.
        
         | conradfr wrote:
         | I noticed he didn't use the latest version of each language.
         | Maybe that's just what you get on Ubuntu 22.04 LTS?
        
           | neonsunset wrote:
           | Using .NET 6 is perfectly fine! It is the latest LTS after
           | all, it's just that on machines with a lot of RAM the GC can
           | indeed pre-allocate quite a bit of memory if, based on
           | hardware configuration, it chooses to use Server GC. This,
           | however, is being tuned with each subsequent release hence
           | I'm curious to see the numbers on 7 on the author's hardware.
        
       | thethirdone wrote:
       | Surely a Sufficiently Smart Compiler could optimize all of these
       | to only take the a constant amount of memory (and less time).
       | 
       | I wonder if there are any correctness constraints that would
       | actually prevent such optimizations. For example, the compiler
       | might be required to execute the exact same system calls as an
       | unoptimized program. It seems quite reasonable that the programs
       | using system threads could not be optimized significantly under
       | such a constraint.
       | 
       | I would expect the larger limit that prevents compilers from
       | optimizing this is the large amount of code that defines the
       | semantics of the lightweight threads. For example, I would expect
       | the observable / required behaviors of tokio to not prevent a SCC
       | from optimizing it, but the massive amount of Rust code that
       | defines the lightweight thread semantics cannot be reasonably
       | optimized.
       | 
       | Which programming language / framework is the closest to actually
       | being optimized to like < 4 byte / task? I would make an
       | uneducated guess of Elixir or Go.
        
       | [deleted]
        
       | wwarner wrote:
       | Would be nice to see throughput tested at this level. There is
       | may be a good reason for Go to allocate more per thread if each
       | thread is really doing something. Hat tip to Rust and C# tho!
        
         | bsaul wrote:
         | I suspect 1 million concurrent task very likely becomes
         | unmanageable in any real world scenario anyway. 1KB per task
         | giving 1GB of total memory consumption, which means 100KB would
         | be enough to raise memory limits on most systems.
         | 
         | There may be scenarios such as network connexion handling where
         | those connexion are kept idle 99% of the time, but you still
         | need something to handle the very rare case of receiving
         | something. However, is 1 million network connexion manageable
         | from a kernel point of view anyway ? Wouldn't you reach system
         | limits before that ?
        
       | asabil wrote:
       | The maximum number of simultaneously alive Erlang processes is by
       | default 262,144. This needs to be raised to avoid the
       | `system_limit` error.
        
       | whateveracct wrote:
       | The title is asking a very quantifiable question. And I was
       | hoping the article would be a summary of expertise and deep dives
       | the author took into various language's runtimes.
       | 
       | But then I read
       | 
       | > With a little help from ChatGPT
       | 
       | And sure enough the results & prose end up pretty much worthless.
       | Because the author didn't do any actual critical thinking or
       | technical work. (For instance, there's nothing more to "Go's
       | memory usage ballooned at 1M tasks" beyond the observation. Why
       | not? Just dig in there it's all documented you just have to read
       | some stuff.)
        
         | lexandstuff wrote:
         | Do you mind sharing why Go ballooned to 1M?
        
           | pkolaczk wrote:
           | Because goroutine stacks start at minimum 2 kB. But it is not
           | obvious if this is the only factor that contributes to memory
           | usage so I believe experimenting to verify theory is still
           | quite valuable.
        
       | malkia wrote:
       | What's the point of running 1M concurrent tasks? I don't see a
       | good reason tbh.
        
       | d0mine wrote:
       | The results can be ignored.
       | 
       | You can't just copy-paste ChatGPT without even a sanity check
       | e.g., Python code is wrong:                  await sleep(10)
       | 
       | sleeps 10+ seconds and returns None but the intent is to pass a
       | coroutine to create_task() (await should be removed at the very
       | least)
        
         | pkolaczk wrote:
         | Ok, this one is funny. Actually ChatGPT got it right and there
         | was a create_task before, but right before committing I
         | "inlined it" and introduced a bug. Thanks for letting me know.
         | Will double check if it didn't messed up the results.
         | 
         | And BTW - I'm very far from copy pasting from chat gpt. ChatGPT
         | helped with this code, but often required many iterations and
         | also some manual cleanup.
        
       | lordnacho wrote:
       | I'm a big Rust fan, so I'm happy Tokio holds up so well. I find
       | it to be one of the simplest and quickest ways to write task
       | based programs.
       | 
       | Gotta wonder whether you can challenge those numbers with c++
       | though. I suspect you can but with a lot of work, where Rust is
       | pretty effective written naively.
       | 
       | Of course the big issue with these comparisons is it's hard to be
       | an expert in so many different languages. How do you know if what
       | you're doing is the best way in each language?
        
       | jasonwatkinspdx wrote:
       | > "All programs were launched using the release mode if
       | available. Other options were left default."
       | 
       | So it's likely these numbers simply reflect choices in the
       | default configuration, not the ultimate limit with tuning.
       | 
       | I'm starting to dislike these low effort ChatGPT articles.
       | Getting an answer from it is not the same thing as actually
       | learning what's going on, but readers will walk away with the
       | answers uncritically.
        
         | vasco wrote:
         | > I'm starting to dislike these low effort ChatGPT articles.
         | 
         | The next few decades are going to be tough for you, I think.
         | Brace yourself!
        
         | dceddia wrote:
         | Lately I'm seeing a lot of this baked-in assumption that
         | ChatGPT writes good/idiomatic/performant code, as if the code
         | it generates is _~The Answer~_. I guess it's the modern version
         | of copy /pasting from Stack Overflow. The code I've had it
         | write isn't universally terrible but it almost always has some
         | low-hanging performance fruit (in JS for instance, it seems to
         | love spreading arrays). I'm not sure how much I trust it to
         | write good benchmark code!
        
           | kubb wrote:
           | Does spreading arrays come with a performance penalty?
        
             | spicebox wrote:
             | Spreading an array requires looping over all the array
             | elements which is O(n) compared to something like
             | Array.push() which is just 0(1), although spreading has
             | other benefits like copying rather than mutating
        
             | dexwiz wrote:
             | Spreading allocates a new array. Most things you can do
             | with spread you can do with normal functions that either
             | just directly access an entry or mutate an array directly.
             | [first, ...rest] is nice, but will always be slower than
             | direct access.
        
         | bryancoxwell wrote:
         | Curious what build configuration changes would help with this.
         | Specifically with python/Go because that's what I'm most
         | familiar with, but any insight into tweaking build configs to
         | increase performance here would be really interesting to me.
        
           | [deleted]
        
         | fwlr wrote:
         | To play Devil's Advocate, this _is_ a good test of "how many
         | concurrent tasks will an uncritical developer relying on
         | ChatGPT end up achieving with low effort".
        
           | tialaramex wrote:
           | Right, realistically your product will _not_ primarily be
           | used by the most talented experts, who are optimising as best
           | as possible given intimate knowledge of internals. People
           | will do what basically works, and it 's arguably doing a
           | disservice when "benchmarks" avoid showing us also naive
           | performance without the tuning. The tuned performance is
           | worth knowing, but realistically most customers will get the
           | naive performance, ain't nobody got time to learn to tune
           | everything.
           | 
           | I think this applies all the way down to hardware features
           | and programming languages. For example the naive use of
           | sort() in C++ will go faster than in Rust (in C++ that's an
           | unstable sort in Rust it's stable) but may astonish naive
           | programmers (if you don't know what an unstable sort is, or
           | didn't realise that's what the C++ function does). Or
           | opposite example, Rust's Vec::reserve is actively good to
           | call in your code that adds a bunch of things to a Vec, it
           | never destroys amortized growth, but C++ std::vector reserve
           | _does_ destroy amortized growth so you should avoid calling
           | it unless you know the _final_ size of the std::vector.
        
           | latency-guy2 wrote:
           | Time to fire a bunch of people then
        
       | mkoubaa wrote:
       | Is memory really the right resource bottleneck for this many
       | tasks?
        
         | phamilton wrote:
         | For a workload like websockets, where things are generally
         | idle, this is a decent benchmark.
        
       | throwawaaarrgh wrote:
       | I see C is missing. If you're going to do a language comparison,
       | might as well include the one language we know will have the best
       | performance and lowest memory
        
       | withinboredom wrote:
       | Haha, I had to try a PHP build with the parallel extension[1]
       | with some rather simple code[2]:
       | 
       | - single job: 27.8 mb (respectable)
       | 
       | - 10k: 8.3 GB!
       | 
       | At this point, I gave up. I'm 99% sure trying to go to 1 million
       | will crash my computer.
       | 
       | [1]: https://www.php.net/manual/en/intro.parallel.php
       | 
       | [2]:
       | https://gist.github.com/withinboredom/8256055709c0e6a8272b3a...
        
         | withinboredom wrote:
         | I should add that the good ole' fork approach[1] only uses
         | 40.7mb for 10k, 40.7mb for 100k, and 40.7mb for 1 million.
         | 
         | Got to love COW memory. :)
         | 
         | [1]:
         | https://gist.github.com/withinboredom/c811f3c433904235a5f196...
        
       | bastawhiz wrote:
       | Understanding full well that these are pretty artificial results,
       | I'm surprised at just how well Node and Python faired against Go
       | and Java. For all the default assumptions around languages that
       | folks have, they've certainly held their own here.
        
       | znpy wrote:
       | The go memory usage can probably be explained by the fact that go
       | uses fairly large stacks for goroutines (according to
       | https://go.dev/doc/go1.2#stack_size it's 8kb per goroutine, even
       | though that's old)
        
         | pkolaczk wrote:
         | If it was true, the results would be much worse. The average
         | from my measurements is a bit above 2 kB. This website says it
         | is 2 kB: https://tpaschalis.me/goroutines-size/
        
         | amscanne wrote:
         | Stacks changed a lot in Go subsequently. I believe that in 1.2,
         | segmented stacks were still being used. (They now are
         | contiguous and copied on upsize.)
        
         | ilyt wrote:
         | It also runs _actually in parallel_ , unlike the Rust and JS
         | ones
        
       | guenthert wrote:
       | If those 'tasks' can be expressed as state machines, then I'd
       | think log2(<number of states>) * 1e6 bits would be the lower
       | limit.
        
       ___________________________________________________________________
       (page generated 2023-05-21 23:00 UTC)