[HN Gopher] How much memory do you need to run 1M concurrent tasks?
___________________________________________________________________
How much memory do you need to run 1M concurrent tasks?
Author : pkolaczk
Score : 93 points
Date : 2023-05-21 19:58 UTC (3 hours ago)
(HTM) web link (pkolaczk.github.io)
(TXT) w3m dump (pkolaczk.github.io)
| matklad wrote:
| Benchmarks for threads might slightly underestimate memory usage.
| For threads, some amount of memory is spent by the kernel (eg,
| memory maps themselves) and won't show as maxrss.
|
| > a high number of concurrent tasks can consume a significant
| amount of memory,
|
| So Go needs 3gigs for a million tasks. Relatively speaking, for
| an app that does a million concurrent somethings, 3gb is probably
| peanuts.
|
| It seems that memory isn't the limiting factor here.
| davidgrenier wrote:
| For the sake of pitching an analog to go's concurrency model.
| Hopac is a concurrency library for F# (partly written in C#)
| delivering an informal implementation of CSP (similar to Go's
| goroutine and Clojure Core.async). Since the job does nothing but
| wait for 10 seconds, the code is:
| seq{1..1000000} |> Seq.Con.iterJobIgnore (fun _ ->
| timeOutMillis 10000) |> run
|
| I'm not sure how to calculate total memory usage but
| instrumenting the following way: GC.Collect()
| GC.Collect() GC.WaitForPendingFinalizers()
| GC.TryStartNoGCRegion 3000000000L let before =
| GC.GetTotalMemory false seq{1..1000000} |>
| Seq.Con.iterJobIgnore (fun _ -> timeOutMillis 10000) |>
| run GC.GetTotalMemory false - before |> printfn "%d"
| GC.EndNoGCRegion()
|
| Gives the following output, I'm reading 144MB for 1M lightweight
| thread (in the REPL) unless I did something dumb:
| 144449024
| Real: 00:00:10.902, CPU: 00:00:18.570, GC gen0: 3, gen1: 3, gen2:
| 3
| val before: int64 = 53902280L
| beders wrote:
| Another benchmark of the "I don't know what I'm actually
| measuring" flavor.
|
| I don't even know where to start: from the lack of env settings,
| system settings, compiler settings to the fact that no concurrent
| work was actually done, this is one giant flawed benchmark.
|
| The only lesson here should be: Don't take any of this seriously.
| tester756 wrote:
| >from the lack of env settings, system settings, compiler
| settings
|
| compiler settings?? I bet he used defaults everywhere
| pkolaczk wrote:
| Just read the text. It lists versions and states it was using
| defaults in the release mode. Most of those compilers don't
| have any fancy options actually. E.g go is just `go build`,
| Python is just `python3 ...`. Rust options are in the project
| definition. Anyway, I'll add those command line invocations if
| it causes so much controversy.
| [deleted]
| elteto wrote:
| I'm surprised at Elixir failing with 1M tasks. I've definitely
| done this sort of test with Erlang without an issue. Although I
| can't really remember right now if I pushed it to 1M tasks.
| fwlr wrote:
| Unless I'm missing something in the writeup, it doesn't mention
| how much RAM is on the system. A little odd to mention the CPU
| and the OS but not the RAM, for a benchmark that doesn't care
| about execution speed but does care about memory usage. It
| makes figuring out that Elixir error of "system limit reached"
| a little harder to figure out.
|
| edit: oh apparently it's got nothing to do with memory, 1M just
| happens to be higher than the default process limit for Erlang
| and thus Elixir.
| crazydoggers wrote:
| Almost certainly a system cofiguration issue. I'm pretty sure
| the default process limit on BEAM is something like 200K.
| Additionally I don't see mention of updating file handles or
| anything else on the operating system.
|
| Maybe I'm cynical but this is what happens if you rely on
| chatGPT to give you programming answers without understanding
| the fundamentals first.
| sneako wrote:
| "The maximum number of simultaneously alive Erlang processes is
| by default 262,144. This limit can be configured at startup.
| For more information, see the +P command-line flag in the
| erl(1) manual page in ERTS."
|
| https://www.erlang.org/doc/efficiency_guide/advanced.html#:~...
| .
| conradfr wrote:
| Also someone needs to tell ChatGPT that its Elixir code has a
| bug ;)
| themerone wrote:
| This is not a comparison of 1 million concurrent tasks.
|
| For a bunch of these tests, it is measuring 1 million tasks
| multiplexed onto a smaller number of os threads.
|
| In many implementations the cost of concurrency is going to be
| small enough that this test is really a measure of the stack size
| of a million functions.
| creatonez wrote:
| There is sometimes a distinction made between concurrency and
| parallelism, where "concurrency" just means the programming
| interface _looks_ parallel but is allowed to use out-of-order
| task-switching under the hood, and "parallelism" is real-deal
| usage of CPU cores. Both ways of looking at things are useful
| in different situations.
| wongarsu wrote:
| I like the general idea of the test. Things I would have liked to
| see (apart from basic system tweaks like allowing more file
| handles) are a performance comparison (how much CPU time is
| spent) and maybe a comparison between operating systems (this
| might highlight a lot of windows-linux differences, like Windows
| not committing memory - better have a large swap file to handle
| stacks of 100k real threads - or windows having a very different
| scheduler)
| [deleted]
| bsenftner wrote:
| Why no C++ version?
| cynicalsecurity wrote:
| Because C/C++ would kick ass, exposing the rest of the hipster
| technologies as lame.
| Cyph0n wrote:
| They would also kick memory safety in the ass, unless you're
| the kind of dev who doesn't make those kinds of silly
| mistakes
| kevingadd wrote:
| It would be interesting to see how much the C# memory usage would
| go down if you allocated a single Task.Delay at the start and had
| all the tasks wait for it. It would probably result in it
| finishing in 10sec instead of climbing up to 12, while still
| running 1 million tasks. I'm not sure how you would do this in
| other runtimes, and there is still a risk of new problems
| cropping up.
|
| 1 million Task.Delays means 1 million queued waits in the sleep
| machinery, which could run into some bad scaling and overhead.
| But then 1 million tasks waiting on a single awaitable probably
| hits overhead in other subsystems. My guess would be that 1
| million waits is much cheaper than 1 million sleeps, because
| sleeps have to be sorted by due time while the list of tasks
| waiting for completion just needs to be sequential. Entries in
| the sleep queue are also going to be bigger (at minimum, 4-8
| bytes for due time + 4-8 bytes per completion callback) while
| entries in a completion queue would just be 4-8 bytes for the
| completion callback.
|
| Task.WhenAll also has potentially worse behavior than the other
| apps if you care about memory, since it will need to allocate a
| big array of 1 million Tasks in order to pass it in. Many of the
| other programs instead loop over the tasks, which eliminates that
| overhead.
| JB_Dev wrote:
| The other thing I noticed is that they are wrapping every
| Task.Delay(...) inside a Task.Run(...) call. That should not be
| necessary. The inner task will yield back to the called when it
| hits the delay call anyway. There is no need to queue it with
| an immediate yield as if it is background or cpu-bound work.
|
| The Run() call might be creating a more complex state machine
| and using more memory than necessary (2 tasks vs just 1 for
| each delay). The memory usage might be even lower if the Run()
| call is dropped.
| dceddia wrote:
| Tokio seems oddly good compared to the others. Is that because
| it's using a thread pool instead of creating something new for
| every task? I wonder how it would fare if the tasks were more
| intense, and how the time-to-completion compares between these
| approaches.
| bombela wrote:
| async Rust functions are compiled down to a finite state
| machine. A stackless coroutine I think it can be called. I
| suspect that C#/.NET does something similar though I am just
| speculating here. The difference between async-std and tokio
| would then be explained by a lower overhead runtime per task in
| tokio.
| Traubenfuchs wrote:
| The Java tests should be optimized by making the ArrayList a
| fixed size. All this resizing of the huge array backing it and
| the potential final very high amount of empty backing array slots
| probably makes the Java result worse than it could be.
|
| I wonder if the parallel stream API with a virtual thread pool
| would make it better or worse.
| tialaramex wrote:
| The ArrayList is just a growable array and is presumably using
| an exponential growth strategy so it's equivalent to what
| several of the others are doing. You could indeed tell Java how
| big the ArrayList will grow in advance, but you could for
| example likewise do that with the Rust Vec by asking for
| Vec::with_capacity(num_tasks) instead of Vec::new().
|
| Since this is amortized exponential growth it's likely making
| negligible difference, although it is good style to tell data
| structures what you know about their size early.
|
| The Go approach doesn't appear to pay this penalty, it's just
| incrementing a (presumably atomic?) counter, although I have no
| idea where these tasks actually "live" until they decrement the
| counter again - but most of the others are likewise using a
| growable array.
| tester756 wrote:
| C#'s list works the same way
| cyberax wrote:
| It's not really fair to compare async/await systems with Go and
| Java.
|
| Go and Java with lightweight threads provide a full-blown system,
| without any limitations. You don't have to "color" your code into
| async and blocking functions, everything "just works". Go can
| even interrupt tight inner loops with async pre-emption.
|
| The downside is that Go needs around 2k of stack minimum for each
| goroutine. Java is similar, but has a lower per-thread constant.
|
| The upside is that Go can be _much_ faster due to these
| contiguous stacks. With async/await each yield point is basically
| isomorphic to a segmented stack segment. I did a quick benchmark
| to show this: https://blog.alex.net/benchmarking-go-and-c-async-
| await
| kevincox wrote:
| On the contrary I think it is critical to compare them. You can
| then decide if the tradeoff is worth it for your use case.
|
| Even at a million tasks Go is still under 3 GiB of memory. So
| that is roughly 3KiB of memory overhead per task. That is
| likely negligible if your tasks are doing anything significant
| and most people don't need to worry about this number of tasks.
|
| So this comparison shows that in many cases it is worth paying
| that price to avoid function colouring. But of course there are
| still some use cases where the price isn't worth it.
| tester756 wrote:
| >num += Calculate(20).Result;
|
| You're sure you should use .Result here instead of awaiting it?
| uncheckederror wrote:
| To expand upon this thought, here is the AsyncGuidance doc[1]
| on why not to use .Result to get the return value of a
| completed Task in C#.
|
| To make this simple they introduced async Main[2] a few years
| ago.
|
| [1]: https://github.com/davidfowl/AspNetCoreDiagnosticScenari
| os/b...
|
| [2]: https://github.com/dotnet/csharplang/blob/main/proposals
| /csh...
| pkolaczk wrote:
| I don't know how C#, but Rust async/await doesn't allocate
| anything on the heap by itself. So it is not a universal
| property of all async/await implementations contrary to what
| you imply in your blog post.
| evntdrvn wrote:
| For C#, you might want to try a different version with
| ValueTask instead of Task. It's more memory-friendly.
|
| It would also be interesting to try Native AOT compiling both
| versions...
| za3faran wrote:
| One thing I wanted to add is that in golang, you end up passing
| context.Context to all asynchronous functions to handle
| cancellations and timeouts, so you "color" them regardless.
| Java folks with structured concurrency have the right idea
| here.
| mav88 wrote:
| You haven't written a concurrent Go program. Not sure what
| you're trying to demonstrate.
| ilyt wrote:
| Your benchmark for Go runs in single goroutine
| schemescape wrote:
| Isn't the first example completely synchronous? If that's
| what's being compared, why not use C# for both programs?
| djsavvy wrote:
| In your Go benchmark code, how is the calculation actually
| being done across threads? I don't see the `go` keyword
| anywhere in the source code.
| h4x0rr wrote:
| I'm wondering that as well. The c# code seems really
| unrealistic though, using a lot of tasks for CPU bound work.
| A fair comparison would at least need to involve some degree
| of IO. Does go maybe automatically parallelize everythinh it
| can? That would be one potential answer to the initial
| question
| jimsimmons wrote:
| None of what you say is a gotcha. An ignorant programmer stands
| only to benefit from Go
| neonsunset wrote:
| As of today (with .NET 7 and 8 preview) 120MB base footprint
| memory usage by .NET is indeed surprising. A few questions:
|
| - How are you exactly launching the workload?
|
| - How much memory does your system have?
|
| - Could you try running it with .NET 7 instead?
|
| The last two are especially relevant because in the recent
| versions each release got extra work to reduce memory footprint
| on Linux (mostly because it's an important metric for container
| deployments).
|
| Overall, 120MB sounds about right if you are launching a .NET 6
| workload with some extra instrumentation attached while the
| runtime picks Server GC which is much more enthusiastic about
| allocating large segments of memory, especially if the system has
| large amount of RAM.
|
| As for Tasks - C# tasks are in many ways similar to Rust's
| futures (both are yielding-to-the-caller state machines) and take
| very little memory, which scales directly with the amount of
| variables that persist across "yield points" (which you have
| none).
| DaiPlusPlus wrote:
| > Overall, 120MB sounds about right if you are launching a .NET
| 6 workload with some extra instrumentation attached while the
| runtime picks Server GC which is much more enthusiastic about
| allocating large segments of memory, especially if the system
| has large amount of RAM.
|
| Also if you have lots of CPU cores then the CLR will
| (pre-)allocate up-to ~10MB/core for per-hardware-thread GC
| heaps. So my 16-core (32-thread) desktop will run every .NET
| program with a starting (OS-level) memory usage of ~200MB for a
| Hello, World program - though actual total GC usage will be < 1
| MB.
| conradfr wrote:
| I noticed he didn't use the latest version of each language.
| Maybe that's just what you get on Ubuntu 22.04 LTS?
| neonsunset wrote:
| Using .NET 6 is perfectly fine! It is the latest LTS after
| all, it's just that on machines with a lot of RAM the GC can
| indeed pre-allocate quite a bit of memory if, based on
| hardware configuration, it chooses to use Server GC. This,
| however, is being tuned with each subsequent release hence
| I'm curious to see the numbers on 7 on the author's hardware.
| thethirdone wrote:
| Surely a Sufficiently Smart Compiler could optimize all of these
| to only take the a constant amount of memory (and less time).
|
| I wonder if there are any correctness constraints that would
| actually prevent such optimizations. For example, the compiler
| might be required to execute the exact same system calls as an
| unoptimized program. It seems quite reasonable that the programs
| using system threads could not be optimized significantly under
| such a constraint.
|
| I would expect the larger limit that prevents compilers from
| optimizing this is the large amount of code that defines the
| semantics of the lightweight threads. For example, I would expect
| the observable / required behaviors of tokio to not prevent a SCC
| from optimizing it, but the massive amount of Rust code that
| defines the lightweight thread semantics cannot be reasonably
| optimized.
|
| Which programming language / framework is the closest to actually
| being optimized to like < 4 byte / task? I would make an
| uneducated guess of Elixir or Go.
| [deleted]
| wwarner wrote:
| Would be nice to see throughput tested at this level. There is
| may be a good reason for Go to allocate more per thread if each
| thread is really doing something. Hat tip to Rust and C# tho!
| bsaul wrote:
| I suspect 1 million concurrent task very likely becomes
| unmanageable in any real world scenario anyway. 1KB per task
| giving 1GB of total memory consumption, which means 100KB would
| be enough to raise memory limits on most systems.
|
| There may be scenarios such as network connexion handling where
| those connexion are kept idle 99% of the time, but you still
| need something to handle the very rare case of receiving
| something. However, is 1 million network connexion manageable
| from a kernel point of view anyway ? Wouldn't you reach system
| limits before that ?
| asabil wrote:
| The maximum number of simultaneously alive Erlang processes is by
| default 262,144. This needs to be raised to avoid the
| `system_limit` error.
| whateveracct wrote:
| The title is asking a very quantifiable question. And I was
| hoping the article would be a summary of expertise and deep dives
| the author took into various language's runtimes.
|
| But then I read
|
| > With a little help from ChatGPT
|
| And sure enough the results & prose end up pretty much worthless.
| Because the author didn't do any actual critical thinking or
| technical work. (For instance, there's nothing more to "Go's
| memory usage ballooned at 1M tasks" beyond the observation. Why
| not? Just dig in there it's all documented you just have to read
| some stuff.)
| lexandstuff wrote:
| Do you mind sharing why Go ballooned to 1M?
| pkolaczk wrote:
| Because goroutine stacks start at minimum 2 kB. But it is not
| obvious if this is the only factor that contributes to memory
| usage so I believe experimenting to verify theory is still
| quite valuable.
| malkia wrote:
| What's the point of running 1M concurrent tasks? I don't see a
| good reason tbh.
| d0mine wrote:
| The results can be ignored.
|
| You can't just copy-paste ChatGPT without even a sanity check
| e.g., Python code is wrong: await sleep(10)
|
| sleeps 10+ seconds and returns None but the intent is to pass a
| coroutine to create_task() (await should be removed at the very
| least)
| pkolaczk wrote:
| Ok, this one is funny. Actually ChatGPT got it right and there
| was a create_task before, but right before committing I
| "inlined it" and introduced a bug. Thanks for letting me know.
| Will double check if it didn't messed up the results.
|
| And BTW - I'm very far from copy pasting from chat gpt. ChatGPT
| helped with this code, but often required many iterations and
| also some manual cleanup.
| lordnacho wrote:
| I'm a big Rust fan, so I'm happy Tokio holds up so well. I find
| it to be one of the simplest and quickest ways to write task
| based programs.
|
| Gotta wonder whether you can challenge those numbers with c++
| though. I suspect you can but with a lot of work, where Rust is
| pretty effective written naively.
|
| Of course the big issue with these comparisons is it's hard to be
| an expert in so many different languages. How do you know if what
| you're doing is the best way in each language?
| jasonwatkinspdx wrote:
| > "All programs were launched using the release mode if
| available. Other options were left default."
|
| So it's likely these numbers simply reflect choices in the
| default configuration, not the ultimate limit with tuning.
|
| I'm starting to dislike these low effort ChatGPT articles.
| Getting an answer from it is not the same thing as actually
| learning what's going on, but readers will walk away with the
| answers uncritically.
| vasco wrote:
| > I'm starting to dislike these low effort ChatGPT articles.
|
| The next few decades are going to be tough for you, I think.
| Brace yourself!
| dceddia wrote:
| Lately I'm seeing a lot of this baked-in assumption that
| ChatGPT writes good/idiomatic/performant code, as if the code
| it generates is _~The Answer~_. I guess it's the modern version
| of copy /pasting from Stack Overflow. The code I've had it
| write isn't universally terrible but it almost always has some
| low-hanging performance fruit (in JS for instance, it seems to
| love spreading arrays). I'm not sure how much I trust it to
| write good benchmark code!
| kubb wrote:
| Does spreading arrays come with a performance penalty?
| spicebox wrote:
| Spreading an array requires looping over all the array
| elements which is O(n) compared to something like
| Array.push() which is just 0(1), although spreading has
| other benefits like copying rather than mutating
| dexwiz wrote:
| Spreading allocates a new array. Most things you can do
| with spread you can do with normal functions that either
| just directly access an entry or mutate an array directly.
| [first, ...rest] is nice, but will always be slower than
| direct access.
| bryancoxwell wrote:
| Curious what build configuration changes would help with this.
| Specifically with python/Go because that's what I'm most
| familiar with, but any insight into tweaking build configs to
| increase performance here would be really interesting to me.
| [deleted]
| fwlr wrote:
| To play Devil's Advocate, this _is_ a good test of "how many
| concurrent tasks will an uncritical developer relying on
| ChatGPT end up achieving with low effort".
| tialaramex wrote:
| Right, realistically your product will _not_ primarily be
| used by the most talented experts, who are optimising as best
| as possible given intimate knowledge of internals. People
| will do what basically works, and it 's arguably doing a
| disservice when "benchmarks" avoid showing us also naive
| performance without the tuning. The tuned performance is
| worth knowing, but realistically most customers will get the
| naive performance, ain't nobody got time to learn to tune
| everything.
|
| I think this applies all the way down to hardware features
| and programming languages. For example the naive use of
| sort() in C++ will go faster than in Rust (in C++ that's an
| unstable sort in Rust it's stable) but may astonish naive
| programmers (if you don't know what an unstable sort is, or
| didn't realise that's what the C++ function does). Or
| opposite example, Rust's Vec::reserve is actively good to
| call in your code that adds a bunch of things to a Vec, it
| never destroys amortized growth, but C++ std::vector reserve
| _does_ destroy amortized growth so you should avoid calling
| it unless you know the _final_ size of the std::vector.
| latency-guy2 wrote:
| Time to fire a bunch of people then
| mkoubaa wrote:
| Is memory really the right resource bottleneck for this many
| tasks?
| phamilton wrote:
| For a workload like websockets, where things are generally
| idle, this is a decent benchmark.
| throwawaaarrgh wrote:
| I see C is missing. If you're going to do a language comparison,
| might as well include the one language we know will have the best
| performance and lowest memory
| withinboredom wrote:
| Haha, I had to try a PHP build with the parallel extension[1]
| with some rather simple code[2]:
|
| - single job: 27.8 mb (respectable)
|
| - 10k: 8.3 GB!
|
| At this point, I gave up. I'm 99% sure trying to go to 1 million
| will crash my computer.
|
| [1]: https://www.php.net/manual/en/intro.parallel.php
|
| [2]:
| https://gist.github.com/withinboredom/8256055709c0e6a8272b3a...
| withinboredom wrote:
| I should add that the good ole' fork approach[1] only uses
| 40.7mb for 10k, 40.7mb for 100k, and 40.7mb for 1 million.
|
| Got to love COW memory. :)
|
| [1]:
| https://gist.github.com/withinboredom/c811f3c433904235a5f196...
| bastawhiz wrote:
| Understanding full well that these are pretty artificial results,
| I'm surprised at just how well Node and Python faired against Go
| and Java. For all the default assumptions around languages that
| folks have, they've certainly held their own here.
| znpy wrote:
| The go memory usage can probably be explained by the fact that go
| uses fairly large stacks for goroutines (according to
| https://go.dev/doc/go1.2#stack_size it's 8kb per goroutine, even
| though that's old)
| pkolaczk wrote:
| If it was true, the results would be much worse. The average
| from my measurements is a bit above 2 kB. This website says it
| is 2 kB: https://tpaschalis.me/goroutines-size/
| amscanne wrote:
| Stacks changed a lot in Go subsequently. I believe that in 1.2,
| segmented stacks were still being used. (They now are
| contiguous and copied on upsize.)
| ilyt wrote:
| It also runs _actually in parallel_ , unlike the Rust and JS
| ones
| guenthert wrote:
| If those 'tasks' can be expressed as state machines, then I'd
| think log2(<number of states>) * 1e6 bits would be the lower
| limit.
___________________________________________________________________
(page generated 2023-05-21 23:00 UTC)