[HN Gopher] Goroutines are not significantly lighter than threads
___________________________________________________________________
Goroutines are not significantly lighter than threads
Author : todsacerdoti
Score : 83 points
Date : 2021-03-12 20:12 UTC (2 hours ago)
(HTM) web link (matklad.github.io)
(TXT) w3m dump (matklad.github.io)
| anonuser123456 wrote:
| >If the application does 10k concurrent things, 100mb might be
| negligible.
|
| When you think about memory in quantities of 1,10,100,1000
| million units, a few hundred MB makes a big difference (e.g.
| Android devices).
| eeZah7Ux wrote:
| On an android device you don't need 10k coroutines.
|
| And on severs - people implemented webservers like apache 25
| years ago without coroutines and it ran with with very limited
| RAM.
| hermanradtke wrote:
| Hence the word "might". If it is not negligible then one needs
| to opt for something async IO at the cost of more complexity.
| a2g4rAVy wrote:
| Developers don't care about memory on consumer devices. Look at
| any modern website.
| wmf wrote:
| This has been known since the days of NPTL and the O(1)
| scheduler. I guess it has to be relearned every decade or so.
| jeffbee wrote:
| Known but not well-known. There are many false beliefs out
| there. One of the most durable is the incorrect belief that a
| thread incurs the maximum stack size on creation, which isn't
| true on Linux.
| rufius wrote:
| For giggles, would've liked to have seen Erlang/Elixir in the
| mix.
| thesz wrote:
| And Haskell. Haskell green thread overhead is half of Erlang.
| whateveracct wrote:
| Haskell's concurrency blows Go's out the water.
| minaguib wrote:
| Good start but very simplistic
|
| What happens when these threads or goroutines start doing real
| work (CPU, IO) concurrently ?
| platinumrad wrote:
| It's curious how people associated with the Rust project will
| vocally complain about unfair/shallow criticism then turn around
| and post something like this.
| dang wrote:
| Please don't post in the flamewar style to HN. It leads to
| flamewars, which are predictable, nasty, and dumb.
|
| https://news.ycombinator.com/newsguidelines.html
| tln wrote:
| Out of curiosity, how is this unfair/shallow? It's not obvious
| to me
| platinumrad wrote:
| Given the domains where Go is most likely to be used, any
| comparison of goroutines vs threads should include the
| netpoller vs blocking socket i/o.
| derefr wrote:
| Why would using threads imply using [thread-]blocking
| socket IO?
|
| Most HLL runtimes (that you'd ever want to use for writing
| a server) have some async IO completion mechanism using
| scheduler-reserved "async IO waiter" threads, exactly akin
| to netpoll. (Or they plug into the OS in a less-portable
| way that avoids blocking IO syscalls to begin with, like
| Node's libuv.)
|
| Or, if the language's runtime doesn't do it for you, then
| the popular connection-pooling abstractions that the
| language's server libraries are built on top of do it for
| you, by spawning their own AIO completion threads. Jetty's
| worker pool in Java, Tokio in Rust, etc.
|
| Honestly, who's out there writing code in 2021 where there
| are worker POSIX threads, and those threads are (directly
| or indirectly) calling read(2)?
| matklad wrote:
| I do :)
|
| I don't write web services, so I have low O(1) things to
| communicate with, for which threads with blocking read
| work OK. They are not perfect, as there is no
| cancellation, but working around lack of cancellation in
| a couple of places is less costly than bringing
| dependency on a relatively fresh Rust async ecosystem.
| matklad wrote:
| I don't think every benchmark should be a benchmark of
| everything. Targeted micro benchmarks are valuable, if they
| are explicit about their applicability scope.
|
| The post is very direct about looking only at the memory
| usage, and literally says that using the results to reason
| about overall performance would be wrong.
|
| I see how the article could be read as "goroutines bad,
| threads good", as it doesn't go to extraordinary lengths to
| prevent that. But I prefer to cater to a careful reader,
| and not add loud disclaimers repeatedly.
| geodel wrote:
| Exactly!. Just few days there was criticism of Rust's async
| story. Now it could be most likely wrong on some technical
| details, but Rust folks took it so personally and called it an
| attack on folks to who developed async system for Rust. This
| time and again shows they really hate any criticism which is
| not approved by Rust cultists.
| dang wrote:
| Please don't post flamewar comments to HN. It's not what this
| site is for.
|
| https://news.ycombinator.com/newsguidelines.html
| [deleted]
| gbrown_ wrote:
| > The most commonly cited drawback of OS-level threads is that
| they use a lot of RAM. This is not true on Linux.
|
| I was under the impression this was context switching rather than
| memory usage, which I haven't seen much about when these models
| are compared.
| zlynx wrote:
| Goroutines had excellent context switching back when Go
| preferred to run in a single thread. Version 1.3 or so?
|
| Now that Go defaults to multiple threads it has lost all of
| that advantage. Goroutine switching and channel send/receive
| has to apply all of the locking that any multithreaded program
| has to use.
| [deleted]
| smasher164 wrote:
| I thought this article would talk about overcommit and demand
| paging, but it just counted the RSS between two processes and
| called it a day.
|
| They don't even attempt to distinguish the size of the stack from
| the rest of the runtime.
| echlebek wrote:
| With MADV_FREE behaviour, RSS isn't even a useful measurement.
| nmca wrote:
| 3x seems significant?
| gnabgib wrote:
| It does, the article and results contradict the title. I'm not
| sure that "lightweight" purely can be capture by memory usage
| either.
| fulafel wrote:
| To argue in favour of the article: if you do real work in the
| coroutine / thread, there is going to be much more data in the
| work context than the stack and the memory usage difference is
| likely to be negligible.
| eeZah7Ux wrote:
| No, the article correctly points out:
|
| > A thread is only 3 times as large as a goroutine. Absolute
| numbers are also significant: 10k threads require only 100
| megabytes of overhead. If the application does 10k concurrent
| things, 100mb might be negligible.
|
| If it's still not clear: if your application has a good reason
| to run 10k parallel tasks it's most likely doing something
| complex that requires plenty of RAM.
|
| It's very unlikely that you really have to save those 100MB of
| RAM and at the same time you cannot rethink the architecture to
| stop using this level of parallelism.
|
| And even so, that would justify using a coroutine library, not
| a whole programming language.
| coder543 wrote:
| Not only that, but the author takes a single data point and
| attempts to draw a line.
|
| Without measuring 10k, 20k, 30k of each and seeing how the
| memory usage changes, we are seriously comparing apples to
| oranges here.
|
| Go's allocator will hold onto a decent chunk of memory just to
| amortize the cost of reaching out to the operating system. Just
| asking the operating system how much memory the Go application
| is using doesn't tell the whole story in this benchmark.
|
| Is it 3x or 30x? The author _certainly_ doesn 't know based on
| taking a single data point from each application.
|
| Goroutines are more memory efficient than full OS threads, but
| they're also cheaper to spawn and the Go runtime can optimize
| for certain common use cases very effectively.
|
| I've also personally had serious difficulty in the past with
| getting Linux distros to let me launch tens of thousands of OS
| threads. Long before I run out of memory, I hit various limits
| that block the program from spawning additional threads. Some
| of them can be adjusted, but I never managed to spawn threads
| arbitrarily up to the amount of memory that was available...
| probably just my own failing, but that alone is reason enough
| not to spawn an unbounded number of OS threads.
| labawi wrote:
| There was a related post about a month ago[1].
|
| I believe the article fails to account for, among other
| things, the virtual memory overhead that comes with sparse
| allocations for large maximum stack sizes, that would about
| double the numbers. With 2MB or so stack spacing, if you only
| use a single page of RAM per thread, you still need another
| whole page of RAM in the "page table" (actually a trie) and
| on linux those are not counted in the process RES memory
| usage.
|
| That being said, creating thousands of native threads has
| lots of other pitfalls, including stack ones - would not
| recommend as a general strategy.
|
| If haven't tried creating actual threads, but for test
| mapping stack-like staggered memory, I've had success with:
| sysctl vm.overcommit_memory=1 sysctl
| vm.max_map_count=10000000 swapoff -a
|
| Remember to keep an eye on free memory, not process RES or
| similar, that doesn't count page tables and other overhead.
|
| [1] https://news.ycombinator.com/item?id=25997506
| matklad wrote:
| Thanks, RSS not counting memory for page tables themselves
| is a good point, I will correct that in the article
| tomorrow.
| mplewis wrote:
| Even if Goroutines were significantly heavier than threads, I
| would prefer them for many reasons.
| twic wrote:
| Such as?
| adonovan wrote:
| Faster context switches. Vastly larger recursion depth before
| stack overflow.
| twic wrote:
| How often is recursion depth a constraint on your
| programming?
| b0sk wrote:
| Easier to reason about. easier to communicate. Threads are a
| low level primitive. the goroutines seemed to be designed
| with concurrency in mind.
| monocasa wrote:
| It's the channels that make goroutines easier to reason
| about, not the green threads. There's plenty of channel
| implementations on native threads.
| dilyevsky wrote:
| Example of c++ channel library that is easy to use? The
| only one I've seen was packaged with Google's internal
| cpp fiber library that depended on custom linux kernel
| patches.
| monocasa wrote:
| I haven't done much actor model programming in C++ in
| nonproprietary envs so I'm not sure there.
|
| But Rust's mpsc package is a good public example of the
| concept off the top of my head running on native threads.
| They're really not that complicated and a c++ version
| would be on the order of a few hundred lines.
|
| I had been using the concept since the 90s, as RTOSes
| really love it. Pretty much every N64 game uses native
| threads communicating via "software FIFOs".
| pantulis wrote:
| I'd say that they are easier to write code and to reason
| about when compared with, say, POSIX threads.
| twic wrote:
| How come? How are goroutines different from real threads
| such that they're easier to reason about?
| matklad wrote:
| In theory, they are equivalent. In practice, POSIX
| blocking IO does create problems around cancellation and
| selectability. For example, there's no good way to cancel
| an outstanding blocking read call, and you might need to
| do weird hops to work around that.
|
| If you do goroutines, you need to redo all IO yourself
| anyway, and that's a good opportunity to implement things
| like universal support for cancellation and timeouts.
| tmpz22 wrote:
| You're being a massive contrarian right now - asking others
| for supporting evidence while providing none to support your
| own position.
| hedora wrote:
| I'd be curious to know how C++20's stackless coroutines compare
| on memory usage.
| jbandela1 wrote:
| C++ stackless co-routines are pretty lightweight.
|
| A few years back, when I was writing something for a
| presentation, I spun up 50 million coroutines, and arranged
| them in a ring with channels and passed a value all the way
| from each co-routine to the other, on Windows Laptop with 8GB
| of RAM.
|
| I don't recall the exact amount of memory used, but my laptop
| handled it really well.
| tapirl wrote:
| > A thread is only 3 times (memory consuming) as large as a
| goroutine.
|
| Now, the official implementation allocates a 2k-byte stack for
| each new goroutine. The initial stack could be much smaller in
| theory.
| labawi wrote:
| Asides from significant memory use from uncounted overhead for
| native threads, fun starts when your threads start actually
| using more than a single page of stack, calling into who-knows-
| what, instead of just sleeping.
|
| Typical programs will transparently grow stack on demand, both
| with native threads (memory not mapped until used) and
| goroutines / green threads.
|
| Unlike green threads of typical runtimes, I don't know of
| native threading that releases no longer used stack space,
| until .. I guess the thread is destroyed, so sporadic large
| stack use on can blow up your memory.
| bww wrote:
| I guess this one contrived benchmark proves it. Case closed.
| eptcyka wrote:
| Yeah, but its significantly cheaper to switch between goroutines
| than is threads.
| Ameo wrote:
| Yeah that's what I was thinking as well. The article leads with
| "The most commonly cited drawback of OS-level threads is that
| they use a lot of RAM." and that's not the impression I had at
| all.
|
| I've always felt that the biggest reasons that threadless
| concurrency primitives were adopted was to avoid the non-
| negligible cost of spawning threads and context switching
| between them.
| jchw wrote:
| Not only does 3x seem quite significant, but thread context
| switches are also a large overhead and that goes untested here.
| If thread context switch overhead was as low as usermode context
| switching, there would be no use for coroutines since you could
| just use threads instead; I doubt it's non-trivial.
|
| (Of course, in Go, the scheduler also weaves in the GC IIRC, so
| an apples-apples comparison may be difficult. Micro benchmarks
| are just not that useful.)
|
| P.S.: this article seems to work under the assumption that 10,000
| Goroutines is a reasonable upper limit, or at least it feels as
| though it implies that. However, you can definitely run apps with
| 100,000 or even 1,000,000.
| refenestrator wrote:
| I thought the major cost was cold cache after switch in both
| cases?
|
| Kernel context switches are pretty light compared to the
| following slowdowns due to cache.
| jeffbee wrote:
| The performance of the Go runtime with a million blocked
| goroutines is pretty OK, but its performance with even 1000
| runnable goroutines is not great at all. You really need to
| think about which you are going to have.
| gnfargbl wrote:
| True, but if you regularly have 1000 runnable goroutines then
| could you not reconfigure your app to have, say, 64 runnable
| goroutines, and get better throughput? Large numbers of
| goroutines do seem to be a good fit for problems which are
| mostly waiting on the network.
| jeffbee wrote:
| The architecture of Go forces you to have 1 goroutine
| servicing every socket, so the number of runnable
| goroutines will then be at the mercy of your packet inter-
| arrival process.
| networkimprov wrote:
| I think there's a bug, tho it might not make a difference to the
| results: for i := 0; i < 10; i++ {
| go func() { f(i) // sees whatever value i has when
| f() is called, usually 10 }() }
| thewakalix wrote:
| For the curious reader: this can be solved by simply passing i
| as a parameter. for i := 0; i < 10; i++ {
| go func(i int) { f(i) }(i)
| }
| networkimprov wrote:
| Or commonly, tho perhaps confusingly: for i
| := 0; i < 10 i++ { i := i // new variable i for
| every iteration go func() { f(i) }() }
| Groxx wrote:
| Yea. I do this personally - it's easy copypasta since it
| infers types for you, and as it shadows the iteration var
| it's impossible to use the wrong one. Same trick works for
| other local vars e.g. outside the loop, though at that
| point it's just as not-copypasta-able as the
| `func(arg){}(arg)` construct.
| mykowebhn wrote:
| This is a very common newbie mistake. Fix by passing in i.
| for i := 0; i < 10; i++ { go func(i int) {
| f(i) // sees whatever value i has when f() is called, usually
| 10 }(i) }
| thereare5lights wrote:
| This looks very much like the mistake in javascript with
| callbacks in for loops.
| panopticon wrote:
| Note that using `let` instead of `var` in your for-loop
| fixes this in JavaScript (ignoring IE's faulty implemention
| of `let`).
| [deleted]
| matklad wrote:
| Thanks for the correction, that is indeed a bug!
| The_rationalist wrote:
| How does Loom threads overhead compare to goroutines and Kotlin
| coroutines?
| aardvark179 wrote:
| Well the costs are subtly different. The size of the virtual
| thread object is fairly small, but is not the only cost. Rather
| than allocating a stack for a virtual thread we freeze chunks
| of stack into special objects and thaw the back onto the stack
| of the OS thread (we can do this because we know there are no
| pointers into Java stack frames). This means that the cost can
| be very small, but you may place more of a load on the GC if a
| collection occurs between the freeze and the thaw which causes
| this stack object to be moved.
|
| [edit] I should mention that there are other potential
| overheads if you use things like thread local variables as
| these may require more work from the GC to be collected. We are
| working on a new mechanism which will should be better in this
| regard and be a better API for many of he uses of thread
| locals.
___________________________________________________________________
(page generated 2021-03-12 23:01 UTC)