[HN Gopher] Goroutines are not significantly lighter than threads
       ___________________________________________________________________
        
       Goroutines are not significantly lighter than threads
        
       Author : todsacerdoti
       Score  : 83 points
       Date   : 2021-03-12 20:12 UTC (2 hours ago)
        
 (HTM) web link (matklad.github.io)
 (TXT) w3m dump (matklad.github.io)
        
       | anonuser123456 wrote:
       | >If the application does 10k concurrent things, 100mb might be
       | negligible.
       | 
       | When you think about memory in quantities of 1,10,100,1000
       | million units, a few hundred MB makes a big difference (e.g.
       | Android devices).
        
         | eeZah7Ux wrote:
         | On an android device you don't need 10k coroutines.
         | 
         | And on severs - people implemented webservers like apache 25
         | years ago without coroutines and it ran with with very limited
         | RAM.
        
         | hermanradtke wrote:
         | Hence the word "might". If it is not negligible then one needs
         | to opt for something async IO at the cost of more complexity.
        
         | a2g4rAVy wrote:
         | Developers don't care about memory on consumer devices. Look at
         | any modern website.
        
       | wmf wrote:
       | This has been known since the days of NPTL and the O(1)
       | scheduler. I guess it has to be relearned every decade or so.
        
         | jeffbee wrote:
         | Known but not well-known. There are many false beliefs out
         | there. One of the most durable is the incorrect belief that a
         | thread incurs the maximum stack size on creation, which isn't
         | true on Linux.
        
       | rufius wrote:
       | For giggles, would've liked to have seen Erlang/Elixir in the
       | mix.
        
         | thesz wrote:
         | And Haskell. Haskell green thread overhead is half of Erlang.
        
           | whateveracct wrote:
           | Haskell's concurrency blows Go's out the water.
        
       | minaguib wrote:
       | Good start but very simplistic
       | 
       | What happens when these threads or goroutines start doing real
       | work (CPU, IO) concurrently ?
        
       | platinumrad wrote:
       | It's curious how people associated with the Rust project will
       | vocally complain about unfair/shallow criticism then turn around
       | and post something like this.
        
         | dang wrote:
         | Please don't post in the flamewar style to HN. It leads to
         | flamewars, which are predictable, nasty, and dumb.
         | 
         | https://news.ycombinator.com/newsguidelines.html
        
         | tln wrote:
         | Out of curiosity, how is this unfair/shallow? It's not obvious
         | to me
        
           | platinumrad wrote:
           | Given the domains where Go is most likely to be used, any
           | comparison of goroutines vs threads should include the
           | netpoller vs blocking socket i/o.
        
             | derefr wrote:
             | Why would using threads imply using [thread-]blocking
             | socket IO?
             | 
             | Most HLL runtimes (that you'd ever want to use for writing
             | a server) have some async IO completion mechanism using
             | scheduler-reserved "async IO waiter" threads, exactly akin
             | to netpoll. (Or they plug into the OS in a less-portable
             | way that avoids blocking IO syscalls to begin with, like
             | Node's libuv.)
             | 
             | Or, if the language's runtime doesn't do it for you, then
             | the popular connection-pooling abstractions that the
             | language's server libraries are built on top of do it for
             | you, by spawning their own AIO completion threads. Jetty's
             | worker pool in Java, Tokio in Rust, etc.
             | 
             | Honestly, who's out there writing code in 2021 where there
             | are worker POSIX threads, and those threads are (directly
             | or indirectly) calling read(2)?
        
               | matklad wrote:
               | I do :)
               | 
               | I don't write web services, so I have low O(1) things to
               | communicate with, for which threads with blocking read
               | work OK. They are not perfect, as there is no
               | cancellation, but working around lack of cancellation in
               | a couple of places is less costly than bringing
               | dependency on a relatively fresh Rust async ecosystem.
        
             | matklad wrote:
             | I don't think every benchmark should be a benchmark of
             | everything. Targeted micro benchmarks are valuable, if they
             | are explicit about their applicability scope.
             | 
             | The post is very direct about looking only at the memory
             | usage, and literally says that using the results to reason
             | about overall performance would be wrong.
             | 
             | I see how the article could be read as "goroutines bad,
             | threads good", as it doesn't go to extraordinary lengths to
             | prevent that. But I prefer to cater to a careful reader,
             | and not add loud disclaimers repeatedly.
        
         | geodel wrote:
         | Exactly!. Just few days there was criticism of Rust's async
         | story. Now it could be most likely wrong on some technical
         | details, but Rust folks took it so personally and called it an
         | attack on folks to who developed async system for Rust. This
         | time and again shows they really hate any criticism which is
         | not approved by Rust cultists.
        
           | dang wrote:
           | Please don't post flamewar comments to HN. It's not what this
           | site is for.
           | 
           | https://news.ycombinator.com/newsguidelines.html
        
         | [deleted]
        
       | gbrown_ wrote:
       | > The most commonly cited drawback of OS-level threads is that
       | they use a lot of RAM. This is not true on Linux.
       | 
       | I was under the impression this was context switching rather than
       | memory usage, which I haven't seen much about when these models
       | are compared.
        
         | zlynx wrote:
         | Goroutines had excellent context switching back when Go
         | preferred to run in a single thread. Version 1.3 or so?
         | 
         | Now that Go defaults to multiple threads it has lost all of
         | that advantage. Goroutine switching and channel send/receive
         | has to apply all of the locking that any multithreaded program
         | has to use.
        
       | [deleted]
        
       | smasher164 wrote:
       | I thought this article would talk about overcommit and demand
       | paging, but it just counted the RSS between two processes and
       | called it a day.
       | 
       | They don't even attempt to distinguish the size of the stack from
       | the rest of the runtime.
        
         | echlebek wrote:
         | With MADV_FREE behaviour, RSS isn't even a useful measurement.
        
       | nmca wrote:
       | 3x seems significant?
        
         | gnabgib wrote:
         | It does, the article and results contradict the title. I'm not
         | sure that "lightweight" purely can be capture by memory usage
         | either.
        
         | fulafel wrote:
         | To argue in favour of the article: if you do real work in the
         | coroutine / thread, there is going to be much more data in the
         | work context than the stack and the memory usage difference is
         | likely to be negligible.
        
         | eeZah7Ux wrote:
         | No, the article correctly points out:
         | 
         | > A thread is only 3 times as large as a goroutine. Absolute
         | numbers are also significant: 10k threads require only 100
         | megabytes of overhead. If the application does 10k concurrent
         | things, 100mb might be negligible.
         | 
         | If it's still not clear: if your application has a good reason
         | to run 10k parallel tasks it's most likely doing something
         | complex that requires plenty of RAM.
         | 
         | It's very unlikely that you really have to save those 100MB of
         | RAM and at the same time you cannot rethink the architecture to
         | stop using this level of parallelism.
         | 
         | And even so, that would justify using a coroutine library, not
         | a whole programming language.
        
         | coder543 wrote:
         | Not only that, but the author takes a single data point and
         | attempts to draw a line.
         | 
         | Without measuring 10k, 20k, 30k of each and seeing how the
         | memory usage changes, we are seriously comparing apples to
         | oranges here.
         | 
         | Go's allocator will hold onto a decent chunk of memory just to
         | amortize the cost of reaching out to the operating system. Just
         | asking the operating system how much memory the Go application
         | is using doesn't tell the whole story in this benchmark.
         | 
         | Is it 3x or 30x? The author _certainly_ doesn 't know based on
         | taking a single data point from each application.
         | 
         | Goroutines are more memory efficient than full OS threads, but
         | they're also cheaper to spawn and the Go runtime can optimize
         | for certain common use cases very effectively.
         | 
         | I've also personally had serious difficulty in the past with
         | getting Linux distros to let me launch tens of thousands of OS
         | threads. Long before I run out of memory, I hit various limits
         | that block the program from spawning additional threads. Some
         | of them can be adjusted, but I never managed to spawn threads
         | arbitrarily up to the amount of memory that was available...
         | probably just my own failing, but that alone is reason enough
         | not to spawn an unbounded number of OS threads.
        
           | labawi wrote:
           | There was a related post about a month ago[1].
           | 
           | I believe the article fails to account for, among other
           | things, the virtual memory overhead that comes with sparse
           | allocations for large maximum stack sizes, that would about
           | double the numbers. With 2MB or so stack spacing, if you only
           | use a single page of RAM per thread, you still need another
           | whole page of RAM in the "page table" (actually a trie) and
           | on linux those are not counted in the process RES memory
           | usage.
           | 
           | That being said, creating thousands of native threads has
           | lots of other pitfalls, including stack ones - would not
           | recommend as a general strategy.
           | 
           | If haven't tried creating actual threads, but for test
           | mapping stack-like staggered memory, I've had success with:
           | sysctl vm.overcommit_memory=1       sysctl
           | vm.max_map_count=10000000       swapoff -a
           | 
           | Remember to keep an eye on free memory, not process RES or
           | similar, that doesn't count page tables and other overhead.
           | 
           | [1] https://news.ycombinator.com/item?id=25997506
        
             | matklad wrote:
             | Thanks, RSS not counting memory for page tables themselves
             | is a good point, I will correct that in the article
             | tomorrow.
        
       | mplewis wrote:
       | Even if Goroutines were significantly heavier than threads, I
       | would prefer them for many reasons.
        
         | twic wrote:
         | Such as?
        
           | adonovan wrote:
           | Faster context switches. Vastly larger recursion depth before
           | stack overflow.
        
             | twic wrote:
             | How often is recursion depth a constraint on your
             | programming?
        
           | b0sk wrote:
           | Easier to reason about. easier to communicate. Threads are a
           | low level primitive. the goroutines seemed to be designed
           | with concurrency in mind.
        
             | monocasa wrote:
             | It's the channels that make goroutines easier to reason
             | about, not the green threads. There's plenty of channel
             | implementations on native threads.
        
               | dilyevsky wrote:
               | Example of c++ channel library that is easy to use? The
               | only one I've seen was packaged with Google's internal
               | cpp fiber library that depended on custom linux kernel
               | patches.
        
               | monocasa wrote:
               | I haven't done much actor model programming in C++ in
               | nonproprietary envs so I'm not sure there.
               | 
               | But Rust's mpsc package is a good public example of the
               | concept off the top of my head running on native threads.
               | They're really not that complicated and a c++ version
               | would be on the order of a few hundred lines.
               | 
               | I had been using the concept since the 90s, as RTOSes
               | really love it. Pretty much every N64 game uses native
               | threads communicating via "software FIFOs".
        
           | pantulis wrote:
           | I'd say that they are easier to write code and to reason
           | about when compared with, say, POSIX threads.
        
             | twic wrote:
             | How come? How are goroutines different from real threads
             | such that they're easier to reason about?
        
               | matklad wrote:
               | In theory, they are equivalent. In practice, POSIX
               | blocking IO does create problems around cancellation and
               | selectability. For example, there's no good way to cancel
               | an outstanding blocking read call, and you might need to
               | do weird hops to work around that.
               | 
               | If you do goroutines, you need to redo all IO yourself
               | anyway, and that's a good opportunity to implement things
               | like universal support for cancellation and timeouts.
        
           | tmpz22 wrote:
           | You're being a massive contrarian right now - asking others
           | for supporting evidence while providing none to support your
           | own position.
        
       | hedora wrote:
       | I'd be curious to know how C++20's stackless coroutines compare
       | on memory usage.
        
         | jbandela1 wrote:
         | C++ stackless co-routines are pretty lightweight.
         | 
         | A few years back, when I was writing something for a
         | presentation, I spun up 50 million coroutines, and arranged
         | them in a ring with channels and passed a value all the way
         | from each co-routine to the other, on Windows Laptop with 8GB
         | of RAM.
         | 
         | I don't recall the exact amount of memory used, but my laptop
         | handled it really well.
        
       | tapirl wrote:
       | > A thread is only 3 times (memory consuming) as large as a
       | goroutine.
       | 
       | Now, the official implementation allocates a 2k-byte stack for
       | each new goroutine. The initial stack could be much smaller in
       | theory.
        
         | labawi wrote:
         | Asides from significant memory use from uncounted overhead for
         | native threads, fun starts when your threads start actually
         | using more than a single page of stack, calling into who-knows-
         | what, instead of just sleeping.
         | 
         | Typical programs will transparently grow stack on demand, both
         | with native threads (memory not mapped until used) and
         | goroutines / green threads.
         | 
         | Unlike green threads of typical runtimes, I don't know of
         | native threading that releases no longer used stack space,
         | until .. I guess the thread is destroyed, so sporadic large
         | stack use on can blow up your memory.
        
       | bww wrote:
       | I guess this one contrived benchmark proves it. Case closed.
        
       | eptcyka wrote:
       | Yeah, but its significantly cheaper to switch between goroutines
       | than is threads.
        
         | Ameo wrote:
         | Yeah that's what I was thinking as well. The article leads with
         | "The most commonly cited drawback of OS-level threads is that
         | they use a lot of RAM." and that's not the impression I had at
         | all.
         | 
         | I've always felt that the biggest reasons that threadless
         | concurrency primitives were adopted was to avoid the non-
         | negligible cost of spawning threads and context switching
         | between them.
        
       | jchw wrote:
       | Not only does 3x seem quite significant, but thread context
       | switches are also a large overhead and that goes untested here.
       | If thread context switch overhead was as low as usermode context
       | switching, there would be no use for coroutines since you could
       | just use threads instead; I doubt it's non-trivial.
       | 
       | (Of course, in Go, the scheduler also weaves in the GC IIRC, so
       | an apples-apples comparison may be difficult. Micro benchmarks
       | are just not that useful.)
       | 
       | P.S.: this article seems to work under the assumption that 10,000
       | Goroutines is a reasonable upper limit, or at least it feels as
       | though it implies that. However, you can definitely run apps with
       | 100,000 or even 1,000,000.
        
         | refenestrator wrote:
         | I thought the major cost was cold cache after switch in both
         | cases?
         | 
         | Kernel context switches are pretty light compared to the
         | following slowdowns due to cache.
        
         | jeffbee wrote:
         | The performance of the Go runtime with a million blocked
         | goroutines is pretty OK, but its performance with even 1000
         | runnable goroutines is not great at all. You really need to
         | think about which you are going to have.
        
           | gnfargbl wrote:
           | True, but if you regularly have 1000 runnable goroutines then
           | could you not reconfigure your app to have, say, 64 runnable
           | goroutines, and get better throughput? Large numbers of
           | goroutines do seem to be a good fit for problems which are
           | mostly waiting on the network.
        
             | jeffbee wrote:
             | The architecture of Go forces you to have 1 goroutine
             | servicing every socket, so the number of runnable
             | goroutines will then be at the mercy of your packet inter-
             | arrival process.
        
       | networkimprov wrote:
       | I think there's a bug, tho it might not make a difference to the
       | results:                   for i := 0; i < 10; i++ {
       | go func() {               f(i) // sees whatever value i has when
       | f() is called, usually 10            }()         }
        
         | thewakalix wrote:
         | For the curious reader: this can be solved by simply passing i
         | as a parameter.                   for i := 0; i < 10; i++ {
         | go func(i int) {                 f(i)             }(i)
         | }
        
           | networkimprov wrote:
           | Or commonly, tho perhaps confusingly:                   for i
           | := 0; i < 10 i++ {            i := i // new variable i for
           | every iteration            go func() { f(i) }()         }
        
             | Groxx wrote:
             | Yea. I do this personally - it's easy copypasta since it
             | infers types for you, and as it shadows the iteration var
             | it's impossible to use the wrong one. Same trick works for
             | other local vars e.g. outside the loop, though at that
             | point it's just as not-copypasta-able as the
             | `func(arg){}(arg)` construct.
        
         | mykowebhn wrote:
         | This is a very common newbie mistake. Fix by passing in i.
         | for i := 0; i < 10; i++ {            go func(i int) {
         | f(i) // sees whatever value i has when f() is called, usually
         | 10            }(i)         }
        
           | thereare5lights wrote:
           | This looks very much like the mistake in javascript with
           | callbacks in for loops.
        
             | panopticon wrote:
             | Note that using `let` instead of `var` in your for-loop
             | fixes this in JavaScript (ignoring IE's faulty implemention
             | of `let`).
        
           | [deleted]
        
         | matklad wrote:
         | Thanks for the correction, that is indeed a bug!
        
       | The_rationalist wrote:
       | How does Loom threads overhead compare to goroutines and Kotlin
       | coroutines?
        
         | aardvark179 wrote:
         | Well the costs are subtly different. The size of the virtual
         | thread object is fairly small, but is not the only cost. Rather
         | than allocating a stack for a virtual thread we freeze chunks
         | of stack into special objects and thaw the back onto the stack
         | of the OS thread (we can do this because we know there are no
         | pointers into Java stack frames). This means that the cost can
         | be very small, but you may place more of a load on the GC if a
         | collection occurs between the freeze and the thaw which causes
         | this stack object to be moved.
         | 
         | [edit] I should mention that there are other potential
         | overheads if you use things like thread local variables as
         | these may require more work from the GC to be collected. We are
         | working on a new mechanism which will should be better in this
         | regard and be a better API for many of he uses of thread
         | locals.
        
       ___________________________________________________________________
       (page generated 2021-03-12 23:01 UTC)