[HN Gopher] Goroutines Under the Hood (2020)
___________________________________________________________________
Goroutines Under the Hood (2020)
Author : s3micolon0
Score : 107 points
Date : 2022-06-03 12:21 UTC (2 days ago)
(HTM) web link (osmh.dev)
(TXT) w3m dump (osmh.dev)
| rounakdatta wrote:
| While the blog is a great introductory post,
| https://www.youtube.com/watch?v=KBZlN0izeiY is a great watch if
| you're interested in the magical optimizations in goroutine
| scheduling.
| qwertox wrote:
| I've fallen in love with Python's asyncio for some time now, but
| I know that go has coroutines integrated as a first class
| citizen.
|
| This article (which I have not read but just skimmed) made me
| search for a simple example, and I landed at "A Tour of Go -
| Goroutines"[0]
|
| That is one of the cleanest examples I've ever seen on this
| topic, and it shows just how well integrated they are in the
| language.
|
| [0] https://go.dev/tour/concurrency/1
| throwaway894345 wrote:
| Having used both in production for many years, Go's model is
| waayyyy better, mostly because Python's model results in a
| bunch of runtime bugs.
|
| The least of which are type error things like forgetting to
| await an async function--these can be caught with a type
| checker (although this means you need to have a type checker
| running in your CI and annotations for all of your
| dependencies).
|
| The most serious are the ones where someone calls a sync or
| CPU-heavy function (directly or transitively) and it starves
| the event loop causing timeouts in unrelated endpoints and
| eventually bringing down the entire application (load shedding
| can help mitigate this somewhat). Go dodges these problems by
| not having sync functions at all (everything is async under the
| covers) and parallelism means CPU-bound workloads don't block
| the whole event loop.
| aaronbwebber wrote:
| I love Go and goroutines, but...
|
| > A newly minted goroutine is given a few kilobytes
|
| a line later
|
| > It is practical to create hundreds of thousands of goroutines
| in the same address space
|
| So it's not practical to create 100s of Ks of goroutines - it's
| possible, sure, but because you incur GBs of memory overhead if
| you are actually creating that many goroutines means that for any
| practical problem you are going to want to stick to a few
| thousand goroutines. I can almost guarantee you that you have
| something better to do with those GBs of memory than store
| goroutine stacks.
|
| Asking the scheduler to handle scheduling 100s of Ks of
| goroutines is also not a great idea in my experience either.
| barsonme wrote:
| In addition to the other comments about memory usage, I'll
| mention that there is a proposal (that's either going to make
| it into Go 1.19 or 1.20?) that uses heuristics to determine a
| good starting stack size for goroutines.
| sumy23 wrote:
| My experience is that whatever you're doing with the go routine
| is usually a bottleneck before the go routine itself. E.g. if
| you make a network request, you become network bound before
| memory bound from go routines.
| hamburglar wrote:
| If you asked me what "a few kb" times "hundreds of thousands"
| is, I'd have characterized it as "more than a few hundreds of
| thousands of kb", not necessarily "gigabytes," and that doesn't
| sound impractical at all. My JVM heaps are usually 16GB.
|
| And go actually does a pretty good job of scheduling hundreds
| of thousands of threads. 6 months ago I did some fairly abusive
| high-thread-count experiments solving problems in silly ways
| that required all N goroutines to participate and I didn't see
| much perf falloff on my laptop until I got 1.5-2 million
| goroutines.
| uluyol wrote:
| Why is spending GB on stack space a bad thing? Ultimately, in a
| server, you need to store state for each request. Whether
| that's on the stack or heap, it's still memory that necessarily
| has to be used.
| guenthert wrote:
| Despite popular belief, not everything is a (web) server. I
| can imagine many threads to be appealing in e.g. simulations.
| jjnoakes wrote:
| If you need the stack space then there is no difference. The
| difference arises because if you preallocate all that stack
| space using worst case stack sizes and don't use most of it,
| you've wasted lots of memory.
|
| Also there is a ton of nuance here like overcommitted pages
| and large address spaces which mitigate some of those
| downsides.
| scottlamb wrote:
| > So it's not practical to create 100s of Ks of goroutines -
| it's possible, sure, but because you incur GBs of memory
| overhead if you are actually creating that many goroutines
| means that for any practical problem you are going to want to
| stick to a few thousand goroutines. I can almost guarantee you
| that you have something better to do with those GBs of memory
| than store goroutine stacks.
|
| You lost me in a couple places:
|
| 1) "GBs of memory overhead" being a lot. A rule of thumb I've
| seen in a datacenter situation is that (iirc) 1 hyperthread and
| 6 GiB of RAM are roughly equivalent in cost. (I'm sure it
| varies over processor/RAM generations, so you should probably
| check this on your platform rather than take my word for it.) I
| think most engineers are way too stingy with RAM. It often
| makes sense to use more of it to reduce CPU, and to just spend
| it on developer convenience. Additionally, often one goroutine
| matches up to one incoming or outgoing socket connection (see
| below). How much RAM are you spending per connection on socket
| buffers? Probably a lot more than a few kilobytes...
|
| 2) The idea that you target a certain number of goroutines.
| They model some activity, often a connection or request. I
| don't target a certain number of those; I target filling the
| machine. (Either the most constrained resource of
| CPU/RAM/SSD/disk/network if you're the only thing running
| there, or a decent chunk of it with Kubernetes or whatever,
| bin-packing to use all dimensions of the machine as best as
| possible.) Unless the goroutines' work is exclusively CPU-
| bound, of course, then you want them to match the number of
| CPUs available, so thousands is too much already.
| deepsun wrote:
| > I think most engineers are way too stingy with RAM. It
| often makes sense to use more of it to reduce CPU, and to
| just spend it on developer convenience.
|
| Hey! That's Java's argument!
| morelisp wrote:
| I agree that GBs for 100Ks of go routines is not in some
| sense "a lot", in that you might still be using memory pretty
| effectively. But I don't see that a "6GB vs 1 core" tradeoff
| makes any sense to talk about.
|
| We have HTTP ingress that needs ~100 cores but could
| theoretically all fit in 1GB. We have k/v stores that need
| only 16 cores but would like 500GB. And we have data points
| at most places in-between. We can't give the ingress 600GB
| instead, and we can't give the k/v stores 100 cores. So the
| fact they're financially interchangeable is meaningless for
| capacity planning.
|
| Arguably, for _most code_ and especially in a GCd language,
| using less memory and less CPU go hand-in-hand.
| scottlamb wrote:
| If you are in aggregate making good use of all the
| dimensions of the available machines/VMs, great. I think
| often people either leave one dimension unused or (when
| buying their own hardware / selecting a VM shape) could be
| adding more RAM cheaply.
|
| > Arguably, for _most code_ and especially in a GCd
| language, using less memory and less CPU go hand-in-hand.
|
| Agreed in general. Even in a non-GC language, less dense
| data structures means worse CPU cache utilization. But on
| the other hand, memoization and the like can provide a real
| trade-off.
|
| In this case, I don't think it's costing much CPU. The GC
| isn't traversing beyond the bounds of the stack, and it
| mostly shouldn't end up in the CPU cache either. (Just a
| partial cache line at the boundary, and some more after a
| goroutine's stack shrinks or the goroutine exits.)
| geodel wrote:
| I mean it has many diagrams and logical explanation of Goroutines
| and concurrency concepts in general but it is definitely not
| under the hood descriptions.
| geertj wrote:
| I came here to write the same thing. Things I had hoped this
| would go into is how goroutines grow their stacks, and how they
| are preempted.
| ignoramous wrote:
| There you go: Vicki Niu, _Goroutines: Under the hood_ ,
| https://www.youtube-nocookie.com/embed/S-MaTH8WpOM (2020).
| jeffbee wrote:
| It didn't really lift the hood at all, unfortunately. Luckily for
| us the runtime is extensively commented, e.g.
| https://github.com/golang/go/blob/master/src/runtime/proc.go
| guenthert wrote:
| So it's just coroutines on top of n:m scheduling, similar to what
| SysV offered a while ago?
| captainmuon wrote:
| One thing that really goes against my intuition is that user
| space threads (lightweight treads, goroutines) are faster than
| kernel threads. Without knowing too much assembly, I would assume
| any modern processor would make a context switch a one
| instruction affair. Interrupt -> small scheduler code picks the
| thread to run -> LOAD THREAD instruction and the processor swaps
| in all the registers and the instruction pointer.
|
| You probably can't beat that in user space, especially if you
| want to preempt threads yourself. You'd have to check after every
| step, or profile your own process or something like that. And
| indeed, Go's scheduler is cooperative.
|
| But then, why can't you get the performance of Goroutines with OS
| threads? Is it just because of legacy issues? Or does it only
| work with cooperative threading, which requires language support?
|
| One thing I'm missing from that article is how the
| cooperativeness is implemented. I think in Go (and in Java's
| Project Loom), you have "normal code", but then deep down in
| network and IO functions, you have magic "yield" instructions. So
| all the layers above can pretend they are running on regular
| threads, and you avoid the "colored function problem", but you
| get runtime behavior similar to coroutines. Which only works if
| really every blocking IO is modified to include yielding
| behavior. If you call a blocking OS function, I assume something
| bad will happen.
| heavyset_go wrote:
| I don't have any specific recommendations to give you, but skim
| through an operating systems text book, or college course that
| puts its slides and whatnot online, when it comes to kernel
| context switching. It'll give you an idea of what kind of work
| a kernel must do when context switching between threads and
| processes, and why userspace multitasking can be more
| efficient.
| garaetjjte wrote:
| One of the issues is that OS schedulers are complex, and
| actually much more expensive than context switch itself. You
| can mitigate this with user-mode scheduling of kernel threads:
| https://www.youtube.com/watch?v=KXuZi9aeGTw
| danachow wrote:
| > modern processor would make a context switch a one
| instruction affair.
|
| Reduced instruction set (complexity) is a hallmark of modern
| processor designs, not the other way around.
|
| You might want to read about what is involved in a task switch
| (either "thread" with same memory mapping, or "process") but it
| is not something that is conducive to reasonably carry out in
| one instruction.
| duped wrote:
| > I would assume any modern processor would make a context
| switch a one instruction affair. Interrupt -> small scheduler
| code picks the thread to run -> LOAD THREAD instruction and the
| processor swaps in all the registers and the instruction
| pointer.
|
| It can't be a single instruction, since the details of what a
| "context" contains depends on the OS and ABI. For example on
| Linux, the signal mask is a part of the OS thread context (but
| usually not user thread contexts) which requires a syscall to
| retrieve it from kernel memory before saving it in the context.
|
| The reason why user threads are so much faster than OS threads
| is precisely because it can be reduced to a handful of
| instructions without caring about all the details that OS
| threads need to care about.
|
| > Which only works if really every blocking IO is modified to
| include yielding behavior. If you call a blocking OS function,
| I assume something bad will happen.
|
| That's exactly what Go does, they introduce yield points into
| function prologues and i/o ops. You don't have direct FFI calls
| in Go so it's not as big of an issue. It's roughly the same
| problem as GC safepoints in multithreaded interpreters that
| support FFI.
| pron wrote:
| There are a few reasons why context switching in user mode
| could be faster, but that has little if anything to do with the
| performance benefits of usermode threads. The performance
| benefit of usermode threads is a result of their _quantity_ ,
| due to Little's law. They're not "faster", just more numerous,
| and that's what you need for higher throughput. More precisely,
| OS threads, because of their scarcity, introduce an artificial
| bound on throughput that's lower than what the hardware can
| support, and usermode threads remove that bound.
|
| More here: https://inside.java/2020/08/07/loom-performance/
|
| As to why it's hard for the OS to allow that many threads, the
| OS would need to keep thread stacks small and resizable, and
| that is hard to do if you don't know the specifics of how the
| language uses the stack. For example, to accommodate low-level
| languages that allow pointers into the stack you would need to
| manipulate virtual memory (to keep addresses valid), but that
| only works at a page granularity, or to introduce split stacks,
| which would require a new kind of ABI known to compilers (and
| would probably have a cost to performance).
| ibraheemdev wrote:
| > More precisely, OS threads, because of their scarcity,
| introduce an artificial bound on throughput that's lower than
| what the hardware can support, and usermode threads remove
| that bound.
|
| Why are OS threads scarce? The OS allocates thread stacks
| lazily. Given a kernel stack of ~8kb (two pages) and a user
| stack of ~4kb, one could spawn 100k threads with just over
| 1GB. A userspace runtime will allow you to bring that number
| down, but if you're at the scale of concurrency it is
| unlikely to matter much.
| masklinn wrote:
| > And indeed, Go's scheduler is cooperative.
|
| It hasn't been cooperative for a few versions now, the
| scheduler became preemptive in 1.14. And before that there were
| yield points at every function prolog (as well as all IO
| primitives) so there were relatively few situations where
| cooperation was necessary.
|
| > Without knowing too much assembly, I would assume any modern
| processor would make a context switch a one instruction affair.
|
| Any context switch (to the kernel) is expensive, and way more
| than a single operation. The kernel also has a ton of stuff to
| do, it's not just "picks the thread to run", you have to
| restore the ip and sp, but also may have to restore FPU/SSE/AVX
| state (AVX512 is over 2KB of state), traps state.
|
| Kernel-level context switching costs on the order of 10x what
| userland context switching does:
| https://eli.thegreenplace.net/2018/measuring-context-switchi...
|
| > LOAD THREAD
|
| There is no load thread instruction
| jaytaylor wrote:
| > It hasn't been cooperative for a few versions now, the
| scheduler became preemptive in 1.14. And before that there
| were yield points at every function prolog (as well as all IO
| primitives) so there were relatively few situations where
| cooperation was necessary.
|
| Since co-op was most unnecessary, do you know why it was
| changed to preemptive or what the specific cases were that
| are resolved with preemptive scheduling?
| mseepgood wrote:
| Tight loops without function calls.
| kgeist wrote:
| IIRC in earlier versions, an infinite loop without
| function calls could freeze the entire runtime: GC's stop
| the world event is triggered => goroutines are being
| suspended cooperatively => but this one goroutine never
| enters a function prolog and thus never suspends => all
| goroutines except one are suspended and there's no
| progress. Preemptive scheduling is much more robust.
| Although it's solvable in cooperative scheduling with an
| additional suspension check at the end of each loop, but
| it adds overhead for all loops. If I remember correctly,
| .NET or JVM implement safe points for GC (which can be
| used to switch contexts cooperatively as well) by simply
| reading a pointer from a special preallocated virtual
| memory page which is remapped to nothing when a stop-the-
| world event is triggered, so such a read traps into an
| invalid memory handler where you can park your thread.
| But I'm not sure how costly it is for thousands/millions
| of coroutines.
| jesboat wrote:
| > But I'm not sure how costly it is for
| thousands/millions of coroutines.
|
| Still cheap: you only need to preempt the threads which
| are actively running user code. If a coroutine is ready
| to run, but not actually running, you don't have to do
| anything with it (as long as you check for safepoints
| before entering user code.) That means your safepoints
| cost is `O(os threads currently running user code)` which
| in most runtimes is `O(num cores)`
| throwaway894345 wrote:
| Replace "tight" with "long-running/infinite" but yeah,
| otherwise this is correct.
| YZF wrote:
| There is a ton of context associated with OS/kernel threads.
| Virtual memory, security, I/O. While there is some hardware
| acceleration for those in modern processors there isn't
| anything like LOAD THREAD and even with CPU support it's still
| very expensive.
|
| You get an interrupt, then the kernel needs to load its own
| context (the tables it needs to access), then the kernel needs
| to do the expensive switch.
|
| In user space you have a lot less context. The actual switching
| is pretty much the cost of a function call. If you need
| preemption that's a different story and mostly depends on what
| facilities are available for that. Inserting preemption checks
| is a little hacky (hello Go ;) ) but what can you do.
|
| EDIT: It's worthwhile noting there's indirect costs like caches
| being stale. Lightweight/green threads will often work on
| shared data structures so the caches are more likely to have
| something useful in them. They may even share the code.
| asien wrote:
| >I would assume any modern processor would make a context
| switch a one instruction affair.
|
| Has been the historic assumption, has been proven to be wrong
| by every possible benchmark.
|
| Consider tech empower[0] for raw stack performance , runtime
| level threads outperform IO threads since OS thread were
| designed to be mapped on physicals cores.
|
| This is very expensive and inefficient.
|
| Creating one thread for every request you have ( Apache + PHP )
| will exhaust the hardware after a few thousands/qps target.
|
| Runtime can indeed have millions of those "lightweight threads"
| without killing your machine since they create a pool from
| physical threads and tap into IO events to efficiently switch
| or resume contexts. This is by far much faster.
|
| [0]
| https://www.techempower.com/benchmarks/#section=data-r20&hw=...
| gpderetta wrote:
| Just the interrupt itself is going to cost a couple of order of
| magnitude more than the whole userspace context switch.
| [deleted]
| yellow_lead wrote:
| (2020) But so high level it's still relevant.
| bogomipz wrote:
| In the conclusion the author states:
|
| >"Go run-time scheduler multiplexes goroutines onto threads and
| when a thread blocks, the run-time moves the blocked goroutines
| to another runnable kernel thread to achieve the highest
| efficiency possible."
|
| Why would the Go run-time move the blocked goroutines to another
| runnable kernel thread? If it is currently blocked it won't be
| schedulable regardless no?
___________________________________________________________________
(page generated 2022-06-05 23:00 UTC)