[HN Gopher] C++ Coroutines
___________________________________________________________________
C++ Coroutines
Author : todsacerdoti
Score : 98 points
Date : 2023-02-22 16:31 UTC (6 hours ago)
(HTM) web link (nigeltao.github.io)
(TXT) w3m dump (nigeltao.github.io)
| jmull wrote:
| > Asynchronous code (e.g. callbacks) is more efficient (letting
| you do other work while waiting for things) but is also more
| complicated (manually saving and restoring state).
|
| This is a persistent misunderstanding about async... it is _not_
| generally more efficient than non-async code. In fact, the
| overhead of async may slow things down. (There can be other
| downsides as well. E.g. debugging may be more difficult.)
|
| For it to help, you need to have something better for the current
| thread to be doing than waiting for whatever it is you're waiting
| for.
|
| Whatever that is, another worker thread could be doing it, so
| what we're really talking about is that async may let you reduce
| the number of threads you're using.
|
| (Async is so useful in JS because it lets you reduce the number
| of threads to one, which lets you sidestep all kinds of hairy
| issues.)
|
| That might be good enough, depending on your situation, to
| overcome the downsides of async worthwhile. But it may very well
| not.
|
| Anyway: don't use async because it's "more efficient".
| the_duke wrote:
| You are leaving out an important aspect: context switches are
| expensive, especially with all the mitigations introduced after
| Spectre et al.
|
| Doing more work in a single thread saves a lot of context
| switches, especially if IO makes up a large portion of the
| workload. And even more so with other mechanisms to minimize
| syscalls, like io_uring on Linux.
|
| Async overhead is highly dependent on the way the async runtime
| is implemented.
|
| It's true that async requires scheduling and task switching to
| be implemented in userspace, but assuming that async is used as
| an alternative to threads, then the OS would be doing a similar
| amount of work anyway.
|
| The OS may be able to do that more efficiently because it has
| more information about the system. Or it may be the opposite,
| because the application has more information about the
| workloads.
|
| "async = overhead" as a general statement is not correct.
| otabdeveloper4 wrote:
| Context switches happen regardless of whether you're using
| kernel mode ("threads") or user mode ("async") scheduling.
|
| And kernel mode actually has much more efficient and
| performant context switches, because the kernel can access
| hardware directly.
|
| (It is sometimes useful to be able to do custom user mode
| scheduling, but certainly not because of "context switches".)
| [deleted]
| the_duke wrote:
| I don't quite follow your argument there.
|
| This is unrelated to kernel threading.
|
| If you have 1 thread handling 1000 requests with some async
| io mechanism (epoll, io_uring, ...) ,instead of 1000
| threads each handling one request, there are much fewer
| threads fighting over CPU cores and the 1 thread can stay
| active much longer, hence reducing the amount of context
| switches.
|
| Especially with a mechanism like io_uring, which helps
| minimize syscalls (and hence switching to kernel threads).
| otherjason wrote:
| Do you have a citation for kernel mode having more
| efficient context switches? What kind of direct hardware
| access are you referring to that would be better than
| pushing the register context onto the stack?
|
| In my experience, the exact opposite is true, particularly
| in the era of CPU mitigations that require TLB flushes upon
| every kernel-mode context switch.
| messe wrote:
| You're right, kernel-level context switching is much
| slower than user-level context switching.
|
| User-level can also have the advantage of having more
| actual context about the task that is running, meaning
| that it's often able to avoid saving/restoring as much
| data as a kernel-level switch would. See Go's green
| threads for a great example of this kind of cooperation
| between runtime and language.
|
| > Do you have a citation for kernel mode having more
| efficient context switches? What kind of direct hardware
| access are you referring to that would be better than
| pushing the register context onto the stack?
|
| The closest thing to this that I can think of is on
| 32-bit x86 which did have hardware assisted context
| switching via TSRs.
|
| As it happens, everybody stopped using it because it was
| too slow, and a bit painful unless you fully bought into
| x86's awful segmentation model. Early Linux kernels use
| it if you want to see it in action.
| jmull wrote:
| Thread context switches are about scheduling threads on
| physical cores. Async is about executing in a scope on a
| thread. There's no direct conversion here. Packing a thread
| with async tasks could let you reduce the number of threads
| you have, which will probably reduce the thread context
| switching if there is contention for cores. Whether that's
| significantly better than the cost of async context switches
| really depends on specifics (even if async context switch
| cost is near zero, though we're no longer talking about C++
| coroutines in that case). But keep in mind: If your threads
| are mostly waiting, they aren't triggering a lot of context
| switches. More async means fewer busier threads, less async
| means more less busy threads. To run into a problem with less
| async your threads need to be forcing switching at pretty
| fine-grained level. (BTW, most blocking for IO isn't fine-
| grained from the perspective of instructions executing on a
| core) That kind of thing happens, but has solutions other
| than async.
| naasking wrote:
| > Anyway: don't use async because it's "more efficient".
|
| I don't think efficiency has only one meaning. For instance,
| you can create millions of coroutines while you're limited to
| only thousands of threads. Doesn't that mean your program is
| making better use of your hardware and thus is more efficient?
| otabdeveloper4 wrote:
| Coroutines and threads are not that different.
|
| Threads usually preallocate some stack space, while
| coroutines usually don't. But that isn't such a big deal.
| josephg wrote:
| Coroutines also don't need to flush the TLB when they get
| scheduled. This is a pretty big deal for performance, when
| you have a lot of them.
| naasking wrote:
| That's not the only difference. Coroutines also typically
| don't have nearly as many concurrency hazards because you
| can control when you yield.
| dymk wrote:
| coroutines can be either cooperative or not, depending on
| implementation, so it depends
| jmull wrote:
| Sure, they _can be_ more efficient in _particular
| situations_.
|
| But coroutines have overhead (so do threads). Your millions
| of coroutines might be worse than your thousands of threads
| in terms of making efficient use of the cpu cores your app or
| service has available.
| nmilo wrote:
| No, your processor can only do n things at a time, where n is
| the number of cores (probably like 8) (simplifying hugely of
| course). It's pretty easy to fill out 8 cores with 1000s of
| threads, the coroutines only become useful when almost all of
| them are blocking on a socket or something, periodically
| switching on and off to handle it.
| [deleted]
| naasking wrote:
| Yes, but contexts where you're creating a million
| coroutines are almost certainly I/O bound (like a web
| server).
| KMag wrote:
| Agreed in general, that async code is generally less efficient
| than epoll/kqueue/poll/select/etc. non-blocking I/O. However,
| note that async code is often quite a bit more efficient than
| naive Java 1.3-style (pre-NIO) tons-of-blocked-threads code.
|
| Now, io_uring async code is yet another story.
|
| I was working at LimeWire when Apple finally shipped a 1.4 JVM
| as their system JVM. The whole thing got much lighter weight
| when we could use non-blocking I/O.
| junon wrote:
| Eh? Async code almost always uses the APIs you describe.
| KMag wrote:
| The async runtime is almost always built on top of non-
| blocking I/O APIs, and the extra layer of abstraction adds
| overhead. You're rarely calling epoll in your async code,
| but rather your runtime is handling all of that for you.
|
| That is, unless you're using io_uring, POSIX AIO, etc., in
| which case your async code might not have the extra layer
| of abstraction on top of the system APIs.
| mr_00ff00 wrote:
| But a single thread with async can, and for many tasks, is more
| efficient that a single thread without async, because you
| aren't waiting on requests.
|
| One point also is that until recently most personal computers
| did not have multiple cores, aka you could think at the lowest
| level that even multiple threads were just more dangerous async
| lol
|
| Even making new threads in the JVM isn't guaranteed to result
| in more than a single OS thread.
| jcelerier wrote:
| > One point also is that until recently most personal
| computers did not have multiple cores,
|
| core 2 duo has been released 17 years ago
| jcranmer wrote:
| Dual core _started_ coming out ~2005. It took a few more
| years for it to dominate the product lineup. Computers tend
| to have a replacement cycle of ~5 years. Consequently, it
| 's not until about 2012-ish that you could reasonably
| expect a personal computer to have a multicore computer.
|
| (Yeah, that's still a decade ago, but that's a good deal
| more recent than 17 years.)
| dymk wrote:
| A decade is a long time for computers
| duped wrote:
| Asynchronous execution is more efficient if and only if the
| time spent waiting for a call to complete is longer than the
| time spent checking if the call has completed plus the time
| spent dispatching to another asynchronous call (and waiting vs
| checking if it is completed, and so on).
|
| Remember that concurrency is not parallelism, it's not about
| the number of threads being used.
| danuker wrote:
| Why implement coroutines when you have multithreading?
|
| Coroutines are to me a brain-twisty way to make the most out of a
| single core, because your language doesn't support
| multithreading. For example: JavaScript with a single thread in
| the browser, Python with the GIL.
|
| C++ has threads which ACTUALLY run in parallel on the CPU. Why
| bother complicating the language further?
| eestrada wrote:
| Because native multithreading is expensive when it comes to (a)
| memory consumption per thread of execution and (b) context
| switching. Also, if your program spends a lot of time just
| waiting on outside resources (most notably slow/blocking IO),
| coroutine concurrency is bother faster and more lightweight. If
| your program is CPU bound, then coroutines/async have little-
| to-no value over native multithreading.
|
| That being said, yeah, most async workflows are obtuse and
| difficult to follow and they force all consumers of your
| interface to also use the async primitives if they want the
| benefits. Go seems to be one of the few languages that does
| async/coroutines right.
|
| See: https://eli.thegreenplace.net/2018/go-hits-the-
| concurrency-n...
| rcme wrote:
| What I like about Go is that each co-routine is entirely
| synchronous, and the only concurrency primitives you're given
| are channels and mutexes. This makes your code so simple and
| easy to understand.
|
| I just looked up Kotlin's co-routines, and there are a number
| of annoying hoops you need to jump through:
|
| 1. Co-routines can only be built by calling runBlocking or
| coroutineScope.
|
| 2. Functions that are async buy should act like blocking
| calls in a co-routine need to be marked with `suspend`
|
| 3. There are a bunch of primitives you need to use like
| `launch` or `async`.
|
| 4. There are built-in options to make the co-routines lazy,
| but it makes the API feel sloppy.
|
| 5. Rather than locks, Kotlin encourages explicit thread
| dispatching. Some times you need to do this in Go too, like
| when using a C library, but in general it's rare.
| ackfoobar wrote:
| `launch` is just the `go` keyword. `async` stores the
| result in a `Deferred` (or `Promise`), which is an
| abstraction for parallel decomposition.
|
| Other than confining UI work to the main thread, I don't
| think I have seen thread confinement as an alternative to
| locks.
| rcme wrote:
| That's kind of my point. Go has the `go` keyword. Kotlin
| has runBlocking, coroutineScope, launch, async, and
| suspend. Go's simplicity feels nicer to me.
| ackfoobar wrote:
| `runBlocking` and `suspend` is the coloured function
| point. Some hate it, I don't mind it. It has its
| benefits.
|
| ---
|
| A quick search on SO gives this example of parallel
| decomposition.
| https://stackoverflow.com/questions/57662739/pattern-for-
| fet...
|
| Starting several `async` tasks, then awaiting them is
| cleaner for me. But that's subjective.
|
| ---
|
| Kotlin's structured concurrency handles the context stuff
| by default. When you handle those concerns (cancellation
| for example) in go, it's just as complex, but more
| verbose.
| rcme wrote:
| Go's context object is used for cancellation. I think
| this is pretty simple and has the benefit of avoiding any
| type of exception handling while still giving you the
| ability to clean up anything you were doing.
|
| In general, if you like Kotlin, I can see why you'd like
| their approach to concurrency. Kotlin really likes a
| large standard library with generic blocks that act as
| syntactic sugar. I used to like that too, but as I've
| gotten older I've gotten lazier. Kotlin now how too much
| mental overhead for my old brain.
| int_19h wrote:
| The other consequence, however, is that Go coroutines only
| work with other Go code, and bespoke Go threads and stacks
| require the runtime to jump through a lot of hoops to do
| FFI correctly, with the result that the ecosystem tends to
| rewrite rather than reuse existing code in C and other
| languages.
|
| In contrast, the approach with explicit promises and await
| can be mapped all the way down to a C struct with a
| function and a data pointer representing a callback. That
| is, it is C ABI compatible, and thus it can be easily
| composed across any boundary that respects that ABI.
| rcme wrote:
| The only runtime hoop I've ever had to jump through was
| to invoke runtime.LockOSThread() inside a go-routine. In
| a lot of ways: go func() {
| runtime.LockOSThread() // ... }()
|
| Is simpler than creating a thread in other languages.
| int_19h wrote:
| The runtime jumps through those hoops for you as needed,
| mostly, but there are consequences to that wrt
| performance.
| PaulDavisThe1st wrote:
| The cost of context switching between threads sharing the
| same address space is much lower than between tasks, because
| there's no need for a TLB flush. It just comes down to the
| cost of saving/restoring register state (more or less), and
| that's a nice known fixed cost on any given processor.
|
| Context switching between tasks has a variable cost, because
| the implications of the TLB flush depend on the working set
| size of the switched-to task. If it doesn't touch much
| memory, the variable cost is low; if it touches a lot, the
| variable cost can be much, much higher than the register
| save/restore (fixed cost).
| nordsieck wrote:
| > It just comes down to the cost of saving/restoring
| register state (more or less), and that's a nice known
| fixed cost on any given processor.
|
| There's also the cost of cache eviction. Not something you
| have to manually manage, but it's a cost you pay
| nonetheless. Maybe.
| PaulDavisThe1st wrote:
| Indeed, I ought to have mentioned the cost of cache
| misses post-context switch too. It too is working set
| size dependent, but the relationship looks different from
| the cost of TLB misses. However, my understanding is that
| an inter-task thread switch would also not cause cache
| invalidation (there's no need; the address space remains
| the same).
| GuB-42 wrote:
| Coroutines are when you don't want to run in parallel, because
| sequence matters. A typical example is iterators, you want the
| loop and iterator (routine and coroutine) to run in lockstep.
|
| You could use threads instead, but then you risk getting in all
| sorts of concurrency problems. And if you synchronize
| everything, then great, you have reinvented coroutines. In
| fact, you can think of coroutines as threads with built-in
| synchronization, which may be how they are implemented.
| [deleted]
| layer8 wrote:
| Coroutines are for when you would normally use callbacks or a
| state machine, but handling the callback's state or the state
| machine is complex enough that it becomes much simpler when
| expressed as a coroutine (simpler to reason about). It isn't
| really an alternative to multithreading.
| okamiueru wrote:
| To more elegantly write code that is async in nature like
| IO/Web requests? Let stuff happen and just await where needed?
|
| Seems like different type of concern than what you need
| multiprocessing for.
| injidup wrote:
| Coroutines don't really have anything to do with threads. They
| are just a way for you to model suspending and resuming
| functions.
|
| If you have every tried to implement a complicated iterator you
| will find that you model a for loop in your head but that the
| iteration of the loop is controlled from outside using the ++
| operator. The ++ operator allows the iterator loop to go one
| more time round. Normally you have to manage this with state
| stored in some object. With coroutines you can actually write a
| loop and co_yield from the middle of the loop to implement the
| ++ operator.
| grogers wrote:
| This is actually really convenient.
|
| In our code we have an abstraction that wraps a set using a
| variant. Instead of implementing an iterator/begin/end based
| on which data the variant holds, we have an iter() method
| that unwraps the variant and loops through and co_yields.
| Since you can consume a generator in a for-each loop, you use
| it the exact same way (except calling iter).
|
| In our case it doesn't save that much complexity, but it's
| definitely some. I could imagine more complicated cases where
| it is even more convenient.
| 112233 wrote:
| Coroutines This is another can be used It as another way is
| still possible to language construct like manage object
| lifetime, and make some to use threads control flow much easier
| if you have coroutines to write. lambdas.
| [deleted]
| [deleted]
| bmurphy1976 wrote:
| Good for embedded devices (e.g. a single core ESP32).
| [deleted]
| jcelerier wrote:
| even without any threads or async, coroutines are very nice for
| writing generators and super useful when you want generic code
| which can iterate, without leaking the container implementation
| to client code
| steveklabnik wrote:
| Why use a hammer when you already have screwdrivers?
| com2kid wrote:
| > Why implement coroutines when you have multithreading?
|
| Because coroutines are easier to think about than threads, and
| avoid entire (large) classes of bugs.
|
| Because coroutines (typically) have a lower overhead than
| threads.
|
| Because if you are IO bound and not CPU bound, and therefore
| you don't need to spread work across cores, threads may not
| provide a given workload with much, or any, benefit.
| demindiro wrote:
| For an I/O heavy application (not in C++) I avoid threads for
| the main "loop" as there is a central cache that is accessed
| often. Synchronizing access to it is likely more expensive and
| error-prone than using just a single thread. It is also
| possible a few hundred or thousand tasks are running
| concurrently, which becomes very expensive very quickly with
| threads.
|
| There are a few CPU-intensive tasks but those are easy to
| isolate and send to a threadpool.
| akmittal wrote:
| Real threads can too expensive sometimes. A single coroutine in
| Go just use few kb of memory. So we can easily spawn hundred of
| thousands of coroutines but not possible using threads
| dragontamer wrote:
| Your point is correct but your magnitudes are off.
|
| IIRC, a thread takes up ~32kB or so worth of data in
| pthreads/Linux, or at least... somewhere around there. So on
| a rather cheap server with 64GB of RAM, you can easily fit
| 1-million threads. More expensive servers have 512GB, 1TB or
| more RAM and can easily handle this kind of load.
|
| Of course, coroutines are even smaller/more efficient and can
| go into 10-million, 100-million or higher.
| Matheus28 wrote:
| A thread on Windows is MUCH heavier than Linux. 1 MB stack
| from what I remember plus takes a lot longer to create.
| dragontamer wrote:
| dwStackSize can be set as low as 64kB though. So if
| you're in the habit of making millions-of-threads, maybe
| make them smaller on Windows?
|
| It does default to 1MB, but these sorts of things are
| configurable.
| com2kid wrote:
| Ugh I really need to finish my blog post where we had async
| with only a few dozen bytes of overhead per task.
|
| Cooperative async runtimes are obscenely efficient.
| dragontamer wrote:
| > C++ has threads which ACTUALLY run in parallel on the CPU.
| Why bother complicating the language further?
|
| Actual parallelism is surprisingly slow and resource heavy in
| practice. When everything is on one core, you keep things local
| in L3, L2, L1.
|
| However, add on a 2nd core, then you suddenly have "ping
| ponging". Lets say core#0 has data in L1 cache, and now Core#1
| needs to read-modify-write to it. That means the cores need to:
|
| 1: Core#1 begins to mark that cache-line as exclusive.
|
| 2. Core#0 needs to then respond by marking the line as invalid.
| Then Core#0 ejects the data from L1, and passes it to Core#1.
|
| 3. Core#1 can finally begin to work on the data.
|
| 4. If Core#0 uses the data again, Core#1 must do step#2 again.
|
| --------
|
| This is called "Ping-ponging". L1 cache read/writes is
| 1-nanosecond, but ping-pongs can be 30nanoseconds or slower,
| 30x slower for no reason.
|
| You don't want to add another core to your problem unless
| you're _actually_ getting a speed benefit. Its more complex
| than it may seem at first glance. You very well could add a
| bunch of cores and then suddenly your program is way slower
| because of these kinds of issues.
|
| Another problem: False sharing. Core#0 is working on "int x",
| and Core#1 is working on "int y", but x and y are on the same
| cacheline. So they have to ping-pong even though they never
| actually touch the same data.
| PaulDavisThe1st wrote:
| Your example implies some combination of (a) excessive data
| sharing between threads (b) possible need for core pinning.
|
| If thread A is going to need to read _all_ the data touched
| by thread B, then it 's unclear why you've split the task
| across threads. If there are still good reasons, then
| probably pin A and B to core N (never pin anything to core 0,
| unrelated story), and let them interleave there.
|
| If that doesn't make sense, then yep, you'll have to face the
| cost of ping-ponging.
| Night_Thastus wrote:
| EDIT: I think I follow how this works now. I didn't understand
| the sieve itself, so the co-routines was throwing me for a loop.
|
| We get all the numbers from 2 to 40, print out 2, then eliminate
| all the entries divisible by 2. Now we're left with 3, 5, 7, 9,
| etc.
|
| We print out 3, then remove all the entries divisible by 3.
|
| This repeats until we've hit the end (40). The next number after
| a round of eliminations will always be a prime.
| layer8 wrote:
| Yeah, the sieve use-case is a bit too clever IMO to be a good
| introductory example.
| zabzonk wrote:
| perhaps the take here is that many features of modern c++ are not
| intended for application writers, but more for library writers?
| ghostwriter wrote:
| They serve both to a great extent, as long as code owners are
| willing to refactor their codebases. The changes do actually
| make code significantly safer and more pleasant to work with.
| My favourite ones are the concepts and class template argument
| deductions.
| steveklabnik wrote:
| I can't speak to modern C++ specifically, but in many, many
| languages, more complex features are often for library writers
| than application writers.
| corysama wrote:
| That's definitely a thing. A whole lot of the features added
| since C++11 have been specifically intended to enable library
| writers to do very complex things under the hood so they can
| present powerful yet simple interfaces to users.
| mrfox321 wrote:
| Users of coroutine-friendly libraries will have to adopt the
| async function coloring (co_await, co_yield, co_return,
| Task<T>).
|
| So saying that coroutines are low-level / not application code
| is misguided.
| nly wrote:
| Not necessarily. Coroutines are invisible at the ABI level
| (that is coroutines are just functions).
|
| It's pretty easy to write a library with them (using your own
| executor) and not reveal their use to the user.
|
| Obviously sharing an executor between application and library
| is another matter.
| Joker_vD wrote:
| > sharing an executor between application and library is
| another matter.
|
| Especially when there is not one, but several libraries,
| and the application itself doesn't actually has or uses any
| executors, now that's a fun challenge.
| mrfox321 wrote:
| Good point.
|
| Although doesn't that defeat the purpose of structured
| concurrency?
|
| i.e. creating tasks that are plumbed to some background
| thread/executor.
|
| By directly invoking the coroutine, lifetime issues become
| much easier to avoid.
| artemonster wrote:
| The fact that they went for stackless is a testament to how bad
| committee-driven design process is. just a bunch of bureaucrats
| playing important actors, similarly to stackoverflow mods.
| colejohnson66 wrote:
| Can you ELI5 why stackless is bad?
| artemonster wrote:
| For coroutines to be _really_ useful, they have to be
| stackful. The guy that originally did the proposal for C++ is
| also an author of multiple coro libraries and did research on
| this. You can emulate stackful coros with stackless via
| trampoline, but heck, you can emulate stackless coros with
| closures and trampoline too! The requirement of this extra
| trampoline makes their use extra convoluted and negates the
| very slight advantages that such functionality may really
| bring. Making stackful "right" was hard, so a useless
| compromise was made, which is basically like adding a
| syntactic sugar to the language.
| [deleted]
| int_19h wrote:
| It may be a compromise, but what exactly makes it
| "useless"? A similar style of async has been in use in e.g.
| C# for over a decade now, and it has been very successful
| there. Yes, it is syntactic sugar - but so are e.g.
| lambdas. What matters in practice is that it makes some
| kinds of code much shorter and easier to read.
| artemonster wrote:
| stackful coroutines is a feature that is on par with
| power of delimited continuations (proven fact in
| academia, you can easily find somewhat easy to follow
| papers on this topic), stackless coroutines is a stupid
| gimmick. You see the "compromise"? You have argued to
| have a case to add a car, but after long debate, politics
| and "compomsie" you got a _TOY_ car instead.
| [deleted]
| malkia wrote:
| I'm only sitting here, waiting for my thread_locals to get
| broken.... or maybe they are handled magically somehow
| jcelerier wrote:
| .. why would they be broken ? coroutines are not related to
| threads
| grogers wrote:
| I think the point is that if the coroutine suspends it could
| be resumed on a different thread. So you don't really want to
| mix thread locals and coroutines.
| jcelerier wrote:
| I mean, for sure but that's not a property of coroutines,
| any callback system can do that
|
| e.g. if you have this code with boost:
| #include <boost/asio.hpp> #include <chrono>
| #include <vector> #include <thread>
| #include <iostream> std::mutex print_mutex;
| int main() { using namespace boost::asio;
| using namespace std::literals;
| io_context io_context(20); auto work_guard =
| make_work_guard(io_context); for(int i = 0; i <
| 100; i++) { post(io_context, [&] {
| thread_local int x = 0; int z = x++;
| std::lock_guard _{print_mutex}; std::cout <<
| z << "\n"; }); }
| std::vector<std::jthread> v; for(int i = 0; i <
| 20; i++) v.emplace_back([&] {
| io_context.run_for(2s); }); }
|
| i hope you don't expect the output to be "0, 1, 2, ..., 99"
___________________________________________________________________
(page generated 2023-02-22 23:01 UTC)