[HN Gopher] Notes on Virtual Threads and Clojure
___________________________________________________________________
Notes on Virtual Threads and Clojure
Author : Borkdude
Score : 89 points
Date : 2022-05-13 09:45 UTC (13 hours ago)
(HTM) web link (ales.rocks)
(TXT) w3m dump (ales.rocks)
| geokon wrote:
| So maybe a more basic and fundamental question - but why can't OS
| threads be as good as these virtual threads? Why do we have to
| have two layers of thread abstraction? My first reaction is "eww"
| quickly followed by "I'm prolly not understanding something
| here". Why can't we make the OS handle thousands of threads?
| Blackthorn wrote:
| We can, this was inspired by a talk given by Google about their
| own infrastructure. But it was apparently never open sourced.
| verdagon wrote:
| Context switching costs are slightly higher for OS threads, but
| the real bottleneck is in the physical memory usage. An OS
| thread will start with an entire 4kb (page size on most OSs)
| for its stack and grow by 4kb as needed. A virtual thread will
| often have <1kb physical memory used, to start.
|
| In a program with lots of small OS threads, that's quite a bit
| of wasted space.
| native_samples wrote:
| Thousands of threads is not a problem for any modern OS if you
| have enough RAM. Loom takes it up into the millions.
|
| As for why the JVM can do this and you can't with other
| runtimes, it's because it needs a complex runtime with very
| tight integration between the compiler, the GC, the threading
| library and various other subsystems. That's why the moment you
| make a native call the JVM loses the ability to switch between
| the virtual threads and has to start up more OS threads to
| compensate - it didn't create the code that's running on the
| stack at that point, so it can't work its usual magic.
|
| Fortunately the whole 'pure Java' movement of some decades ago
| means that actually you can do a heck of a lot of stuff without
| using native code. It's an ecosystem way less dependent on C
| extensions than most other languages.
| renox wrote:
| They could: Solaris had M:N threads at some point in time but
| 1) they're complicated 2) you pay a performance cost even if
| you don't use many threads..
| pjmlp wrote:
| Called LWT, but they were eventually dropped exactly for
| those reasons.
| kerblang wrote:
| Since nobody else brought it up: One big item is non-blocking
| I/O. The idea of non-blocking is that you can manage thousands
| of network connections with only a handful of threads, because
| when you do thread-per-connection, you'll notice that your
| threads are mostly asleep in typical I/O heavy networked
| systems (like web-server/database stuff that most of us work
| on). You can do non-blocking in java now, but only with nasty
| callbacks and Futures and thread pools and such.
|
| In Go you get non-blocking "for free" so that it looks like
| blocking I/O but the runtime automatically suspends & resumes
| virtual threads as I/O buffers fill & drain. One of the big
| inspirations for Loom was Go jealousy. For all Go's faults,
| this is a really nice thing to have.
| dboreham wrote:
| OS can handle thousands of threads easily. Kernel threading
| runs into scaling problems up at the 100's of thousands to
| millions level.
| duped wrote:
| I'm going to call them "green threads" in this context, don't
| read into it because the terminology is loose.
|
| - Since GT are known to a compiler/runtime they can be
| cooperative, as opposed to preemptive. OS threads cannot
| introspect into what your code is doing. That means GT may
| insert suspend and resume points at API boundaries, for example
| at any kind of IO. While an OS could implement this in theory
| (switch threads at i/o syscalls) in practice I don't think
| there are any thread schedulers that do this.
|
| - GT have orders of magnitude faster context switches than OS
| threads. All they need to do is swap a stack frame pointer.
|
| - GT often have growable stacks that start extremely small, OS
| threads have larger default stack sizes that don't grow. This
| makes starting and destroying GT cheaper, and they can use less
| memory than the equivalent number of OS threads.
|
| As for why OS threads don't just work the same, there are some
| disadvantages. You can't pin a GT to a single core, or set
| thread affinity. Context switching an OS thread is more
| expensive because it's doing more things, like saving the
| signal mask (which invokes a syscall, hence the cost). The
| large fixed stack size means that they don't have to perform
| memory allocation on a function call once the stack is too
| small. They have to be preemptive since the OS doesn't know
| anything about what the program is doing, other than a handful
| of i/o syscalls.
|
| They're essentially two different technologies for two use
| cases. The OS threads are more robust and closer to the metal,
| but they are expensive to start up and switch between. However
| GT are small and cheap.
|
| Another way to look at it is that there are _only_ OS threads,
| and "virtual" or "green" threads are distinct concurrent tasks
| scheduled onto a pool of OS threads on the fly. The compiler
| inserts a bunch of book keeping to make scheduling those tasks
| easier. The only thing that makes them "thread like" is the
| API.
| zozbot234 wrote:
| Some operating systems implement standard OS threading API's
| on top of a pool of true "system threads". Notably, this even
| included some early Java implementations, before a 1-to-1
| mapping between system and "user" threads became more popular
| for some reason.
| duped wrote:
| which makes a lot of sense once they have to map to logical
| cores
| gpderetta wrote:
| M:N hybrid threads were in vogue at the turn of the millennium.
| Solaris had them, some of the *BSDs had them and IBM almost got
| them in Linux. There was also a lot of research on the topic
| (look for scheduler activation). They went out of fashion
| because they slowed down parallel programs (as opposed to
| highly concurrent programs) and required a lot additional
| complexity.
|
| Then the "10k Problem" article popularized poll-based
| architectures as opposed to threaded ones and the rest is
| history.
| zozbot234 wrote:
| Scheduler activation is not so different from the modern
| approach, using e.g. uring or IOCP (not really poll-based).
| The original terminology references "virtual processors" as
| opposed to system-level threads, and "user threads" as
| opposed to arbitrary tasks, but either way a task that's
| about to block must have its state saved and subsequently
| restored once the event has been handled, and these
| operations are managed by the user-level runtime not the OS.
| sidpatil wrote:
| My understanding is that OS thread operations require the
| processor/OS to switch between user mode and kernel mode, which
| is slow. Virtual threads don't require this, since they are
| managed entirely by the user process.
| belter wrote:
| Correct. This is the main reason.
| Copenjin wrote:
| For those wondering about the "tail call" part that would be more
| interesting for Clojure than fibers: As adding
| the ability to manipulate call stacks to the JVM will undoubtedly
| be required, it is also the goal of this project to add an even
| lighter-weight construct that will allow unwinding the stack to
| some point and then invoke a method with given arguments
| (basically, a generalization of efficient tail-calls). We will
| call that feature unwind-and-invoke, or UAI. It is not
| the goal of this project to add an automatic tail-call
| optimization to the JVM.
|
| [1] https://cr.openjdk.java.net/~rpressler/loom/Loom-
| Proposal.ht...
| gorjusborg wrote:
| > It is not the goal of this project to add an automatic tail-
| call optimization to the JVM.
|
| Wait what? Are we getting TCO or not with Loom?
| patrec wrote:
| I interpret this as "we will add a primitive that will allow
| you to implement TCO yourself (and more), but the behavior of
| existing java or byte code won't change, and the stack will
| continue to grow, even if a call occurs in tail position".
| Presumably because it's hard to impossible to retrofit TCO
| without changing observable behavior.
| zasdffaa wrote:
| Serious question, isn't that just a while/for/do loop? If
| people want TCO then... just write a loop, surely? I must
| be missing something.
| patrec wrote:
| You are -- non-self tail-calls. If you have mutually
| tail-recursive functions, say f1, f2, ..., f_n you have
| two problems without TCO:
|
| 1. There is no way to rewrite this into loops by just
| modifying the insides of these functions.
|
| 2. The remedy, which is to transform these functions and
| everything that calls them into a giant loop causes lots
| of problems:
|
| The code will become utterly unreadable (if you do this
| transform manually).
|
| Modularity is completely broken and there's no sane way
| to expose these functions individually to external
| callers.
|
| Also, even if you don't care about either of the above:
| your compiler's optimizer will probably choke on it
| (https://blog.reverberate.org/2021/04/21/musttail-
| efficient-i...).
| zasdffaa wrote:
| Well, good points but I was rather thinking the compiler
| should do it as an option.
|
| Of lesser note, I don't believe I've ever come across
| mutually tail recursive functions, or the need for them,
| and although that may reflect my lack of experience in
| some areas, I guess it's not at all common? Maybe in the
| haskell world perhaps.
| Twisol wrote:
| They make state machines leagues easier to reason about.
| Each state is its own function, and you tail-call to the
| next state (and pass along only the data it relies on, if
| you're doing things functionally).
| patrec wrote:
| > I don't believe I've ever come across mutually tail
| recursive functions, or the need for them
|
| Well, had you read the link I posted in my reply to your
| question, you would have ;) Anyway, there are a some
| useful things that are much harder to do pleasantly and
| efficiently without it.
| ithrow wrote:
| TCO will be part of project valhalla not loom.
| funcDropShadow wrote:
| No we won't. [1] does not even mention tail-call elimination.
| Project Loom does not change the bytecode and does not change
| it's semantics. I.e. we don't get guaranteed tail-call
| elimination.
|
| [1]: https://openjdk.java.net/jeps/425
| _old_dude_ wrote:
| One issue to implement TCO is that the "security
| model"/circus of Java is based on the stack frames, so you
| can not remove stack frames without getting into trouble.
|
| The security manager has to be removed first [1]
|
| [1] https://openjdk.java.net/jeps/411
| joe_fishfish wrote:
| Tail-call optimisation on the JVM is already somewhat
| supported with an annotation / keyword (depending on
| language). My understanding is that the project will improve
| the current implementation, but you'll still need to define
| when a function is tail-recursive, the compiler won't do it
| automatically.
| funcDropShadow wrote:
| E.g. Scala provides such an annotation, but that is
| implemented by rewriting the recursive method to a non-
| recursive method with a loop.
| anamexis wrote:
| Isn't that kinda how all TCO works?
| pjmlp wrote:
| First level TCO yes, however the annotation way doesn't
| work for mutually recursive calls.
| jwesleyharding wrote:
| to clarity just a tad, the Scala compiler does TCO out of
| the box and the annotation is added only to check that
| the method is in fact so optimizable
| pjmlp wrote:
| Unless something changed since 2.0.9 (last time I did
| anything relevant in Scala), only one level, it isn't
| clever enough to rewrite mutually recursive calls.
|
| This is rather important, specially when coming from
| languages like Scheme that have TCO as part of the
| language specification compliance.
| dboreham wrote:
| Quick note that this:
|
| > Each thread easily uses an additional Megabytes of memory
|
| isn't quite the problem it seems because unless the thread
| actually _uses_ large amounts of stack, the memory consumed is
| mostly virtual. With today's 64-bit architectures, there is
| plenty of virtual address space to be burned.
| funcDropShadow wrote:
| But it requires changing the virtual memory mapping tables,
| i.e. it requires a context switch to the kernel and back.
| Therefore it decreases effectiveness of TLAB caches. I.e. there
| is a performance cost. One that is hard to quantify for very
| short lived threads and only slightly better quantifiable for
| longer lived threads.
|
| And every argument that OS threads are good enough is in stark
| contrast to the length to which people go to avoid using them.
| I am talking about all the async libraries/patterns/language
| features people use to avoid using OS-level threads.
| native_samples wrote:
| Well not quite. The problem is that it only takes one call tree
| with a deep stack to cause all that memory to be allocated
| (slowly, using lots of page faults) and after that, it's used.
| The only way to get rid of it again is to swap it out, which
| means if you then go ahead and do another deep call it's even
| slower as the kernel doesn't know what it can discard and what
| it can't.
|
| Deep stacks are pretty common in modern software especially for
| Java software.
| spacechild1 wrote:
| Yes! I can't count how often I've read this when people talk
| about OS threads...
| didibus wrote:
| I'm a little confused about the "structured concurency" example?
|
| I don't understand what the advantages of it are?
|
| Are Virtual Threads resources that need to be explicitly cleaned
| up? Can't the pool grow and shrink based on demand?
|
| In Go for example, I don't remember having to clean up fibers
| manually.
|
| For example, how is the structured concurency example different
| from: (defn run-concurrently [] (let [f1
| (future (identity 2000)) f2 (future (prn "Starting
| a long running operation")) f3 (future
| (Thread/sleep 1000)) f4 (future (prn "Done."))]
| (run! deref [f1 f2 f3 f4]) 4)) (run-
| concurrently)
|
| Also as an FYI, the example can be simplified by using `with-
| open`.
| comma_at wrote:
| At a bare minimum your code might have different behavior. Your
| `run! deref` will wait for each future one by one. What if the
| last one threw an exception while the others are running? You'd
| probably want the other tasks to get canceled immediately. I'm
| not sure what Executor.close() does exactly, but in theory it
| could do this for you automatically. Not something I want to
| write by hand for sure[1].
|
| > In Go for example, I don't remember having to clean up fibers
| manually.
|
| And that is awful if you think about it a bit. There's no
| composition happening, your go block just goes off somewhere
| without you having any control of it. It's like goto, only it
| forks first and never gives back the process handle[2].
|
| [1] https://clojureverse.org/t/missionary-new-release-with-
| strea... [2] https://vorpus.org/blog/notes-on-structured-
| concurrency-or-g...
| didibus wrote:
| Hum, handling the first to error and interrupt the others is
| a good use case.
|
| From what I read, Loom structured concurency will not
| automatically interrupt the other tasks though. And the
| .close I think will wait for all of them to complete. They
| said it's because you might not always want other tasks to
| cancel, so they made it more explicit.
|
| But arguably, maybe if you had a catch in there it would be
| triggered on the first task to throw. But I can also imagine
| a similar function to wait for the first future to return or
| error in Clojure. Though I'm guessing at that point one might
| say that's just implementing structured concurency?
|
| My question maybe was about the need to close virtual
| threads. Are they just abusing Closeable so you can use
| .close as a wait for all to complete even when errors are
| thrown feature, or do you actually leak resources if you
| don't clean up Virtual Threads you're done using?
| Matthias247 wrote:
| The advantage is that it provides guarantees, that wouldn't be
| there otherwise.
|
| Let's for example say one of those 4 tasks would write to a
| file. With structured concurrency you have a guarantee that
| once the supertask finishes, the file is no longer in use.
| Without structured concurrency you won't have that, and a
| subtask might error on trying to use the file or find garbage
| data in it. Running the supertask in a loop with structured
| concurrency would have no risk of running into resource
| contention, while without it there is that risk.
|
| Plus structured concurrency also helps adding a deterministic
| bound on the amount of concurrency in the system, while just
| spawning more background tasks can lead to concurrency grow in
| an unbounded fashion.
|
| > In Go for example, I don't remember having to clean up fibers
| manually.
|
| The equivalent in Go would be to wait for subtasks to complete,
| using the WaitGroup pattern. And to notify the subtasks that
| they should complete using "Context".
| didibus wrote:
| My code waits for all futures to complete, that's what:
| (run! deref [f1 f2 f3 f4])
|
| does.
| dgb23 wrote:
| I'm hopeful for first class support in core.async. JVM support
| for virtual threads is the last missing piece IMO.
| Skinney wrote:
| When JVM lands support for virtual threads, what reasons remain
| for using core.async?
|
| core.async will still be useful in ClojureScript land, though.
| dgb23 wrote:
| Channels & alts etc. are useful constructs regardless right?
| Skinney wrote:
| Sure, but do these have API's that work with regular
| threads?
| cr__ wrote:
| Yep. They're decoupled completely from the "go" bits in
| core.async that are meant to make up for the lack of
| green threads. All of the channel operations have
| blocking variants that work fine with regular threads
| now, and they seem likely to work fine with fibers.
| Skinney wrote:
| Didn't know that, that's great!
| pgt wrote:
| Does that mean we can use core.async channels without the
| `go` macro? How would this look?
| bcrosby95 wrote:
| I don't know if you mean something deeper, but core.async
| has <!, >!, and alts! variants that don't require a go
| block: <!!, >!!, and alts!!.
|
| You can just use them like a function. But they block the
| thread. So if you e.g. naively call it in the REPL you
| will block it indefinitely. E.g.: (>!!
| (chan) "this will block the REPL")
|
| The idea would be to pass a channel to multiple virtual
| threads which use the blocking versions.
| [deleted]
| didibus wrote:
| Yes you can. Core.async already supports using normal
| threads with the `thread` macro instead of using the `go`
| macro.
|
| All code looks exactly the same otherwise.
| [deleted]
| bcrosby95 wrote:
| Although it doesn't seem to have a nice API for doing so, it
| seems like you can change the core.async executor by using
| alter-var-root.
|
| https://stackoverflow.com/questions/18779296/clojure-core-as...
___________________________________________________________________
(page generated 2022-05-13 23:02 UTC)