[HN Gopher] Notes on Virtual Threads and Clojure
       ___________________________________________________________________
        
       Notes on Virtual Threads and Clojure
        
       Author : Borkdude
       Score  : 89 points
       Date   : 2022-05-13 09:45 UTC (13 hours ago)
        
 (HTM) web link (ales.rocks)
 (TXT) w3m dump (ales.rocks)
        
       | geokon wrote:
       | So maybe a more basic and fundamental question - but why can't OS
       | threads be as good as these virtual threads? Why do we have to
       | have two layers of thread abstraction? My first reaction is "eww"
       | quickly followed by "I'm prolly not understanding something
       | here". Why can't we make the OS handle thousands of threads?
        
         | Blackthorn wrote:
         | We can, this was inspired by a talk given by Google about their
         | own infrastructure. But it was apparently never open sourced.
        
         | verdagon wrote:
         | Context switching costs are slightly higher for OS threads, but
         | the real bottleneck is in the physical memory usage. An OS
         | thread will start with an entire 4kb (page size on most OSs)
         | for its stack and grow by 4kb as needed. A virtual thread will
         | often have <1kb physical memory used, to start.
         | 
         | In a program with lots of small OS threads, that's quite a bit
         | of wasted space.
        
         | native_samples wrote:
         | Thousands of threads is not a problem for any modern OS if you
         | have enough RAM. Loom takes it up into the millions.
         | 
         | As for why the JVM can do this and you can't with other
         | runtimes, it's because it needs a complex runtime with very
         | tight integration between the compiler, the GC, the threading
         | library and various other subsystems. That's why the moment you
         | make a native call the JVM loses the ability to switch between
         | the virtual threads and has to start up more OS threads to
         | compensate - it didn't create the code that's running on the
         | stack at that point, so it can't work its usual magic.
         | 
         | Fortunately the whole 'pure Java' movement of some decades ago
         | means that actually you can do a heck of a lot of stuff without
         | using native code. It's an ecosystem way less dependent on C
         | extensions than most other languages.
        
         | renox wrote:
         | They could: Solaris had M:N threads at some point in time but
         | 1) they're complicated 2) you pay a performance cost even if
         | you don't use many threads..
        
           | pjmlp wrote:
           | Called LWT, but they were eventually dropped exactly for
           | those reasons.
        
         | kerblang wrote:
         | Since nobody else brought it up: One big item is non-blocking
         | I/O. The idea of non-blocking is that you can manage thousands
         | of network connections with only a handful of threads, because
         | when you do thread-per-connection, you'll notice that your
         | threads are mostly asleep in typical I/O heavy networked
         | systems (like web-server/database stuff that most of us work
         | on). You can do non-blocking in java now, but only with nasty
         | callbacks and Futures and thread pools and such.
         | 
         | In Go you get non-blocking "for free" so that it looks like
         | blocking I/O but the runtime automatically suspends & resumes
         | virtual threads as I/O buffers fill & drain. One of the big
         | inspirations for Loom was Go jealousy. For all Go's faults,
         | this is a really nice thing to have.
        
         | dboreham wrote:
         | OS can handle thousands of threads easily. Kernel threading
         | runs into scaling problems up at the 100's of thousands to
         | millions level.
        
         | duped wrote:
         | I'm going to call them "green threads" in this context, don't
         | read into it because the terminology is loose.
         | 
         | - Since GT are known to a compiler/runtime they can be
         | cooperative, as opposed to preemptive. OS threads cannot
         | introspect into what your code is doing. That means GT may
         | insert suspend and resume points at API boundaries, for example
         | at any kind of IO. While an OS could implement this in theory
         | (switch threads at i/o syscalls) in practice I don't think
         | there are any thread schedulers that do this.
         | 
         | - GT have orders of magnitude faster context switches than OS
         | threads. All they need to do is swap a stack frame pointer.
         | 
         | - GT often have growable stacks that start extremely small, OS
         | threads have larger default stack sizes that don't grow. This
         | makes starting and destroying GT cheaper, and they can use less
         | memory than the equivalent number of OS threads.
         | 
         | As for why OS threads don't just work the same, there are some
         | disadvantages. You can't pin a GT to a single core, or set
         | thread affinity. Context switching an OS thread is more
         | expensive because it's doing more things, like saving the
         | signal mask (which invokes a syscall, hence the cost). The
         | large fixed stack size means that they don't have to perform
         | memory allocation on a function call once the stack is too
         | small. They have to be preemptive since the OS doesn't know
         | anything about what the program is doing, other than a handful
         | of i/o syscalls.
         | 
         | They're essentially two different technologies for two use
         | cases. The OS threads are more robust and closer to the metal,
         | but they are expensive to start up and switch between. However
         | GT are small and cheap.
         | 
         | Another way to look at it is that there are _only_ OS threads,
         | and  "virtual" or "green" threads are distinct concurrent tasks
         | scheduled onto a pool of OS threads on the fly. The compiler
         | inserts a bunch of book keeping to make scheduling those tasks
         | easier. The only thing that makes them "thread like" is the
         | API.
        
           | zozbot234 wrote:
           | Some operating systems implement standard OS threading API's
           | on top of a pool of true "system threads". Notably, this even
           | included some early Java implementations, before a 1-to-1
           | mapping between system and "user" threads became more popular
           | for some reason.
        
             | duped wrote:
             | which makes a lot of sense once they have to map to logical
             | cores
        
         | gpderetta wrote:
         | M:N hybrid threads were in vogue at the turn of the millennium.
         | Solaris had them, some of the *BSDs had them and IBM almost got
         | them in Linux. There was also a lot of research on the topic
         | (look for scheduler activation). They went out of fashion
         | because they slowed down parallel programs (as opposed to
         | highly concurrent programs) and required a lot additional
         | complexity.
         | 
         | Then the "10k Problem" article popularized poll-based
         | architectures as opposed to threaded ones and the rest is
         | history.
        
           | zozbot234 wrote:
           | Scheduler activation is not so different from the modern
           | approach, using e.g. uring or IOCP (not really poll-based).
           | The original terminology references "virtual processors" as
           | opposed to system-level threads, and "user threads" as
           | opposed to arbitrary tasks, but either way a task that's
           | about to block must have its state saved and subsequently
           | restored once the event has been handled, and these
           | operations are managed by the user-level runtime not the OS.
        
         | sidpatil wrote:
         | My understanding is that OS thread operations require the
         | processor/OS to switch between user mode and kernel mode, which
         | is slow. Virtual threads don't require this, since they are
         | managed entirely by the user process.
        
           | belter wrote:
           | Correct. This is the main reason.
        
       | Copenjin wrote:
       | For those wondering about the "tail call" part that would be more
       | interesting for Clojure than fibers:                 As adding
       | the ability to manipulate call stacks to the JVM will undoubtedly
       | be required, it is also the goal of this project to add an even
       | lighter-weight construct that will allow unwinding the stack to
       | some point and then invoke a  method with given arguments
       | (basically, a generalization of efficient tail-calls). We will
       | call that feature unwind-and-invoke, or UAI.            It is not
       | the goal of this project to add an automatic tail-call
       | optimization to the JVM.
       | 
       | [1] https://cr.openjdk.java.net/~rpressler/loom/Loom-
       | Proposal.ht...
        
         | gorjusborg wrote:
         | > It is not the goal of this project to add an automatic tail-
         | call optimization to the JVM.
         | 
         | Wait what? Are we getting TCO or not with Loom?
        
           | patrec wrote:
           | I interpret this as "we will add a primitive that will allow
           | you to implement TCO yourself (and more), but the behavior of
           | existing java or byte code won't change, and the stack will
           | continue to grow, even if a call occurs in tail position".
           | Presumably because it's hard to impossible to retrofit TCO
           | without changing observable behavior.
        
             | zasdffaa wrote:
             | Serious question, isn't that just a while/for/do loop? If
             | people want TCO then... just write a loop, surely? I must
             | be missing something.
        
               | patrec wrote:
               | You are -- non-self tail-calls. If you have mutually
               | tail-recursive functions, say f1, f2, ..., f_n you have
               | two problems without TCO:
               | 
               | 1. There is no way to rewrite this into loops by just
               | modifying the insides of these functions.
               | 
               | 2. The remedy, which is to transform these functions and
               | everything that calls them into a giant loop causes lots
               | of problems:
               | 
               | The code will become utterly unreadable (if you do this
               | transform manually).
               | 
               | Modularity is completely broken and there's no sane way
               | to expose these functions individually to external
               | callers.
               | 
               | Also, even if you don't care about either of the above:
               | your compiler's optimizer will probably choke on it
               | (https://blog.reverberate.org/2021/04/21/musttail-
               | efficient-i...).
        
               | zasdffaa wrote:
               | Well, good points but I was rather thinking the compiler
               | should do it as an option.
               | 
               | Of lesser note, I don't believe I've ever come across
               | mutually tail recursive functions, or the need for them,
               | and although that may reflect my lack of experience in
               | some areas, I guess it's not at all common? Maybe in the
               | haskell world perhaps.
        
               | Twisol wrote:
               | They make state machines leagues easier to reason about.
               | Each state is its own function, and you tail-call to the
               | next state (and pass along only the data it relies on, if
               | you're doing things functionally).
        
               | patrec wrote:
               | > I don't believe I've ever come across mutually tail
               | recursive functions, or the need for them
               | 
               | Well, had you read the link I posted in my reply to your
               | question, you would have ;) Anyway, there are a some
               | useful things that are much harder to do pleasantly and
               | efficiently without it.
        
           | ithrow wrote:
           | TCO will be part of project valhalla not loom.
        
           | funcDropShadow wrote:
           | No we won't. [1] does not even mention tail-call elimination.
           | Project Loom does not change the bytecode and does not change
           | it's semantics. I.e. we don't get guaranteed tail-call
           | elimination.
           | 
           | [1]: https://openjdk.java.net/jeps/425
        
           | _old_dude_ wrote:
           | One issue to implement TCO is that the "security
           | model"/circus of Java is based on the stack frames, so you
           | can not remove stack frames without getting into trouble.
           | 
           | The security manager has to be removed first [1]
           | 
           | [1] https://openjdk.java.net/jeps/411
        
           | joe_fishfish wrote:
           | Tail-call optimisation on the JVM is already somewhat
           | supported with an annotation / keyword (depending on
           | language). My understanding is that the project will improve
           | the current implementation, but you'll still need to define
           | when a function is tail-recursive, the compiler won't do it
           | automatically.
        
             | funcDropShadow wrote:
             | E.g. Scala provides such an annotation, but that is
             | implemented by rewriting the recursive method to a non-
             | recursive method with a loop.
        
               | anamexis wrote:
               | Isn't that kinda how all TCO works?
        
               | pjmlp wrote:
               | First level TCO yes, however the annotation way doesn't
               | work for mutually recursive calls.
        
               | jwesleyharding wrote:
               | to clarity just a tad, the Scala compiler does TCO out of
               | the box and the annotation is added only to check that
               | the method is in fact so optimizable
        
               | pjmlp wrote:
               | Unless something changed since 2.0.9 (last time I did
               | anything relevant in Scala), only one level, it isn't
               | clever enough to rewrite mutually recursive calls.
               | 
               | This is rather important, specially when coming from
               | languages like Scheme that have TCO as part of the
               | language specification compliance.
        
       | dboreham wrote:
       | Quick note that this:
       | 
       | > Each thread easily uses an additional Megabytes of memory
       | 
       | isn't quite the problem it seems because unless the thread
       | actually _uses_ large amounts of stack, the memory consumed is
       | mostly virtual. With today's 64-bit architectures, there is
       | plenty of virtual address space to be burned.
        
         | funcDropShadow wrote:
         | But it requires changing the virtual memory mapping tables,
         | i.e. it requires a context switch to the kernel and back.
         | Therefore it decreases effectiveness of TLAB caches. I.e. there
         | is a performance cost. One that is hard to quantify for very
         | short lived threads and only slightly better quantifiable for
         | longer lived threads.
         | 
         | And every argument that OS threads are good enough is in stark
         | contrast to the length to which people go to avoid using them.
         | I am talking about all the async libraries/patterns/language
         | features people use to avoid using OS-level threads.
        
         | native_samples wrote:
         | Well not quite. The problem is that it only takes one call tree
         | with a deep stack to cause all that memory to be allocated
         | (slowly, using lots of page faults) and after that, it's used.
         | The only way to get rid of it again is to swap it out, which
         | means if you then go ahead and do another deep call it's even
         | slower as the kernel doesn't know what it can discard and what
         | it can't.
         | 
         | Deep stacks are pretty common in modern software especially for
         | Java software.
        
         | spacechild1 wrote:
         | Yes! I can't count how often I've read this when people talk
         | about OS threads...
        
       | didibus wrote:
       | I'm a little confused about the "structured concurency" example?
       | 
       | I don't understand what the advantages of it are?
       | 
       | Are Virtual Threads resources that need to be explicitly cleaned
       | up? Can't the pool grow and shrink based on demand?
       | 
       | In Go for example, I don't remember having to clean up fibers
       | manually.
       | 
       | For example, how is the structured concurency example different
       | from:                 (defn run-concurrently []         (let [f1
       | (future (identity 2000))               f2 (future (prn "Starting
       | a long running operation"))               f3 (future
       | (Thread/sleep 1000))               f4 (future (prn "Done."))]
       | (run! deref [f1 f2 f3 f4])             4))            (run-
       | concurrently)
       | 
       | Also as an FYI, the example can be simplified by using `with-
       | open`.
        
         | comma_at wrote:
         | At a bare minimum your code might have different behavior. Your
         | `run! deref` will wait for each future one by one. What if the
         | last one threw an exception while the others are running? You'd
         | probably want the other tasks to get canceled immediately. I'm
         | not sure what Executor.close() does exactly, but in theory it
         | could do this for you automatically. Not something I want to
         | write by hand for sure[1].
         | 
         | > In Go for example, I don't remember having to clean up fibers
         | manually.
         | 
         | And that is awful if you think about it a bit. There's no
         | composition happening, your go block just goes off somewhere
         | without you having any control of it. It's like goto, only it
         | forks first and never gives back the process handle[2].
         | 
         | [1] https://clojureverse.org/t/missionary-new-release-with-
         | strea... [2] https://vorpus.org/blog/notes-on-structured-
         | concurrency-or-g...
        
           | didibus wrote:
           | Hum, handling the first to error and interrupt the others is
           | a good use case.
           | 
           | From what I read, Loom structured concurency will not
           | automatically interrupt the other tasks though. And the
           | .close I think will wait for all of them to complete. They
           | said it's because you might not always want other tasks to
           | cancel, so they made it more explicit.
           | 
           | But arguably, maybe if you had a catch in there it would be
           | triggered on the first task to throw. But I can also imagine
           | a similar function to wait for the first future to return or
           | error in Clojure. Though I'm guessing at that point one might
           | say that's just implementing structured concurency?
           | 
           | My question maybe was about the need to close virtual
           | threads. Are they just abusing Closeable so you can use
           | .close as a wait for all to complete even when errors are
           | thrown feature, or do you actually leak resources if you
           | don't clean up Virtual Threads you're done using?
        
         | Matthias247 wrote:
         | The advantage is that it provides guarantees, that wouldn't be
         | there otherwise.
         | 
         | Let's for example say one of those 4 tasks would write to a
         | file. With structured concurrency you have a guarantee that
         | once the supertask finishes, the file is no longer in use.
         | Without structured concurrency you won't have that, and a
         | subtask might error on trying to use the file or find garbage
         | data in it. Running the supertask in a loop with structured
         | concurrency would have no risk of running into resource
         | contention, while without it there is that risk.
         | 
         | Plus structured concurrency also helps adding a deterministic
         | bound on the amount of concurrency in the system, while just
         | spawning more background tasks can lead to concurrency grow in
         | an unbounded fashion.
         | 
         | > In Go for example, I don't remember having to clean up fibers
         | manually.
         | 
         | The equivalent in Go would be to wait for subtasks to complete,
         | using the WaitGroup pattern. And to notify the subtasks that
         | they should complete using "Context".
        
           | didibus wrote:
           | My code waits for all futures to complete, that's what:
           | (run! deref [f1 f2 f3 f4])
           | 
           | does.
        
       | dgb23 wrote:
       | I'm hopeful for first class support in core.async. JVM support
       | for virtual threads is the last missing piece IMO.
        
         | Skinney wrote:
         | When JVM lands support for virtual threads, what reasons remain
         | for using core.async?
         | 
         | core.async will still be useful in ClojureScript land, though.
        
           | dgb23 wrote:
           | Channels & alts etc. are useful constructs regardless right?
        
             | Skinney wrote:
             | Sure, but do these have API's that work with regular
             | threads?
        
               | cr__ wrote:
               | Yep. They're decoupled completely from the "go" bits in
               | core.async that are meant to make up for the lack of
               | green threads. All of the channel operations have
               | blocking variants that work fine with regular threads
               | now, and they seem likely to work fine with fibers.
        
               | Skinney wrote:
               | Didn't know that, that's great!
        
               | pgt wrote:
               | Does that mean we can use core.async channels without the
               | `go` macro? How would this look?
        
               | bcrosby95 wrote:
               | I don't know if you mean something deeper, but core.async
               | has <!, >!, and alts! variants that don't require a go
               | block: <!!, >!!, and alts!!.
               | 
               | You can just use them like a function. But they block the
               | thread. So if you e.g. naively call it in the REPL you
               | will block it indefinitely. E.g.:                 (>!!
               | (chan) "this will block the REPL")
               | 
               | The idea would be to pass a channel to multiple virtual
               | threads which use the blocking versions.
        
               | [deleted]
        
               | didibus wrote:
               | Yes you can. Core.async already supports using normal
               | threads with the `thread` macro instead of using the `go`
               | macro.
               | 
               | All code looks exactly the same otherwise.
        
               | [deleted]
        
         | bcrosby95 wrote:
         | Although it doesn't seem to have a nice API for doing so, it
         | seems like you can change the core.async executor by using
         | alter-var-root.
         | 
         | https://stackoverflow.com/questions/18779296/clojure-core-as...
        
       ___________________________________________________________________
       (page generated 2022-05-13 23:02 UTC)