hngopher.com

       [HN Gopher] Thread Pools on the JVM
       ___________________________________________________________________
        
       Thread Pools on the JVM
        
       Author : ovis
       Score  : 170 points
       Date   : 2021-07-19 15:42 UTC (7 hours ago)
        
 (HTM) web link (gist.github.com)
 (TXT) w3m dump (gist.github.com)
        
       | 0xffff2 wrote:
       | This seems like good advice in general. Is any of it really
       | specific to the JVM? If I was doing thread pooling with CPU and
       | IO bound tasks, I would approach threading in a similar way in
       | C++.
        
         | cogman10 wrote:
         | It'll depend on if your language has either coroutines or
         | lightweight threads.
         | 
         | Threadpooling only matters if you have neither of those things.
         | 
         | Otherwise, you should be using one or the other over a thread
         | pool. You might still spin up a threadpool for CPU bound
         | operations, but you wouldn't have one dedicated to IO.
         | 
         | As of C++ 20, there are coroutines which you should be looking
         | at (IMO).
         | 
         | https://en.cppreference.com/w/cpp/language/coroutines
        
           | dragontamer wrote:
           | Threadpools are probably better on CPU-bound bound (or CPU-
           | ish bound tasks: like RAM-bound) without any I/O.
           | 
           | Coroutines / Goroutines and the like are probably better on
           | I/O bound tasks where the CPU-effort in task-switching is
           | significant.
           | 
           | --------
           | 
           | For example: Matrix Multiplication is better with a
           | Threadpool. Handling 1000 simultaneous connections when you
           | get Slashdotted (or "Hacker News hug of death") is better
           | solved with coroutines.
        
             | cogman10 wrote:
             | I agree.
             | 
             | Coroutines MIGHT be more efficient if what you end up
             | building is a statemachine anyways (as that's what most of
             | those coroutines are doing with the compiler). Otherwise,
             | if it's just pure parallel CPU/memory burning with little
             | state transitions/dependence then a dedicated CPU pool
             | fixed to roughly the number of CPU cores on the box will be
             | the most efficient.
             | 
             | Heck, it can often even yield benefits to "pin" certain
             | tasks to a thread to keep the CPU cache filled with relent
             | data. For example, 4 threads handling the 4 quadrants of
             | the matrix rather than having the next available thread
             | picking up the next task.
        
               | dragontamer wrote:
               | The one that gives me a headache is thinking about how to
               | oversubscribe a GPU (or worse: 4 GPUs, as in the case of
               | the Summit supercomputer).
               | 
               | Its I/O to send data to and from a GPU, and therefore its
               | an I/O bound task somewhat. But there's also a
               | significant amount of CPU work involved. Ideally, you
               | want to balance CPU-work and GPU-work to maximize the
               | work being done.
               | 
               | Fortunately, CUDA-streams seems like they'd mesh pretty
               | well with coroutines (if enough code were there to
               | support them). But if you're reaching for the "GPU-
               | button", everything is compute-bound (if not, you're
               | "doing it wrong"). So now you have a question of "how
               | much to oversubscribe?"
               | 
               | Then again, that's why you just make the
               | oversubscription-factor a #define and then test a lot to
               | find the right factor.... EDIT: Or maybe you
               | oversubscribe until the GPU / CPU runs out of VRAM / RAM.
               | Oversubscription isn't really an issue with coroutines
               | that are executed inside of a thread-pool: you aren't
               | spending any CPU-time needlessly task-switching.
        
               | cogman10 wrote:
               | And, TBF, a lot of the IO stuff comes down to
               | specifically talking about what sort of device you are
               | talking to and where.
               | 
               | For a lot of the programming I do (and I'm sure a lot of
               | others on HN) IO is almost all network IO. For that,
               | because it's so slow and everything is working over DMA
               | anyways, coroutines end up working really well.
               | 
               | However, once you start talking about on system resources
               | such as SSDs or the GPU, it gets more tricky. As you
               | rightly point out, the GPU is especially bad because all
               | GPU communication ends up being routed through the CPU.
               | At least for a HD, there's DMA which cuts down on the
               | amount of CPU work that needs to be done to access a bit
               | of data.
        
               | jsmith45 wrote:
               | Only stackless co-routines require state machine
               | transformation. Stackfull co-routines based user mode
               | threading generally just change out the IO primitives to
               | issue an asynchronous version of the operation, and
               | immediately calls into the the user mode scheduler to
               | pick some ready-to-resume co-routine to switch the stack
               | to and resume. They might include a preemption facility
               | (beyond just the OS's preemption of the underlying kernel
               | threads), but that is not required and is largely a
               | language/runtime design decision.
               | 
               | The big headaches with stackfull co-routine based user
               | mode threading come from two sources. One is allocating
               | the stack. If your language requires a contiguous stack
               | then you either need to make the stacks small, and risk
               | running out, or make them big which can be a problem on
               | 32-bit platforms (you can run out of address space), or
               | can be a problem on some platforms (those with strict
               | commit-charge based memory accounting). Both can be
               | mitigated by allowing non-contiguous stacks or re-
               | locatable contiguous stacks (to allow small stacks to
               | grown later without headaches), although obviously that
               | can have performance considerations.
               | 
               | The other stackfull co-routine headache is in calling
               | into code from another language (i.e. FFI) which could be
               | making direct blocking system calls, and end up starving
               | you of your OS threads.
               | 
               | I do agree that in purely CPU or memory bound
               | applications a classical thread pool makes better sense.
               | The main advantages of either type of co-routine based
               | user mode threading primarily apply to IO-heavy or mixed
               | workloads.
        
           | [deleted]
        
         | valbaca wrote:
         | > Is any of it really specific to the JVM?
         | 
         | Not for languages with go/coroutines (e.g. Go, Clojure,
         | Crystal) as those were designed specifically to help with the
         | thread-per-IO constraint.
        
       | jackcviers3 wrote:
       | Author mentions scala. Both ZIO[1] and Cats-Effect[2] provide
       | fibers (coroutines) over these specific threadpool designs today,
       | without the need for Project Loom, and give the user the
       | capability of selecting the pool type to use without explicit
       | reference. They are unusable from Java, sadly, as the schedulers
       | and ExecutionContexts and runtime are implicitly provided in
       | sealed companion objects and are therefore private and
       | inaccessible to Java code, even when compiling with
       | ScalaThenJava. Basically, you cannot run an IO from Java code.
       | 
       | You can expose a method on the scala side to enter the IO world
       | that will take your arguments and run them in the IO environment,
       | returning a result to you, or notifying some Java class using
       | Observer/Observable. This can, of course take Java lambdas and
       | datatypes, thus keeping your business code in Java should you so
       | desire. It's clunky, though, and I wish Java had easy IO
       | primitives like Scala.
       | 
       | 1. https://github.com/zio/zio
       | 
       | 2. https://typelevel.org/cats-effect/versions
        
         | rzzzt wrote:
         | Quasar has similar functionality:
         | https://docs.paralleluniverse.co/quasar/
        
           | cogman10 wrote:
           | Fun fact, one of the primary loom devs wrote quasar.
        
         | AzzieElbab wrote:
         | That gist is from D.J. Spiewak - one of the authors of cats
         | effect :)
        
       | u678u wrote:
       | With Python at first I was scared of GIL being single threaded,
       | now I'm used to it and it works great. Thousands of threads used
       | to be normal for my old Java projects but seems crazy to me now.
        
       | charleslmunger wrote:
       | Another tip - If you have a dynamically-sized thread pool, make
       | it use a minimum of two threads. Otherwise developers will get
       | used to guaranteed serialization of tasks, and you'll never be
       | able to change it.
        
         | hellectronic wrote:
         | nice!
        
         | bobbylarrybobby wrote:
         | https://www.hyrumslaw.com
        
       | elric wrote:
       | > you're almost always going to have some sort of singleton
       | object somewhere in your application which just has these three
       | pools, pre-configured for use
       | 
       | I'm bemused by this statement, and I can't figure out whether
       | this is an assertion rooted in supreme confidence, or just idle,
       | wishful thinking.
       | 
       | That being said, giving threading advice in a virtualized and
       | containerized world is tricky. And while these three categories
       | seem sensible, mapping the functions of any non-trivial system
       | onto them is going to be difficult, unless the system was
       | specifically designed around it.
        
       | jfoutz wrote:
       | I'm wary of unbounded thread pools. Production has a funny way of
       | showing that threads always consume resources. A fun example is
       | file descriptors. An unexpected database reboot is often a short
       | outage, but it's crazy how quickly unbounded thread pools can
       | amplify errors and delay recovery.
       | 
       | Anyway, they have their place, but if you've got a fancy chain of
       | micro services calling out to wherever, think hard before putting
       | those calls in an unbounded thread pool.
        
         | sk5t wrote:
         | And you should be wary! Prefer instead a bounded thread pool
         | with a bounded queue of tasks waiting for service, and also
         | decide explicitly what should happen when the queue fills up or
         | wait times become too high (whatever "too high" means for the
         | application).
        
         | jeffbee wrote:
         | Unbounded thread pools are bad, bounded thread pool executors
         | with unbounded work queues are bad, and bounded thread pools
         | with bounded queues, FIFO policies, and silent drops are also
         | bad. There are many bad ways to do this.
        
       | cogman10 wrote:
       | Loom can't land fast enough!
       | 
       | The current issue the JVM has is that all threads have a
       | corresponding operating system thread. That, unfortunately, is
       | really heavy memory wise and on the OS context switcher.
       | 
       | Loom allows java to have threads as light weight as a goroutine.
       | It's going to change the way everything works. You might still
       | have a dedicated CPU bound thread pool (the common fork join pool
       | exists and probably should be used for that). But otherwise,
       | you'll just spin up virtual threads and do away with all the
       | consternation over how to manage thread pools and what a thread
       | pool should be used for.
        
         | jeffbee wrote:
         | Are you quite certain that a (linux, nptl) thread costs more
         | memory than a goroutine? You've implied that but it's not
         | obviously true.
        
           | dragontamer wrote:
           | Wouldn't any linux/nptl thread require at at least the
           | register-state of the entire x86 (or ARM) CPU?
           | 
           | I don't think goroutines would need such information. A
           | goroutine knows that "int foobar;" is currently being stored
           | in "rbx", and that "int foobar" is currently saved on the
           | stack. Therefore, rbx doesn't need to be saved.
           | 
           | ------
           | 
           | Linux/NPTL threads don't know when they are interrupted. So
           | all register state (including AVX512 state if those are being
           | used) needs to be saved. AVX512 x 32 is 2kB alone.
           | 
           | Even if AVX512 isn't being used by a thread (Linux detects
           | all AVX512 registers to be all-zero), RAX through R15 is
           | 128-bytes, plus SSE-registers (another 128-bytes) or ~256
           | bytes of space that the goroutines don't need. Plus whatever
           | other process-specific information needs to be saved off (CPU
           | time and other such process / thread details that Linux needs
           | to decide which threads to process next)
        
             | jeffbee wrote:
             | I don't think the question is dominated by machine state, I
             | think it would be more of a question of stack size. They
             | are demand-paged and 4k by default for native threads, 2k
             | by default for goroutines but stored on a GC'd heap that
             | defaults to 100% overhead, so it sounds like a wash to me.
        
               | dragontamer wrote:
               | Hmmm.
               | 
               | It seems like you're taking this from a perspective of
               | "Pthreads in C++ vs Coroutines in Go", which is correct
               | in some respects, but different from how I was taking the
               | discussion.
               | 
               | I guess I was taking it from a perspective of "pthreads
               | in C++ vs Go-like coroutines reimplemented in C++", which
               | would be pthreads vs C++20 coroutines. (Or really: it
               | seems like this "Loom" discussion is more of a Java thing
               | but probably a close analog to the PThreads in C++ vs
               | C++20 Coroutines)
               | 
               | I agree with you that that the garbage collector overhead
               | is a big deal in practice. But its an aspect of the
               | discussion I was purposefully avoiding. But I'm also not
               | the person you responded to.
        
               | jeffbee wrote:
               | Right, I admit there are better ways to do it, but I
               | don't think it's obviously true that goroutines
               | specifically are either more compact or faster to switch
               | between. The benefits might be imaginary. The Go runtime
               | has a thread scheduler that kinda sucks actually (it
               | scales badly as the number of runnable goroutines
               | increases) and there are also ways of making native
               | threads faster, like SwitchTo
               | https://lkml.org/lkml/2020/7/22/1202
        
         | ovis wrote:
         | What benefits does loom provide vs using something like cats-
         | effect fibres?
        
           | _old_dude_ wrote:
           | You can actually debug the code you write because you get a
           | real stacktrace, not few frames that shows the underlying
           | implementation.
        
             | Nullabillity wrote:
             | On the other hand, you'll spend a lot more time debugging
             | Loom code, because it reuses the same broken-by-design
             | thread API.
        
               | elygre wrote:
               | What is broken-by-design about the api?
        
               | Nullabillity wrote:
               | Fundamentally, an async API is either data-oriented
               | (Futures/Promises: tell me what data this task produced)
               | or job-oriented (Threads: tell me when this task is
               | done). You can think of it like functions vs subroutines.
               | 
               | Since you typically care about the data produced by the
               | task, threads require you to sort out your own
               | backchannel for communicating this data back (such as: a
               | channel, a mutexed variable, or something else).
               | Unscientifically speaking, getting this backchannel wrong
               | is the source of ~99% of multithreading bugs, and they
               | are a huge pain to fix.
               | 
               | You can implement futures on top of threads by using a
               | thread + oneshot channel, but that requires that you know
               | about it, and keep them coupled. The point of futures is
               | that this becomes the default correct-by-default API,
               | unless someone goes out of their way to do it some other
               | way.
               | 
               | On the other hand, implementing threads on top of futures
               | is trivial: just return an empty token value.
               | 
               | There are also some performance implications: depending
               | on your runtime it might be able to detect that future A
               | is only used by future B, and fuse them into one
               | scheduling unit. This becomes harder when the channels
               | are decoupled from the scheduling.
        
               | azth wrote:
               | Good points, but as far as I can tell, there's nothing
               | preventing you from spawning a bunch of Loom-thread
               | backed `CompletableFuture`s and waiting on them.
        
               | Nullabillity wrote:
               | True, but Loom won't really help you there since that
               | already CompletableFuture.runAsync already uses a pooling
               | scheduler. Same for cats-effect and zio, for that matter.
               | 
               | (And that's aside from CompletableFuture having its own
               | separate problems, like the obtrude methods)
        
               | derefr wrote:
               | > already uses a pooling scheduler
               | 
               | A _bounded_ pooling scheduler. (The
               | ForkJoinPool.commonPool.)
               | 
               | Loom, I believe, "dummies out" the
               | ForkJoinPool.commonPool --
               | ForkJoinTasks/CompletableFutures/etc. by default just
               | execute on Loom's unbounded virtual-thread executor.
               | 
               | (Which happens to be _built on top of_ a ForkJoinPool,
               | because it 's a good scheduler. Don't fix what ain't
               | broke.)
        
             | clhodapp wrote:
             | Admittedly, loom will do much better but cats-effect does
             | try its best within the limitations of the current JVM:
             | https://typelevel.org/cats-effect/docs/2.x/guides/tracing
        
           | ackfoobar wrote:
           | For the team that I am in, I can see a huge productivity
           | boost if my teammates can write in direct style instead of
           | wrapping their heads around monads.
        
             | hamandcheese wrote:
             | Scala for-expressions make it pretty easy to write "direct
             | style" code. Someone on the team should probably understand
             | whats going on, though. I've had decent success with ZIO on
             | my team, and it seems perfectly teachable/learnable.
        
               | ackfoobar wrote:
               | I am the someone who "understand whats going on". My
               | experience of the knowledge transfer was not pleasant at
               | all. Maybe it's my ability of explaining, maybe it's my
               | teammates, maybe it's ZIO having better names for
               | combinators than Cats.
               | 
               | For-comprehension does help. But the alternative is
               | callback hell all the way, so that's not saying much. It
               | is still clunky compared to the regular syntax.
        
         | bestinterest wrote:
         | Whats the difference between goroutines and project loom? Is
         | their any?
        
           | _old_dude_ wrote:
           | Unlike go routine, Loom virtual threads are not preempted by
           | the scheduler. I believe you may be able to explicitly
           | preempt a virtual thread but the last time i checked it was
           | not part of the public API
        
             | vips7L wrote:
             | Unless I'm misunderstanding, virtual threads are
             | preemptive: https://cr.openjdk.java.net/~rpressler/loom/loo
             | m/sol1_part1....
        
               | _old_dude_ wrote:
               | By the OS, not by the scheduler see https://cr.openjdk.ja
               | va.net/~rpressler/loom/loom/sol1_part2....
        
               | vips7L wrote:
               | What about pron's comments here then?
               | https://news.ycombinator.com/item?id=27885569
               | 
               | > Second, Loom's virtual threads can also be forcibly
               | preempted by the scheduler at any safepoint to implement
               | time sharing
        
               | _old_dude_ wrote:
               | For me, preemption by the Java scheduler is not currently
               | supported but may be added in the future, after all the
               | goroutine were not preempted at the beginning in Go.
               | 
               | The whole quote
               | 
               | > Second, Loom's virtual threads can also be forcibly
               | preempted by the scheduler at any safepoint to implement
               | time sharing. Currently, this capability isn't exposed
               | because we're yet to find a use-case for it
               | 
               | I believe it's a reference to [1] but i may be wrong.
               | 
               | [1] https://download.java.net/java/early_access/loom/docs
               | /api/ja...
        
               | vips7L wrote:
               | > The whole quote
               | 
               | Sorry I was skimming! Thanks!
        
           | cogman10 wrote:
           | Terminology mostly :D
           | 
           | I've not looked into the goroutine implementation, so I
           | couldn't tell you how it compares to what I've read loom is
           | doing.
           | 
           | Loom is looking to have some extremely compact stacks which
           | means each new "virtual thread" as they are calling them will
           | end up having bytes worth of memory allocated.
           | 
           | Another thing coming with loom that go lacks is "structured
           | concurrency". It's the notion that you might have a group of
           | tasks that need to finish before moving on from a method
           | (rather than needing to worry about firing and forgetting
           | causing odd things to happen at odd times).
        
             | jayd16 wrote:
             | >structured concurrency
             | 
             | That's good to hear. You see a lot of these Loom
             | discussions talk about implicit and magical asynchronous
             | execution. I was afraid fine grained thread control would
             | be left out. Its super useful if you want to interface with
             | how most GUI frameworks function (ie a Main thread), or
             | important OS threads like a thread with a bound GL context
             | or what have you.
        
               | cogman10 wrote:
               | Yeah, while virtual threads are the bread and butter of
               | Loom, they are also adding a lot of QoL things. In
               | particular, the notion of "ScopedVariables" will be a
               | godsend to a lot of concurrent work I do. It's the notion
               | of "I want this bit of context to be carried through from
               | one thread of execution to the next".
               | 
               | Beyond that, one thing the loom authors have suggested is
               | that when you want to limit concurrency the better way to
               | do that is using concurrency constructs like semaphores
               | rather than relying on a fixed pool size.
        
             | ccday wrote:
             | Not sure if it counts as structured concurrency but Go has
             | the feature you describe:
             | https://gobyexample.com/waitgroups
        
           | jayd16 wrote:
           | The biggest difference is probably that the JVM will support
           | both OS and lightweight threads. That's really useful for
           | certain things talking to the GPU in a single thread context.
        
         | Spivak wrote:
         | You are ignoring the downside to green threads which is that
         | it's cooperative. If the thread doesn't yield control back to
         | the event loop then the real OS thread backing the loop is now
         | stuck.
         | 
         | Which leads to dirty things like inserting sleep 0 at the top
         | of loops and dealing with really unbalanced scheduling of
         | threads don't hit yields often enough. Plus with loom it might
         | not be obvious that some function is a yield since it's meant
         | to be transparent so if you grab a lock and yield you make
         | everyone wait until your scheduled again.
         | 
         | Green threads are great! I love them and they're the only real
         | solutions to really concurrent IO heavy workloads but it's not
         | a panacea and trades one kind of discipline for another.
        
           | sudhirj wrote:
           | Sleep 0 sounds like quite a hack, Go has the neater
           | https://pkg.go.dev/runtime#Gosched instead, and I assume
           | there will be a Java equivalent as well. And if most stdlib
           | methods and all blocking methods call it, it's going to be
           | pretty difficult to hang a green thread.
        
           | brokencode wrote:
           | I was under the impression that Loom was implementing
           | preemptable lightweight threads. Is that not the case?
        
             | clhodapp wrote:
             | I think that's not quite it:
             | 
             | I believe that loom is implementing cooperative lightweight
             | threads and simultaneously reworking all of the blocking IO
             | operations in the Java standard library to include yields.
             | I guess this means that you could, for example, hold an OS-
             | level thread forever by writing an infinite loop that
             | doesn't do any IO...
        
             | mikepurvis wrote:
             | It sounds like it is: https://cr.openjdk.java.net/~rpressle
             | r/loom/loom/sol1_part1....
             | 
             | But the other side of that is that sometimes non-preemption
             | is also a desirable property-- like in JavaScript, or
             | Python asyncio, knowing that you don't need to lock over
             | every little manipulation of some shared data structure
             | because you're never going to yield if you didn't
             | explicitly await.
        
             | Spivak wrote:
             | So loom uses interesting terminology when talking about
             | this. They say that they're preemptive and not cooperative
             | because there's not an explicit await/yield keyword that
             | you call from your code but that isn't the whole story
             | because threads are only preempted when they perform IO or
             | are synchronized. So you as an author can't know for sure
             | where the yield points are and aren't supposed to rely on
             | them but they're still there. You're not going to be
             | forcefully preempted in the middle of number crunching.
             | 
             | I think most people would consider this a surprising notion
             | of preemption where it's out of your control-ish but also
             | not arbitrary like it is for OS threads which still leads
             | to basically the same problems and constraints as
             | cooperative threads.
        
               | pron wrote:
               | > So loom uses interesting terminology when talking about
               | this.
               | 
               | That is a common terminology. Wikipedia says: [1]
               | 
               |  _The term preemptive multitasking is used to distinguish
               | a multitasking operating system, which permits preemption
               | of tasks, from a cooperative multitasking system wherein
               | processes or tasks must be explicitly programmed to yield
               | when they do not need system resources. ... The term
               | "preemptive multitasking" is sometimes mistakenly used
               | when the intended meaning is more specific, referring
               | instead to the class of scheduling policies known as
               | time-shared scheduling, or time-sharing._
               | 
               | > threads are only preempted when they perform IO or are
               | synchronized
               | 
               | First, they can be preempted by any call, explicit or
               | implicit, to the runtime (or any library, for that
               | matter). For all you know, class loading or even Math.sin
               | might include a scheduling point (although that is
               | unlikely as that's a compiler intrinsic). We make no
               | promises on when scheduling can occur. Not only do
               | threads not explicitly yield, code cannot statically
               | determine where scheduling might occur; I don't believe
               | anyone can consider this "cooperative."
               | 
               | Second, Loom's virtual threads can also be forcibly
               | preempted by the scheduler at any safepoint to implement
               | time sharing. Currently, this capability isn't exposed
               | because we're yet to find a use-case for it (other than
               | one special case that we want to address, but isn't
               | urgent). If you believe you have one, please send it to
               | the loom-dev mailing list.
               | 
               | The reason it's hard to find good use cases for time
               | slicing is as follows:
               | 
               | 1. If you have only a small number of threads that are
               | frequently CPU bound. In that case, just make them
               | platform threads and use the OS scheduler. Loom makes it
               | easy to choose which implementation you want for each
               | thread.
               | 
               | 2. If you have a great many threads, each of which can
               | infrequently become CPU-bound, then the scheduler takes
               | care of that with work-stealing and other scheduling
               | techniques.
               | 
               | 3. If you have a great many threads, each of which is
               | _frequently_ CPU-bound, then your cores are
               | oversubscribed by orders of magnitude -- recall that we
               | 're talking about hundreds of thousands or possibly
               | millions of threads -- and no scheduling strategy can
               | help you.
               | 
               | It's possible that there could arise real-world
               | situations where infrequent CPU-boundedness might affect
               | responsiveness, but we'll want to see such cases before
               | deciding to expose the mechanism. Even OSes don't like
               | relying on time-sharing (it happens less frequently than
               | people think on well-tuned servers), and putting that
               | capability in the hands of programmers is an attractive
               | nuisance that will more likely cause a degradation in
               | performance.
               | 
               | [1]: https://en.wikipedia.org/wiki/Preemption_(computing)
               | #Preempt...
        
               | cogman10 wrote:
               | Yeah... this is a place where I disagree with how the
               | Loom devs define "preemptive". They are basically
               | defining it as "most tasks will give up control when they
               | hit a blocking operation". Yet, it's been my
               | understanding that preemption means the scheduler can
               | stop a currently operating task from running and switch
               | to something else. That's not what happens with loom.
        
           | hn_throwaway_99 wrote:
           | Agreed, but you have other single-threaded server languages
           | like NodeJS which have the same problem (a new request can
           | only be handled if the current request gives up control,
           | usually waiting for IO) and people have figured out how to
           | handle it.
           | 
           | I see Project Loom as really providing all the benefits of
           | single threaded languages like Node (i.e. tons of
           | scalability), but with an easier programming model that
           | threads provide as opposed to using async/await.
        
           | neeleshs wrote:
           | :) sleep 0! I was trying to see if there is a way to preempt
           | stuck threads (infinite loops etc), and wrote a small while
           | loop replacement                 pwhile(()-> loop predicate,
           | ()-> {loop body});
           | 
           | All it does is add a thread.isinterrupted check to the
           | predicate. At this point, best to switch to Erlang !
        
           | cogman10 wrote:
           | Which is why the advice would be "Don't use virtual threads
           | for CPU work".
           | 
           | It just so happens that a large number of JVM users are
           | working with IO bound problems. Once you start talking about
           | CPU bound problems the JVM tends not to be the thing most
           | people reach for.
           | 
           | Loom doesn't remove the CPU bound solution by adding the IO
           | solution. Instead, it adds a good IO solution and keeps the
           | old CPU solution when needed.
           | 
           | In fact, there's already a really good pool in the JVM for
           | common CPU bound tasks. `Forkjoin.common()`.
        
           | saurik wrote:
           | FWIW, while you are probably correct in the context of Loom--
           | a specific implementation that I honestly haven't looked at
           | much--you shouldn't generalize to "green threads" of all
           | forms as you not only can totally implement this well but
           | Erlang does so: as you are working with a byte code and a JIT
           | anyway, you instrument the code to check occasionally if it
           | was preempted (I believe Erlang does this for every
           | potentially-backward jump, which is sufficient to guarantee
           | even a broken loop can be preempted).
        
         | cbsmith wrote:
         | > That, unfortunately, is really heavy memory wise and on the
         | OS context switcher.
         | 
         | So, there was a time where a broad statement like that was
         | pretty solid. These days, I don't think so. The default stack
         | size (on 64-bit Linux) is 1MB, and you can manipulate that to
         | be smaller if you want. That's also the _virtual_ memory. The
         | actually memory usage depends on your application. There was a
         | time where 1MB was a lot of memory, but these days, for a lot
         | of contexts, it 's kind of peanuts unless you have literally
         | millions of threads (and even then...). Yes, you can be more
         | memory efficient, but it wouldn't necessarily help _that_ much.
         | Similarly, at least in the case of blocking IO (which is
         | normally why you 'd have so many threads), the overhead on the
         | OS context switcher isn't necessarily that significant, as most
         | threads will be blocked at any given time, and you're already
         | going to have a context switch from the kernel to userspace.
         | Depending on circumstance, using polling IO models can lead to
         | _more_ context switching, not less.
         | 
         | There's certainly circumstances where threads significantly
         | impede your application's efficiency, but if you are really in
         | that situation you likely already know it. In the broad set of
         | use cases though, switching from a thread-based concurrency
         | model to something else isn't going to be the big win people
         | think it will be.
        
           | vbezhenar wrote:
           | Your words might be true, but the world jumped on async wagon
           | long time ago and going all in. Nobody likes threads,
           | everyone wants lightweight threads. Emulating lightweight
           | threads with promises (optionally hidden behind async/await
           | transformations) is very popular. So demand for this feature
           | is here.
           | 
           | I don't know why, I, personally, never needed that feature
           | and good old threads were always enough for me. It's weird
           | for me to watch non-JDBC drivers with async interface, when
           | it was a common knowledge that JDBC data source should use
           | something like 10-20 threads maximum (depending on DB CPU
           | count), anything more is a sign of bad database design. And
           | running 10-20 threads, obviously, is not an issue.
           | 
           | But demand is here. And probably lightweight threads is a
           | better approach than async/await transformations.
        
           | kllrnohj wrote:
           | > So, there was a time where a broad statement like that was
           | pretty solid.
           | 
           | That time is approaching 20 years old at this point, too.
           | Native threads haven't been "expensive" for a very, very long
           | time now.
           | 
           | Maybe if you're in the camp of disabling overcommit it
           | matters, but otherwise the application of green threads is
           | definitely a specialized niche, not generally useful.
           | 
           | > In the broad set of use cases though, switching from a
           | thread-based concurrency model to something else isn't going
           | to be the big win people think it will be.
           | 
           | I'd go even further and say it'll be a net-loss in most
           | cases, especially with modern complications like
           | heterogeneous compute. If you're use case is specifically
           | spinning up thousands of threads for IO (aka, you're a server
           | & nothing else), then sure. But if you aren't there's no win
           | here, just complications (like times when you _need_ native
           | thread isolation for FFI reasons, like using OpenGL)
        
             | cbsmith wrote:
             | > That time is approaching 20 years old at this point, too.
             | Native threads haven't been "expensive" for a very, very
             | long time now.
             | 
             | It depends on the context, but yes. I worked on stuff
             | throughout the 2000's where we ran into scaling problems
             | with thread based concurrency models. At the time, running
             | 100,000 threads was... challenging. But yeah, by 2010 we
             | were talking about the C10M problem, because the C10K
             | problem wasn't a problem any more. There are some cases
             | where you really do need to handle 10's or 100's of
             | millions of threads, but there aren't a lot of them.
             | 
             | > Maybe if you're in the camp of disabling overcommit it
             | matters, but otherwise the application of green threads is
             | definitely a specialized niche, not generally useful.
             | 
             | Yup, but everyone is still stuck on the old mental model of
             | "threads are bad", partly driven by the assumption that
             | whatever is being done to handle those extreme cases is
             | what one should be doing to address their own problem
             | space. :-(
             | 
             | > I'd go even further and say it'll be a net-loss in most
             | cases, especially with modern complications like
             | heterogeneous compute.
             | 
             | Even more so if you're doing polling based I/O rather than
             | a reactive model. The look on people's faces when I point
             | out to them that there's good reason to think that for the
             | scale they are working at, they'll likely get better
             | performance if they just use threads to scale...
             | 
             | It's so weird how we talk about the context switching costs
             | between threads without recognizing that the thread does
             | the poll is not the same thread that processed the IO
             | request in the kernel.
        
           | user5994461 wrote:
           | >>> The default stack size (on 64-bit Linux) is 1MB
           | 
           | The default thread stack size is 8 or 10 MB on most Linux.
           | 
           | The exception is Alpine that's below 1 MB.
        
             | ori_b wrote:
             | The default reserved size is 8mb. The allocated size starts
             | at a page (usually 4k), and grows in page sized increments
             | as you use it.
        
             | cbsmith wrote:
             | To clarify, the 1MB is the default stack size for threads
             | with the JVM on 64-bit Linux.
             | 
             | Search for "-Xss": https://docs.oracle.com/en/java/javase/1
             | 6/docs/specs/man/jav...
        
         | christkv wrote:
         | Are we coming full circle going back a variant of the original
         | Java green threads?
        
           | dragonwriter wrote:
           | > Are we coming full circle going back a variant of the
           | original Java green threads?
           | 
           | There are basically two kinds of green threads:
           | 
           | (1) N:1, where one OS thread hosts all the application
           | threads, and
           | 
           | (2) M:N, where M application threads are hosted on N OS
           | threads.
           | 
           | Original Java (and Ruby, and lots of other systems before
           | every microcomputer was a multicore parallel system) green
           | threads were N:1, which provide concurrency but not
           | parallelism, which is fine when your underlying system can't
           | do real parallelism anyway.)
           | 
           | Wanting to take advantage of multicore systems (at least, in
           | the Ruby case, for underlying native code) drove a transition
           | to native threads (which you could call an N:N threading
           | model, as application and OS threads are identical.)
           | 
           | But this limits the level of concurrency to the level of
           | parallelism, which can be a regression compared to N:1 models
           | for applications where the level of concurrency that is
           | useful is greater than the level of parallelism available.
           | 
           | What lots of newer systems are driving toward, to solve that,
           | are M:N models, which can leverage all available parallelism
           | but also provide a higher degree of concurrency.
        
             | jjtheblunt wrote:
             | I worked in Solaris internals for a while at Sun during the
             | early java era, and Solaris threading definitely did
             | multiplexing of userspace onto os, and then os onto cores.
             | 
             | Do you have a citation (because I can't find one)
             | specifying your assertion that original Java green threads
             | were not analogous to Solaris user -> os -> hardware
             | multiplexing?
        
               | dragonwriter wrote:
               | > Do you have a citation (because I can't find one)
               | specifying your assertion that original Java green
               | threads were not analogous to Solaris user -> os ->
               | hardware multiplexing?
               | 
               | I was writing from memory of second-hand after-the-fact
               | recitations of the history. Doing some followup research
               | prompted by your question, if I understand this document
               | [0] correctly, Java initially had N:1 green threads on
               | Solaris, then M:N green threads on Solaris with 1:1
               | native threads on Unix and Windows.
               | 
               | [0]
               | https://docs.oracle.com/cd/E19455-01/806-3461/6jck06gqe/
        
             | cbsmith wrote:
             | Java had M:N green thread models a LOOOOONG time ago.
             | 
             | And Linux tried M:N thread implementations specifically to
             | improve thread performance.
             | 
             | In both cases, it turned out that just using 1:1 native
             | threads ended up being a net win.
        
               | hawk_ wrote:
               | i am not aware of M:N thread builtin in Java even from
               | long time ago, at least not in a way that you could
               | control N
        
               | truffdog wrote:
               | It was Solaris only, so there is definitely an asterisk
               | somewhere.
        
               | cbsmith wrote:
               | It was all very long ago, but the NGPT project did M:N
               | threading on Linux (https://web.archive.org/web/200204081
               | 03057/http://www-124.ib...).
               | 
               | There were also a number of M:N JVM implementations that
               | were particularly popular in the soft-realtime space back
               | in the early 2000's.
               | 
               | One of the fun trends with computing is that as hardware,
               | software, and applications evolve, ideas that were once
               | not terribly useful suddenly become useful again. It's
               | entirely possible that M:N threads for the JVM is one of
               | those cases, but it's NOT a new idea.
        
               | cbsmith wrote:
               | The old JDK 1.1 Developer's Guide had a page on the
               | different thread models: https://docs.oracle.com/cd/E1945
               | 5-01/806-3461/6jck06gqk/inde...
               | 
               | At the time, Solaris had the only "certified" JVM that
               | did M:N threads, so they really liked to make a big deal
               | about it.
               | 
               | You could control N through a JNI call to
               | thr_setconcurrency. Not portable, but it worked. That
               | particular capability was almost always not helpful.
        
           | hashmash wrote:
           | Not quite. The original green threads were seen as more of a
           | hack until Solaris supported true threads. Green threads
           | could only support one CPU core, and so without a major
           | redesign, it was a dead end.
        
           | neeleshs wrote:
           | This gives some color -
           | https://blogs.oracle.com/javamagazine/going-inside-javas-
           | pro...
        
             | carimura wrote:
             | and also many resources direct from the source (those that
             | are working on Loom): https://inside.java/tag/loom
        
           | AtlasBarfed wrote:
           | Basically yes.
           | 
           | Longer answer: devs back in the day couldn't really grok the
           | difference between green and real threads. Java made its
           | bones as an enterprise language, which can have smart
           | programmers, but they will conversely not be closer-to-metal
           | knowledgewise. Too many devs back in the day expected a java
           | thread to be a real thread, so java re-engineered to
           | accomodate this.
           | 
           | I think the JDK/JVM teams also viewed it as a maturation of
           | the JVM to be directly using OS resources so closely across
           | platforms, rather than "hacking" it with green threads.
           | 
           | These days, our high performance fanciness means the devs are
           | demanding green thread analogues, and go/elixir/others are
           | seemingly superior because of those.
           | 
           | So to remain competitive in the marketplace, Java now needs
           | threads that aren't threads even though Java used to have
           | threads that weren't threads.
        
           | cogman10 wrote:
           | Yes and no.
           | 
           | The new Loom threads will be much lighter weight than the
           | original Java green threads. Further, the entire IO
           | infrastructure of the JVM is being reworked for Loom to make
           | sure the OS doesn't block the VM's thread. What's more, Loom
           | does M:N threading.
           | 
           | Same concept, very different implementation.
        
             | iamcreasy wrote:
             | So, with Loom now we can tell exactly in which order theses
             | threads were executed as it's not up to OS to decide thread
             | execution order anymore?
        
       ___________________________________________________________________
       (page generated 2021-07-19 23:00 UTC)