[HN Gopher] Thread Pools on the JVM
___________________________________________________________________
Thread Pools on the JVM
Author : ovis
Score : 170 points
Date : 2021-07-19 15:42 UTC (7 hours ago)
(HTM) web link (gist.github.com)
(TXT) w3m dump (gist.github.com)
| 0xffff2 wrote:
| This seems like good advice in general. Is any of it really
| specific to the JVM? If I was doing thread pooling with CPU and
| IO bound tasks, I would approach threading in a similar way in
| C++.
| cogman10 wrote:
| It'll depend on if your language has either coroutines or
| lightweight threads.
|
| Threadpooling only matters if you have neither of those things.
|
| Otherwise, you should be using one or the other over a thread
| pool. You might still spin up a threadpool for CPU bound
| operations, but you wouldn't have one dedicated to IO.
|
| As of C++ 20, there are coroutines which you should be looking
| at (IMO).
|
| https://en.cppreference.com/w/cpp/language/coroutines
| dragontamer wrote:
| Threadpools are probably better on CPU-bound bound (or CPU-
| ish bound tasks: like RAM-bound) without any I/O.
|
| Coroutines / Goroutines and the like are probably better on
| I/O bound tasks where the CPU-effort in task-switching is
| significant.
|
| --------
|
| For example: Matrix Multiplication is better with a
| Threadpool. Handling 1000 simultaneous connections when you
| get Slashdotted (or "Hacker News hug of death") is better
| solved with coroutines.
| cogman10 wrote:
| I agree.
|
| Coroutines MIGHT be more efficient if what you end up
| building is a statemachine anyways (as that's what most of
| those coroutines are doing with the compiler). Otherwise,
| if it's just pure parallel CPU/memory burning with little
| state transitions/dependence then a dedicated CPU pool
| fixed to roughly the number of CPU cores on the box will be
| the most efficient.
|
| Heck, it can often even yield benefits to "pin" certain
| tasks to a thread to keep the CPU cache filled with relent
| data. For example, 4 threads handling the 4 quadrants of
| the matrix rather than having the next available thread
| picking up the next task.
| dragontamer wrote:
| The one that gives me a headache is thinking about how to
| oversubscribe a GPU (or worse: 4 GPUs, as in the case of
| the Summit supercomputer).
|
| Its I/O to send data to and from a GPU, and therefore its
| an I/O bound task somewhat. But there's also a
| significant amount of CPU work involved. Ideally, you
| want to balance CPU-work and GPU-work to maximize the
| work being done.
|
| Fortunately, CUDA-streams seems like they'd mesh pretty
| well with coroutines (if enough code were there to
| support them). But if you're reaching for the "GPU-
| button", everything is compute-bound (if not, you're
| "doing it wrong"). So now you have a question of "how
| much to oversubscribe?"
|
| Then again, that's why you just make the
| oversubscription-factor a #define and then test a lot to
| find the right factor.... EDIT: Or maybe you
| oversubscribe until the GPU / CPU runs out of VRAM / RAM.
| Oversubscription isn't really an issue with coroutines
| that are executed inside of a thread-pool: you aren't
| spending any CPU-time needlessly task-switching.
| cogman10 wrote:
| And, TBF, a lot of the IO stuff comes down to
| specifically talking about what sort of device you are
| talking to and where.
|
| For a lot of the programming I do (and I'm sure a lot of
| others on HN) IO is almost all network IO. For that,
| because it's so slow and everything is working over DMA
| anyways, coroutines end up working really well.
|
| However, once you start talking about on system resources
| such as SSDs or the GPU, it gets more tricky. As you
| rightly point out, the GPU is especially bad because all
| GPU communication ends up being routed through the CPU.
| At least for a HD, there's DMA which cuts down on the
| amount of CPU work that needs to be done to access a bit
| of data.
| jsmith45 wrote:
| Only stackless co-routines require state machine
| transformation. Stackfull co-routines based user mode
| threading generally just change out the IO primitives to
| issue an asynchronous version of the operation, and
| immediately calls into the the user mode scheduler to
| pick some ready-to-resume co-routine to switch the stack
| to and resume. They might include a preemption facility
| (beyond just the OS's preemption of the underlying kernel
| threads), but that is not required and is largely a
| language/runtime design decision.
|
| The big headaches with stackfull co-routine based user
| mode threading come from two sources. One is allocating
| the stack. If your language requires a contiguous stack
| then you either need to make the stacks small, and risk
| running out, or make them big which can be a problem on
| 32-bit platforms (you can run out of address space), or
| can be a problem on some platforms (those with strict
| commit-charge based memory accounting). Both can be
| mitigated by allowing non-contiguous stacks or re-
| locatable contiguous stacks (to allow small stacks to
| grown later without headaches), although obviously that
| can have performance considerations.
|
| The other stackfull co-routine headache is in calling
| into code from another language (i.e. FFI) which could be
| making direct blocking system calls, and end up starving
| you of your OS threads.
|
| I do agree that in purely CPU or memory bound
| applications a classical thread pool makes better sense.
| The main advantages of either type of co-routine based
| user mode threading primarily apply to IO-heavy or mixed
| workloads.
| [deleted]
| valbaca wrote:
| > Is any of it really specific to the JVM?
|
| Not for languages with go/coroutines (e.g. Go, Clojure,
| Crystal) as those were designed specifically to help with the
| thread-per-IO constraint.
| jackcviers3 wrote:
| Author mentions scala. Both ZIO[1] and Cats-Effect[2] provide
| fibers (coroutines) over these specific threadpool designs today,
| without the need for Project Loom, and give the user the
| capability of selecting the pool type to use without explicit
| reference. They are unusable from Java, sadly, as the schedulers
| and ExecutionContexts and runtime are implicitly provided in
| sealed companion objects and are therefore private and
| inaccessible to Java code, even when compiling with
| ScalaThenJava. Basically, you cannot run an IO from Java code.
|
| You can expose a method on the scala side to enter the IO world
| that will take your arguments and run them in the IO environment,
| returning a result to you, or notifying some Java class using
| Observer/Observable. This can, of course take Java lambdas and
| datatypes, thus keeping your business code in Java should you so
| desire. It's clunky, though, and I wish Java had easy IO
| primitives like Scala.
|
| 1. https://github.com/zio/zio
|
| 2. https://typelevel.org/cats-effect/versions
| rzzzt wrote:
| Quasar has similar functionality:
| https://docs.paralleluniverse.co/quasar/
| cogman10 wrote:
| Fun fact, one of the primary loom devs wrote quasar.
| AzzieElbab wrote:
| That gist is from D.J. Spiewak - one of the authors of cats
| effect :)
| u678u wrote:
| With Python at first I was scared of GIL being single threaded,
| now I'm used to it and it works great. Thousands of threads used
| to be normal for my old Java projects but seems crazy to me now.
| charleslmunger wrote:
| Another tip - If you have a dynamically-sized thread pool, make
| it use a minimum of two threads. Otherwise developers will get
| used to guaranteed serialization of tasks, and you'll never be
| able to change it.
| hellectronic wrote:
| nice!
| bobbylarrybobby wrote:
| https://www.hyrumslaw.com
| elric wrote:
| > you're almost always going to have some sort of singleton
| object somewhere in your application which just has these three
| pools, pre-configured for use
|
| I'm bemused by this statement, and I can't figure out whether
| this is an assertion rooted in supreme confidence, or just idle,
| wishful thinking.
|
| That being said, giving threading advice in a virtualized and
| containerized world is tricky. And while these three categories
| seem sensible, mapping the functions of any non-trivial system
| onto them is going to be difficult, unless the system was
| specifically designed around it.
| jfoutz wrote:
| I'm wary of unbounded thread pools. Production has a funny way of
| showing that threads always consume resources. A fun example is
| file descriptors. An unexpected database reboot is often a short
| outage, but it's crazy how quickly unbounded thread pools can
| amplify errors and delay recovery.
|
| Anyway, they have their place, but if you've got a fancy chain of
| micro services calling out to wherever, think hard before putting
| those calls in an unbounded thread pool.
| sk5t wrote:
| And you should be wary! Prefer instead a bounded thread pool
| with a bounded queue of tasks waiting for service, and also
| decide explicitly what should happen when the queue fills up or
| wait times become too high (whatever "too high" means for the
| application).
| jeffbee wrote:
| Unbounded thread pools are bad, bounded thread pool executors
| with unbounded work queues are bad, and bounded thread pools
| with bounded queues, FIFO policies, and silent drops are also
| bad. There are many bad ways to do this.
| cogman10 wrote:
| Loom can't land fast enough!
|
| The current issue the JVM has is that all threads have a
| corresponding operating system thread. That, unfortunately, is
| really heavy memory wise and on the OS context switcher.
|
| Loom allows java to have threads as light weight as a goroutine.
| It's going to change the way everything works. You might still
| have a dedicated CPU bound thread pool (the common fork join pool
| exists and probably should be used for that). But otherwise,
| you'll just spin up virtual threads and do away with all the
| consternation over how to manage thread pools and what a thread
| pool should be used for.
| jeffbee wrote:
| Are you quite certain that a (linux, nptl) thread costs more
| memory than a goroutine? You've implied that but it's not
| obviously true.
| dragontamer wrote:
| Wouldn't any linux/nptl thread require at at least the
| register-state of the entire x86 (or ARM) CPU?
|
| I don't think goroutines would need such information. A
| goroutine knows that "int foobar;" is currently being stored
| in "rbx", and that "int foobar" is currently saved on the
| stack. Therefore, rbx doesn't need to be saved.
|
| ------
|
| Linux/NPTL threads don't know when they are interrupted. So
| all register state (including AVX512 state if those are being
| used) needs to be saved. AVX512 x 32 is 2kB alone.
|
| Even if AVX512 isn't being used by a thread (Linux detects
| all AVX512 registers to be all-zero), RAX through R15 is
| 128-bytes, plus SSE-registers (another 128-bytes) or ~256
| bytes of space that the goroutines don't need. Plus whatever
| other process-specific information needs to be saved off (CPU
| time and other such process / thread details that Linux needs
| to decide which threads to process next)
| jeffbee wrote:
| I don't think the question is dominated by machine state, I
| think it would be more of a question of stack size. They
| are demand-paged and 4k by default for native threads, 2k
| by default for goroutines but stored on a GC'd heap that
| defaults to 100% overhead, so it sounds like a wash to me.
| dragontamer wrote:
| Hmmm.
|
| It seems like you're taking this from a perspective of
| "Pthreads in C++ vs Coroutines in Go", which is correct
| in some respects, but different from how I was taking the
| discussion.
|
| I guess I was taking it from a perspective of "pthreads
| in C++ vs Go-like coroutines reimplemented in C++", which
| would be pthreads vs C++20 coroutines. (Or really: it
| seems like this "Loom" discussion is more of a Java thing
| but probably a close analog to the PThreads in C++ vs
| C++20 Coroutines)
|
| I agree with you that that the garbage collector overhead
| is a big deal in practice. But its an aspect of the
| discussion I was purposefully avoiding. But I'm also not
| the person you responded to.
| jeffbee wrote:
| Right, I admit there are better ways to do it, but I
| don't think it's obviously true that goroutines
| specifically are either more compact or faster to switch
| between. The benefits might be imaginary. The Go runtime
| has a thread scheduler that kinda sucks actually (it
| scales badly as the number of runnable goroutines
| increases) and there are also ways of making native
| threads faster, like SwitchTo
| https://lkml.org/lkml/2020/7/22/1202
| ovis wrote:
| What benefits does loom provide vs using something like cats-
| effect fibres?
| _old_dude_ wrote:
| You can actually debug the code you write because you get a
| real stacktrace, not few frames that shows the underlying
| implementation.
| Nullabillity wrote:
| On the other hand, you'll spend a lot more time debugging
| Loom code, because it reuses the same broken-by-design
| thread API.
| elygre wrote:
| What is broken-by-design about the api?
| Nullabillity wrote:
| Fundamentally, an async API is either data-oriented
| (Futures/Promises: tell me what data this task produced)
| or job-oriented (Threads: tell me when this task is
| done). You can think of it like functions vs subroutines.
|
| Since you typically care about the data produced by the
| task, threads require you to sort out your own
| backchannel for communicating this data back (such as: a
| channel, a mutexed variable, or something else).
| Unscientifically speaking, getting this backchannel wrong
| is the source of ~99% of multithreading bugs, and they
| are a huge pain to fix.
|
| You can implement futures on top of threads by using a
| thread + oneshot channel, but that requires that you know
| about it, and keep them coupled. The point of futures is
| that this becomes the default correct-by-default API,
| unless someone goes out of their way to do it some other
| way.
|
| On the other hand, implementing threads on top of futures
| is trivial: just return an empty token value.
|
| There are also some performance implications: depending
| on your runtime it might be able to detect that future A
| is only used by future B, and fuse them into one
| scheduling unit. This becomes harder when the channels
| are decoupled from the scheduling.
| azth wrote:
| Good points, but as far as I can tell, there's nothing
| preventing you from spawning a bunch of Loom-thread
| backed `CompletableFuture`s and waiting on them.
| Nullabillity wrote:
| True, but Loom won't really help you there since that
| already CompletableFuture.runAsync already uses a pooling
| scheduler. Same for cats-effect and zio, for that matter.
|
| (And that's aside from CompletableFuture having its own
| separate problems, like the obtrude methods)
| derefr wrote:
| > already uses a pooling scheduler
|
| A _bounded_ pooling scheduler. (The
| ForkJoinPool.commonPool.)
|
| Loom, I believe, "dummies out" the
| ForkJoinPool.commonPool --
| ForkJoinTasks/CompletableFutures/etc. by default just
| execute on Loom's unbounded virtual-thread executor.
|
| (Which happens to be _built on top of_ a ForkJoinPool,
| because it 's a good scheduler. Don't fix what ain't
| broke.)
| clhodapp wrote:
| Admittedly, loom will do much better but cats-effect does
| try its best within the limitations of the current JVM:
| https://typelevel.org/cats-effect/docs/2.x/guides/tracing
| ackfoobar wrote:
| For the team that I am in, I can see a huge productivity
| boost if my teammates can write in direct style instead of
| wrapping their heads around monads.
| hamandcheese wrote:
| Scala for-expressions make it pretty easy to write "direct
| style" code. Someone on the team should probably understand
| whats going on, though. I've had decent success with ZIO on
| my team, and it seems perfectly teachable/learnable.
| ackfoobar wrote:
| I am the someone who "understand whats going on". My
| experience of the knowledge transfer was not pleasant at
| all. Maybe it's my ability of explaining, maybe it's my
| teammates, maybe it's ZIO having better names for
| combinators than Cats.
|
| For-comprehension does help. But the alternative is
| callback hell all the way, so that's not saying much. It
| is still clunky compared to the regular syntax.
| bestinterest wrote:
| Whats the difference between goroutines and project loom? Is
| their any?
| _old_dude_ wrote:
| Unlike go routine, Loom virtual threads are not preempted by
| the scheduler. I believe you may be able to explicitly
| preempt a virtual thread but the last time i checked it was
| not part of the public API
| vips7L wrote:
| Unless I'm misunderstanding, virtual threads are
| preemptive: https://cr.openjdk.java.net/~rpressler/loom/loo
| m/sol1_part1....
| _old_dude_ wrote:
| By the OS, not by the scheduler see https://cr.openjdk.ja
| va.net/~rpressler/loom/loom/sol1_part2....
| vips7L wrote:
| What about pron's comments here then?
| https://news.ycombinator.com/item?id=27885569
|
| > Second, Loom's virtual threads can also be forcibly
| preempted by the scheduler at any safepoint to implement
| time sharing
| _old_dude_ wrote:
| For me, preemption by the Java scheduler is not currently
| supported but may be added in the future, after all the
| goroutine were not preempted at the beginning in Go.
|
| The whole quote
|
| > Second, Loom's virtual threads can also be forcibly
| preempted by the scheduler at any safepoint to implement
| time sharing. Currently, this capability isn't exposed
| because we're yet to find a use-case for it
|
| I believe it's a reference to [1] but i may be wrong.
|
| [1] https://download.java.net/java/early_access/loom/docs
| /api/ja...
| vips7L wrote:
| > The whole quote
|
| Sorry I was skimming! Thanks!
| cogman10 wrote:
| Terminology mostly :D
|
| I've not looked into the goroutine implementation, so I
| couldn't tell you how it compares to what I've read loom is
| doing.
|
| Loom is looking to have some extremely compact stacks which
| means each new "virtual thread" as they are calling them will
| end up having bytes worth of memory allocated.
|
| Another thing coming with loom that go lacks is "structured
| concurrency". It's the notion that you might have a group of
| tasks that need to finish before moving on from a method
| (rather than needing to worry about firing and forgetting
| causing odd things to happen at odd times).
| jayd16 wrote:
| >structured concurrency
|
| That's good to hear. You see a lot of these Loom
| discussions talk about implicit and magical asynchronous
| execution. I was afraid fine grained thread control would
| be left out. Its super useful if you want to interface with
| how most GUI frameworks function (ie a Main thread), or
| important OS threads like a thread with a bound GL context
| or what have you.
| cogman10 wrote:
| Yeah, while virtual threads are the bread and butter of
| Loom, they are also adding a lot of QoL things. In
| particular, the notion of "ScopedVariables" will be a
| godsend to a lot of concurrent work I do. It's the notion
| of "I want this bit of context to be carried through from
| one thread of execution to the next".
|
| Beyond that, one thing the loom authors have suggested is
| that when you want to limit concurrency the better way to
| do that is using concurrency constructs like semaphores
| rather than relying on a fixed pool size.
| ccday wrote:
| Not sure if it counts as structured concurrency but Go has
| the feature you describe:
| https://gobyexample.com/waitgroups
| jayd16 wrote:
| The biggest difference is probably that the JVM will support
| both OS and lightweight threads. That's really useful for
| certain things talking to the GPU in a single thread context.
| Spivak wrote:
| You are ignoring the downside to green threads which is that
| it's cooperative. If the thread doesn't yield control back to
| the event loop then the real OS thread backing the loop is now
| stuck.
|
| Which leads to dirty things like inserting sleep 0 at the top
| of loops and dealing with really unbalanced scheduling of
| threads don't hit yields often enough. Plus with loom it might
| not be obvious that some function is a yield since it's meant
| to be transparent so if you grab a lock and yield you make
| everyone wait until your scheduled again.
|
| Green threads are great! I love them and they're the only real
| solutions to really concurrent IO heavy workloads but it's not
| a panacea and trades one kind of discipline for another.
| sudhirj wrote:
| Sleep 0 sounds like quite a hack, Go has the neater
| https://pkg.go.dev/runtime#Gosched instead, and I assume
| there will be a Java equivalent as well. And if most stdlib
| methods and all blocking methods call it, it's going to be
| pretty difficult to hang a green thread.
| brokencode wrote:
| I was under the impression that Loom was implementing
| preemptable lightweight threads. Is that not the case?
| clhodapp wrote:
| I think that's not quite it:
|
| I believe that loom is implementing cooperative lightweight
| threads and simultaneously reworking all of the blocking IO
| operations in the Java standard library to include yields.
| I guess this means that you could, for example, hold an OS-
| level thread forever by writing an infinite loop that
| doesn't do any IO...
| mikepurvis wrote:
| It sounds like it is: https://cr.openjdk.java.net/~rpressle
| r/loom/loom/sol1_part1....
|
| But the other side of that is that sometimes non-preemption
| is also a desirable property-- like in JavaScript, or
| Python asyncio, knowing that you don't need to lock over
| every little manipulation of some shared data structure
| because you're never going to yield if you didn't
| explicitly await.
| Spivak wrote:
| So loom uses interesting terminology when talking about
| this. They say that they're preemptive and not cooperative
| because there's not an explicit await/yield keyword that
| you call from your code but that isn't the whole story
| because threads are only preempted when they perform IO or
| are synchronized. So you as an author can't know for sure
| where the yield points are and aren't supposed to rely on
| them but they're still there. You're not going to be
| forcefully preempted in the middle of number crunching.
|
| I think most people would consider this a surprising notion
| of preemption where it's out of your control-ish but also
| not arbitrary like it is for OS threads which still leads
| to basically the same problems and constraints as
| cooperative threads.
| pron wrote:
| > So loom uses interesting terminology when talking about
| this.
|
| That is a common terminology. Wikipedia says: [1]
|
| _The term preemptive multitasking is used to distinguish
| a multitasking operating system, which permits preemption
| of tasks, from a cooperative multitasking system wherein
| processes or tasks must be explicitly programmed to yield
| when they do not need system resources. ... The term
| "preemptive multitasking" is sometimes mistakenly used
| when the intended meaning is more specific, referring
| instead to the class of scheduling policies known as
| time-shared scheduling, or time-sharing._
|
| > threads are only preempted when they perform IO or are
| synchronized
|
| First, they can be preempted by any call, explicit or
| implicit, to the runtime (or any library, for that
| matter). For all you know, class loading or even Math.sin
| might include a scheduling point (although that is
| unlikely as that's a compiler intrinsic). We make no
| promises on when scheduling can occur. Not only do
| threads not explicitly yield, code cannot statically
| determine where scheduling might occur; I don't believe
| anyone can consider this "cooperative."
|
| Second, Loom's virtual threads can also be forcibly
| preempted by the scheduler at any safepoint to implement
| time sharing. Currently, this capability isn't exposed
| because we're yet to find a use-case for it (other than
| one special case that we want to address, but isn't
| urgent). If you believe you have one, please send it to
| the loom-dev mailing list.
|
| The reason it's hard to find good use cases for time
| slicing is as follows:
|
| 1. If you have only a small number of threads that are
| frequently CPU bound. In that case, just make them
| platform threads and use the OS scheduler. Loom makes it
| easy to choose which implementation you want for each
| thread.
|
| 2. If you have a great many threads, each of which can
| infrequently become CPU-bound, then the scheduler takes
| care of that with work-stealing and other scheduling
| techniques.
|
| 3. If you have a great many threads, each of which is
| _frequently_ CPU-bound, then your cores are
| oversubscribed by orders of magnitude -- recall that we
| 're talking about hundreds of thousands or possibly
| millions of threads -- and no scheduling strategy can
| help you.
|
| It's possible that there could arise real-world
| situations where infrequent CPU-boundedness might affect
| responsiveness, but we'll want to see such cases before
| deciding to expose the mechanism. Even OSes don't like
| relying on time-sharing (it happens less frequently than
| people think on well-tuned servers), and putting that
| capability in the hands of programmers is an attractive
| nuisance that will more likely cause a degradation in
| performance.
|
| [1]: https://en.wikipedia.org/wiki/Preemption_(computing)
| #Preempt...
| cogman10 wrote:
| Yeah... this is a place where I disagree with how the
| Loom devs define "preemptive". They are basically
| defining it as "most tasks will give up control when they
| hit a blocking operation". Yet, it's been my
| understanding that preemption means the scheduler can
| stop a currently operating task from running and switch
| to something else. That's not what happens with loom.
| hn_throwaway_99 wrote:
| Agreed, but you have other single-threaded server languages
| like NodeJS which have the same problem (a new request can
| only be handled if the current request gives up control,
| usually waiting for IO) and people have figured out how to
| handle it.
|
| I see Project Loom as really providing all the benefits of
| single threaded languages like Node (i.e. tons of
| scalability), but with an easier programming model that
| threads provide as opposed to using async/await.
| neeleshs wrote:
| :) sleep 0! I was trying to see if there is a way to preempt
| stuck threads (infinite loops etc), and wrote a small while
| loop replacement pwhile(()-> loop predicate,
| ()-> {loop body});
|
| All it does is add a thread.isinterrupted check to the
| predicate. At this point, best to switch to Erlang !
| cogman10 wrote:
| Which is why the advice would be "Don't use virtual threads
| for CPU work".
|
| It just so happens that a large number of JVM users are
| working with IO bound problems. Once you start talking about
| CPU bound problems the JVM tends not to be the thing most
| people reach for.
|
| Loom doesn't remove the CPU bound solution by adding the IO
| solution. Instead, it adds a good IO solution and keeps the
| old CPU solution when needed.
|
| In fact, there's already a really good pool in the JVM for
| common CPU bound tasks. `Forkjoin.common()`.
| saurik wrote:
| FWIW, while you are probably correct in the context of Loom--
| a specific implementation that I honestly haven't looked at
| much--you shouldn't generalize to "green threads" of all
| forms as you not only can totally implement this well but
| Erlang does so: as you are working with a byte code and a JIT
| anyway, you instrument the code to check occasionally if it
| was preempted (I believe Erlang does this for every
| potentially-backward jump, which is sufficient to guarantee
| even a broken loop can be preempted).
| cbsmith wrote:
| > That, unfortunately, is really heavy memory wise and on the
| OS context switcher.
|
| So, there was a time where a broad statement like that was
| pretty solid. These days, I don't think so. The default stack
| size (on 64-bit Linux) is 1MB, and you can manipulate that to
| be smaller if you want. That's also the _virtual_ memory. The
| actually memory usage depends on your application. There was a
| time where 1MB was a lot of memory, but these days, for a lot
| of contexts, it 's kind of peanuts unless you have literally
| millions of threads (and even then...). Yes, you can be more
| memory efficient, but it wouldn't necessarily help _that_ much.
| Similarly, at least in the case of blocking IO (which is
| normally why you 'd have so many threads), the overhead on the
| OS context switcher isn't necessarily that significant, as most
| threads will be blocked at any given time, and you're already
| going to have a context switch from the kernel to userspace.
| Depending on circumstance, using polling IO models can lead to
| _more_ context switching, not less.
|
| There's certainly circumstances where threads significantly
| impede your application's efficiency, but if you are really in
| that situation you likely already know it. In the broad set of
| use cases though, switching from a thread-based concurrency
| model to something else isn't going to be the big win people
| think it will be.
| vbezhenar wrote:
| Your words might be true, but the world jumped on async wagon
| long time ago and going all in. Nobody likes threads,
| everyone wants lightweight threads. Emulating lightweight
| threads with promises (optionally hidden behind async/await
| transformations) is very popular. So demand for this feature
| is here.
|
| I don't know why, I, personally, never needed that feature
| and good old threads were always enough for me. It's weird
| for me to watch non-JDBC drivers with async interface, when
| it was a common knowledge that JDBC data source should use
| something like 10-20 threads maximum (depending on DB CPU
| count), anything more is a sign of bad database design. And
| running 10-20 threads, obviously, is not an issue.
|
| But demand is here. And probably lightweight threads is a
| better approach than async/await transformations.
| kllrnohj wrote:
| > So, there was a time where a broad statement like that was
| pretty solid.
|
| That time is approaching 20 years old at this point, too.
| Native threads haven't been "expensive" for a very, very long
| time now.
|
| Maybe if you're in the camp of disabling overcommit it
| matters, but otherwise the application of green threads is
| definitely a specialized niche, not generally useful.
|
| > In the broad set of use cases though, switching from a
| thread-based concurrency model to something else isn't going
| to be the big win people think it will be.
|
| I'd go even further and say it'll be a net-loss in most
| cases, especially with modern complications like
| heterogeneous compute. If you're use case is specifically
| spinning up thousands of threads for IO (aka, you're a server
| & nothing else), then sure. But if you aren't there's no win
| here, just complications (like times when you _need_ native
| thread isolation for FFI reasons, like using OpenGL)
| cbsmith wrote:
| > That time is approaching 20 years old at this point, too.
| Native threads haven't been "expensive" for a very, very
| long time now.
|
| It depends on the context, but yes. I worked on stuff
| throughout the 2000's where we ran into scaling problems
| with thread based concurrency models. At the time, running
| 100,000 threads was... challenging. But yeah, by 2010 we
| were talking about the C10M problem, because the C10K
| problem wasn't a problem any more. There are some cases
| where you really do need to handle 10's or 100's of
| millions of threads, but there aren't a lot of them.
|
| > Maybe if you're in the camp of disabling overcommit it
| matters, but otherwise the application of green threads is
| definitely a specialized niche, not generally useful.
|
| Yup, but everyone is still stuck on the old mental model of
| "threads are bad", partly driven by the assumption that
| whatever is being done to handle those extreme cases is
| what one should be doing to address their own problem
| space. :-(
|
| > I'd go even further and say it'll be a net-loss in most
| cases, especially with modern complications like
| heterogeneous compute.
|
| Even more so if you're doing polling based I/O rather than
| a reactive model. The look on people's faces when I point
| out to them that there's good reason to think that for the
| scale they are working at, they'll likely get better
| performance if they just use threads to scale...
|
| It's so weird how we talk about the context switching costs
| between threads without recognizing that the thread does
| the poll is not the same thread that processed the IO
| request in the kernel.
| user5994461 wrote:
| >>> The default stack size (on 64-bit Linux) is 1MB
|
| The default thread stack size is 8 or 10 MB on most Linux.
|
| The exception is Alpine that's below 1 MB.
| ori_b wrote:
| The default reserved size is 8mb. The allocated size starts
| at a page (usually 4k), and grows in page sized increments
| as you use it.
| cbsmith wrote:
| To clarify, the 1MB is the default stack size for threads
| with the JVM on 64-bit Linux.
|
| Search for "-Xss": https://docs.oracle.com/en/java/javase/1
| 6/docs/specs/man/jav...
| christkv wrote:
| Are we coming full circle going back a variant of the original
| Java green threads?
| dragonwriter wrote:
| > Are we coming full circle going back a variant of the
| original Java green threads?
|
| There are basically two kinds of green threads:
|
| (1) N:1, where one OS thread hosts all the application
| threads, and
|
| (2) M:N, where M application threads are hosted on N OS
| threads.
|
| Original Java (and Ruby, and lots of other systems before
| every microcomputer was a multicore parallel system) green
| threads were N:1, which provide concurrency but not
| parallelism, which is fine when your underlying system can't
| do real parallelism anyway.)
|
| Wanting to take advantage of multicore systems (at least, in
| the Ruby case, for underlying native code) drove a transition
| to native threads (which you could call an N:N threading
| model, as application and OS threads are identical.)
|
| But this limits the level of concurrency to the level of
| parallelism, which can be a regression compared to N:1 models
| for applications where the level of concurrency that is
| useful is greater than the level of parallelism available.
|
| What lots of newer systems are driving toward, to solve that,
| are M:N models, which can leverage all available parallelism
| but also provide a higher degree of concurrency.
| jjtheblunt wrote:
| I worked in Solaris internals for a while at Sun during the
| early java era, and Solaris threading definitely did
| multiplexing of userspace onto os, and then os onto cores.
|
| Do you have a citation (because I can't find one)
| specifying your assertion that original Java green threads
| were not analogous to Solaris user -> os -> hardware
| multiplexing?
| dragonwriter wrote:
| > Do you have a citation (because I can't find one)
| specifying your assertion that original Java green
| threads were not analogous to Solaris user -> os ->
| hardware multiplexing?
|
| I was writing from memory of second-hand after-the-fact
| recitations of the history. Doing some followup research
| prompted by your question, if I understand this document
| [0] correctly, Java initially had N:1 green threads on
| Solaris, then M:N green threads on Solaris with 1:1
| native threads on Unix and Windows.
|
| [0]
| https://docs.oracle.com/cd/E19455-01/806-3461/6jck06gqe/
| cbsmith wrote:
| Java had M:N green thread models a LOOOOONG time ago.
|
| And Linux tried M:N thread implementations specifically to
| improve thread performance.
|
| In both cases, it turned out that just using 1:1 native
| threads ended up being a net win.
| hawk_ wrote:
| i am not aware of M:N thread builtin in Java even from
| long time ago, at least not in a way that you could
| control N
| truffdog wrote:
| It was Solaris only, so there is definitely an asterisk
| somewhere.
| cbsmith wrote:
| It was all very long ago, but the NGPT project did M:N
| threading on Linux (https://web.archive.org/web/200204081
| 03057/http://www-124.ib...).
|
| There were also a number of M:N JVM implementations that
| were particularly popular in the soft-realtime space back
| in the early 2000's.
|
| One of the fun trends with computing is that as hardware,
| software, and applications evolve, ideas that were once
| not terribly useful suddenly become useful again. It's
| entirely possible that M:N threads for the JVM is one of
| those cases, but it's NOT a new idea.
| cbsmith wrote:
| The old JDK 1.1 Developer's Guide had a page on the
| different thread models: https://docs.oracle.com/cd/E1945
| 5-01/806-3461/6jck06gqk/inde...
|
| At the time, Solaris had the only "certified" JVM that
| did M:N threads, so they really liked to make a big deal
| about it.
|
| You could control N through a JNI call to
| thr_setconcurrency. Not portable, but it worked. That
| particular capability was almost always not helpful.
| hashmash wrote:
| Not quite. The original green threads were seen as more of a
| hack until Solaris supported true threads. Green threads
| could only support one CPU core, and so without a major
| redesign, it was a dead end.
| neeleshs wrote:
| This gives some color -
| https://blogs.oracle.com/javamagazine/going-inside-javas-
| pro...
| carimura wrote:
| and also many resources direct from the source (those that
| are working on Loom): https://inside.java/tag/loom
| AtlasBarfed wrote:
| Basically yes.
|
| Longer answer: devs back in the day couldn't really grok the
| difference between green and real threads. Java made its
| bones as an enterprise language, which can have smart
| programmers, but they will conversely not be closer-to-metal
| knowledgewise. Too many devs back in the day expected a java
| thread to be a real thread, so java re-engineered to
| accomodate this.
|
| I think the JDK/JVM teams also viewed it as a maturation of
| the JVM to be directly using OS resources so closely across
| platforms, rather than "hacking" it with green threads.
|
| These days, our high performance fanciness means the devs are
| demanding green thread analogues, and go/elixir/others are
| seemingly superior because of those.
|
| So to remain competitive in the marketplace, Java now needs
| threads that aren't threads even though Java used to have
| threads that weren't threads.
| cogman10 wrote:
| Yes and no.
|
| The new Loom threads will be much lighter weight than the
| original Java green threads. Further, the entire IO
| infrastructure of the JVM is being reworked for Loom to make
| sure the OS doesn't block the VM's thread. What's more, Loom
| does M:N threading.
|
| Same concept, very different implementation.
| iamcreasy wrote:
| So, with Loom now we can tell exactly in which order theses
| threads were executed as it's not up to OS to decide thread
| execution order anymore?
___________________________________________________________________
(page generated 2021-07-19 23:00 UTC)