[HN Gopher] Resource efficient Thread Pools with Zig
___________________________________________________________________
Resource efficient Thread Pools with Zig
Author : komuW
Score : 283 points
Date : 2021-09-13 08:32 UTC (14 hours ago)
(HTM) web link (zig.news)
(TXT) w3m dump (zig.news)
| goodpoint wrote:
| How does it compare to Nim's Weave thread manager?
| chris_st wrote:
| Nim newby here -- does Nim have something analogous to Go's
| channels for inter-thread communication?
| goodpoint wrote:
| Yes, it has channels as a built-in:
|
| https://nim-lang.org/docs/channels.html
|
| Things like Weave abstract away the need to manage threads
| and messaging explicitly.
| ur-whale wrote:
| How does it compare with a bare-metal C++ threadpool?
| judofyr wrote:
| What is a bare-metal C++ threadpool? Is that a specific library
| you're referring to?
| mcnichol wrote:
| You've been spending too much time arguing over virtualization
| or bare metal.
|
| You are at the wrong level of abstraction.
| thechao wrote:
| Weellll... I guess you could get rid of the OS and switch to
| a stack-only ABI with a callee saves-everything policy? Then
| "thread switching" and "multiprocess" both map onto
| saving/restoring the stack (on ARM you'd have to have SP/LR
| motion to stack), and process communication would just be
| over a unified stack.
|
| What'd be cool is if _every_ core (and, thus, affinitized
| process) had its own virtual address. That way if any
| _single_ core went down, its neighbors could stand it back
| up. Of course, you 'd have to have some sort of hideous last-
| level shared virtual addressing scheme -- probably with large
| last level pages -- to "migrate" a "process" between cores.
|
| EDIT: And security be damned! We need that last .5%!
| dragontamer wrote:
| GPUs are practically the "OS-free" compute device of choice
| these days.
|
| The way GPUs work in most cases is through work-queues. The
| CPU submits work to GPUs (either OpenCL queues or CUDA
| Streams), and the GPU "plucks" work off of the queue to do.
| There are dozens of SMs (NVidia) or CUs (AMD GCN) or WGPs
| (AMD RDNA) that pluck the work off concurrently.
|
| The kernels are then run to completion. Upon completion,
| the stream gets an event that triggers (which often times,
| automatically enqueues another task). If the CPU has work
| to do, it can get an async message (or maybe be waiting
| through blocking behavior).
|
| --------
|
| There are other models of course, not everyone runs the
| standard methodology. "Persistent kernels" start up early
| on and infinite-loop. You interact with those by passing
| data over PCIe and the kernel itself contains a queue/load-
| balancing logic to pickup the data somehow (rather than
| letting the device drivers / hardware logic do that)
|
| ------
|
| The thing about GPU-level tasks is that you have so many
| little tasks (ex: shoot a ray, or render pixel (0,0),
| render pixel(1,0), etc. etc.) that just pulling tasks off
| of a queue is your most efficient methodology. Lots of
| small, roughly equal-sized tasks can be load balanced with
| simple methodologies.
| StreamBright wrote:
| Is there a IoC / CSP (ala core.async in Clojure) in Zig yet? Is
| it planned to have something like this? I could not find too much
| about that online.
| matu3ba wrote:
| The former (Inversion of Control) is a general method and too
| unspecific to answer. I can only hint that one of the biggest
| caveats of async is brittle composability due to missing
| cancellation routines. This will eventually be addressed.
|
| The latter (communicating sequential processes) is a formal
| language, but it looks like it is for IPC:
| https://arild.github.io/csp-presentation/#1
| scott_s wrote:
| I designed and implemented a mostly lock-free dynamic thread
| scheduler for streaming runtimes, and I learned some similar
| lessons: avoid global data and amortize the necessary
| synchronization that you have to do. One of the main
| peculiarities of a streaming context is that work-stealing is
| counter-productive. In a streaming context, it's more like
| _cutting in_ when the work would be done anyway. It 's better to
| go find a part of the streaming graph that is not currently being
| executed than to steal some from another thread.
|
| The paper describing my design is "Low-Synchronization, Mostly
| Lock-Free, Elastic Scheduling for Streaming Runtimes",
| https://www.scott-a-s.com/files/pldi2017_lf_elastic_scheduli....
| The source code for the product implementation is now open
| source. Most is in
| https://github.com/IBMStreams/OSStreams/blob/main/src/cpp/SP...
| and
| https://github.com/IBMStreams/OSStreams/blob/main/src/cpp/SP....
| riofoxx wrote:
| T.o a.s.s.i.s.t y.o.u i.n t.r.a.d.e.s f.o.r b.e.t.t.e.r
| i.m.p.r.o.v.e.m.e.n.t on c.r.y.p.t.o.c.u.r.r.e.n.c.y
| W.h.a.t.s.A.p.p (+13052395906)
| posharma wrote:
| How does this compare to OpenMP? Does it support nested
| parallelism? (disclaimer: haven't read the post thoroughly).
| jorangreef wrote:
| Incredible post. Ticks all the boxes and exciting to see this
| coming to Zig. Protty has got to be a domain specialist when it
| comes to atomics and threads.
| vanderZwan wrote:
| If they weren't before, they probably are by now after all the
| work put into this
| gpderetta wrote:
| To be very pedantic, Vyukov MPSC queue, while indeed awesome in
| its simplicity and performance, is not technically lock-free, not
| even obstruction free (and Vyukov never claimed it as such [1]):
| a producer halting between the atomic swap and atomic store will
| prevent further nodes enqueued by other producers from ever being
| dequeued by the consumer.
|
| [1] edit: and in fact there is a note exactly on this on the
| linked Vyukov page.
| kprotty wrote:
| I'm aware that it's technically considered blocking due to a
| producer being preempted between swap and store causing the
| consumer to see an invalid link and report empty. I address
| this in the run queue section by noting that it's OK if empty
| is spuriously reported since the thread waking/waiting
| mechanism takes care of rescheduling such a case.
| feffe wrote:
| A pet peeve of mine is calling something using atomic operations
| lock less. It's very common but once it click that atomic
| operations are just locks on the instruction level it feels very
| wrong to call most algorithms lock less. There's still the same
| concerns to avoid contention when using atomic operations as with
| regular mutexes. A mutex lock in the uncontended case never
| enters the kernel and is just an atomic operation... The main
| strategy regardless of if using locks or atomics directly is to
| avoid contention.
|
| The only truly lock less algorithm that I'm aware of (for most
| CPU archs) is RCU in the read case.
| OskarS wrote:
| "Lock free" has a very precise meaning in computer science, and
| it's this: "given some number of threads of execution, an
| algorithm is lock free if in some finite number of steps, at
| least one thread is guaranteed to make forward progress". When
| using mutexes, this is not guaranteed under any amount
| contention: the scheduler is perfectly free to preempt a thread
| holding a lock, which means no threads make progress.
|
| However, if you use a CAS loop (like this code does) to update
| some value (i.e. a while loop that repeatedly tries to execute
| a compare-and-swap), that loop will only fail to make progress
| if some other thread DID make progress. This means that such an
| algorithm is in fact lock-free: if thread X didn't progress, it
| is necessarily because some other thread did. In addition, a
| scheduler preempting a thread at any point will not stop other
| threads making progress.
|
| Whether an algorithm is lock free or not is not a matter of
| opinion, it is a mathematical statement. The fact that it
| annoys you is... weird.
| feffe wrote:
| Yes, I'm probably wrong on the semantics of lock free. I
| should read up on that. My point was that maybe you can gain
| a factor of 2 by using atomic ops in a "lockless" fashion
| compared to mutexes but it still scales extremely poorly
| compared to what you could do if the hardware had real async
| mechanisms that did not rely on cache coherency. The cache
| line that is subjected to an atomic operation will be evicted
| from all other CPUs cache lines, back and forth.
|
| A simple example is just letting all CPUs in a multi core CPU
| doing an atomic add at a memory address. The scaling will be
| exactly 1 with the number of available cores. I realize this
| is very of topic wrt to this cool article an implementation
| in zig. It's just that this problem can't really be solved in
| an efficient manner with todays hardware.
| gpderetta wrote:
| Atomic instructions are not locks on the instruction level (the
| lock prefix on x86 is just an historical artefact).
|
| edit also lock-free has not much to do with locks.
| feffe wrote:
| It's a lock associated with the cache line that the atomic
| operation operates on. This is because they are built on top
| of the cache coherency mechanism, a synchronous blocking
| operation on the hardware level to implement an asynchronous
| mechanism on the software level.
|
| The big frustration with today's multi core CPUs is that
| there's simply not efficient way to communicate using message
| passing mechanisms. I think this is something the hardware
| guys should focus on :-) Provide an async mechanism to
| communicate between cores, not relying on the cache
| coherency.
| gpderetta wrote:
| There is no lock associated with the cache line.
|
| The coherency protocol guarantees that a core can own in
| exclusive mode a cache line for a bounded number of cycles,
| this guarantees forward progress and it is different from
| an unbounded critical section. It has also nothing to do
| with the lock prefix and also applies to normal non atomic
| writes.
|
| What the lock prefix does is delay the load associated with
| the RMW so that it is executed together with the store
| before the core has a chance to lose the ownership of the
| line (technically this can also be implemented
| optimistically with speculation and replays, but you still
| need a pessimistic fallback to maintain forward progress).
|
| A message passing feature would be nice, but it would be
| necessarily non coherent which means you can only use it to
| pass serialised values, not pointers to other threads.
|
| An alternative more usable solution would be to
| aggressively speculate around memory barriers as the memory
| subsystem is already highly asynchronous.
| feffe wrote:
| The effect is the same. If someone touch the cache line,
| it's evicted from all other caches that has it,
| triggering a cache miss when other cores touch it.
| Everyone knows this. I just think it's a bit depressing
| if you try to optimize message passing on many-core CPUs
| you'll realize you can't make it fast. No-one has been
| able to make it fast (I've checked Intel message passing
| code as well). If you get more than 20 Mevents through
| some kind of shared queue, you are lucky. That is slow if
| you compare to how many instructions a CPU can retire.
|
| So all these event loop frameworks try to implement an
| async behaviour ontop of a hardware mechanism that is
| synchronous in it's core. The main issue is that the
| lowest software construct, the queue used to communicate
| is built on this cache line ping pong match. What the
| software want it still a queue, you could still send
| pointers if the memory they point to has been committed
| to memory when the receiving core see them. Just remove
| the inefficient atomic operations way of synchronizing
| the data between CPUs. Send them using some kind of
| network pipe instead :-) As you say the memory subsystem
| is really asynchronous.
|
| I'm convinces this is coming, it should just have arrived
| 10 years ago...
| gpderetta wrote:
| >What the software want it still a queue, you could still
| send pointers if the memory they point to has been
| committed to memory when the receiving core see them
|
| Which means that the sender will need to pessimistically
| flush anything that the receiver might need to access,
| which is likely more expensive than optimistic cache
| coherency.
|
| Non-CC systems were a thing in the past, and still are
| for things like GPUs and distributed systems, but are
| much harder to program and not necessarily more
| efficient.
|
| As the saying goes there are only two hard things in
| programming...
| omazurov wrote:
| Increasing the size of the array 10x (100_000_000) and filling a
| glaring omission:
|
| Go (go1.17.1 darwin/amd64) took 5.591593544s
| took 5.285948722s took 5.218750076s took
| 5.239224787s took 5.131232207s
|
| Zig (0.9.0-dev.959+f011f1393) took 3.48s
| took 3.37s took 3.44s took 3.43s took
| 3.49s
|
| Java (openjdk version "17-ea") Time: 2684 ms
| Time: 2654 ms Time: 2840 ms Time: 2676 ms
| Time: 2649 ms
|
| MacBook Pro Early 2013, 2.7 GHz Quad-Core Intel Core i7, 16 GB
| 1600 MHz DDR3
| gavinray wrote:
| Is this using JVM/JIT or using Graal to make a native-image
| binary and calling that? You'd get better results due to
| shaving off startup time with that if it does include it.
|
| I think .NET 6 latest preview would slightly beat out Java here
| as well.
|
| This isn't exactly apples-to-apples since Zig/Rust/Go are
| considered "systems languages", but the JVM has GraalVM for
| compiling to native binaries/libraries and even
| importing/exporting C functions and structs. And .NET of course
| has .NET Native + "Dotnet Native Exports", including the LLVM
| experiment where it uses LLVM bitcode instead of Ryu.
|
| So you can make the argument that writing a native binary or a
| library which exported this benchmark function as a C-callable
| method with a C header in each language would technically
| equivalent.
|
| The JVM and .NET ones would have a large size (several MB each)
| but would otherwise still fill this requirement.
| egberts1 wrote:
| This is the first four-state-machine thread algorithm.
|
| And I like what I am seeing. Hopefully all transitions between
| states have been tested.
|
| Good work.
| matu3ba wrote:
| Drawing them could help figuring out all (invalid) state
| transitions. Certainly would make a cool drawing.
| Jarred wrote:
| Running the benchmarks locally (zig is fastest, but that's
| probably expected)
|
| zig (~1 week older than HEAD): ./zig-
| out/bin/qsort filling shuffling running
| took 177.46ms
|
| rust nightly: cargo run --release --bin qsort
| warning: the cargo feature `edition2021` has been stabilized in
| the 1.56 release and is no longer necessary to be listed in the
| manifest See https://doc.rust-
| lang.org/nightly/cargo/reference/manifest.html#the-edition-field
| for more information about using this feature.
| Finished release [optimized] target(s) in 0.02s
| Running `target/release/qsort` filling shuffling
| running took 896.91656ms cargo run --release
| --bin rsort warning: the cargo feature `edition2021` has
| been stabilized in the 1.56 release and is no longer necessary to
| be listed in the manifest See https://doc.rust-
| lang.org/nightly/cargo/reference/manifest.html#the-edition-field
| for more information about using this feature. Finished
| release [optimized] target(s) in 0.02s Running
| `target/release/rsort` filling shuffling
| running took 212.190694ms
|
| go 1.16.3: go run qsort.go filling
| shuffling running took 222.90356ms
|
| on macOS 11.5.2 with a 2.4 GHz 8-Core Intel Core i9
| Arnavion wrote:
| The fact that the Rust tokio version (the one that uses tokio
| tasks instead of threads) is slow is to be expected. tokio
| tasks aren't appropriate for running quicksort; they will have
| overhead compared to a regular threadpool because of I/O
| reactor, waker etc code that will never get used.
|
| rayon or crossbeam are more appropriate since they are actually
| designed for general-purpose thread work. Using rayon's scopes
| will also get rid of the `unsafe` that's being (ab)used to
| create the fake `&'static mut` slices.
|
| Though for some reason, the rayon version (rayon_qsort.rs) uses
| scopes, but still uses `unsafe` to create `&'static mut`
| slices... ( https://github.com/kprotty/zap/pull/3 )
| gefhfff wrote:
| Can you explain that Tokio vs crossbeam vs rayon a bit more?
| Should I default to rayon or crossbeam because of that?
| ModernMech wrote:
| It depends on what you want to do. If you are doing io-
| bound work, Tokio would be what you want -- you would use
| it as a runtime for the async capabilities in Rust. If you
| have cpu-bound work, then rayon is what you want to use.
| Rayon is a work-stealing parallelism crate -- it will
| schedule work to be done, and different threads will
| schedule portions of it as they become available to do
| work. It's very easy to get 100% CPU utilization across all
| cores use Rayon if your work is naturally paralellizable,
| and the interface is dead simple: anywhere you do a
| .iter(), you can turn it into a .par_iter(), and Rayon will
| parallelize the work.
|
| Note there is some overhead to using Rayon -- you normally
| will be better off doing your work on a single thread
| unless you have a large number of elements in your
| collection... I found for my application I needed more than
| 1e6 elements before I saw an appreciable performance
| benefit to using Rayon.
|
| As others said, Crossbeam is for sending and receiving
| messages across threads. I use it alongside of tokio and
| rayon.
| YorickPeterse wrote:
| Crossbeam is a library that provides various concurrency
| primitives, such as a queue type that allows stealing of
| work.
|
| Rayon is a library for running code in parallel. Typically
| you'll give it a (parallel) iterator of sorts, and it will
| distribute the work across a pool of threads.
|
| Tokio is a library for async IO operations.
| knuthsat wrote:
| What about smol?
| nicoburns wrote:
| Tokio is designed for concurrent io, rayon, or tokio-
| threadpool for concurrent cpu-bound task.
| kprotty wrote:
| I don't think tokio's slowness here is "to be expected".
| There isn't much reason for tokio tasks to have that much
| overhead over normal thread pools. The I/O driver shouldn't
| be called in such a benchmark given there's no I/O work
| happening. Waker's only add a reference count inc/dec + an
| atomic cas for wake() which should only happen on the
| JoinHandle `awaits` [4] compared to just an atomic swap for
| the Zig case on the join handle.
|
| Golang doesn't poll for I/O under such cases [0] and tokio
| should be using the `ParkThread` parker for putting threads
| to sleep [1] given `net` features aren't enabled (not sure if
| this is the actually the case) which you can force with a
| custom Runtime initialization instead of `#[tokio::main]` as
| an exercise.
|
| `crossbeam-deque` requires heap allocation for the run
| queues, heap allocates on growth, and garbage collects said
| memory. This is an overhead I wished to avoid and is
| something tokio has been improvements with avoiding as well
| [2].
|
| `rayon` isn't a good comparison here given `rayon::join` is
| optimized to hook directly into the scheduler and run the
| caller only until the other forked-section completes [3].
| This isn't general purpose and it takes advantage of
| unbounded stack allocation which can technically cause a
| stack overflow. Zig could do this and also take advantage of
| batch scheduling, but it complicates the code and is unfair
| here given `async` usage. Tokio, golang, and the Zig
| benchmarks require heap allocation on spawn so I believe it
| makes it a fairer comparison. This is also why I used rayon
| scopes instead of join(): less specialization and
| reintroduced the heap allocation from the unbounded
| concurrency.
|
| The `unsafe` there is from me copying the benchmark code from
| the tokio version to the rayon version and forgetting to
| remove the hack. In tokio, ownership of the array needed to
| be passed into the function given the lifetime was no longer
| linear from the spawn() I assume (correct me if i'm wrong
| here, but this is what the compile error hinted at). So I
| needed to recreate the array after the function, hence
| unsafe. If there's a better way for the tokio version, please
| send a PR. I see you've done so for the rayon version and I
| gladly merged it.
|
| [0]: See `atomic.Load(&netpollWaiters) > 0` in
| https://golang.org/src/runtime/proc.go
|
| [1]: https://github.com/tokio-
| rs/tokio/blob/master/tokio/src/runt...
|
| [2]: https://tokio.rs/blog/2019-10-scheduler#a-better-run-
| queue
|
| [3]: https://docs.rs/rayon-
| core/1.9.1/src/rayon_core/join/mod.rs....
|
| [4]: https://github.com/tokio-
| rs/tokio/blob/98578a6f4a494e709f000...
| skohan wrote:
| Honestly tokio is so complex, and it serves so many use-
| cases it's hard for me to believe it could ever be truly
| optimal for any one use-case.
| kristoff_it wrote:
| Isn't Go's scheduler just as versatile?
| skohan wrote:
| I'm not an expert on the go scheduler, but my perception
| is that it is more of a focused single-purpose component
| whereas tokio seems like a sprawling swiss-army-knife of
| a library if you browse the source
| bbatha wrote:
| The tokio scheduler and the go scheduler are roughly
| equivalent. Much of tokios bulk is reimplementing much of
| the std lib in an async compatible way (io, os,
| blocking).
| tick_tock_tick wrote:
| It's much more versatile but also has many more years of
| optimization.
| Arnavion wrote:
| >I don't think tokio's slowness here is "to be expected".
|
| And yet on my machine the rayon version takes ~160ms and
| the tokio version takes ~1350s. This isn't at the level of
| some minor missed performance optimization.
|
| >There isn't much reason for tokio tasks to have that much
| overhead over normal thread pools.
|
| tokio is an async runtime. tokio tasks are meant to be for
| distributing I/O-bound work. It would be at least a little
| more correct to use spawn_blocking for CPU-bound tasks, but
| that still doesn't work for your recursive calls because
| that's not what it's meant for.
|
| In general, if you have CPU-bound work in your tokio-using
| application, you run it on a different threadpool - tokio's
| blocking one, or a completely different one.
|
| >`rayon` isn't a good comparison here given `rayon::join`
| is optimized to hook directly into the scheduler and run
| the caller only until the other forked-section completes
| [3]. [...] This is also why I used rayon scopes instead of
| join(): less specialization and reintroduced the heap
| allocation from the unbounded concurrency.
|
| My original comment was also talking about scopes, not
| `rayon::join`. So yes, `rayon` is absolutely a good
| comparison.
| pizza234 wrote:
| I get inconsistent results, both in terms of platform and
| language. The benchmark may be flawed, and no conclusions
| should be drawn, to be prudent.
|
| Running on Linux (zig -Dc), rounded average of 3 runs.
|
| 8-thread Intel laptop machine: zip
| 0.9.0-dev.958: 945 rust 1.5.0-nightly
| 21-09-12/qsort: 545 rust 1.5.0-nightly 21-09-12/rsort:
| 255 go 1.16.8: 355
|
| 32-thread AMD workstation: zip 0.9.0-dev.958:
| 410 rust 1.5.0-nightly 21-09-12/qsort: 780 rust
| 1.5.0-nightly 21-09-12/rsort: 135 go 1.16.8:
| 135
|
| Additionally, I haven't done a thorough review, but enabling
| the commented code: // Optimized version
| (hooks directly into the scheduler) // rayon::join(
| // || quick_sort(low), // ||
| quick_sort(high) // );
|
| yields a 110ms run on the faster machine.
| kprotty wrote:
| I'm not sure how this implies it is flawed. It benchmarks
| thread pools so on a system which allowed more parallel
| concurrent tasks (i.e. 32t amd cpu) the throughput is
| expected to scale somewhat, and you see this in your results.
|
| Also, `zig -Dc` links to libc (`-Dc`) but builds in debug
| mode by default, similar to Rust and C/C++ compilers. The
| readme contains instructions to compile it with optimizations
| (`-Drelease-fast`), the rust versions do so as well
| (`--release`) so you should re-evaluate your results.
|
| See one of my other comments in this post on why I didn't use
| rayon::join() by default.
| rdsubhas wrote:
| At first glance, this looks great, almost 20% speedup!
|
| But genuine question, this looks to be how almost every new
| tech begins. Initially it's FAST because it keeps things as
| barebones as possible, and everything else other than the
| specific module can be ignored.
|
| But no feature works in isolation, so as more features get
| added - scheduling constraints, safety quirks, or random
| features that cut across the language - the 20% looks like
| nothing.
|
| I did go through the blog, but how can one be sure that this
| 20% gap is a guaranteed design advantage, and not just a
| temporary deferral of everything else?
| dnautics wrote:
| > how can one be sure that this 20% gap is a guaranteed
| design advantage
|
| It's no guarantee, but in my experience, Zig is a language
| that lends itself very well to composability. I imagine that
| instead of adding features to stdlib, anyone who needs "extra
| features" in their threading system will find it quite easy
| to implement their own as a library outside of stdlib that
| can easily be swapped in and out in A/B tests for the library
| consumer. If some feature or another is REALLY wanted in the
| stdlib for some reason it would probably be easy to drop it
| behind a comptime-flag as part of a type constructor
| function, which won't compromise (runtime) performance, teeny
| tiny hit to comptime performance.
| kristoff_it wrote:
| Nobody can predict the future, but one thing that might help
| is that Protty is in the Zig core dev team and, as the post
| shows, he really cares about efficiency.
| jorangreef wrote:
| > this looks to be how almost every new tech begins
|
| On second thought, do you think this is true? The level of
| detail shown in this post is fairly extraordinary for event
| loops just getting started.
|
| > how can one be sure that this 20% gap is a guaranteed
| design advantage
|
| I would say because the design is solid and recognizes
| principles of mechanical sympathy that are fundamental and
| unlikely to change for quite some time.
| skohan wrote:
| fwiw Zig is taking its role as a low-level, bare bones "c
| replacement" as a core design constraint. For instance zig
| has eschewed from even fairly tame features like operator
| overloading because they're not considered in line with Zig's
| identity as a minimal, explicit language.
|
| So feature creep is always a concern, but from what I
| understand Zig has a better chance at avoiding this than
| other projects due to the language's guiding philosophy.
| adamrezich wrote:
| it seems like the design philosophy of zig is such that it's
| not going for feature creep/bloat, it's keeping the language
| very simple and eschewing unexpected complexity
| erichdongubler wrote:
| Does this already use `target-cpu=native`? Not sure if that
| would be apples-to-apples with the Zig implementation (which is
| what's important here), but I'd be surprised if it didn't yield
| some interesting wins.
| adenozine wrote:
| Rust is such a bloated, insane design-by-committee language.
| It's actually still surprising to me that corporations are
| defending it and using it. I sure won't.
|
| It's also just so dang ugly to read. If Perl and C++ had a
| mutant offspring, I think it'd be Rust. And that's just not
| respectful to Perl!
|
| I'm really happy that Zig is doing so well. I'm planning on
| contributing a little chunk of money to the ZSF whenever they
| get a 1.0 landed. I do think there are some lessons to learn
| from Rust about formal methods as foundational building blocks
| in a language. I'm eager to see Zig mature more around the
| memory-safety side of things.
| pcwalton wrote:
| There was no "design committee" for Rust.
|
| I'm increasingly skeptical that any language with lifetimes
| can be anything other than "ugly to read". People complain
| about syntax, but what they're really complaining about is
| semantics.
| [deleted]
| komuW wrote:
| When you do `go run qsort.go` you are also timing the time
| taken by Golang to build/compile the code. I suspect the same
| applies to `cargo run`
|
| You should do something like; `go build . && time ./qsort`
| masklinn wrote:
| The same occurs with Rust, we can even see that the no-op
| recompilation takes about 20ms.
|
| It's obviously just a rough estimate / comparison otherwise
| it'd be using hyperfine or something along those lines.
| Arnavion wrote:
| The "took #ms" line is output of the program. It does not
| include compilation time.
| stewbrew wrote:
| Erm, could somebody explain to me why I shouldn't understand
| this as an argument pro go given that it is about as fast as
| the fully optimized versions of zig and rust?
| kristoff_it wrote:
| I think that's a perfectly reasonable way of looking at it.
| jerf wrote:
| Go's value proposition is that it has good bang-for-the-buck.
| It's fairly easy to write in. There's easier, but it's fairly
| easy. It performs fairly well. There are things that perform
| better, but it's fairly good. It scales fairly well. There's
| things that scale better, but it's pretty good.
|
| If you draw all this out on the programming landscape, I
| think one of the reasons Go has succeeded as well as it did
| is that this was a poorly covered part of the landscape.
| There were languages that were much easier, but performed
| much worse, and languages that performed much better, but
| generally required a lot more developer effort.
|
| I don't expect Go to ever top the charts on performance on a
| task like this, but it's often hanging around at only a bit
| slower, and it does that across a wide variety of tasks and
| on a wide variety of dimensions. It's not the best at much of
| anything, but it's pretty darned good for pretty cheap on a
| lot of measures.
| mamcx wrote:
| Well, work with several task is THE thing with Go, right? If
| it were too much worse Go will fail.
|
| In situations like this, is important to remember that
| Rust/Zig are (system)languages to MAKE "Go" but also, million
| other different things with different goals.
|
| So other way to look at this is "If I wanna designa language
| with a runtime like Go, this both have it that close!"
| r0f1 wrote:
| I know this is off-topic, but does this website not look eerily
| familiar to dev.to?
| kristoff_it wrote:
| Yes, as others have said, it's based on the same open source
| CMS.
|
| I've created zig.news for people who want to write about Zig
| but who don't necessarily want to maintain their own blog and
| to the work necessary to publicize it.
|
| This should hopefully help more people write down their
| experience and opinions on using Zig and alleviate the problem
| of having a "blogger aristocracy" who has a much stronger voice
| than anyone else.
|
| I briefly go over this concept in this 3mins long video:
| https://www.youtube.com/watch?v=i9pUuj6eiEg
| ibraheemdev wrote:
| I'm guessing it's using forem, the open-source platform that
| powers dev.to: https://github.com/forem/forem
| andruby wrote:
| That's correct. Both are built on Forem. It's in the footer:
| Built on Forem -- the open source software that powers DEV and
| other inclusive communities. Made with love and Ruby on
| Rails.
|
| https://github.com/forem/forem
| r0f1 wrote:
| Ah I did not know that. Thanks :)
| randyrand wrote:
| please have Thread Priority a native feature. See GCD.
| kevingadd wrote:
| It's always great to see a robust new threadpool. Most of the
| stock ones I've used have horrible characteristics - for example
| on my Threadripper lots of production apps will spawn 48 or more
| threadpool threads and the scheduling/throughput characteristics
| on that are generally pretty dire because they needlessly compete
| for resources, etc. They tend to allocate garbage for every work
| item too, which makes them a bad choice for realtime scenarios.
| It's genuinely frustrating that most processes on my machine have
| 128 threads or more due to unintelligently scaling off
| ProcessorCount without an upper limit. For example, GIMP's bad
| threading policy means that operations like resizing an image
| take _seconds_ instead of 100ms or less.
|
| I've been using a custom threadpool for a while now but writing
| your own without the expertise necessary can be a real pain.
| smashed wrote:
| You can change the number of processors gimp's gegl multi-
| threaded operations gimp will use in the preferences.
|
| https://youtu.be/D2c8BH8q6Yc
|
| The tile size seems to be hard-coded to 128x128.
| eptcyka wrote:
| I don't think that it's the threadpool's fault that an
| application uses it incorrectly. Also, I think there are a lot
| of developers who have not considered that on today's machines,
| just spawning as many threads as there are cores is the optimal
| amount of threads in a thread pool for every use case. I
| wouldn't say it's the case of bad software but rather software
| that was written for the CPUs of 5 years ago. And generally, I
| don't think this will be an easy problem to solve due to the
| variety of heterogeneous and non-heterogeneous topology of
| modern SoCs. In fact, I don't see this specific threadpool
| doing anything to optimize for the disparate core clusters of
| your threadripper, or to acknowledge the disparity between core
| clusters on the M1.
| matu3ba wrote:
| > I don't see this specific threadpool doing anything to
| optimize for the disparate core clusters of your
| threadripper, or to acknowledge the disparity between core
| clusters on the M1.
|
| To do that efficiently you need to pin thread groups to cores
| based on having information of data usage. This smells like
| over-optimizing architectures to me, if you want to go beyond
| separating stuff like hyper-threads on io. Additional
| annoyance: There is no POSIX way to get hyperthreads and
| physical ones.
| eptcyka wrote:
| I think a general purpose threadpool should work well on
| general purpose hardware, and it seems like the most
| popular SoCs on consumer devices will have heterogeneous
| cores et al, so a good implementation would schedule the
| threadpool appropriately. I agree that there is no POSIX
| way to distinguish between hyper threads and regular
| threads, and this is something that should be improved. I'm
| not saying that the achievements made by the threadpool
| implementation are lackluster or that any of the other
| solutions solve the issues I outline any better. What I am
| saying that the comment I was originally referring was
| somewhat mistaken about the benefits of a more optimal, yet
| naive threadpool library.
|
| This isn't just about hyperthreads, by the way. As long as
| the workload isn't compute heavy and often stalls on
| memory, hyperthreads are just as good as regular threads.
| On a hardware level, there is no distinction between a
| regular and a hyperthread core. Either you multiplex a
| single physical core or you don't. Anyway, there is more to
| it than slower threads and faster threads - accessing
| memory between threads will be slower depending on which
| core is trying to access which bits of memory - a core
| stealing work from a sibling core on the same chiplet will
| probably be able to do that quicker than stealing work from
| a core across the cluster if the data prefetcher has been
| doing it's job correctly. Spawning more threads than
| necessary might force a CPU to power up more cores than
| necessary, resulting in slower performance per core
| performance and worse power efficiency, especially if a
| fast or slow cluster needs to be turned on, where a more
| optimal scheduling of threads might not force that to
| happen. I think a general purpose thread pool by default
| should no longer spawn as many threads as there are
| _cores_, whatever that term even means, with optional
| toggles to inform whether the work that'll be scheduled
| will be compute heavy or not.
| dragontamer wrote:
| > just spawning as many threads as there are cores is the
| optimal amount of threads in a thread pool for every use case
|
| Absolutely not.
|
| Any task with any amount of I/O will have a significant
| amount of blocking. A GPU kernel may take microseconds or
| milliseconds to respond. RDMA (a memory-access over Ethernet)
| may take many microseconds.
|
| Having multiple threads per core would be more efficient: it
| gives the cores something to do while waiting for SSD, RDMA,
| or GPUs. Remember: even the earliest single-core systems from
| the 1970s had multiple threads on one core: its just more
| efficient to have multiple terminals to read/write to at a
| time.
|
| --------
|
| One hardware-thread per thread (since SMT8 machines exist
| like POWER9 / POWER10) is only efficient in the most
| computationally expensive situations. Which is in fact, a
| rarity in today's world. Your typical programs will be
| waiting on the network interface or SSD.
|
| IIRC: there's professional thread-pools out there that are
| 1.5x threads per hardware thread as a default option, and
| then scale up/down depending on how computationally expensive
| things look. That is: a 64-core/128-thread Threadripper would
| be overloaded with 192 threads, under the assumption that at
| least 33% of them would be waiting on I/O at any given time.
| eptcyka wrote:
| >Any task with any amount of I/O will have a significant
| amount of blocking. A GPU kernel may take microseconds or
| milliseconds to respond. RDMA (a memory-access over
| Ethernet) may take many microseconds.
|
| The argument I failed to make was that with heterogeneous
| distribution of memory bandwidth and compute resources,
| most user applications would benefit from spawning less
| threads than all available cores. In I/O heavy workloads,
| the correct thing to do is to do asynchronous I/O. This can
| be done for SSDs and GPUs. On contemporary systems where
| there's heavy costs associated with context switching
| ,avoiding them and servicing multiple tasks without
| blocking will always be superior to spawning more threads
| to do more blocking.
|
| When it comes to hyperthreading, I assume that the threads
| are cores - because from the OS perspective, they are, and
| you cannot distinguish two hyperthreads running on a single
| core anyway.
|
| Also, I apologize, but the first sentence you're quoting is
| not not what I intended to write - my point is that most
| application developers might still think that spawning as
| many threads as there are cores is a reasonable thing to do
| in all cases - but with CPUs that comprise of 4 core
| clusters with 16 cores each, it's often better to spawn far
| less than the total amount of cores available.
| dragontamer wrote:
| > In I/O heavy workloads, the correct thing to do is to
| do asynchronous I/O
|
| You can't async mmap into memory reads (or the GPU-
| equivalent: cudaMallocManaged).
|
| Today's I/O is looking more-and-more like a memory
| read/write. As such, your typical "node = node->next"
| pointer traversals could very well be an incredible
| string of I/O. That indirection could be in RAM, over SSD
| (mmap), over RDMA (ethernet pretending to be RAM), or
| into GPU-RAM (cudaMallocManaged)... with an appropriate
| PCIe command (possibly atomic-PCIe) to boot.
|
| Async only works on the simplest of reading/writing
| patterns.
|
| EDIT: Consider the "persistent kernel" pattern on GPGPU
| (which starts off with a lot of the similar thought
| process you have on CPU-land. Launch only X wavefronts,
| where X is the size of the GPU). You primarily
| communicate with a persistent kernel over RAM / atomics.
| You don't use the classic cuda <<<blocks, threads>>>()
| (which admittingly has an async interface)... Instead:
| you read/write to magic managed-memory locations that
| will be eventually sync'd over PCIe to the GPU. This is
| because the "persistent kernel" is always executing. You
| launched it upon program start, there's no event to
| async-trigger on. You just pass data to the GPU. The GPU
| operates on the data at its leisure. Then it eventually
| returns data to a buffer elsewhere (probably with atomic
| compare-and-swaps, which traverse PCIe these days, to
| ensure coherence)
| touisteur wrote:
| Smells like io_uring to me on the userland/kernel
| interface. Just write to the queue and come later check
| the completion queue.
| dragontamer wrote:
| CUDA-streams probably already use io_uring under the
| hood.
|
| The issue is that a CUDA-stream based execution kernel
| will spend lots of time pulling from the queue and
| spinning up threads (sound like a familiar problem?).
|
| Instead: you sidestep this issue with persistent kernels:
| you launch exactly the number of kernels that matches
| your GPU-size, and then pass data to those kernels
| through an alternative means.
|
| -------
|
| The future of I/O is going to be atomic-operations across
| PCIe with stronger-and-stronger coherency models. PCIe
| 3.0 introduced atomic-PCIe commands. PCIe 4.0
| strengthened them. PCIe 5.0 is rumored to have even
| stronger coherency rules and "coherent grids" are being
| seriously discussed / implemented in high-speed computers
| (see Compute eXpress Link, or CXL)
|
| Any serious I/O (and GPUs will be part of that I/O
| future), will likely use PCIe atomics in some regards,
| quite similar to atomic-compare-and-swaps that the CPU
| does already between their cores.
|
| I/O is becoming indistinguishable from memcpy + atomics.
|
| -------
|
| EDIT: In effect, I'm saying that GPU-programmers are more
| concerned about GPU-efficiency rather than CPU-
| efficiency. Sure, the CPU is less efficient spinning on
| these memory-reads/writes. But the GPU is where the bulk
| of the calculations are happening. Therefore, it behooves
| the architect to optimize the GPU (even if the CPU ends
| up slightly less efficient).
|
| Whatever argument you have about "queues being more
| efficient in hypothetical architecture", the GPU has more
| calculations and more cores to feed (aka: far more
| overhead when it comes to Amdahl's law). That means you
| want to apply those principles first to the GPU, and then
| the CPU "makes up for it", so to speak.
| eptcyka wrote:
| I was under the impression that PCI-E was perfectly
| capable of sending notifications from one device to
| another in a somewhat efficient manner. Having said that,
| this is not my area of expertise - and I do see that if
| your main concern is to feed the GPU then blocking a
| thread might be the optimal solution. I assume that MSI
| would be too much overhead and might involve some context
| switching to service the interrupt from the kernel etc to
| allow for asynchronous completion? Also, is it possible
| to have overlapping memory regions between a high speed
| networking card and the input buffer from the GPU, which
| in effect just means that the CPU just has to tell the
| GPU to start reading once the network card is done
| receiving?
|
| Having said that, I don't believe that for most
| application developers this is a major concern - in cases
| where you flood the GPU with a firehose of data to
| compute on you probably also don't care about what other
| processes run on the machine and whether your
| architectural decisions end up making people's laps
| uncomfortably hot. I also do not believe that the future
| of all I/O is just memcpy and atomics - we can already do
| that today. It doesn't really bring you any advantages
| for speed in the general case. I think the future of I/O
| is memcpy, atomics and a good signaling mechanism to
| signal I/O task completion without costly context
| switches with as little extraneous memory allocation as
| possible. Moreover, the future of consumer computing will
| probably not rely on PCI-E at all and instead have the
| GPU and the CPU share all of it's memory. And hey, maybe
| Nvidia will add some cool ARM companion cores to their
| biggest chips, slap on some DDR5 slots on their cards and
| sell self-contained solutions, sidestepping PCI-E
| entirely, at least for feeding the data from the CPU to
| the GPU.
| nicoburns wrote:
| Can you limit the number of cores available to an application
| when you open it? I feel like there must be a way to do that.
| dundarious wrote:
| On Linux it's taskset, numactl, etc.
| kevingadd wrote:
| On windows, no. You can use affinity to prevent the
| application from running on some of your cores but it will
| still spin up a ton of threads (which makes the situation
| even worse sometimes)
| jedisct1 wrote:
| Pretty impressive achievement!
___________________________________________________________________
(page generated 2021-09-13 23:01 UTC)