[HN Gopher] Resource efficient Thread Pools with Zig
       ___________________________________________________________________
        
       Resource efficient Thread Pools with Zig
        
       Author : komuW
       Score  : 283 points
       Date   : 2021-09-13 08:32 UTC (14 hours ago)
        
 (HTM) web link (zig.news)
 (TXT) w3m dump (zig.news)
        
       | goodpoint wrote:
       | How does it compare to Nim's Weave thread manager?
        
         | chris_st wrote:
         | Nim newby here -- does Nim have something analogous to Go's
         | channels for inter-thread communication?
        
           | goodpoint wrote:
           | Yes, it has channels as a built-in:
           | 
           | https://nim-lang.org/docs/channels.html
           | 
           | Things like Weave abstract away the need to manage threads
           | and messaging explicitly.
        
       | ur-whale wrote:
       | How does it compare with a bare-metal C++ threadpool?
        
         | judofyr wrote:
         | What is a bare-metal C++ threadpool? Is that a specific library
         | you're referring to?
        
         | mcnichol wrote:
         | You've been spending too much time arguing over virtualization
         | or bare metal.
         | 
         | You are at the wrong level of abstraction.
        
           | thechao wrote:
           | Weellll... I guess you could get rid of the OS and switch to
           | a stack-only ABI with a callee saves-everything policy? Then
           | "thread switching" and "multiprocess" both map onto
           | saving/restoring the stack (on ARM you'd have to have SP/LR
           | motion to stack), and process communication would just be
           | over a unified stack.
           | 
           | What'd be cool is if _every_ core (and, thus, affinitized
           | process) had its own virtual address. That way if any
           | _single_ core went down, its neighbors could stand it back
           | up. Of course, you 'd have to have some sort of hideous last-
           | level shared virtual addressing scheme -- probably with large
           | last level pages -- to "migrate" a "process" between cores.
           | 
           | EDIT: And security be damned! We need that last .5%!
        
             | dragontamer wrote:
             | GPUs are practically the "OS-free" compute device of choice
             | these days.
             | 
             | The way GPUs work in most cases is through work-queues. The
             | CPU submits work to GPUs (either OpenCL queues or CUDA
             | Streams), and the GPU "plucks" work off of the queue to do.
             | There are dozens of SMs (NVidia) or CUs (AMD GCN) or WGPs
             | (AMD RDNA) that pluck the work off concurrently.
             | 
             | The kernels are then run to completion. Upon completion,
             | the stream gets an event that triggers (which often times,
             | automatically enqueues another task). If the CPU has work
             | to do, it can get an async message (or maybe be waiting
             | through blocking behavior).
             | 
             | --------
             | 
             | There are other models of course, not everyone runs the
             | standard methodology. "Persistent kernels" start up early
             | on and infinite-loop. You interact with those by passing
             | data over PCIe and the kernel itself contains a queue/load-
             | balancing logic to pickup the data somehow (rather than
             | letting the device drivers / hardware logic do that)
             | 
             | ------
             | 
             | The thing about GPU-level tasks is that you have so many
             | little tasks (ex: shoot a ray, or render pixel (0,0),
             | render pixel(1,0), etc. etc.) that just pulling tasks off
             | of a queue is your most efficient methodology. Lots of
             | small, roughly equal-sized tasks can be load balanced with
             | simple methodologies.
        
       | StreamBright wrote:
       | Is there a IoC / CSP (ala core.async in Clojure) in Zig yet? Is
       | it planned to have something like this? I could not find too much
       | about that online.
        
         | matu3ba wrote:
         | The former (Inversion of Control) is a general method and too
         | unspecific to answer. I can only hint that one of the biggest
         | caveats of async is brittle composability due to missing
         | cancellation routines. This will eventually be addressed.
         | 
         | The latter (communicating sequential processes) is a formal
         | language, but it looks like it is for IPC:
         | https://arild.github.io/csp-presentation/#1
        
       | scott_s wrote:
       | I designed and implemented a mostly lock-free dynamic thread
       | scheduler for streaming runtimes, and I learned some similar
       | lessons: avoid global data and amortize the necessary
       | synchronization that you have to do. One of the main
       | peculiarities of a streaming context is that work-stealing is
       | counter-productive. In a streaming context, it's more like
       | _cutting in_ when the work would be done anyway. It 's better to
       | go find a part of the streaming graph that is not currently being
       | executed than to steal some from another thread.
       | 
       | The paper describing my design is "Low-Synchronization, Mostly
       | Lock-Free, Elastic Scheduling for Streaming Runtimes",
       | https://www.scott-a-s.com/files/pldi2017_lf_elastic_scheduli....
       | The source code for the product implementation is now open
       | source. Most is in
       | https://github.com/IBMStreams/OSStreams/blob/main/src/cpp/SP...
       | and
       | https://github.com/IBMStreams/OSStreams/blob/main/src/cpp/SP....
        
       | riofoxx wrote:
       | T.o a.s.s.i.s.t y.o.u i.n t.r.a.d.e.s f.o.r b.e.t.t.e.r
       | i.m.p.r.o.v.e.m.e.n.t on c.r.y.p.t.o.c.u.r.r.e.n.c.y
       | W.h.a.t.s.A.p.p (+13052395906)
        
       | posharma wrote:
       | How does this compare to OpenMP? Does it support nested
       | parallelism? (disclaimer: haven't read the post thoroughly).
        
       | jorangreef wrote:
       | Incredible post. Ticks all the boxes and exciting to see this
       | coming to Zig. Protty has got to be a domain specialist when it
       | comes to atomics and threads.
        
         | vanderZwan wrote:
         | If they weren't before, they probably are by now after all the
         | work put into this
        
       | gpderetta wrote:
       | To be very pedantic, Vyukov MPSC queue, while indeed awesome in
       | its simplicity and performance, is not technically lock-free, not
       | even obstruction free (and Vyukov never claimed it as such [1]):
       | a producer halting between the atomic swap and atomic store will
       | prevent further nodes enqueued by other producers from ever being
       | dequeued by the consumer.
       | 
       | [1] edit: and in fact there is a note exactly on this on the
       | linked Vyukov page.
        
         | kprotty wrote:
         | I'm aware that it's technically considered blocking due to a
         | producer being preempted between swap and store causing the
         | consumer to see an invalid link and report empty. I address
         | this in the run queue section by noting that it's OK if empty
         | is spuriously reported since the thread waking/waiting
         | mechanism takes care of rescheduling such a case.
        
       | feffe wrote:
       | A pet peeve of mine is calling something using atomic operations
       | lock less. It's very common but once it click that atomic
       | operations are just locks on the instruction level it feels very
       | wrong to call most algorithms lock less. There's still the same
       | concerns to avoid contention when using atomic operations as with
       | regular mutexes. A mutex lock in the uncontended case never
       | enters the kernel and is just an atomic operation... The main
       | strategy regardless of if using locks or atomics directly is to
       | avoid contention.
       | 
       | The only truly lock less algorithm that I'm aware of (for most
       | CPU archs) is RCU in the read case.
        
         | OskarS wrote:
         | "Lock free" has a very precise meaning in computer science, and
         | it's this: "given some number of threads of execution, an
         | algorithm is lock free if in some finite number of steps, at
         | least one thread is guaranteed to make forward progress". When
         | using mutexes, this is not guaranteed under any amount
         | contention: the scheduler is perfectly free to preempt a thread
         | holding a lock, which means no threads make progress.
         | 
         | However, if you use a CAS loop (like this code does) to update
         | some value (i.e. a while loop that repeatedly tries to execute
         | a compare-and-swap), that loop will only fail to make progress
         | if some other thread DID make progress. This means that such an
         | algorithm is in fact lock-free: if thread X didn't progress, it
         | is necessarily because some other thread did. In addition, a
         | scheduler preempting a thread at any point will not stop other
         | threads making progress.
         | 
         | Whether an algorithm is lock free or not is not a matter of
         | opinion, it is a mathematical statement. The fact that it
         | annoys you is... weird.
        
           | feffe wrote:
           | Yes, I'm probably wrong on the semantics of lock free. I
           | should read up on that. My point was that maybe you can gain
           | a factor of 2 by using atomic ops in a "lockless" fashion
           | compared to mutexes but it still scales extremely poorly
           | compared to what you could do if the hardware had real async
           | mechanisms that did not rely on cache coherency. The cache
           | line that is subjected to an atomic operation will be evicted
           | from all other CPUs cache lines, back and forth.
           | 
           | A simple example is just letting all CPUs in a multi core CPU
           | doing an atomic add at a memory address. The scaling will be
           | exactly 1 with the number of available cores. I realize this
           | is very of topic wrt to this cool article an implementation
           | in zig. It's just that this problem can't really be solved in
           | an efficient manner with todays hardware.
        
         | gpderetta wrote:
         | Atomic instructions are not locks on the instruction level (the
         | lock prefix on x86 is just an historical artefact).
         | 
         | edit also lock-free has not much to do with locks.
        
           | feffe wrote:
           | It's a lock associated with the cache line that the atomic
           | operation operates on. This is because they are built on top
           | of the cache coherency mechanism, a synchronous blocking
           | operation on the hardware level to implement an asynchronous
           | mechanism on the software level.
           | 
           | The big frustration with today's multi core CPUs is that
           | there's simply not efficient way to communicate using message
           | passing mechanisms. I think this is something the hardware
           | guys should focus on :-) Provide an async mechanism to
           | communicate between cores, not relying on the cache
           | coherency.
        
             | gpderetta wrote:
             | There is no lock associated with the cache line.
             | 
             | The coherency protocol guarantees that a core can own in
             | exclusive mode a cache line for a bounded number of cycles,
             | this guarantees forward progress and it is different from
             | an unbounded critical section. It has also nothing to do
             | with the lock prefix and also applies to normal non atomic
             | writes.
             | 
             | What the lock prefix does is delay the load associated with
             | the RMW so that it is executed together with the store
             | before the core has a chance to lose the ownership of the
             | line (technically this can also be implemented
             | optimistically with speculation and replays, but you still
             | need a pessimistic fallback to maintain forward progress).
             | 
             | A message passing feature would be nice, but it would be
             | necessarily non coherent which means you can only use it to
             | pass serialised values, not pointers to other threads.
             | 
             | An alternative more usable solution would be to
             | aggressively speculate around memory barriers as the memory
             | subsystem is already highly asynchronous.
        
               | feffe wrote:
               | The effect is the same. If someone touch the cache line,
               | it's evicted from all other caches that has it,
               | triggering a cache miss when other cores touch it.
               | Everyone knows this. I just think it's a bit depressing
               | if you try to optimize message passing on many-core CPUs
               | you'll realize you can't make it fast. No-one has been
               | able to make it fast (I've checked Intel message passing
               | code as well). If you get more than 20 Mevents through
               | some kind of shared queue, you are lucky. That is slow if
               | you compare to how many instructions a CPU can retire.
               | 
               | So all these event loop frameworks try to implement an
               | async behaviour ontop of a hardware mechanism that is
               | synchronous in it's core. The main issue is that the
               | lowest software construct, the queue used to communicate
               | is built on this cache line ping pong match. What the
               | software want it still a queue, you could still send
               | pointers if the memory they point to has been committed
               | to memory when the receiving core see them. Just remove
               | the inefficient atomic operations way of synchronizing
               | the data between CPUs. Send them using some kind of
               | network pipe instead :-) As you say the memory subsystem
               | is really asynchronous.
               | 
               | I'm convinces this is coming, it should just have arrived
               | 10 years ago...
        
               | gpderetta wrote:
               | >What the software want it still a queue, you could still
               | send pointers if the memory they point to has been
               | committed to memory when the receiving core see them
               | 
               | Which means that the sender will need to pessimistically
               | flush anything that the receiver might need to access,
               | which is likely more expensive than optimistic cache
               | coherency.
               | 
               | Non-CC systems were a thing in the past, and still are
               | for things like GPUs and distributed systems, but are
               | much harder to program and not necessarily more
               | efficient.
               | 
               | As the saying goes there are only two hard things in
               | programming...
        
       | omazurov wrote:
       | Increasing the size of the array 10x (100_000_000) and filling a
       | glaring omission:
       | 
       | Go (go1.17.1 darwin/amd64)                   took 5.591593544s
       | took 5.285948722s         took 5.218750076s         took
       | 5.239224787s         took 5.131232207s
       | 
       | Zig (0.9.0-dev.959+f011f1393)                   took 3.48s
       | took 3.37s         took 3.44s         took 3.43s         took
       | 3.49s
       | 
       | Java (openjdk version "17-ea")                   Time: 2684 ms
       | Time: 2654 ms         Time: 2840 ms         Time: 2676 ms
       | Time: 2649 ms
       | 
       | MacBook Pro Early 2013, 2.7 GHz Quad-Core Intel Core i7, 16 GB
       | 1600 MHz DDR3
        
         | gavinray wrote:
         | Is this using JVM/JIT or using Graal to make a native-image
         | binary and calling that? You'd get better results due to
         | shaving off startup time with that if it does include it.
         | 
         | I think .NET 6 latest preview would slightly beat out Java here
         | as well.
         | 
         | This isn't exactly apples-to-apples since Zig/Rust/Go are
         | considered "systems languages", but the JVM has GraalVM for
         | compiling to native binaries/libraries and even
         | importing/exporting C functions and structs. And .NET of course
         | has .NET Native + "Dotnet Native Exports", including the LLVM
         | experiment where it uses LLVM bitcode instead of Ryu.
         | 
         | So you can make the argument that writing a native binary or a
         | library which exported this benchmark function as a C-callable
         | method with a C header in each language would technically
         | equivalent.
         | 
         | The JVM and .NET ones would have a large size (several MB each)
         | but would otherwise still fill this requirement.
        
       | egberts1 wrote:
       | This is the first four-state-machine thread algorithm.
       | 
       | And I like what I am seeing. Hopefully all transitions between
       | states have been tested.
       | 
       | Good work.
        
         | matu3ba wrote:
         | Drawing them could help figuring out all (invalid) state
         | transitions. Certainly would make a cool drawing.
        
       | Jarred wrote:
       | Running the benchmarks locally (zig is fastest, but that's
       | probably expected)
       | 
       | zig (~1 week older than HEAD):                   ./zig-
       | out/bin/qsort         filling         shuffling         running
       | took 177.46ms
       | 
       | rust nightly:                   cargo run --release --bin qsort
       | warning: the cargo feature `edition2021` has been stabilized in
       | the 1.56 release and is no longer necessary to be listed in the
       | manifest         See https://doc.rust-
       | lang.org/nightly/cargo/reference/manifest.html#the-edition-field
       | for more information about using this feature.
       | Finished release [optimized] target(s) in 0.02s
       | Running `target/release/qsort`         filling         shuffling
       | running         took 896.91656ms              cargo run --release
       | --bin rsort         warning: the cargo feature `edition2021` has
       | been stabilized in the 1.56 release and is no longer necessary to
       | be listed in the manifest         See https://doc.rust-
       | lang.org/nightly/cargo/reference/manifest.html#the-edition-field
       | for more information about using this feature.         Finished
       | release [optimized] target(s) in 0.02s             Running
       | `target/release/rsort`         filling         shuffling
       | running         took 212.190694ms
       | 
       | go 1.16.3:                   go run qsort.go         filling
       | shuffling         running         took 222.90356ms
       | 
       | on macOS 11.5.2 with a 2.4 GHz 8-Core Intel Core i9
        
         | Arnavion wrote:
         | The fact that the Rust tokio version (the one that uses tokio
         | tasks instead of threads) is slow is to be expected. tokio
         | tasks aren't appropriate for running quicksort; they will have
         | overhead compared to a regular threadpool because of I/O
         | reactor, waker etc code that will never get used.
         | 
         | rayon or crossbeam are more appropriate since they are actually
         | designed for general-purpose thread work. Using rayon's scopes
         | will also get rid of the `unsafe` that's being (ab)used to
         | create the fake `&'static mut` slices.
         | 
         | Though for some reason, the rayon version (rayon_qsort.rs) uses
         | scopes, but still uses `unsafe` to create `&'static mut`
         | slices... ( https://github.com/kprotty/zap/pull/3 )
        
           | gefhfff wrote:
           | Can you explain that Tokio vs crossbeam vs rayon a bit more?
           | Should I default to rayon or crossbeam because of that?
        
             | ModernMech wrote:
             | It depends on what you want to do. If you are doing io-
             | bound work, Tokio would be what you want -- you would use
             | it as a runtime for the async capabilities in Rust. If you
             | have cpu-bound work, then rayon is what you want to use.
             | Rayon is a work-stealing parallelism crate -- it will
             | schedule work to be done, and different threads will
             | schedule portions of it as they become available to do
             | work. It's very easy to get 100% CPU utilization across all
             | cores use Rayon if your work is naturally paralellizable,
             | and the interface is dead simple: anywhere you do a
             | .iter(), you can turn it into a .par_iter(), and Rayon will
             | parallelize the work.
             | 
             | Note there is some overhead to using Rayon -- you normally
             | will be better off doing your work on a single thread
             | unless you have a large number of elements in your
             | collection... I found for my application I needed more than
             | 1e6 elements before I saw an appreciable performance
             | benefit to using Rayon.
             | 
             | As others said, Crossbeam is for sending and receiving
             | messages across threads. I use it alongside of tokio and
             | rayon.
        
             | YorickPeterse wrote:
             | Crossbeam is a library that provides various concurrency
             | primitives, such as a queue type that allows stealing of
             | work.
             | 
             | Rayon is a library for running code in parallel. Typically
             | you'll give it a (parallel) iterator of sorts, and it will
             | distribute the work across a pool of threads.
             | 
             | Tokio is a library for async IO operations.
        
             | knuthsat wrote:
             | What about smol?
        
             | nicoburns wrote:
             | Tokio is designed for concurrent io, rayon, or tokio-
             | threadpool for concurrent cpu-bound task.
        
           | kprotty wrote:
           | I don't think tokio's slowness here is "to be expected".
           | There isn't much reason for tokio tasks to have that much
           | overhead over normal thread pools. The I/O driver shouldn't
           | be called in such a benchmark given there's no I/O work
           | happening. Waker's only add a reference count inc/dec + an
           | atomic cas for wake() which should only happen on the
           | JoinHandle `awaits` [4] compared to just an atomic swap for
           | the Zig case on the join handle.
           | 
           | Golang doesn't poll for I/O under such cases [0] and tokio
           | should be using the `ParkThread` parker for putting threads
           | to sleep [1] given `net` features aren't enabled (not sure if
           | this is the actually the case) which you can force with a
           | custom Runtime initialization instead of `#[tokio::main]` as
           | an exercise.
           | 
           | `crossbeam-deque` requires heap allocation for the run
           | queues, heap allocates on growth, and garbage collects said
           | memory. This is an overhead I wished to avoid and is
           | something tokio has been improvements with avoiding as well
           | [2].
           | 
           | `rayon` isn't a good comparison here given `rayon::join` is
           | optimized to hook directly into the scheduler and run the
           | caller only until the other forked-section completes [3].
           | This isn't general purpose and it takes advantage of
           | unbounded stack allocation which can technically cause a
           | stack overflow. Zig could do this and also take advantage of
           | batch scheduling, but it complicates the code and is unfair
           | here given `async` usage. Tokio, golang, and the Zig
           | benchmarks require heap allocation on spawn so I believe it
           | makes it a fairer comparison. This is also why I used rayon
           | scopes instead of join(): less specialization and
           | reintroduced the heap allocation from the unbounded
           | concurrency.
           | 
           | The `unsafe` there is from me copying the benchmark code from
           | the tokio version to the rayon version and forgetting to
           | remove the hack. In tokio, ownership of the array needed to
           | be passed into the function given the lifetime was no longer
           | linear from the spawn() I assume (correct me if i'm wrong
           | here, but this is what the compile error hinted at). So I
           | needed to recreate the array after the function, hence
           | unsafe. If there's a better way for the tokio version, please
           | send a PR. I see you've done so for the rayon version and I
           | gladly merged it.
           | 
           | [0]: See `atomic.Load(&netpollWaiters) > 0` in
           | https://golang.org/src/runtime/proc.go
           | 
           | [1]: https://github.com/tokio-
           | rs/tokio/blob/master/tokio/src/runt...
           | 
           | [2]: https://tokio.rs/blog/2019-10-scheduler#a-better-run-
           | queue
           | 
           | [3]: https://docs.rs/rayon-
           | core/1.9.1/src/rayon_core/join/mod.rs....
           | 
           | [4]: https://github.com/tokio-
           | rs/tokio/blob/98578a6f4a494e709f000...
        
             | skohan wrote:
             | Honestly tokio is so complex, and it serves so many use-
             | cases it's hard for me to believe it could ever be truly
             | optimal for any one use-case.
        
               | kristoff_it wrote:
               | Isn't Go's scheduler just as versatile?
        
               | skohan wrote:
               | I'm not an expert on the go scheduler, but my perception
               | is that it is more of a focused single-purpose component
               | whereas tokio seems like a sprawling swiss-army-knife of
               | a library if you browse the source
        
               | bbatha wrote:
               | The tokio scheduler and the go scheduler are roughly
               | equivalent. Much of tokios bulk is reimplementing much of
               | the std lib in an async compatible way (io, os,
               | blocking).
        
               | tick_tock_tick wrote:
               | It's much more versatile but also has many more years of
               | optimization.
        
             | Arnavion wrote:
             | >I don't think tokio's slowness here is "to be expected".
             | 
             | And yet on my machine the rayon version takes ~160ms and
             | the tokio version takes ~1350s. This isn't at the level of
             | some minor missed performance optimization.
             | 
             | >There isn't much reason for tokio tasks to have that much
             | overhead over normal thread pools.
             | 
             | tokio is an async runtime. tokio tasks are meant to be for
             | distributing I/O-bound work. It would be at least a little
             | more correct to use spawn_blocking for CPU-bound tasks, but
             | that still doesn't work for your recursive calls because
             | that's not what it's meant for.
             | 
             | In general, if you have CPU-bound work in your tokio-using
             | application, you run it on a different threadpool - tokio's
             | blocking one, or a completely different one.
             | 
             | >`rayon` isn't a good comparison here given `rayon::join`
             | is optimized to hook directly into the scheduler and run
             | the caller only until the other forked-section completes
             | [3]. [...] This is also why I used rayon scopes instead of
             | join(): less specialization and reintroduced the heap
             | allocation from the unbounded concurrency.
             | 
             | My original comment was also talking about scopes, not
             | `rayon::join`. So yes, `rayon` is absolutely a good
             | comparison.
        
         | pizza234 wrote:
         | I get inconsistent results, both in terms of platform and
         | language. The benchmark may be flawed, and no conclusions
         | should be drawn, to be prudent.
         | 
         | Running on Linux (zig -Dc), rounded average of 3 runs.
         | 
         | 8-thread Intel laptop machine:                 zip
         | 0.9.0-dev.958:                 945       rust 1.5.0-nightly
         | 21-09-12/qsort: 545       rust 1.5.0-nightly 21-09-12/rsort:
         | 255       go 1.16.8:                         355
         | 
         | 32-thread AMD workstation:                 zip 0.9.0-dev.958:
         | 410       rust 1.5.0-nightly 21-09-12/qsort: 780       rust
         | 1.5.0-nightly 21-09-12/rsort: 135       go 1.16.8:
         | 135
         | 
         | Additionally, I haven't done a thorough review, but enabling
         | the commented code:                       // Optimized version
         | (hooks directly into the scheduler)             // rayon::join(
         | //     || quick_sort(low),             //     ||
         | quick_sort(high)             // );
         | 
         | yields a 110ms run on the faster machine.
        
           | kprotty wrote:
           | I'm not sure how this implies it is flawed. It benchmarks
           | thread pools so on a system which allowed more parallel
           | concurrent tasks (i.e. 32t amd cpu) the throughput is
           | expected to scale somewhat, and you see this in your results.
           | 
           | Also, `zig -Dc` links to libc (`-Dc`) but builds in debug
           | mode by default, similar to Rust and C/C++ compilers. The
           | readme contains instructions to compile it with optimizations
           | (`-Drelease-fast`), the rust versions do so as well
           | (`--release`) so you should re-evaluate your results.
           | 
           | See one of my other comments in this post on why I didn't use
           | rayon::join() by default.
        
         | rdsubhas wrote:
         | At first glance, this looks great, almost 20% speedup!
         | 
         | But genuine question, this looks to be how almost every new
         | tech begins. Initially it's FAST because it keeps things as
         | barebones as possible, and everything else other than the
         | specific module can be ignored.
         | 
         | But no feature works in isolation, so as more features get
         | added - scheduling constraints, safety quirks, or random
         | features that cut across the language - the 20% looks like
         | nothing.
         | 
         | I did go through the blog, but how can one be sure that this
         | 20% gap is a guaranteed design advantage, and not just a
         | temporary deferral of everything else?
        
           | dnautics wrote:
           | > how can one be sure that this 20% gap is a guaranteed
           | design advantage
           | 
           | It's no guarantee, but in my experience, Zig is a language
           | that lends itself very well to composability. I imagine that
           | instead of adding features to stdlib, anyone who needs "extra
           | features" in their threading system will find it quite easy
           | to implement their own as a library outside of stdlib that
           | can easily be swapped in and out in A/B tests for the library
           | consumer. If some feature or another is REALLY wanted in the
           | stdlib for some reason it would probably be easy to drop it
           | behind a comptime-flag as part of a type constructor
           | function, which won't compromise (runtime) performance, teeny
           | tiny hit to comptime performance.
        
           | kristoff_it wrote:
           | Nobody can predict the future, but one thing that might help
           | is that Protty is in the Zig core dev team and, as the post
           | shows, he really cares about efficiency.
        
           | jorangreef wrote:
           | > this looks to be how almost every new tech begins
           | 
           | On second thought, do you think this is true? The level of
           | detail shown in this post is fairly extraordinary for event
           | loops just getting started.
           | 
           | > how can one be sure that this 20% gap is a guaranteed
           | design advantage
           | 
           | I would say because the design is solid and recognizes
           | principles of mechanical sympathy that are fundamental and
           | unlikely to change for quite some time.
        
           | skohan wrote:
           | fwiw Zig is taking its role as a low-level, bare bones "c
           | replacement" as a core design constraint. For instance zig
           | has eschewed from even fairly tame features like operator
           | overloading because they're not considered in line with Zig's
           | identity as a minimal, explicit language.
           | 
           | So feature creep is always a concern, but from what I
           | understand Zig has a better chance at avoiding this than
           | other projects due to the language's guiding philosophy.
        
           | adamrezich wrote:
           | it seems like the design philosophy of zig is such that it's
           | not going for feature creep/bloat, it's keeping the language
           | very simple and eschewing unexpected complexity
        
         | erichdongubler wrote:
         | Does this already use `target-cpu=native`? Not sure if that
         | would be apples-to-apples with the Zig implementation (which is
         | what's important here), but I'd be surprised if it didn't yield
         | some interesting wins.
        
         | adenozine wrote:
         | Rust is such a bloated, insane design-by-committee language.
         | It's actually still surprising to me that corporations are
         | defending it and using it. I sure won't.
         | 
         | It's also just so dang ugly to read. If Perl and C++ had a
         | mutant offspring, I think it'd be Rust. And that's just not
         | respectful to Perl!
         | 
         | I'm really happy that Zig is doing so well. I'm planning on
         | contributing a little chunk of money to the ZSF whenever they
         | get a 1.0 landed. I do think there are some lessons to learn
         | from Rust about formal methods as foundational building blocks
         | in a language. I'm eager to see Zig mature more around the
         | memory-safety side of things.
        
           | pcwalton wrote:
           | There was no "design committee" for Rust.
           | 
           | I'm increasingly skeptical that any language with lifetimes
           | can be anything other than "ugly to read". People complain
           | about syntax, but what they're really complaining about is
           | semantics.
        
         | [deleted]
        
         | komuW wrote:
         | When you do `go run qsort.go` you are also timing the time
         | taken by Golang to build/compile the code. I suspect the same
         | applies to `cargo run`
         | 
         | You should do something like; `go build . && time ./qsort`
        
           | masklinn wrote:
           | The same occurs with Rust, we can even see that the no-op
           | recompilation takes about 20ms.
           | 
           | It's obviously just a rough estimate / comparison otherwise
           | it'd be using hyperfine or something along those lines.
        
           | Arnavion wrote:
           | The "took #ms" line is output of the program. It does not
           | include compilation time.
        
         | stewbrew wrote:
         | Erm, could somebody explain to me why I shouldn't understand
         | this as an argument pro go given that it is about as fast as
         | the fully optimized versions of zig and rust?
        
           | kristoff_it wrote:
           | I think that's a perfectly reasonable way of looking at it.
        
           | jerf wrote:
           | Go's value proposition is that it has good bang-for-the-buck.
           | It's fairly easy to write in. There's easier, but it's fairly
           | easy. It performs fairly well. There are things that perform
           | better, but it's fairly good. It scales fairly well. There's
           | things that scale better, but it's pretty good.
           | 
           | If you draw all this out on the programming landscape, I
           | think one of the reasons Go has succeeded as well as it did
           | is that this was a poorly covered part of the landscape.
           | There were languages that were much easier, but performed
           | much worse, and languages that performed much better, but
           | generally required a lot more developer effort.
           | 
           | I don't expect Go to ever top the charts on performance on a
           | task like this, but it's often hanging around at only a bit
           | slower, and it does that across a wide variety of tasks and
           | on a wide variety of dimensions. It's not the best at much of
           | anything, but it's pretty darned good for pretty cheap on a
           | lot of measures.
        
           | mamcx wrote:
           | Well, work with several task is THE thing with Go, right? If
           | it were too much worse Go will fail.
           | 
           | In situations like this, is important to remember that
           | Rust/Zig are (system)languages to MAKE "Go" but also, million
           | other different things with different goals.
           | 
           | So other way to look at this is "If I wanna designa language
           | with a runtime like Go, this both have it that close!"
        
       | r0f1 wrote:
       | I know this is off-topic, but does this website not look eerily
       | familiar to dev.to?
        
         | kristoff_it wrote:
         | Yes, as others have said, it's based on the same open source
         | CMS.
         | 
         | I've created zig.news for people who want to write about Zig
         | but who don't necessarily want to maintain their own blog and
         | to the work necessary to publicize it.
         | 
         | This should hopefully help more people write down their
         | experience and opinions on using Zig and alleviate the problem
         | of having a "blogger aristocracy" who has a much stronger voice
         | than anyone else.
         | 
         | I briefly go over this concept in this 3mins long video:
         | https://www.youtube.com/watch?v=i9pUuj6eiEg
        
         | ibraheemdev wrote:
         | I'm guessing it's using forem, the open-source platform that
         | powers dev.to: https://github.com/forem/forem
        
         | andruby wrote:
         | That's correct. Both are built on Forem. It's in the footer:
         | Built on Forem -- the open source software that powers DEV and
         | other inclusive communities.        Made with love and Ruby on
         | Rails.
         | 
         | https://github.com/forem/forem
        
           | r0f1 wrote:
           | Ah I did not know that. Thanks :)
        
       | randyrand wrote:
       | please have Thread Priority a native feature. See GCD.
        
       | kevingadd wrote:
       | It's always great to see a robust new threadpool. Most of the
       | stock ones I've used have horrible characteristics - for example
       | on my Threadripper lots of production apps will spawn 48 or more
       | threadpool threads and the scheduling/throughput characteristics
       | on that are generally pretty dire because they needlessly compete
       | for resources, etc. They tend to allocate garbage for every work
       | item too, which makes them a bad choice for realtime scenarios.
       | It's genuinely frustrating that most processes on my machine have
       | 128 threads or more due to unintelligently scaling off
       | ProcessorCount without an upper limit. For example, GIMP's bad
       | threading policy means that operations like resizing an image
       | take _seconds_ instead of 100ms or less.
       | 
       | I've been using a custom threadpool for a while now but writing
       | your own without the expertise necessary can be a real pain.
        
         | smashed wrote:
         | You can change the number of processors gimp's gegl multi-
         | threaded operations gimp will use in the preferences.
         | 
         | https://youtu.be/D2c8BH8q6Yc
         | 
         | The tile size seems to be hard-coded to 128x128.
        
         | eptcyka wrote:
         | I don't think that it's the threadpool's fault that an
         | application uses it incorrectly. Also, I think there are a lot
         | of developers who have not considered that on today's machines,
         | just spawning as many threads as there are cores is the optimal
         | amount of threads in a thread pool for every use case. I
         | wouldn't say it's the case of bad software but rather software
         | that was written for the CPUs of 5 years ago. And generally, I
         | don't think this will be an easy problem to solve due to the
         | variety of heterogeneous and non-heterogeneous topology of
         | modern SoCs. In fact, I don't see this specific threadpool
         | doing anything to optimize for the disparate core clusters of
         | your threadripper, or to acknowledge the disparity between core
         | clusters on the M1.
        
           | matu3ba wrote:
           | > I don't see this specific threadpool doing anything to
           | optimize for the disparate core clusters of your
           | threadripper, or to acknowledge the disparity between core
           | clusters on the M1.
           | 
           | To do that efficiently you need to pin thread groups to cores
           | based on having information of data usage. This smells like
           | over-optimizing architectures to me, if you want to go beyond
           | separating stuff like hyper-threads on io. Additional
           | annoyance: There is no POSIX way to get hyperthreads and
           | physical ones.
        
             | eptcyka wrote:
             | I think a general purpose threadpool should work well on
             | general purpose hardware, and it seems like the most
             | popular SoCs on consumer devices will have heterogeneous
             | cores et al, so a good implementation would schedule the
             | threadpool appropriately. I agree that there is no POSIX
             | way to distinguish between hyper threads and regular
             | threads, and this is something that should be improved. I'm
             | not saying that the achievements made by the threadpool
             | implementation are lackluster or that any of the other
             | solutions solve the issues I outline any better. What I am
             | saying that the comment I was originally referring was
             | somewhat mistaken about the benefits of a more optimal, yet
             | naive threadpool library.
             | 
             | This isn't just about hyperthreads, by the way. As long as
             | the workload isn't compute heavy and often stalls on
             | memory, hyperthreads are just as good as regular threads.
             | On a hardware level, there is no distinction between a
             | regular and a hyperthread core. Either you multiplex a
             | single physical core or you don't. Anyway, there is more to
             | it than slower threads and faster threads - accessing
             | memory between threads will be slower depending on which
             | core is trying to access which bits of memory - a core
             | stealing work from a sibling core on the same chiplet will
             | probably be able to do that quicker than stealing work from
             | a core across the cluster if the data prefetcher has been
             | doing it's job correctly. Spawning more threads than
             | necessary might force a CPU to power up more cores than
             | necessary, resulting in slower performance per core
             | performance and worse power efficiency, especially if a
             | fast or slow cluster needs to be turned on, where a more
             | optimal scheduling of threads might not force that to
             | happen. I think a general purpose thread pool by default
             | should no longer spawn as many threads as there are
             | _cores_, whatever that term even means, with optional
             | toggles to inform whether the work that'll be scheduled
             | will be compute heavy or not.
        
           | dragontamer wrote:
           | > just spawning as many threads as there are cores is the
           | optimal amount of threads in a thread pool for every use case
           | 
           | Absolutely not.
           | 
           | Any task with any amount of I/O will have a significant
           | amount of blocking. A GPU kernel may take microseconds or
           | milliseconds to respond. RDMA (a memory-access over Ethernet)
           | may take many microseconds.
           | 
           | Having multiple threads per core would be more efficient: it
           | gives the cores something to do while waiting for SSD, RDMA,
           | or GPUs. Remember: even the earliest single-core systems from
           | the 1970s had multiple threads on one core: its just more
           | efficient to have multiple terminals to read/write to at a
           | time.
           | 
           | --------
           | 
           | One hardware-thread per thread (since SMT8 machines exist
           | like POWER9 / POWER10) is only efficient in the most
           | computationally expensive situations. Which is in fact, a
           | rarity in today's world. Your typical programs will be
           | waiting on the network interface or SSD.
           | 
           | IIRC: there's professional thread-pools out there that are
           | 1.5x threads per hardware thread as a default option, and
           | then scale up/down depending on how computationally expensive
           | things look. That is: a 64-core/128-thread Threadripper would
           | be overloaded with 192 threads, under the assumption that at
           | least 33% of them would be waiting on I/O at any given time.
        
             | eptcyka wrote:
             | >Any task with any amount of I/O will have a significant
             | amount of blocking. A GPU kernel may take microseconds or
             | milliseconds to respond. RDMA (a memory-access over
             | Ethernet) may take many microseconds.
             | 
             | The argument I failed to make was that with heterogeneous
             | distribution of memory bandwidth and compute resources,
             | most user applications would benefit from spawning less
             | threads than all available cores. In I/O heavy workloads,
             | the correct thing to do is to do asynchronous I/O. This can
             | be done for SSDs and GPUs. On contemporary systems where
             | there's heavy costs associated with context switching
             | ,avoiding them and servicing multiple tasks without
             | blocking will always be superior to spawning more threads
             | to do more blocking.
             | 
             | When it comes to hyperthreading, I assume that the threads
             | are cores - because from the OS perspective, they are, and
             | you cannot distinguish two hyperthreads running on a single
             | core anyway.
             | 
             | Also, I apologize, but the first sentence you're quoting is
             | not not what I intended to write - my point is that most
             | application developers might still think that spawning as
             | many threads as there are cores is a reasonable thing to do
             | in all cases - but with CPUs that comprise of 4 core
             | clusters with 16 cores each, it's often better to spawn far
             | less than the total amount of cores available.
        
               | dragontamer wrote:
               | > In I/O heavy workloads, the correct thing to do is to
               | do asynchronous I/O
               | 
               | You can't async mmap into memory reads (or the GPU-
               | equivalent: cudaMallocManaged).
               | 
               | Today's I/O is looking more-and-more like a memory
               | read/write. As such, your typical "node = node->next"
               | pointer traversals could very well be an incredible
               | string of I/O. That indirection could be in RAM, over SSD
               | (mmap), over RDMA (ethernet pretending to be RAM), or
               | into GPU-RAM (cudaMallocManaged)... with an appropriate
               | PCIe command (possibly atomic-PCIe) to boot.
               | 
               | Async only works on the simplest of reading/writing
               | patterns.
               | 
               | EDIT: Consider the "persistent kernel" pattern on GPGPU
               | (which starts off with a lot of the similar thought
               | process you have on CPU-land. Launch only X wavefronts,
               | where X is the size of the GPU). You primarily
               | communicate with a persistent kernel over RAM / atomics.
               | You don't use the classic cuda <<<blocks, threads>>>()
               | (which admittingly has an async interface)... Instead:
               | you read/write to magic managed-memory locations that
               | will be eventually sync'd over PCIe to the GPU. This is
               | because the "persistent kernel" is always executing. You
               | launched it upon program start, there's no event to
               | async-trigger on. You just pass data to the GPU. The GPU
               | operates on the data at its leisure. Then it eventually
               | returns data to a buffer elsewhere (probably with atomic
               | compare-and-swaps, which traverse PCIe these days, to
               | ensure coherence)
        
               | touisteur wrote:
               | Smells like io_uring to me on the userland/kernel
               | interface. Just write to the queue and come later check
               | the completion queue.
        
               | dragontamer wrote:
               | CUDA-streams probably already use io_uring under the
               | hood.
               | 
               | The issue is that a CUDA-stream based execution kernel
               | will spend lots of time pulling from the queue and
               | spinning up threads (sound like a familiar problem?).
               | 
               | Instead: you sidestep this issue with persistent kernels:
               | you launch exactly the number of kernels that matches
               | your GPU-size, and then pass data to those kernels
               | through an alternative means.
               | 
               | -------
               | 
               | The future of I/O is going to be atomic-operations across
               | PCIe with stronger-and-stronger coherency models. PCIe
               | 3.0 introduced atomic-PCIe commands. PCIe 4.0
               | strengthened them. PCIe 5.0 is rumored to have even
               | stronger coherency rules and "coherent grids" are being
               | seriously discussed / implemented in high-speed computers
               | (see Compute eXpress Link, or CXL)
               | 
               | Any serious I/O (and GPUs will be part of that I/O
               | future), will likely use PCIe atomics in some regards,
               | quite similar to atomic-compare-and-swaps that the CPU
               | does already between their cores.
               | 
               | I/O is becoming indistinguishable from memcpy + atomics.
               | 
               | -------
               | 
               | EDIT: In effect, I'm saying that GPU-programmers are more
               | concerned about GPU-efficiency rather than CPU-
               | efficiency. Sure, the CPU is less efficient spinning on
               | these memory-reads/writes. But the GPU is where the bulk
               | of the calculations are happening. Therefore, it behooves
               | the architect to optimize the GPU (even if the CPU ends
               | up slightly less efficient).
               | 
               | Whatever argument you have about "queues being more
               | efficient in hypothetical architecture", the GPU has more
               | calculations and more cores to feed (aka: far more
               | overhead when it comes to Amdahl's law). That means you
               | want to apply those principles first to the GPU, and then
               | the CPU "makes up for it", so to speak.
        
               | eptcyka wrote:
               | I was under the impression that PCI-E was perfectly
               | capable of sending notifications from one device to
               | another in a somewhat efficient manner. Having said that,
               | this is not my area of expertise - and I do see that if
               | your main concern is to feed the GPU then blocking a
               | thread might be the optimal solution. I assume that MSI
               | would be too much overhead and might involve some context
               | switching to service the interrupt from the kernel etc to
               | allow for asynchronous completion? Also, is it possible
               | to have overlapping memory regions between a high speed
               | networking card and the input buffer from the GPU, which
               | in effect just means that the CPU just has to tell the
               | GPU to start reading once the network card is done
               | receiving?
               | 
               | Having said that, I don't believe that for most
               | application developers this is a major concern - in cases
               | where you flood the GPU with a firehose of data to
               | compute on you probably also don't care about what other
               | processes run on the machine and whether your
               | architectural decisions end up making people's laps
               | uncomfortably hot. I also do not believe that the future
               | of all I/O is just memcpy and atomics - we can already do
               | that today. It doesn't really bring you any advantages
               | for speed in the general case. I think the future of I/O
               | is memcpy, atomics and a good signaling mechanism to
               | signal I/O task completion without costly context
               | switches with as little extraneous memory allocation as
               | possible. Moreover, the future of consumer computing will
               | probably not rely on PCI-E at all and instead have the
               | GPU and the CPU share all of it's memory. And hey, maybe
               | Nvidia will add some cool ARM companion cores to their
               | biggest chips, slap on some DDR5 slots on their cards and
               | sell self-contained solutions, sidestepping PCI-E
               | entirely, at least for feeding the data from the CPU to
               | the GPU.
        
         | nicoburns wrote:
         | Can you limit the number of cores available to an application
         | when you open it? I feel like there must be a way to do that.
        
           | dundarious wrote:
           | On Linux it's taskset, numactl, etc.
        
           | kevingadd wrote:
           | On windows, no. You can use affinity to prevent the
           | application from running on some of your cores but it will
           | still spin up a ton of threads (which makes the situation
           | even worse sometimes)
        
       | jedisct1 wrote:
       | Pretty impressive achievement!
        
       ___________________________________________________________________
       (page generated 2021-09-13 23:01 UTC)