[HN Gopher] Welcome to the Parallel Future of Computation
       ___________________________________________________________________
        
       Welcome to the Parallel Future of Computation
        
       Author : Epholys
       Score  : 146 points
       Date   : 2024-05-17 07:37 UTC (15 hours ago)
        
 (HTM) web link (hvm-page.pages.dev)
 (TXT) w3m dump (hvm-page.pages.dev)
        
       | anonzzzies wrote:
       | Maybe I missed it, but there seems to be no license attached to
       | HVM2, nor to Bend or Kind?
        
         | drtournier wrote:
         | https://x.com/VictorTaelin/status/1791241244468806117
        
           | tekknolagi wrote:
           | (Taelin says will likely be MIT or similar)
        
       | throwaway2562 wrote:
       | i-combinators https://www.semanticscholar.org/paper/Interaction-
       | Combinator...
        
       | anentropic wrote:
       | I remember seeing HVM on here a year or two back when it came out
       | and it looked intriguing. Exciting to see something being built
       | on top of it!
       | 
       | I would say that the play on words that gives the language its
       | name ("Bend") doesn't really make sense...
       | 
       | https://github.com/HigherOrderCO/bend/blob/main/GUIDE.md
       | 
       | > Bending is the opposite of folding. Whatever fold consumes,
       | bend creates.
       | 
       | But in everyday language bending is not the opposite of folding,
       | they are more or less the same thing. Why not "unfold", which
       | also has a connotation of "the process of happening" as well as
       | merely the opposite of folding?
       | 
       | I have a question about the example code and output for bending:
       | type Tree:           Node { ~lft, ~rgt }           Leaf { val }
       | def main():           bend x = 0:             when x < 3:
       | tree = Tree/Node { lft: fork(x + 1), rgt: fork(x + 1) }
       | else:               tree = Tree/Leaf { val: 7 }           return
       | tree              tree = fork(0)         tree = ![fork(1),
       | fork(1)]         tree = ![![fork(2),fork(2)], ![fork(2),fork(2)]]
       | tree = ![![![fork(3),fork(3)], ![fork(3),fork(3)]],
       | ![![fork(3),fork(3)], ![fork(3),fork(3)]]]         tree =
       | ![![![7,7], ![7,7]], ![![7,7], ![7,7]]]
       | 
       | Where does the initial "tree = fork(0)" come from?
        
         | developedby wrote:
         | The first `fork` is from using bend and passing the initial
         | state                 The program above will initialize a state
         | (`x = 0`), and then, for as long as `x < 3`,       it will
         | "fork" that state in two, creating a `Tree/Node`, and
         | continuing with `x + 1`.       When `x >= 3`, it will halt and
         | return a `Tree/Leaf` with `7`.       When all is done, the
         | result will be assigned to the `tree` variable:
        
           | anentropic wrote:
           | I would have described the logic in the exact same way, but I
           | still don't see where initial tree = fork(0) state comes from
           | 
           | all the other "fork"s in the output are produced explicitly
           | by:                   Tree/Node { lft: fork(x + 1), rgt:
           | fork(x + 1) }
        
             | anentropic wrote:
             | well, it seems fork is some kind of builtin function that
             | is intimately related to the bend construct: https://github
             | .com/HigherOrderCO/bend/blob/main/FEATURES.md#....
             | 
             | so presumably the initial fork(0) state is implicitly
             | produced by the bend
        
       | api wrote:
       | Oh wow do I wish this existed when I was playing with
       | evolutionary computation and genetic algorithms in college...
        
         | zackmorris wrote:
         | Me too, now you see why they never took off.
        
           | api wrote:
           | They never took off because we discovered, to our surprise to
           | some extent, that gradient descent through back propagation
           | works better than expected if you give it the right learning
           | media and the right input and output encodings. It took a ton
           | of fiddling ("graduate student descent") to figure those out.
           | 
           | Back then everyone thought it was doomed to get stuck at
           | local minima, but it turns out that has a lower probability
           | of happening if the search space has enough dimensions. It
           | works well enough to make the sand talk back to us and now
           | that particular design has sucked all the air out of the
           | room.
           | 
           | Nobody has tried EC at anywhere near the scale of GPTs/LLMs
           | because that amount of compute is expensive and at this point
           | we know those will at least work.
           | 
           | I still think EC is fascinating and would love to play with
           | it some more at some point, maybe trying it combined with
           | back propagation in novel ways. Compute only gets cheaper.
        
       | gingfreecss wrote:
       | amazing
        
       | LightMachine wrote:
       | Hey, Taelin here.
       | 
       | Thanks for posting, but our URL is wrong, it should be:
       | HigherOrderCO.com. I can't post it again (gets flagged as a
       | duplicate). Would appreciate if HN mods could change it!
       | 
       | Anyway, after years of hard work, HVM2 is finally production
       | ready, and we're delivering on a long promise: targeting GPUs!
       | And Bend is a new, Python-like and Haskell-like language to
       | interface with it.
       | 
       | Let me know if you have any questions!
        
         | LightMachine wrote:
         | Also, I wonder if you will agree with me, but I'm perhaps too
         | bothered by this title? We'd probably use something more
         | mundane, like:
         | 
         | - Bend: a high-level language that runs on GPUs, powered by
         | HVM2
         | 
         | Or something similar. I don't like this one because it feels
         | too self-important and not really informative (how do you know
         | what it is about without clicking?). But perhaps I'm
         | overthinking this?
        
           | mccoyb wrote:
           | I agree with you, I don't think this title is helping with
           | parsing what this is about, nor is it helping with
           | discoverability.
        
         | magnio wrote:
         | Congrats on the launch.
         | 
         | I know the docs say this will be fixed soon, but what is the
         | main reason for restricting number types to 24 bits? I saw in
         | the code that they are wrapper around the 32-bit system number
         | types, so what prevents Bend from changing them to U32(u32)
         | right now?
        
           | LightMachine wrote:
           | Great question!
           | 
           | Short answer: GPU
           | 
           | Long answer: CUDA
           | 
           | Seriously though. Implementing a full high-level lang in
           | parallel is HARD, so, to simplify it greatly, we made IC
           | nodes 64-bit, which allows us to use native 64-bit atomic
           | operations in many parts of the implementation. Since each
           | 64-bit node has 2 ports, that gives us 32 bits per port. And
           | since we use 3 bits for the tag, that leaves us with 29 bit
           | payloads. We used that space to easily implement unboxed
           | numbers (f24, u24, i24).
           | 
           | That said, we will have (boxed) 64-bit numbers soon! With
           | this foundation in place, adding them is a matter of coding.
           | I just want to have some time to let people use the limited
           | version, find bugs, etc., before I add more stuff.
        
       | efferifick wrote:
       | Thank you for sharing!
        
       | anonzzzies wrote:
       | What is the terminal used for that demo?
       | https://github.com/HigherOrderCO/Bend does it just skip commands
       | it cannot execute or?
        
         | LightMachine wrote:
         | It was actually just me recording iTerm2 with OBS. The theme is
         | Solarized Light. What do you mean by skip commands?
        
           | anonzzzies wrote:
           | Ah just ctrl-c was it? Sometimes I just think way too
           | difficult. Keep up the good work!
        
             | LightMachine wrote:
             | yes!
        
         | mcintyre1994 wrote:
         | They're probably hitting ctrl+c at the end of the lines they
         | don't want to run, that's telling the terminal "cancel that"
         | but it'll usually just go to the next line and leave what you
         | typed in place, like in this video.
        
       | jes5199 wrote:
       | I think I have a use for this but I'm realizing that I don't know
       | how to build a mental model of what is going to parallelize in
       | this system. Surely some algorithms are better and getting
       | chopped up than others - how can I tell what is going on?
        
         | mjaniczek wrote:
         | I think this is an unsolved tooling question right now.
         | 
         | You could get some sense of the parallelism by using
         | `/usr/bin/time` and dividing the wall time with the user time.
         | 
         | You could look at the Task Manager / Activity Monitor / htop
         | and see if it's using 800% CPU or whatever.
         | 
         | You could use psrecord (https://pypi.org/project/psrecord/) to
         | get a relatively finegrained CPU+mem usage graph across the
         | duration of the program.
         | 
         | But it would probably still be best to record some sort of
         | stats in the Bend/HVM itself, enabled via a CLI flag.
         | Reductions per ms, sampled across the program duration, or
         | something like that.
         | 
         | I'd be interested in anybody's ideas of what a good metric
         | would be here!
         | 
         | EDIT: CLI flag, not CPU flag
        
       | animaomnium wrote:
       | Fala Taelin, nice work! Does HVM2 compile interaction nets to
       | e.g. spirv, or is this an interpreter (like the original HVM)
       | that happens to run on the GPU?
       | 
       | I ask because a while back I was messing around with compiling
       | interaction nets to C after reducing as much of the program as
       | possible (without reducing the inputs), as a form of whole
       | program optimization. Wouldn't be too much harder to target a
       | shader language.
       | 
       | Edit: Oh I see...
       | 
       | > This repository provides a low-level IR language for specifying
       | the HVM2 nets, and a compiler from that language to C and CUDA
       | HVM
       | 
       | Will have to look at the code then!
       | 
       | https://github.com/HigherOrderCO/HVM
       | 
       | Edit: Wait nvm, it looks like the HVM2 cuda runtime is an
       | interpreter, that traverses an in-memory graph and applies
       | reductions.
       | 
       | https://github.com/HigherOrderCO/HVM/blob/5de3e7ed8f1fcee6f2...
       | 
       | I was talking about traversing an interaction net to recover a
       | lambda-calculus-like term, which can be lowered to C a la lisp in
       | small pieces with minimal runtime overhead.
       | 
       | Honestly the motivation is, you are unlikely to outperform a
       | hand-written GPU kernel for like ML workloads using Bend. In
       | theory, HVM could act as glue, stitching together and
       | parallelizing the dispatch order of compute kernels, but you need
       | a good FFI to do that. Interaction nets are hard to translate
       | across FFI boundaries. But, if you compile nets to C, keeping
       | track of FFI compute kernel nodes embedded in the interaction
       | network, you can recover a sensible FFI with no translation
       | overhead.
       | 
       | The other option is implementing HVM in hardware, which I've been
       | messing around with on a spare FPGA.
        
         | LightMachine wrote:
         | It is an interpreter that runs on GPUs, _and_ a compiler to
         | native C and CUDA. We don 't target SPIR-V directly, but aim
         | to. Sadly, while the C compiler results in the expected
         | speedups (3x-4x, and much more soon), the CUDA runtime didn't
         | achieve substantial speedups, compared to the non-compiled
         | version. I believe this is due to warp-divergence: with non-
         | compiled procedures, we can actually merge all function calls
         | into a single "generic" interpreted function expander that can
         | be reduced by warp threads without divergence. We'll be
         | researching this more extensively looking forward.
        
           | animaomnium wrote:
           | Oh that's cool! Interested to see where your research leads.
           | Could you drop me a link to where the interaction net - cuda
           | compiler resides? I skimmed through the HVM2 repo and just
           | read the .cu runtime file.
           | 
           | Edit: nvm, I read through the rest of the codebase. I see
           | that HVM compiles the inet to a large static term and then
           | links against the runtime.
           | 
           | https://github.com/HigherOrderCO/HVM/blob/5de3e7ed8f1fcee6f2.
           | ..
           | 
           | Will have to play around with this and look at the generated
           | assembly, see how much of the runtime a modern c/cu compiler
           | can inline.
           | 
           | Btw, nice code, very compact and clean, well-organized easy
           | to read. Rooting for you!
        
       | mccoyb wrote:
       | This is cool! Is the idea to put Kind2 on top of this in some
       | way?
       | 
       | I'd also love to find an example of writing a small interpreter
       | in Bend - which runs on the GPU.
        
         | LightMachine wrote:
         | Yes, Kind2 will be a type layer on top of Bend, with a similar
         | relationship as in JavaScript / TypeScript (but much more
         | integrated, less ad-hoc and with proofs!). I don't want Kind2
         | to compete directly with Lean though, as it is doing an amazing
         | job and I'm rooting for it. So, Kind2 will be just a type
         | system for Bend that happens to let you prove theorems about
         | your programs, rather than a higher promise to digitalize all
         | of maths and stuff.
        
       | mjaniczek wrote:
       | I've made a benchmark of Bend running a simple counter program on
       | CPU vs GPU, vs Haskell,Node,Python,C that I plan to write a
       | blogpost about, probably this Sunday:
       | 
       | https://docs.google.com/spreadsheets/d/1V_DZPpc7_BP3bmOR8Ees...
       | 
       | It's magical how the GPU version is basically flat (although with
       | a high runtime init cost).
        
       | xiaodai wrote:
       | Looks cool but what's one toy problem that it can solve more
       | efficiently than others?
        
         | JackMorgan wrote:
         | Here is an example of it summing a huge set of numbers 100x
         | faster than in C.
         | 
         | https://github.com/HigherOrderCO/bend/blob/main/GUIDE.md#par...
        
           | ashdnazg wrote:
           | Note that it's not 100x faster than C, but than bend running
           | on one CPU thread.
           | 
           | Running the equivalent C code takes ~2.3 seconds on my
           | machine. Same order of magnitude as bend on the beefy GPU.
        
         | mjaniczek wrote:
         | This is unproven (and not a toy problem), but I imagine it's
         | going to do pretty well at compilers. The amount of time I'm
         | waiting at work, hypnotizing the tsc process that sits at 100%
         | CPU, wishing it was parallel...
        
       | naltroc wrote:
       | so cool. I see it has a lib target, can we use it as a crate
       | instead of external program?
        
       | DrNosferatu wrote:
       | Is there an OpenCL backend?
        
       | yetihehe wrote:
       | Wow, Bend looks like a nice language.
       | 
       | > That's a 111x speedup by doing nothing. No thread spawning, no
       | explicit management of locks, mutexes. We just asked bend to run
       | our program on RTX, and it did. Simple as that. Note that, for
       | now, Bend only supports 24-bit machine ints (u24), thus, results
       | are always mod 2^24.
       | 
       | Ahh, not even 32bit? Hmm, that seems pretty arbitrary for someone
       | not accustomed to gpu's and wanting to solve some problems
       | requiring 64 bits (gravitational simulation of solar system at
       | millimeter resolution could use ~58bit ints for position).
        
         | GGO wrote:
         | 64bit coming soon
        
       | exabrial wrote:
       | > First, install Rust nightly.
       | 
       | Eeek.
        
       | mattdesl wrote:
       | Looking forward to using this. Curious about how far away
       | WebGPU/WASM support might be, it could provide a single cross-
       | platform backend.
        
       | abeppu wrote:
       | I for one found the 'how is this possible' video near the bottom
       | of the page to be unhelpful:
       | 
       | - surely for `3 x 3 = 9`, there is some concept of primitive
       | operations?
       | 
       | - I get that replacement of patterns in a graph can be done in
       | parallel, but (a) identifying when a rewrite rule should apply
       | and (b) communicating the state of the updated graph to worker
       | threads and (c) organizing worker threads to agree on which does
       | each task all take some effort. When is this more work than the
       | original computation (as in the 3x3 example)?
        
         | nojvek wrote:
         | 3 x 3 seemed like a pretty bad example to show how they
         | parallelize.
        
         | jiehong wrote:
         | The flip with Chinese characters in the middle tripped me. I
         | guess they wanted to look like "complicated"...
        
       | gigatexal wrote:
       | The first graphic midway or so down the page has this tag:
       | 
       | tested on: CPU - Apple M3 Max, GPU - NVIDIA RTX 4090
       | 
       | But how? I thought eGPUs don't work on apple silicon and the
       | pci-e having Mac Pro is still M2 based, no?
        
         | GGO wrote:
         | 2 different machines
        
       | zackmorris wrote:
       | This is nice, and obvious. I've waited about 20 years since I
       | learned MATLAB and GNU Octave for someone to make a graph solver
       | like this. And about 25 years since I first had the idea, when I
       | was learning VLSI with VHDL in college and didn't see anything
       | like the functional programming of circuits in what at the time
       | was the imperative C++ world. The closest thing then was Lisp,
       | but nobody talked about how the graph representation
       | (intermediate code or i-code in the imperative world) could be
       | solved in an auto-parallelized way.
       | 
       | We still see this today in how languages go out of their way to
       | implement higher order method libraries (map/reduce/filter) but
       | then under the hood there is no multithreading, they just expect
       | the developer to annotate their loops to be parallel because the
       | languages aren't formal enough to know about side effects in the
       | innermost logic, and don't support immutability or performant
       | pass-by-value semantics with copy-on-write anyway. So we end up
       | with handwavy languages like Rust that put all of that mental
       | load onto the developer for basically no gain, they just save
       | memory by performing computation in-place imperatively.
       | 
       | I also like how Bend sidesteps the nonexistence of highly scaled
       | symmetric multiprocessing CPUs by supporting GPUs. It makes the
       | argument moot that GPUs can't be stopped because they're too big
       | to fail. Julia is the only other language I've seen that tries
       | this. I wish Clojure did, although it's been a long time since I
       | followed it so maybe it has some parallelism?
       | 
       | I would have dearly loved to work on something like Bend, had
       | someone solved the funding issue. Nobody wants to pay for pure
       | research, and nobody sees the need for languages that do what
       | everyone else is doing except easier. We have Kickstarter for
       | widgets and Patreon for influencers, but makers have to bootstrap
       | themselves or learn everything about finance or live in the right
       | city or have a large network to hopefully meet an angel investor
       | or work in academia and lose all rights to what they invent while
       | spending the majority of their time hustling for grants anyway.
       | So it just never happens and we're stuck with the same old busted
       | techniques. Like how Hollywood only has money for sequels and
       | reboots or the recording industry only has money for canned
       | corporate music and hits from already famous artists and yet
       | another cover that yanks the original better song off the radio.
       | 
       | A quarter of a century can go by in the blink of an eye if you
       | get suckered into building other people's dreams as a people-
       | pleaser. Be careful what you work on.
        
         | jjtheblunt wrote:
         | > A quarter of a century can go by in the blink of an eye if
         | you get suckered into building other people's dreams as a
         | people-pleaser. Be careful what you work on
         | 
         | well said! i find myself reflecting the same sentiment when
         | away from the computer (and i've avoided the people-pleaser
         | thing, but what you said resonates as i watch the world)
        
       | croemer wrote:
       | This got a duplicate recently:
       | https://news.ycombinator.com/item?id=40390287
       | 
       | And is a duplicate of this:
       | https://news.ycombinator.com/item?id=40383196
        
       | flakiness wrote:
       | This kind of sound like Mojo. I wonder how these compare?
       | (Besides HVM/Bend being opensource, which is awesome.)
       | 
       | https://www.modular.com/max/mojo
        
       | IsTom wrote:
       | Reminds me a lot of the reduceron, sans FPGA.
        
       | hintymad wrote:
       | Speaking of parallel computing, any book or series of books that
       | can help an engineer learn parallel program and go from zero to
       | hero? Ideally the books will cover both intuition, in-depth
       | details, and theories. Something like The Art of Multiprocessor
       | Programming by Herlihy et el for concurrent programming, even
       | though the book arguably still has too steep of a learning curve.
        
       ___________________________________________________________________
       (page generated 2024-05-17 23:01 UTC)