[HN Gopher] Welcome to the Parallel Future of Computation
___________________________________________________________________
Welcome to the Parallel Future of Computation
Author : Epholys
Score : 146 points
Date : 2024-05-17 07:37 UTC (15 hours ago)
(HTM) web link (hvm-page.pages.dev)
(TXT) w3m dump (hvm-page.pages.dev)
| anonzzzies wrote:
| Maybe I missed it, but there seems to be no license attached to
| HVM2, nor to Bend or Kind?
| drtournier wrote:
| https://x.com/VictorTaelin/status/1791241244468806117
| tekknolagi wrote:
| (Taelin says will likely be MIT or similar)
| throwaway2562 wrote:
| i-combinators https://www.semanticscholar.org/paper/Interaction-
| Combinator...
| anentropic wrote:
| I remember seeing HVM on here a year or two back when it came out
| and it looked intriguing. Exciting to see something being built
| on top of it!
|
| I would say that the play on words that gives the language its
| name ("Bend") doesn't really make sense...
|
| https://github.com/HigherOrderCO/bend/blob/main/GUIDE.md
|
| > Bending is the opposite of folding. Whatever fold consumes,
| bend creates.
|
| But in everyday language bending is not the opposite of folding,
| they are more or less the same thing. Why not "unfold", which
| also has a connotation of "the process of happening" as well as
| merely the opposite of folding?
|
| I have a question about the example code and output for bending:
| type Tree: Node { ~lft, ~rgt } Leaf { val }
| def main(): bend x = 0: when x < 3:
| tree = Tree/Node { lft: fork(x + 1), rgt: fork(x + 1) }
| else: tree = Tree/Leaf { val: 7 } return
| tree tree = fork(0) tree = ![fork(1),
| fork(1)] tree = ![![fork(2),fork(2)], ![fork(2),fork(2)]]
| tree = ![![![fork(3),fork(3)], ![fork(3),fork(3)]],
| ![![fork(3),fork(3)], ![fork(3),fork(3)]]] tree =
| ![![![7,7], ![7,7]], ![![7,7], ![7,7]]]
|
| Where does the initial "tree = fork(0)" come from?
| developedby wrote:
| The first `fork` is from using bend and passing the initial
| state The program above will initialize a state
| (`x = 0`), and then, for as long as `x < 3`, it will
| "fork" that state in two, creating a `Tree/Node`, and
| continuing with `x + 1`. When `x >= 3`, it will halt and
| return a `Tree/Leaf` with `7`. When all is done, the
| result will be assigned to the `tree` variable:
| anentropic wrote:
| I would have described the logic in the exact same way, but I
| still don't see where initial tree = fork(0) state comes from
|
| all the other "fork"s in the output are produced explicitly
| by: Tree/Node { lft: fork(x + 1), rgt:
| fork(x + 1) }
| anentropic wrote:
| well, it seems fork is some kind of builtin function that
| is intimately related to the bend construct: https://github
| .com/HigherOrderCO/bend/blob/main/FEATURES.md#....
|
| so presumably the initial fork(0) state is implicitly
| produced by the bend
| api wrote:
| Oh wow do I wish this existed when I was playing with
| evolutionary computation and genetic algorithms in college...
| zackmorris wrote:
| Me too, now you see why they never took off.
| api wrote:
| They never took off because we discovered, to our surprise to
| some extent, that gradient descent through back propagation
| works better than expected if you give it the right learning
| media and the right input and output encodings. It took a ton
| of fiddling ("graduate student descent") to figure those out.
|
| Back then everyone thought it was doomed to get stuck at
| local minima, but it turns out that has a lower probability
| of happening if the search space has enough dimensions. It
| works well enough to make the sand talk back to us and now
| that particular design has sucked all the air out of the
| room.
|
| Nobody has tried EC at anywhere near the scale of GPTs/LLMs
| because that amount of compute is expensive and at this point
| we know those will at least work.
|
| I still think EC is fascinating and would love to play with
| it some more at some point, maybe trying it combined with
| back propagation in novel ways. Compute only gets cheaper.
| gingfreecss wrote:
| amazing
| LightMachine wrote:
| Hey, Taelin here.
|
| Thanks for posting, but our URL is wrong, it should be:
| HigherOrderCO.com. I can't post it again (gets flagged as a
| duplicate). Would appreciate if HN mods could change it!
|
| Anyway, after years of hard work, HVM2 is finally production
| ready, and we're delivering on a long promise: targeting GPUs!
| And Bend is a new, Python-like and Haskell-like language to
| interface with it.
|
| Let me know if you have any questions!
| LightMachine wrote:
| Also, I wonder if you will agree with me, but I'm perhaps too
| bothered by this title? We'd probably use something more
| mundane, like:
|
| - Bend: a high-level language that runs on GPUs, powered by
| HVM2
|
| Or something similar. I don't like this one because it feels
| too self-important and not really informative (how do you know
| what it is about without clicking?). But perhaps I'm
| overthinking this?
| mccoyb wrote:
| I agree with you, I don't think this title is helping with
| parsing what this is about, nor is it helping with
| discoverability.
| magnio wrote:
| Congrats on the launch.
|
| I know the docs say this will be fixed soon, but what is the
| main reason for restricting number types to 24 bits? I saw in
| the code that they are wrapper around the 32-bit system number
| types, so what prevents Bend from changing them to U32(u32)
| right now?
| LightMachine wrote:
| Great question!
|
| Short answer: GPU
|
| Long answer: CUDA
|
| Seriously though. Implementing a full high-level lang in
| parallel is HARD, so, to simplify it greatly, we made IC
| nodes 64-bit, which allows us to use native 64-bit atomic
| operations in many parts of the implementation. Since each
| 64-bit node has 2 ports, that gives us 32 bits per port. And
| since we use 3 bits for the tag, that leaves us with 29 bit
| payloads. We used that space to easily implement unboxed
| numbers (f24, u24, i24).
|
| That said, we will have (boxed) 64-bit numbers soon! With
| this foundation in place, adding them is a matter of coding.
| I just want to have some time to let people use the limited
| version, find bugs, etc., before I add more stuff.
| efferifick wrote:
| Thank you for sharing!
| anonzzzies wrote:
| What is the terminal used for that demo?
| https://github.com/HigherOrderCO/Bend does it just skip commands
| it cannot execute or?
| LightMachine wrote:
| It was actually just me recording iTerm2 with OBS. The theme is
| Solarized Light. What do you mean by skip commands?
| anonzzzies wrote:
| Ah just ctrl-c was it? Sometimes I just think way too
| difficult. Keep up the good work!
| LightMachine wrote:
| yes!
| mcintyre1994 wrote:
| They're probably hitting ctrl+c at the end of the lines they
| don't want to run, that's telling the terminal "cancel that"
| but it'll usually just go to the next line and leave what you
| typed in place, like in this video.
| jes5199 wrote:
| I think I have a use for this but I'm realizing that I don't know
| how to build a mental model of what is going to parallelize in
| this system. Surely some algorithms are better and getting
| chopped up than others - how can I tell what is going on?
| mjaniczek wrote:
| I think this is an unsolved tooling question right now.
|
| You could get some sense of the parallelism by using
| `/usr/bin/time` and dividing the wall time with the user time.
|
| You could look at the Task Manager / Activity Monitor / htop
| and see if it's using 800% CPU or whatever.
|
| You could use psrecord (https://pypi.org/project/psrecord/) to
| get a relatively finegrained CPU+mem usage graph across the
| duration of the program.
|
| But it would probably still be best to record some sort of
| stats in the Bend/HVM itself, enabled via a CLI flag.
| Reductions per ms, sampled across the program duration, or
| something like that.
|
| I'd be interested in anybody's ideas of what a good metric
| would be here!
|
| EDIT: CLI flag, not CPU flag
| animaomnium wrote:
| Fala Taelin, nice work! Does HVM2 compile interaction nets to
| e.g. spirv, or is this an interpreter (like the original HVM)
| that happens to run on the GPU?
|
| I ask because a while back I was messing around with compiling
| interaction nets to C after reducing as much of the program as
| possible (without reducing the inputs), as a form of whole
| program optimization. Wouldn't be too much harder to target a
| shader language.
|
| Edit: Oh I see...
|
| > This repository provides a low-level IR language for specifying
| the HVM2 nets, and a compiler from that language to C and CUDA
| HVM
|
| Will have to look at the code then!
|
| https://github.com/HigherOrderCO/HVM
|
| Edit: Wait nvm, it looks like the HVM2 cuda runtime is an
| interpreter, that traverses an in-memory graph and applies
| reductions.
|
| https://github.com/HigherOrderCO/HVM/blob/5de3e7ed8f1fcee6f2...
|
| I was talking about traversing an interaction net to recover a
| lambda-calculus-like term, which can be lowered to C a la lisp in
| small pieces with minimal runtime overhead.
|
| Honestly the motivation is, you are unlikely to outperform a
| hand-written GPU kernel for like ML workloads using Bend. In
| theory, HVM could act as glue, stitching together and
| parallelizing the dispatch order of compute kernels, but you need
| a good FFI to do that. Interaction nets are hard to translate
| across FFI boundaries. But, if you compile nets to C, keeping
| track of FFI compute kernel nodes embedded in the interaction
| network, you can recover a sensible FFI with no translation
| overhead.
|
| The other option is implementing HVM in hardware, which I've been
| messing around with on a spare FPGA.
| LightMachine wrote:
| It is an interpreter that runs on GPUs, _and_ a compiler to
| native C and CUDA. We don 't target SPIR-V directly, but aim
| to. Sadly, while the C compiler results in the expected
| speedups (3x-4x, and much more soon), the CUDA runtime didn't
| achieve substantial speedups, compared to the non-compiled
| version. I believe this is due to warp-divergence: with non-
| compiled procedures, we can actually merge all function calls
| into a single "generic" interpreted function expander that can
| be reduced by warp threads without divergence. We'll be
| researching this more extensively looking forward.
| animaomnium wrote:
| Oh that's cool! Interested to see where your research leads.
| Could you drop me a link to where the interaction net - cuda
| compiler resides? I skimmed through the HVM2 repo and just
| read the .cu runtime file.
|
| Edit: nvm, I read through the rest of the codebase. I see
| that HVM compiles the inet to a large static term and then
| links against the runtime.
|
| https://github.com/HigherOrderCO/HVM/blob/5de3e7ed8f1fcee6f2.
| ..
|
| Will have to play around with this and look at the generated
| assembly, see how much of the runtime a modern c/cu compiler
| can inline.
|
| Btw, nice code, very compact and clean, well-organized easy
| to read. Rooting for you!
| mccoyb wrote:
| This is cool! Is the idea to put Kind2 on top of this in some
| way?
|
| I'd also love to find an example of writing a small interpreter
| in Bend - which runs on the GPU.
| LightMachine wrote:
| Yes, Kind2 will be a type layer on top of Bend, with a similar
| relationship as in JavaScript / TypeScript (but much more
| integrated, less ad-hoc and with proofs!). I don't want Kind2
| to compete directly with Lean though, as it is doing an amazing
| job and I'm rooting for it. So, Kind2 will be just a type
| system for Bend that happens to let you prove theorems about
| your programs, rather than a higher promise to digitalize all
| of maths and stuff.
| mjaniczek wrote:
| I've made a benchmark of Bend running a simple counter program on
| CPU vs GPU, vs Haskell,Node,Python,C that I plan to write a
| blogpost about, probably this Sunday:
|
| https://docs.google.com/spreadsheets/d/1V_DZPpc7_BP3bmOR8Ees...
|
| It's magical how the GPU version is basically flat (although with
| a high runtime init cost).
| xiaodai wrote:
| Looks cool but what's one toy problem that it can solve more
| efficiently than others?
| JackMorgan wrote:
| Here is an example of it summing a huge set of numbers 100x
| faster than in C.
|
| https://github.com/HigherOrderCO/bend/blob/main/GUIDE.md#par...
| ashdnazg wrote:
| Note that it's not 100x faster than C, but than bend running
| on one CPU thread.
|
| Running the equivalent C code takes ~2.3 seconds on my
| machine. Same order of magnitude as bend on the beefy GPU.
| mjaniczek wrote:
| This is unproven (and not a toy problem), but I imagine it's
| going to do pretty well at compilers. The amount of time I'm
| waiting at work, hypnotizing the tsc process that sits at 100%
| CPU, wishing it was parallel...
| naltroc wrote:
| so cool. I see it has a lib target, can we use it as a crate
| instead of external program?
| DrNosferatu wrote:
| Is there an OpenCL backend?
| yetihehe wrote:
| Wow, Bend looks like a nice language.
|
| > That's a 111x speedup by doing nothing. No thread spawning, no
| explicit management of locks, mutexes. We just asked bend to run
| our program on RTX, and it did. Simple as that. Note that, for
| now, Bend only supports 24-bit machine ints (u24), thus, results
| are always mod 2^24.
|
| Ahh, not even 32bit? Hmm, that seems pretty arbitrary for someone
| not accustomed to gpu's and wanting to solve some problems
| requiring 64 bits (gravitational simulation of solar system at
| millimeter resolution could use ~58bit ints for position).
| GGO wrote:
| 64bit coming soon
| exabrial wrote:
| > First, install Rust nightly.
|
| Eeek.
| mattdesl wrote:
| Looking forward to using this. Curious about how far away
| WebGPU/WASM support might be, it could provide a single cross-
| platform backend.
| abeppu wrote:
| I for one found the 'how is this possible' video near the bottom
| of the page to be unhelpful:
|
| - surely for `3 x 3 = 9`, there is some concept of primitive
| operations?
|
| - I get that replacement of patterns in a graph can be done in
| parallel, but (a) identifying when a rewrite rule should apply
| and (b) communicating the state of the updated graph to worker
| threads and (c) organizing worker threads to agree on which does
| each task all take some effort. When is this more work than the
| original computation (as in the 3x3 example)?
| nojvek wrote:
| 3 x 3 seemed like a pretty bad example to show how they
| parallelize.
| jiehong wrote:
| The flip with Chinese characters in the middle tripped me. I
| guess they wanted to look like "complicated"...
| gigatexal wrote:
| The first graphic midway or so down the page has this tag:
|
| tested on: CPU - Apple M3 Max, GPU - NVIDIA RTX 4090
|
| But how? I thought eGPUs don't work on apple silicon and the
| pci-e having Mac Pro is still M2 based, no?
| GGO wrote:
| 2 different machines
| zackmorris wrote:
| This is nice, and obvious. I've waited about 20 years since I
| learned MATLAB and GNU Octave for someone to make a graph solver
| like this. And about 25 years since I first had the idea, when I
| was learning VLSI with VHDL in college and didn't see anything
| like the functional programming of circuits in what at the time
| was the imperative C++ world. The closest thing then was Lisp,
| but nobody talked about how the graph representation
| (intermediate code or i-code in the imperative world) could be
| solved in an auto-parallelized way.
|
| We still see this today in how languages go out of their way to
| implement higher order method libraries (map/reduce/filter) but
| then under the hood there is no multithreading, they just expect
| the developer to annotate their loops to be parallel because the
| languages aren't formal enough to know about side effects in the
| innermost logic, and don't support immutability or performant
| pass-by-value semantics with copy-on-write anyway. So we end up
| with handwavy languages like Rust that put all of that mental
| load onto the developer for basically no gain, they just save
| memory by performing computation in-place imperatively.
|
| I also like how Bend sidesteps the nonexistence of highly scaled
| symmetric multiprocessing CPUs by supporting GPUs. It makes the
| argument moot that GPUs can't be stopped because they're too big
| to fail. Julia is the only other language I've seen that tries
| this. I wish Clojure did, although it's been a long time since I
| followed it so maybe it has some parallelism?
|
| I would have dearly loved to work on something like Bend, had
| someone solved the funding issue. Nobody wants to pay for pure
| research, and nobody sees the need for languages that do what
| everyone else is doing except easier. We have Kickstarter for
| widgets and Patreon for influencers, but makers have to bootstrap
| themselves or learn everything about finance or live in the right
| city or have a large network to hopefully meet an angel investor
| or work in academia and lose all rights to what they invent while
| spending the majority of their time hustling for grants anyway.
| So it just never happens and we're stuck with the same old busted
| techniques. Like how Hollywood only has money for sequels and
| reboots or the recording industry only has money for canned
| corporate music and hits from already famous artists and yet
| another cover that yanks the original better song off the radio.
|
| A quarter of a century can go by in the blink of an eye if you
| get suckered into building other people's dreams as a people-
| pleaser. Be careful what you work on.
| jjtheblunt wrote:
| > A quarter of a century can go by in the blink of an eye if
| you get suckered into building other people's dreams as a
| people-pleaser. Be careful what you work on
|
| well said! i find myself reflecting the same sentiment when
| away from the computer (and i've avoided the people-pleaser
| thing, but what you said resonates as i watch the world)
| croemer wrote:
| This got a duplicate recently:
| https://news.ycombinator.com/item?id=40390287
|
| And is a duplicate of this:
| https://news.ycombinator.com/item?id=40383196
| flakiness wrote:
| This kind of sound like Mojo. I wonder how these compare?
| (Besides HVM/Bend being opensource, which is awesome.)
|
| https://www.modular.com/max/mojo
| IsTom wrote:
| Reminds me a lot of the reduceron, sans FPGA.
| hintymad wrote:
| Speaking of parallel computing, any book or series of books that
| can help an engineer learn parallel program and go from zero to
| hero? Ideally the books will cover both intuition, in-depth
| details, and theories. Something like The Art of Multiprocessor
| Programming by Herlihy et el for concurrent programming, even
| though the book arguably still has too steep of a learning curve.
___________________________________________________________________
(page generated 2024-05-17 23:01 UTC)