[HN Gopher] Bend: a high-level language that runs on GPUs (via H...
___________________________________________________________________
Bend: a high-level language that runs on GPUs (via HVM2)
Author : LightMachine
Score : 494 points
Date : 2024-05-17 14:23 UTC (8 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| ruste wrote:
| Been watching your development for a while on Twitter. This is a
| monumental achievement and I hope it gets the recognition it
| deserves.
| shadowpho wrote:
| Wow this is very impressive!
| darlansbjr wrote:
| Would a compiler be faster by using HVM? Would love to see a
| fully parallel version of typescript tsc
| KeplerBoy wrote:
| What's going on with the super-linear speedup going from one
| thread to all 16?
|
| 210 seconds (3.5 minutes) to 10.5 seconds is a 20x speedup, which
| isn't really expected.
| byteknight wrote:
| Its possible to see such scaling if involving any level of
| cache or I/O.
| LightMachine wrote:
| the single-thread case ran a little slower than it should on
| this live demo due to a mistake on my part: `run` redirected to
| the Rust interpreter, rather than the C interpreter. the Rust
| one is a little bit slower. the numbers on the site and on all
| docs are correct though, and the actual speedup is ~12x, not
| ~16x.
| KeplerBoy wrote:
| Thanks for the explanation and the cool project.
|
| I will give bend a shot on some radar signal processing
| algorithms.
| LightMachine wrote:
| I apologize, I gave you the wrong answer.
|
| I thought you was talking about the DEMO example, which ran
| ~30% slower than expected. Instead, you were talking about the
| README, which was actually incorrect. I noticed the error and
| edited it. I explained the issue in another comment.
| kkukshtel wrote:
| Honestly incredible, and congrats on the release after what looks
| like an insane amount of work.
| davidw wrote:
| As a resident of Bend, Oregon... it was kind of funny to read
| this and I'm curious about the origin of the name.
| bytK7 wrote:
| As a fellow resident of Bend I felt the same way when I saw
| this.
| noumenon1111 wrote:
| As a native Bendite but not current Bend resident, seeing
| that word with a capital letter always makes me smell juniper
| and sagebrush a little bit.
| developedby wrote:
| Bending is an operation similar to folding, both in real life
| and in the language. While fold is recursive on data, bend is
| recursive on a boolean condition (like a pure while that
| supports multiple branching recursion points).
|
| I was actually looking forward to seeing someone from Bend to
| make a comment like this
| alex_lav wrote:
| Totally off topic but I'll be driving there later this
| afternoon. Hoping it's as beautiful as last time!
| davidw wrote:
| If you're going to be here for a bit (I am heading out of
| town on a bike trip for a few days), always happy to grab a
| beer with fellow HN people!
| alex_lav wrote:
| Unfortunately just the weekend. I'm just in Portland tho so
| will definitely be back.
| blinded wrote:
| Thought the same thing!
| yetihehe wrote:
| Bend looks like a nice language.
|
| > That's a 111x speedup by doing nothing. No thread spawning, no
| explicit management of locks, mutexes. We just asked bend to run
| our program on RTX, and it did. Simple as that. Note that, for
| now, Bend only supports 24-bit machine ints (u24), thus, results
| are always mod 2^24.
|
| Ahh, not even 32bit? Hmm, that seems pretty arbitrary for someone
| not accustomed to gpu's and wanting to solve some problems
| requiring 64 bits (gravitational simulation of solar system at
| millimeter resolution could use ~58bit ints for position).
| LightMachine wrote:
| We will have 64-bit boxed numbers really soon! As in, next
| month, or earlier if users find this to be a higher priority.
| yetihehe wrote:
| What other types are you planning? Maybe some floats (even if
| only on cpu targets, would be nice).
| LightMachine wrote:
| Immutable textures and strings. Perhaps actual mutable
| arrays. Many numeric types like F64, U64, I64. And some
| vector types like F16x4.
| Archit3ch wrote:
| Is there a platform with native hardware u64? Maybe some FPGA?
| Archit3ch wrote:
| Sorry, meant u24.
| notfed wrote:
| This is really, really cool. This makes me think, "I could
| probably write a high performance GPU program fairly easily"...a
| sentence that's never formed in my head.
| developedby wrote:
| That's the main idea!
| delu wrote:
| Ten years ago, I took a course on parallel algorithms (15-210 at
| CMU). It pitched parallelism as the future of computing as
| Moore's law would hit inevitable limits. I was sold and I was
| excited to experiment with it. Unfortunately, there weren't many
| options for general parallel programming. Even the language we
| used for class (SML) wasn't parallel (there was a section at the
| end about using extensions and CUDA but it was limited from what
| I recall).
|
| Since then, I was able to make some experiments with
| multithreading (thanks Rust) and getting very creative with
| shaders (thanks Shadertoy). But a general parallel language on
| the GPU? I'm super excited to play with this!
| shwestrick wrote:
| Nowadays 210 is actually parallel! You can run 210-style code
| using MaPLe (https://github.com/MPLLang/mpl) and get
| competitive performance with respect to C/C++.
|
| If you liked 210, you might also like https://futhark-lang.org/
| which is an ML-family language that compiles to GPU with good
| performance.
| amelius wrote:
| Huh, the Maple name is already used by a well known computer
| algebra project.
|
| https://en.wikipedia.org/wiki/Maple_(software)
| Rodeoclash wrote:
| The trend towards multiple cores in machines was one of the
| reasons I decided to learn Elixir.
| egnehots wrote:
| the interesting comparison nowadays would be against mojo:
|
| https://www.modular.com/max/mojo
| highfrequency wrote:
| > CPU, Apple M3 Max, 1 thread: 3.5 minutes
|
| > CPU, Apple M3 Max, 16 threads: 10.26 seconds
|
| Surprised to see a _more than linear_ speedup in CPU threads.
| What's going on here?
| Archit3ch wrote:
| More cores = more caches?
| LightMachine wrote:
| I believe the single-core version was running slower due to the
| memory getting full. The benchmark was adding 2^30 numbers, but
| HVM2 32-bit has a limit of 2^29 nodes. I've re-ran it with 2^28
| instead, and the numbers are `33.39 seconds` (1 core) vs `2.94
| seconds` (16 cores). You can replicate the benchmark in an
| Apple M3 Max. I apologize for the mistake.
| ziedaniel1 wrote:
| Very cool idea - but unless I'm missing something, this seems
| very slow.
|
| I just wrote a simple loop in C++ to sum up 0 to 2^30. With a
| single thread without any optimizations it runs in 1.7s on my
| laptop -- matching Bend's performance on an RTX 4090! With -O3 it
| vectorizes the loop to run in less than 80ms.
| #include <iostream> int main() { int sum =
| 0; for (int i = 0; i < 1024*1024*1024; i++) {
| sum += i; } std::cout << sum << "\n";
| return 0; }
| rroriz wrote:
| I think the point is that Bend in a much higher level than C++.
| But to be fair: I also may be missing the point!
| 5- wrote:
| here is the same loop finishing in one second on my laptop,
| single-threaded, in a very high-level language, q:
| q)\t sum til floor 2 xexp 30 1031
| gslepak wrote:
| The point is that Bend parallelizes everything that can be
| parallelized without developers having to do that themselves.
| molenzwiebel wrote:
| If compiled with -O3 on clang, the loop is entirely optimized
| out: https://godbolt.org/z/M1rMY6qM9. Probably not the fairest
| comparison.
| LightMachine wrote:
| Exactly, this kind of thing always happens with these loops,
| which is why I think programs that allocate are fairer. But
| then people point out that the C allocator is terrible, so we
| can't make that point :')
| ziedaniel1 wrote:
| I used GCC and checked that it wasn't optimized out (which
| actually surprised me!)
| LightMachine wrote:
| Bend has no tail-call optimization yet. It is allocating a
| 1-billion long stack, while C is just looping. If you compare
| against a C program that does actual allocations, Bend will
| most likely be faster with a few threads.
|
| Bend's codegen is still abysmal, but these are all low-hanging
| fruits. Most of the work went into making the parallel
| evaluator _correct_ (which is extremely hard!). I know that
| sounds "trust me", but the single-thread performance will get
| much better once we start compiling procedures, generating
| loops, etc. It just hasn't been done.
|
| (I wonder if I should have waited a little bit more before
| actually posting it)
| nneonneo wrote:
| If they're low-hanging fruit, why not do that before posting
| about it publicly? All that happens is that you push yourself
| into a nasty situation: people get a poor first impression of
| the system and are less likely to trust you the second time
| around, and in the (possibly unlikely) event that the
| problems turn out to be harder than you expect, you wind up
| in the really nasty situation of having to deal with failed
| expectations and pressure to fix them quickly.
| naasking wrote:
| That's how development under open source works. You can't
| please everyone.
| nneonneo wrote:
| There's a big difference between developing something and
| announcing loudly that you have something cool; the
| developers have done the latter here.
| Ar-Curunir wrote:
| Thats completely unfair. They _have_ developed something
| cool, just with not all the holes plugged.
| vrmiguel wrote:
| I think it's clearly pretty cool even if not as fast as
| people expect it to be
| trenchgun wrote:
| It is pretty cool milestone achieved, just not production
| ready.
| adw wrote:
| This is very cool and it's being treated unfairly, though
| it's also obviously not ready for prime time; it's an
| existence proof.
|
| To illustrate that, many people on here have been losing
| their mind over Kolomogorov-Arnold Networks, which are
| almost identically positioned; interesting idea, kind of
| cool, does what the paper claims, potentially useful in
| the future, definitely not any use at all for any real
| use-case right now.
|
| (In part that's probably because the average
| understanding of ML here is _not_ strong, so there's more
| deference and credulousness around those claims.)
| LightMachine wrote:
| Dude we're running unrestricted recursion and closures on
| GPUs! If that's not cool to you, I apologize, but that
| mind-blowingly cool to me, and I wanted to share it, even
| though the codegen is still initial. Hell I was actually
| going to publish it with the interpreters only, but I
| still coded an initial compiler because I thought people
| would like to see where it could go :(
| LightMachine wrote:
| I agree with you. But then there's the entire "release
| fast, don't wait before it is perfect". And, then, there's
| the case that people using it will guide us to iteratively
| building what is needed. I'm still trying to find that
| balance, it isn't so easy. This release comes right after
| we finally managed to compile it to GPUs, which is a huge
| milestone people could care about - but there are almost no
| micro-optimizations.
| nneonneo wrote:
| You might want to double check with objdump if the loop is
| actually vectorized, or if the compiler just optimizes it out.
| Your loop actually performs signed integer overflow, which is
| UB in C++; the compiler could legally output anything. If you
| want to avoid the UB, declare sum as unsigned (unsigned integer
| overflow is well-defined); the optimization will still happen
| but at least you'll be guaranteed that it'll be correct.
| ziedaniel1 wrote:
| I did make sure to check before posting.
|
| Good point about the signed integer overflow, though!
| exitheone wrote:
| This seems pretty cool!
|
| Question: Does this take into account memory bandwidth and caches
| between cores? Because getting them wrong can easily make
| parallel programs slower than sequential ones.
| Twirrim wrote:
| For what it's worth, I ported the sum example to pure python.
| def sum(depth, x): if depth == 0:
| return x else: fst = sum(depth-1,
| x*2+0) # adds the fst half snd = sum(depth-1,
| x*2+1) # adds the snd half return fst + snd
| print(sum(30, 0))
|
| under pypy3 it executes in 0m4.478s, single threaded. Under
| python 3.12, it executed in 1m42.148s, again single threaded. I
| mention that because you include benchmark information:
| CPU, Apple M3 Max, 1 thread: 3.5 minutes CPU, Apple M3
| Max, 16 threads: 10.26 seconds GPU, NVIDIA RTX 4090, 32k
| threads: 1.88 seconds
|
| The bend single-threaded version has been running for 42 minutes
| on my laptop, is consuming 6GB of memory, and still hasn't
| finished (12th Gen Intel(R) Core(TM) i7-1270P, Ubuntu 24.04).
| That seems to be an incredibly slow interpreter. Has this been
| tested or developed on anything other than Macs / aarch64?
|
| I appreciate this is early days, but it's hard to get excited
| about what seems to be incredibly slow performance from a really
| simple example you give. If the simple stuff is slow, what does
| that mean for the complicated stuff?
|
| If I get a chance tonight, I'll re-run it with `-s` argument, see
| if I get anything helpful.
| LightMachine wrote:
| Running on 42 minutes is mots likely a bug. Yes, we haven't
| done much testing outside of M3 Max yet. I'm aware it is 2x
| slower on non-Apple CPUs. We'll work on that.
|
| For the `sum` example, Bend has a huge disadvantage, because it
| is allocating 2 IC nodes for each numeric operation, while
| Python is not. This is obviously terribly inefficient. We'll
| avoid that soon (just like HVM1 did it). It just wasn't
| implemented in HVM2 yet.
|
| Note most of the work behind Bend went into making the parallel
| evaluator _correct_. Running closures and unrestricted
| recursion on GPUs is _extremely_ hard. We 've just finished
| that part, so, there was basically 0 effort into micro-
| optimizations. HVM2's codegen is still abysmal. (And I was very
| clear about it on the docs!)
|
| That said, please try comparing the Bitonic Sort example, where
| both are doing the same amount of allocations. I think it will
| give a much fairer idea of how Bend will perform in practice.
| HVM1 used to be 3x slower than GHC in a single core, which
| isn't bad. HVM2 should get to that point not far in the future.
|
| Now, I totally acknowledge these "this is still bad but we
| promise it will get better!!" can be underwhelming, and I
| understand if you don't believe on my words. But I actually
| believe that, with the foundation set, these micro
| optimizations will be the easiest part, and performance will
| skyrocket from here. In any case, we'll keep working on making
| it better, and reporting the progress as milestones are
| reached.
| Twirrim wrote:
| Bitonic sort runs in 0m2.035s. Transpiled to c and compiled
| it takes 0m0.425s.
|
| that sum example, transpiled to C and compiled takes
| 1m12.704s, so it looks like it's just the VM case that is
| having serious issues of some description!
| vrmiguel wrote:
| > it is allocating 2 IC nodes for each numeric operation,
| while Python is not
|
| While that's true, Python would be using big integers
| (PyLongObject) for most of the computations, meaning every
| number gets allocated on the heap.
|
| If we use a Python implementation that would avoid this, like
| PyPy or Cython, the results change significantly:
| % cat sum.py def sum(depth, x): if depth
| == 0: return x else:
| fst = sum(depth-1, x*2+0) # adds the fst half
| snd = sum(depth-1, x*2+1) # adds the snd half
| return fst + snd if __name__ == '__main__':
| print(sum(30, 0)) % time pypy sum.py
| 576460751766552576 pypy sum.py 4.26s user 0.06s
| system 96% cpu 4.464 total
|
| That's on an M2 Pro. I also imagine the result in Bend would
| not be correct since it only supports 24 bit integers,
| meaning it'd overflow quite quickly when summing up to 2^30,
| is that right?
|
| [Edit: just noticed the previous comment had already
| mentioned pypy]
|
| > I'm aware it is 2x slower on non-Apple CPUs.
|
| Do you know why? As far as I can tell, HVM has no
| aarch64/Apple-specific code. Could it be because Apple
| Silicon has wider decode blocks?
|
| > can be underwhelming, and I understand if you don't believe
| on my words
|
| I don't think anyone wants to rain on your parade, but
| extraordinary claims require extraordinary evidence.
|
| The work you've done in Bend and HVM sounds impressive, but I
| feel the benchmarks need more evaluation/scrutiny. Since your
| main competitor would be Mojo and not Python, comparisons to
| Mojo would be nice as well.
| LightMachine wrote:
| The only claim I made is that it scales linearly with
| cores. Nothing else!
|
| I'm personally putting a LOT of effort to make our claims
| as accurate and truthful as possible, in every single
| place. Documentation, website, demos. I spent hours in
| meetings to make sure everything is correct. Yet, sometimes
| it feels that no matter how much effort I put, people will
| just find ways to misinterpret it.
|
| We published the real benchmarks, checked and double
| checked. And then you complained some benchmarks are not so
| good. Which we acknowledged, and provided causes, and how
| we plan to address them. And then you said the benchmarks
| need more evaluation? How does that make sense in the
| context of them being underwhelming?
|
| We're not going to compare to Mojo or other languages,
| specifically because it generates hate.
|
| Our only claim is:
|
| HVM2 is the first version of our Interaction Combinator
| evaluator that runs with linear speedup on GPUs. Running
| closures on GPUs required _colossal_ amount of correctness
| work, and we 're reporting this milestone. Moreover, we
| finally managed to compile a Python-like language to it.
| That is all that is being claimed, and nothing else. The
| codegen is still abysmal and single-core performance is bad
| - that's our next focus. If anything else was claimed, it
| wasn't us!
| CyberDildonics wrote:
| _The only claim I made is that it scales linearly with
| cores. Nothing else!_
|
| The other link on the front page says:
|
| "Welcome to the Parallel Future of Computation"
| LightMachine wrote:
| Scaling with cores is synonym of parallel.
| Dylan16807 wrote:
| "Future" has some mild speed implications but it sounds
| like you're doing reasonably there, bug nonwithstanding.
| singhblom wrote:
| It also has "Not yet" implications ...
| LightMachine wrote:
| But it literally says we believe it is the future of
| parallel computing! If it was faster than GCC today, we
| would've written present :')
| IshKebab wrote:
| I think the issue is that there is the _implicit_ claim
| that this is faster than some alternative. Otherwise what
| 's the point?
|
| If you add some disclaimer like "Note: Bend is currently
| focused on correctness and scaling. On an absolute scale
| it may still be slower than single threaded Python. We
| plan to improve the absolute performance soon." then you
| won't see these comments.
|
| Also this defensive tone does not come off well:
|
| > We published the real benchmarks, checked and double
| checked. And then you complained some benchmarks are not
| so good. Which we acknowledged, and provided causes, and
| how we plan to address them. And then you said the
| benchmarks need more evaluation? How does that make sense
| in the context of them being underwhelming?
| LightMachine wrote:
| Right below install instructions, on Bend's README.md:
|
| > But keep in mind our code gen is still on its infancy,
| and is nowhere as mature as SOTA compilers like GCC and
| GHC.
|
| Second paragraph of Bend's GUIDE.md:
|
| > While cool, Bend is far from perfect. In absolute terms
| it is still not so fast. Compared to SOTA compilers like
| GCC or GHC, our code gen is still embarrassingly bad, and
| there is a lot to improve. That said, it does what it
| promises: scaling horizontally with cores.
|
| Limitations session on HVM2's paper:
|
| > While HVM2 achieves near-linear speedup, its compiler
| is still extremely immature, and not nearly as fast as
| state-of-art alternatives like GCC of GHC. In single-
| thread CPU evaluation, HVM2, is still about 5x slower
| than GHC, and this number can grow to 100x on programs
| that involve loops and mutable arrays, since HVM2 doesn't
| feature these yet.
| IshKebab wrote:
| > Right below install instructions
|
| Yeah exactly. I read most of the readme and watched the
| demo, but I'm not interested in installing it so I missed
| this. I would recommend moving this to the first section
| in its own paragraph.
|
| I understand you might not want to focus on this but it's
| important information and not a bad thing at all.
| LightMachine wrote:
| That's a great feedback actually, thank you.
|
| We'll add the disclaimer before the install instructions
| instead!
| vrmiguel wrote:
| That's true, you never mentioned Python or alternatives
| in your README, I guess I got Mandela'ed from the
| comments in Hacker News, so my bad on that.
|
| People are naturally going to compare the timings and
| function you cite to what's available to the community
| right now, though, that's the only way we can picture its
| performance in real-life tasks.
|
| > Mojo or other languages, specifically because it
| generates hate
|
| Mojo launched comparing itself to Python and didn't
| generate much hate, it seems, but I digress
|
| In any case, I hope Bend and HVM can continue to improve
| even further, it's always nice to see projects like
| those, specially from another Brazilian
| LightMachine wrote:
| Thanks, and I apologize if I got defensive, it is just
| that I put so much effort on being truthful, double-
| checking, putting disclaimers everywhere about every
| possible misinterpretation. Hell this is behind install
| instructions:
|
| > our code gen is still on its infancy, and is nowhere as
| mature as SOTA compilers like GCC and GHC
|
| Yet people still misinterpret. It is frustrating because
| I don't know what I could've done better
| alfalfasprout wrote:
| Don't worry about it. Keep at it, this is a very cool
| project.
|
| FWIW on HN people are inherently going to try to actually
| use your project and so if it's meant to be (long term) a
| faster way to run X people evaluate it against that
| implicit benchmark.
| jonahx wrote:
| > I spent hours in meetings to make sure everything is
| correct. Yet, sometimes it feels that no matter how much
| effort I put, people will just find ways to misinterpret
| it.
|
| from reply below:
|
| > I apologize if I got defensive, it is just that I put
| so much effort on being truthful, double-checking,
| putting disclaimers everywhere about every possible
| misinterpretation.
|
| I just want to say: don't stop. There will always be some
| people who don't notice or acknowledge the effort to be
| precise and truthful. But others will. For me, this
| attitude elevates the project to something I will be
| watching.
| dheera wrote:
| > I'm personally putting a LOT of effort to make our
| claims as accurate and truthful as possible, in every
| single place
|
| Thank you. I understand in such an early irritation of a
| language there are going to be lots of bugs.
|
| This seems like a very, very cool project and I really
| hope it or something like it is successful at making
| utilizing the GPU less cumbersome.
| mgaunard wrote:
| Identifying what's parallelizable is valuable in the
| world of language theory, but pure functional languages
| are as trivial as it gets, so that research isn't exactly
| ground-breaking.
|
| And you're just not fast enough for anyone doing HPC,
| where the problem is not identifying what can be
| parallelized, but figuring out to make the most of the
| hardware, i.e. the codegen.
| glitchc wrote:
| I have no dog in this fight, but feel compelled to defend the
| authors here. Recursion does not test compute, rather it tests
| the compiler's/interpreter's efficiency at standing up and
| tearing down the call stack.
|
| Clearly this language is positioned at using the gpu for
| compute-heavy applications and it's still in its early stages.
| Recursion is not the target application and should not be a
| relevant benchmark.
| light_hue_1 wrote:
| It's the author's own benchmark, but it shows the system
| doesn't work. Recursion isn't hard to eliminate for pure
| functions; there's nothing inherently wrong with it.
|
| The authors made obviously false claims. That this is an
| extremely fast implementation. When it's not. That's really
| poor form.
| rowanG077 wrote:
| Where did he claim it is fast? As far as I can see the only
| claim is that it scales linearly with cores. Which it
| actually seems to do.
| light_hue_1 wrote:
| They show benchmarks with massive performance
| improvements. That clearly implies that the system is
| fast. Where's the disclaimer that this performance is
| worse than Python but eats up an entire massive GPU?
|
| The readme is also blatantly lies: "It is not the kind of
| algorithm you'd expect to run fast on GPUs." Bitonic sort
| on GPUs is a thing and it performs well.
|
| The gaslighting is just amazing here.
| rowanG077 wrote:
| Yes and those benchmarks are real. Showing linear speed
| up in the number cores when writing standard code is a
| real achievement. If you assumed that somehow means this
| is a state of the art compiler with super blazing
| performance is on no one but you. The readme lays it out
| very clearly.
| klabb3 wrote:
| This is very exciting. I don't have any GPU background, but I
| have been worrying a lot about CUDA cementating itself in the
| ecosystem. Here devs don't need CUDA directly which would help
| decoupling the ecosystem from cynical mega corps, always good!
| Anyway enough politics..
|
| Tried to see what the language is like beyond hello world and
| found the guide[1]. It looks like a Python and quacks like a
| Haskell? For instance, variables are immutable, and tree-like
| divide and conquer data structures/algorithms are promoted for
| getting good results. That makes sense I guess! I'm not surprised
| to see a functional core, but I'm surprised to see the pythonic
| frontend, not that it matters much. I must say I highly doubt
| that it will make it much easier for Python devs to learn Bend
| though, although I don't know if that's the goal.
|
| What are some challenges in programming with these kind of
| restrictions in practice? Also, is there good FFI options?
|
| [1]: https://github.com/HigherOrderCO/bend/blob/main/GUIDE.md
| mathiasgredal wrote:
| We have a replacement for CUDA, it is called C++17 parallel
| algorithms. It has vendor support for running on the GPU by
| Intel, AMD and NVIDIA and will also run on all your cores on
| the CPU. It uses the GPU vendors compiler to convert your C++
| to something that can natively run on the GPU. With unified
| memory support, it becomes very fast to run computations on
| heap allocated memory using the GPU, but implementations also
| support non-unified memory
|
| Vendor support:
|
| -
| https://www.intel.com/content/www/us/en/developer/articles/g...
|
| - https://rocm.blogs.amd.com/software-tools-
| optimization/hipst...
|
| - https://docs.nvidia.com/hpc-
| sdk/archive/20.7/pdf/hpc207c++_p...
| britannio wrote:
| Incredible feat, congratulations!
| light_hue_1 wrote:
| Massive promises of amazing performance but they can't find one
| convincing example to showcase. It's hard to see what they're
| bringing to the table when even the simplest possible Haskell
| code just as fast on my 4 year old laptop with an ancient version
| of GHC (8.8). No need for an RTX 4090. module
| Main where sum' :: Int -> Int -> Int sum' 0
| x = x sum' depth x = sum' (depth - 1) ((x \* 2) + 0) +
| sum' (depth - 1) ((x \* 2) + 1) main = print $
| sum' 30 0
|
| Runs in 2.5s. Sure it's not on a GPU, but it's faster! And things
| don't get much more high level.
|
| If you're going to promise amazing performance from a high level
| language, I'd want to see a comparison against JAX.
|
| It's an improvement over traditional interaction nets, sure! But
| interaction nets have always been a failure performance-wise.
| Interaction nets are PL equivalent of genetic algorithms in ML,
| they sound like a cool idea and have a nice story, but then they
| always seem to be a dead end.
|
| Interaction nets optimize parallelism at the cost of everything
| else. Including single-threaded performance. You're just warming
| up the planet by wasting massive amounts of parallel GPU cores to
| do what a single CPU core could do more easily. They're just the
| wrong answer to this problem.
| LightMachine wrote:
| You're wrong. The Haskell code is compiled to a loop, which we
| didn't optimize for yet. I've edited the README to use the
| Bitonic Sort instead, on which allocations are unavoidable.
| Past N=20, HVM2 performs 4x faster than GHC -O2.
| light_hue_1 wrote:
| What? I ran your example, from your readme, where you promise
| a massive performance improvement, and you're accusing me of
| doing something wrong?
|
| This is exactly what a scammer would say.
|
| I guess that's the point here. Scam people who don't know
| anything about parallel computing by never comparing against
| any other method?
| LightMachine wrote:
| Thanks for the feedback! Some clarifications:
|
| 1. I didn't accuse you of doing something wrong, just that
| your claim was wrong! It has been proven that Interaction
| Combinators are an optimal model of concurrent computation.
| I also pointed cases where it also achieves practical
| efficiency, over-performing GHC's highest optimization
| level.
|
| 2. The performance scaling claimed been indeed been
| achieved, and the code is open for anyone to replicate our
| results. The machines used are listed on the repository and
| paper. If you find any trouble replicating, please let me
| know!
|
| 3. We're not selling any product. Bend is Apache-licensed.
| vegadw wrote:
| A lot of negativity in these threads. I say ~cudas~ kudos to the
| author for getting this far! The only similar project I'm aware
| of is Futhark, and that's haskell-y syntax - great for some
| people, but to the general class of C/C++/Python/Js/Java/etc.
| devs pretty arcane and hard to work with. My biggest complaint
| with this is, unlike Futhark, it only targets Cuda or multi-core.
| Futhark which can target OpenCL, Cuda, ISPC, HIP, sigle core CPU,
| or multi core CPU. The performance problems others are pointing
| out I'm certain can be tackled.
| pjmlp wrote:
| Chapel has a decent use in HPC.
|
| Also NVidia has sponsored variants of Haskell, .NET, Java,
| Julia on CUDA, have a Python JIT and are collaborating with
| Mojo folks.
| neonsunset wrote:
| Take a look at ILGPU. It's very nice and has been around for a
| long time! (just no one knows about it, sadly)
|
| Short example: https://github.com/m4rs-
| mt/ILGPU/blob/master/Samples/SimpleM...
|
| Supports even advanced bits like inline PTX assembly:
| https://github.com/m4rs-mt/ILGPU/blob/master/Samples/InlineP...
| Archit3ch wrote:
| Pure functions only? This is disappointing. Furthermore, it
| invites a comparison with JAX.
| chc4 wrote:
| 24bit integers and floats, no array datatype, and a maximum 4GB
| heap of nodes are very harse restrictions, especially for any
| workloads that would actually want to be running on a GPU. The
| limitations in the HVM2 whitepaper about unsound evaluation
| around closures and infinite loops because it is evaluating both
| sides of a conditional without any short circuiting are also
| extremely concerning.
|
| Before you reply "these are things we can address in the future":
| that doesn't matter. Everyone can address everything in the
| future. They are currently hard technical barriers to it's use,
| with no way of knowing the level of effort that will require or
| the knock-on effects, especially since some of these issues have
| been "we can fix that later" for ten years.
|
| I also _highly_ recommend changing your benchmark numbers from
| "interactions per second" to a standard measurement like FLOPS.
| No one else on earth knows how many of those interactions are
| pure overhead from your evaluation semantics, and not doing
| useful work. They come across as attempting to wow an audience
| with high numbers and not communicating an apples to apples
| comparison with other languages.
| LightMachine wrote:
| So use a metric that makes absolutely no sense on given domain,
| instead of one that is completely correct, sensible, accurate,
| stablished on the literature, and vastly superior in context?
| What even is a FLOPS in the context of Interaction Net
| evaluation? These things aren't even interchangeable.
| croemer wrote:
| Dupe of https://news.ycombinator.com/item?id=40387394 and
| https://news.ycombinator.com/item?id=40383196
| CorrectingYou wrote:
| OP comes around with some of the coolest things posted in HN
| recently, and all he gets is extensive criticism, when it is
| clear that this is an early version :/
| swayvil wrote:
| The coolest things are often the most difficult to understand.
|
| Difficult to understand is often threatening.
|
| Criticism is a popular response to threat and is the form of
| reply that requires the least understanding.
| andrewp123 wrote:
| I just wanted to comment on how good the homepage is - it's
| immediately clear what you do. Most people working with
| "combinators" would feel a need to use lots of scary lingo, but
| OP actually shows the simple idea behind the tool (this is the
| opposite take of most academics, who instead show every last
| detail and never tell you what's going on). I really appreciate
| it - we need more of this.
| markush_ wrote:
| Exciting project, congrats on the release!
| funny_name wrote:
| What kind of software would this language be good for? I assume
| it's not the kind of language you'd use for web servers exactly.
| trenchgun wrote:
| Erlang-like actor models would be well suited, so yeah, you
| could use it for web servers (assuming they are able to finish
| the language). It's a general purpose high level programming
| language.
| wolfspaw wrote:
| Nice!
|
| Python-like + High-performance.
|
| And, Different from Mojo, its Fully Open-Source.
| MrLeap wrote:
| This is incredible. This is the kind of work we need to crack
| open the under utilized GPUs out there. I know LLMs are all the
| rage, but there's more gold in them hills.
| gsuuon wrote:
| Congrats on the HVM2 launch! Been following for a while, excited
| to see where this project goes. For others who are lost on the
| interaction net stuff, there was a neat show hn that gave a more
| hands-on interactive intro:
| https://news.ycombinator.com/item?id=37406742 (the 'Get Started'
| writeup was really helpful)
| kerkeslager wrote:
| This looks like the language I've wanted for a long time. I'm
| excited to see how this plays out.
| robust-cactus wrote:
| This is awesome and much needed. Keep going, forget the overly
| pedantic folks, the vision is great and early results are
| exciting.
| Arch485 wrote:
| I want to congratulate the author on this, it's super cool.
| Making correct automatic parallelization is nothing to sneeze at,
| and something you should absolutely be proud of.
|
| I'm excited to see how this project progresses.
| mbforbes wrote:
| Congratulations on the launch and hard work so far! We need
| projects like this. Great readme and demo as well.
|
| Every time I try to write shaders, or even peek through my
| fingers at CUDA C(++) code, I recoil in disbelief that we don't
| have high level programming yet on the GPU. I can't wait until we
| do. The more great projects attacking it the better in my book.
| npalli wrote:
| Is the recursive sum the best function to show multi-threading or
| GPU speedups? Seems unlikely. FWIW, i ported the python example
| to Julia and it ran in about 2.5 seconds the same as the C++
| version. Pure python 3.12 took 183 seconds.
| function sum(depth, x) if depth == 0
| return x else fst = sum(depth-1, x*2+0)
| snd = sum(depth-1, x*2+1) end return fst +
| snd end
|
| println(sum(30,0))
| temp123789246 wrote:
| Congrats!
|
| I've been watching HVM for a while and think it's extremely cool.
|
| My intuition is that this will eventually be a really big deal.
| praetor22 wrote:
| Look, I understand the value proposition and how cool it is from
| a theoretical standpoint, but I honestly don't think this will
| ever become relevant.
|
| Here are some notes from my first impressions and after skimming
| through the paper. And yes, I am aware that this is very very
| early software.
|
| 1. Bend looks like an extremely limited DSL. No FFI. No way of
| interacting with raw buffers. Weird 24bit floating point format.
|
| 2. There's a reason why ICs are not relevant: performance is and
| will always be terrible. There is no other way to put it, graph
| traversal simply doesn't map well on hardware.
|
| 3. The premise of optimal reduction is valid. However, you still
| need to write the kernels in a way that can be parallelized (ie.
| no data dependencies, use of recursion).
|
| 4. There are no serious examples that directly compare Bend/HVM
| code with it's equivalent OMP/CUDA program. How am I suppose to
| evaluate the reduction in implementation complexity and what to
| expect on performance. So many claims, so little actual
| comparisons.
|
| 5. In the real world of high performance parallel computing,
| tree-like structures are non-existent. Arrays are king. And
| that's because of the physical nature of how memory works on a
| hardware level. And do you know what works best on mutable
| contiguous memory buffers ? Loops. We'll see when HVM will
| implement this.
|
| In the end, what we currently have is half-baked language that is
| (almost) fully isolated from external data, extremely slow, a
| massive abstraction on the underlying hardware (unutilised
| features: multilevel caches, tensor cores, simd, atomics).
|
| I apologize if this comes out as harsh, I still find the
| technical implementation and the theoretical background to be
| very interesting. I'm simply not (yet) convinced of its
| usefulness in the real world.
| netbioserror wrote:
| So HVM finally yields fruit. I've been eagerly awaiting this day!
| Bend seems like a very suitable candidate for a Lispy
| S-expression makeover.
| smusamashah wrote:
| I have no interest in this tech as it's apparently for backend
| stuff and not actually rendering things by itself.
|
| But the demo gif is probably the best I have seen in a Github
| readme. I watched it till the end. It was instantly engaging. I
| wanted to see the whole story unfold.
___________________________________________________________________
(page generated 2024-05-17 23:00 UTC)