hngopher.com

       [HN Gopher] Bend: a high-level language that runs on GPUs (via H...
       ___________________________________________________________________
        
       Bend: a high-level language that runs on GPUs (via HVM2)
        
       Author : LightMachine
       Score  : 494 points
       Date   : 2024-05-17 14:23 UTC (8 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | ruste wrote:
       | Been watching your development for a while on Twitter. This is a
       | monumental achievement and I hope it gets the recognition it
       | deserves.
        
       | shadowpho wrote:
       | Wow this is very impressive!
        
       | darlansbjr wrote:
       | Would a compiler be faster by using HVM? Would love to see a
       | fully parallel version of typescript tsc
        
       | KeplerBoy wrote:
       | What's going on with the super-linear speedup going from one
       | thread to all 16?
       | 
       | 210 seconds (3.5 minutes) to 10.5 seconds is a 20x speedup, which
       | isn't really expected.
        
         | byteknight wrote:
         | Its possible to see such scaling if involving any level of
         | cache or I/O.
        
         | LightMachine wrote:
         | the single-thread case ran a little slower than it should on
         | this live demo due to a mistake on my part: `run` redirected to
         | the Rust interpreter, rather than the C interpreter. the Rust
         | one is a little bit slower. the numbers on the site and on all
         | docs are correct though, and the actual speedup is ~12x, not
         | ~16x.
        
           | KeplerBoy wrote:
           | Thanks for the explanation and the cool project.
           | 
           | I will give bend a shot on some radar signal processing
           | algorithms.
        
         | LightMachine wrote:
         | I apologize, I gave you the wrong answer.
         | 
         | I thought you was talking about the DEMO example, which ran
         | ~30% slower than expected. Instead, you were talking about the
         | README, which was actually incorrect. I noticed the error and
         | edited it. I explained the issue in another comment.
        
       | kkukshtel wrote:
       | Honestly incredible, and congrats on the release after what looks
       | like an insane amount of work.
        
       | davidw wrote:
       | As a resident of Bend, Oregon... it was kind of funny to read
       | this and I'm curious about the origin of the name.
        
         | bytK7 wrote:
         | As a fellow resident of Bend I felt the same way when I saw
         | this.
        
           | noumenon1111 wrote:
           | As a native Bendite but not current Bend resident, seeing
           | that word with a capital letter always makes me smell juniper
           | and sagebrush a little bit.
        
         | developedby wrote:
         | Bending is an operation similar to folding, both in real life
         | and in the language. While fold is recursive on data, bend is
         | recursive on a boolean condition (like a pure while that
         | supports multiple branching recursion points).
         | 
         | I was actually looking forward to seeing someone from Bend to
         | make a comment like this
        
         | alex_lav wrote:
         | Totally off topic but I'll be driving there later this
         | afternoon. Hoping it's as beautiful as last time!
        
           | davidw wrote:
           | If you're going to be here for a bit (I am heading out of
           | town on a bike trip for a few days), always happy to grab a
           | beer with fellow HN people!
        
             | alex_lav wrote:
             | Unfortunately just the weekend. I'm just in Portland tho so
             | will definitely be back.
        
         | blinded wrote:
         | Thought the same thing!
        
       | yetihehe wrote:
       | Bend looks like a nice language.
       | 
       | > That's a 111x speedup by doing nothing. No thread spawning, no
       | explicit management of locks, mutexes. We just asked bend to run
       | our program on RTX, and it did. Simple as that. Note that, for
       | now, Bend only supports 24-bit machine ints (u24), thus, results
       | are always mod 2^24.
       | 
       | Ahh, not even 32bit? Hmm, that seems pretty arbitrary for someone
       | not accustomed to gpu's and wanting to solve some problems
       | requiring 64 bits (gravitational simulation of solar system at
       | millimeter resolution could use ~58bit ints for position).
        
         | LightMachine wrote:
         | We will have 64-bit boxed numbers really soon! As in, next
         | month, or earlier if users find this to be a higher priority.
        
           | yetihehe wrote:
           | What other types are you planning? Maybe some floats (even if
           | only on cpu targets, would be nice).
        
             | LightMachine wrote:
             | Immutable textures and strings. Perhaps actual mutable
             | arrays. Many numeric types like F64, U64, I64. And some
             | vector types like F16x4.
        
         | Archit3ch wrote:
         | Is there a platform with native hardware u64? Maybe some FPGA?
        
           | Archit3ch wrote:
           | Sorry, meant u24.
        
       | notfed wrote:
       | This is really, really cool. This makes me think, "I could
       | probably write a high performance GPU program fairly easily"...a
       | sentence that's never formed in my head.
        
         | developedby wrote:
         | That's the main idea!
        
       | delu wrote:
       | Ten years ago, I took a course on parallel algorithms (15-210 at
       | CMU). It pitched parallelism as the future of computing as
       | Moore's law would hit inevitable limits. I was sold and I was
       | excited to experiment with it. Unfortunately, there weren't many
       | options for general parallel programming. Even the language we
       | used for class (SML) wasn't parallel (there was a section at the
       | end about using extensions and CUDA but it was limited from what
       | I recall).
       | 
       | Since then, I was able to make some experiments with
       | multithreading (thanks Rust) and getting very creative with
       | shaders (thanks Shadertoy). But a general parallel language on
       | the GPU? I'm super excited to play with this!
        
         | shwestrick wrote:
         | Nowadays 210 is actually parallel! You can run 210-style code
         | using MaPLe (https://github.com/MPLLang/mpl) and get
         | competitive performance with respect to C/C++.
         | 
         | If you liked 210, you might also like https://futhark-lang.org/
         | which is an ML-family language that compiles to GPU with good
         | performance.
        
           | amelius wrote:
           | Huh, the Maple name is already used by a well known computer
           | algebra project.
           | 
           | https://en.wikipedia.org/wiki/Maple_(software)
        
         | Rodeoclash wrote:
         | The trend towards multiple cores in machines was one of the
         | reasons I decided to learn Elixir.
        
       | egnehots wrote:
       | the interesting comparison nowadays would be against mojo:
       | 
       | https://www.modular.com/max/mojo
        
       | highfrequency wrote:
       | > CPU, Apple M3 Max, 1 thread: 3.5 minutes
       | 
       | > CPU, Apple M3 Max, 16 threads: 10.26 seconds
       | 
       | Surprised to see a _more than linear_ speedup in CPU threads.
       | What's going on here?
        
         | Archit3ch wrote:
         | More cores = more caches?
        
         | LightMachine wrote:
         | I believe the single-core version was running slower due to the
         | memory getting full. The benchmark was adding 2^30 numbers, but
         | HVM2 32-bit has a limit of 2^29 nodes. I've re-ran it with 2^28
         | instead, and the numbers are `33.39 seconds` (1 core) vs `2.94
         | seconds` (16 cores). You can replicate the benchmark in an
         | Apple M3 Max. I apologize for the mistake.
        
       | ziedaniel1 wrote:
       | Very cool idea - but unless I'm missing something, this seems
       | very slow.
       | 
       | I just wrote a simple loop in C++ to sum up 0 to 2^30. With a
       | single thread without any optimizations it runs in 1.7s on my
       | laptop -- matching Bend's performance on an RTX 4090! With -O3 it
       | vectorizes the loop to run in less than 80ms.
       | #include <iostream>              int main() {           int sum =
       | 0;           for (int i = 0; i < 1024*1024*1024; i++) {
       | sum += i;            }           std::cout << sum << "\n";
       | return 0;         }
        
         | rroriz wrote:
         | I think the point is that Bend in a much higher level than C++.
         | But to be fair: I also may be missing the point!
        
           | 5- wrote:
           | here is the same loop finishing in one second on my laptop,
           | single-threaded, in a very high-level language, q:
           | q)\t sum til floor 2 xexp 30       1031
        
           | gslepak wrote:
           | The point is that Bend parallelizes everything that can be
           | parallelized without developers having to do that themselves.
        
         | molenzwiebel wrote:
         | If compiled with -O3 on clang, the loop is entirely optimized
         | out: https://godbolt.org/z/M1rMY6qM9. Probably not the fairest
         | comparison.
        
           | LightMachine wrote:
           | Exactly, this kind of thing always happens with these loops,
           | which is why I think programs that allocate are fairer. But
           | then people point out that the C allocator is terrible, so we
           | can't make that point :')
        
           | ziedaniel1 wrote:
           | I used GCC and checked that it wasn't optimized out (which
           | actually surprised me!)
        
         | LightMachine wrote:
         | Bend has no tail-call optimization yet. It is allocating a
         | 1-billion long stack, while C is just looping. If you compare
         | against a C program that does actual allocations, Bend will
         | most likely be faster with a few threads.
         | 
         | Bend's codegen is still abysmal, but these are all low-hanging
         | fruits. Most of the work went into making the parallel
         | evaluator _correct_ (which is extremely hard!). I know that
         | sounds  "trust me", but the single-thread performance will get
         | much better once we start compiling procedures, generating
         | loops, etc. It just hasn't been done.
         | 
         | (I wonder if I should have waited a little bit more before
         | actually posting it)
        
           | nneonneo wrote:
           | If they're low-hanging fruit, why not do that before posting
           | about it publicly? All that happens is that you push yourself
           | into a nasty situation: people get a poor first impression of
           | the system and are less likely to trust you the second time
           | around, and in the (possibly unlikely) event that the
           | problems turn out to be harder than you expect, you wind up
           | in the really nasty situation of having to deal with failed
           | expectations and pressure to fix them quickly.
        
             | naasking wrote:
             | That's how development under open source works. You can't
             | please everyone.
        
               | nneonneo wrote:
               | There's a big difference between developing something and
               | announcing loudly that you have something cool; the
               | developers have done the latter here.
        
               | Ar-Curunir wrote:
               | Thats completely unfair. They _have_ developed something
               | cool, just with not all the holes plugged.
        
               | vrmiguel wrote:
               | I think it's clearly pretty cool even if not as fast as
               | people expect it to be
        
               | trenchgun wrote:
               | It is pretty cool milestone achieved, just not production
               | ready.
        
               | adw wrote:
               | This is very cool and it's being treated unfairly, though
               | it's also obviously not ready for prime time; it's an
               | existence proof.
               | 
               | To illustrate that, many people on here have been losing
               | their mind over Kolomogorov-Arnold Networks, which are
               | almost identically positioned; interesting idea, kind of
               | cool, does what the paper claims, potentially useful in
               | the future, definitely not any use at all for any real
               | use-case right now.
               | 
               | (In part that's probably because the average
               | understanding of ML here is _not_ strong, so there's more
               | deference and credulousness around those claims.)
        
               | LightMachine wrote:
               | Dude we're running unrestricted recursion and closures on
               | GPUs! If that's not cool to you, I apologize, but that
               | mind-blowingly cool to me, and I wanted to share it, even
               | though the codegen is still initial. Hell I was actually
               | going to publish it with the interpreters only, but I
               | still coded an initial compiler because I thought people
               | would like to see where it could go :(
        
             | LightMachine wrote:
             | I agree with you. But then there's the entire "release
             | fast, don't wait before it is perfect". And, then, there's
             | the case that people using it will guide us to iteratively
             | building what is needed. I'm still trying to find that
             | balance, it isn't so easy. This release comes right after
             | we finally managed to compile it to GPUs, which is a huge
             | milestone people could care about - but there are almost no
             | micro-optimizations.
        
         | nneonneo wrote:
         | You might want to double check with objdump if the loop is
         | actually vectorized, or if the compiler just optimizes it out.
         | Your loop actually performs signed integer overflow, which is
         | UB in C++; the compiler could legally output anything. If you
         | want to avoid the UB, declare sum as unsigned (unsigned integer
         | overflow is well-defined); the optimization will still happen
         | but at least you'll be guaranteed that it'll be correct.
        
           | ziedaniel1 wrote:
           | I did make sure to check before posting.
           | 
           | Good point about the signed integer overflow, though!
        
       | exitheone wrote:
       | This seems pretty cool!
       | 
       | Question: Does this take into account memory bandwidth and caches
       | between cores? Because getting them wrong can easily make
       | parallel programs slower than sequential ones.
        
       | Twirrim wrote:
       | For what it's worth, I ported the sum example to pure python.
       | def sum(depth, x):             if depth == 0:
       | return x             else:               fst = sum(depth-1,
       | x*2+0) # adds the fst half               snd = sum(depth-1,
       | x*2+1) # adds the snd half               return fst + snd
       | print(sum(30, 0))
       | 
       | under pypy3 it executes in 0m4.478s, single threaded. Under
       | python 3.12, it executed in 1m42.148s, again single threaded. I
       | mention that because you include benchmark information:
       | CPU, Apple M3 Max, 1 thread: 3.5 minutes         CPU, Apple M3
       | Max, 16 threads: 10.26 seconds         GPU, NVIDIA RTX 4090, 32k
       | threads: 1.88 seconds
       | 
       | The bend single-threaded version has been running for 42 minutes
       | on my laptop, is consuming 6GB of memory, and still hasn't
       | finished (12th Gen Intel(R) Core(TM) i7-1270P, Ubuntu 24.04).
       | That seems to be an incredibly slow interpreter. Has this been
       | tested or developed on anything other than Macs / aarch64?
       | 
       | I appreciate this is early days, but it's hard to get excited
       | about what seems to be incredibly slow performance from a really
       | simple example you give. If the simple stuff is slow, what does
       | that mean for the complicated stuff?
       | 
       | If I get a chance tonight, I'll re-run it with `-s` argument, see
       | if I get anything helpful.
        
         | LightMachine wrote:
         | Running on 42 minutes is mots likely a bug. Yes, we haven't
         | done much testing outside of M3 Max yet. I'm aware it is 2x
         | slower on non-Apple CPUs. We'll work on that.
         | 
         | For the `sum` example, Bend has a huge disadvantage, because it
         | is allocating 2 IC nodes for each numeric operation, while
         | Python is not. This is obviously terribly inefficient. We'll
         | avoid that soon (just like HVM1 did it). It just wasn't
         | implemented in HVM2 yet.
         | 
         | Note most of the work behind Bend went into making the parallel
         | evaluator _correct_. Running closures and unrestricted
         | recursion on GPUs is _extremely_ hard. We 've just finished
         | that part, so, there was basically 0 effort into micro-
         | optimizations. HVM2's codegen is still abysmal. (And I was very
         | clear about it on the docs!)
         | 
         | That said, please try comparing the Bitonic Sort example, where
         | both are doing the same amount of allocations. I think it will
         | give a much fairer idea of how Bend will perform in practice.
         | HVM1 used to be 3x slower than GHC in a single core, which
         | isn't bad. HVM2 should get to that point not far in the future.
         | 
         | Now, I totally acknowledge these "this is still bad but we
         | promise it will get better!!" can be underwhelming, and I
         | understand if you don't believe on my words. But I actually
         | believe that, with the foundation set, these micro
         | optimizations will be the easiest part, and performance will
         | skyrocket from here. In any case, we'll keep working on making
         | it better, and reporting the progress as milestones are
         | reached.
        
           | Twirrim wrote:
           | Bitonic sort runs in 0m2.035s. Transpiled to c and compiled
           | it takes 0m0.425s.
           | 
           | that sum example, transpiled to C and compiled takes
           | 1m12.704s, so it looks like it's just the VM case that is
           | having serious issues of some description!
        
           | vrmiguel wrote:
           | > it is allocating 2 IC nodes for each numeric operation,
           | while Python is not
           | 
           | While that's true, Python would be using big integers
           | (PyLongObject) for most of the computations, meaning every
           | number gets allocated on the heap.
           | 
           | If we use a Python implementation that would avoid this, like
           | PyPy or Cython, the results change significantly:
           | % cat sum.py          def sum(depth, x):             if depth
           | == 0:                 return x             else:
           | fst = sum(depth-1, x*2+0) # adds the fst half
           | snd = sum(depth-1, x*2+1) # adds the snd half
           | return fst + snd              if __name__ == '__main__':
           | print(sum(30, 0))              % time pypy sum.py
           | 576460751766552576         pypy sum.py  4.26s user 0.06s
           | system 96% cpu 4.464 total
           | 
           | That's on an M2 Pro. I also imagine the result in Bend would
           | not be correct since it only supports 24 bit integers,
           | meaning it'd overflow quite quickly when summing up to 2^30,
           | is that right?
           | 
           | [Edit: just noticed the previous comment had already
           | mentioned pypy]
           | 
           | > I'm aware it is 2x slower on non-Apple CPUs.
           | 
           | Do you know why? As far as I can tell, HVM has no
           | aarch64/Apple-specific code. Could it be because Apple
           | Silicon has wider decode blocks?
           | 
           | > can be underwhelming, and I understand if you don't believe
           | on my words
           | 
           | I don't think anyone wants to rain on your parade, but
           | extraordinary claims require extraordinary evidence.
           | 
           | The work you've done in Bend and HVM sounds impressive, but I
           | feel the benchmarks need more evaluation/scrutiny. Since your
           | main competitor would be Mojo and not Python, comparisons to
           | Mojo would be nice as well.
        
             | LightMachine wrote:
             | The only claim I made is that it scales linearly with
             | cores. Nothing else!
             | 
             | I'm personally putting a LOT of effort to make our claims
             | as accurate and truthful as possible, in every single
             | place. Documentation, website, demos. I spent hours in
             | meetings to make sure everything is correct. Yet, sometimes
             | it feels that no matter how much effort I put, people will
             | just find ways to misinterpret it.
             | 
             | We published the real benchmarks, checked and double
             | checked. And then you complained some benchmarks are not so
             | good. Which we acknowledged, and provided causes, and how
             | we plan to address them. And then you said the benchmarks
             | need more evaluation? How does that make sense in the
             | context of them being underwhelming?
             | 
             | We're not going to compare to Mojo or other languages,
             | specifically because it generates hate.
             | 
             | Our only claim is:
             | 
             | HVM2 is the first version of our Interaction Combinator
             | evaluator that runs with linear speedup on GPUs. Running
             | closures on GPUs required _colossal_ amount of correctness
             | work, and we 're reporting this milestone. Moreover, we
             | finally managed to compile a Python-like language to it.
             | That is all that is being claimed, and nothing else. The
             | codegen is still abysmal and single-core performance is bad
             | - that's our next focus. If anything else was claimed, it
             | wasn't us!
        
               | CyberDildonics wrote:
               | _The only claim I made is that it scales linearly with
               | cores. Nothing else!_
               | 
               | The other link on the front page says:
               | 
               | "Welcome to the Parallel Future of Computation"
        
               | LightMachine wrote:
               | Scaling with cores is synonym of parallel.
        
               | Dylan16807 wrote:
               | "Future" has some mild speed implications but it sounds
               | like you're doing reasonably there, bug nonwithstanding.
        
               | singhblom wrote:
               | It also has "Not yet" implications ...
        
               | LightMachine wrote:
               | But it literally says we believe it is the future of
               | parallel computing! If it was faster than GCC today, we
               | would've written present :')
        
               | IshKebab wrote:
               | I think the issue is that there is the _implicit_ claim
               | that this is faster than some alternative. Otherwise what
               | 's the point?
               | 
               | If you add some disclaimer like "Note: Bend is currently
               | focused on correctness and scaling. On an absolute scale
               | it may still be slower than single threaded Python. We
               | plan to improve the absolute performance soon." then you
               | won't see these comments.
               | 
               | Also this defensive tone does not come off well:
               | 
               | > We published the real benchmarks, checked and double
               | checked. And then you complained some benchmarks are not
               | so good. Which we acknowledged, and provided causes, and
               | how we plan to address them. And then you said the
               | benchmarks need more evaluation? How does that make sense
               | in the context of them being underwhelming?
        
               | LightMachine wrote:
               | Right below install instructions, on Bend's README.md:
               | 
               | > But keep in mind our code gen is still on its infancy,
               | and is nowhere as mature as SOTA compilers like GCC and
               | GHC.
               | 
               | Second paragraph of Bend's GUIDE.md:
               | 
               | > While cool, Bend is far from perfect. In absolute terms
               | it is still not so fast. Compared to SOTA compilers like
               | GCC or GHC, our code gen is still embarrassingly bad, and
               | there is a lot to improve. That said, it does what it
               | promises: scaling horizontally with cores.
               | 
               | Limitations session on HVM2's paper:
               | 
               | > While HVM2 achieves near-linear speedup, its compiler
               | is still extremely immature, and not nearly as fast as
               | state-of-art alternatives like GCC of GHC. In single-
               | thread CPU evaluation, HVM2, is still about 5x slower
               | than GHC, and this number can grow to 100x on programs
               | that involve loops and mutable arrays, since HVM2 doesn't
               | feature these yet.
        
               | IshKebab wrote:
               | > Right below install instructions
               | 
               | Yeah exactly. I read most of the readme and watched the
               | demo, but I'm not interested in installing it so I missed
               | this. I would recommend moving this to the first section
               | in its own paragraph.
               | 
               | I understand you might not want to focus on this but it's
               | important information and not a bad thing at all.
        
               | LightMachine wrote:
               | That's a great feedback actually, thank you.
               | 
               | We'll add the disclaimer before the install instructions
               | instead!
        
               | vrmiguel wrote:
               | That's true, you never mentioned Python or alternatives
               | in your README, I guess I got Mandela'ed from the
               | comments in Hacker News, so my bad on that.
               | 
               | People are naturally going to compare the timings and
               | function you cite to what's available to the community
               | right now, though, that's the only way we can picture its
               | performance in real-life tasks.
               | 
               | > Mojo or other languages, specifically because it
               | generates hate
               | 
               | Mojo launched comparing itself to Python and didn't
               | generate much hate, it seems, but I digress
               | 
               | In any case, I hope Bend and HVM can continue to improve
               | even further, it's always nice to see projects like
               | those, specially from another Brazilian
        
               | LightMachine wrote:
               | Thanks, and I apologize if I got defensive, it is just
               | that I put so much effort on being truthful, double-
               | checking, putting disclaimers everywhere about every
               | possible misinterpretation. Hell this is behind install
               | instructions:
               | 
               | > our code gen is still on its infancy, and is nowhere as
               | mature as SOTA compilers like GCC and GHC
               | 
               | Yet people still misinterpret. It is frustrating because
               | I don't know what I could've done better
        
               | alfalfasprout wrote:
               | Don't worry about it. Keep at it, this is a very cool
               | project.
               | 
               | FWIW on HN people are inherently going to try to actually
               | use your project and so if it's meant to be (long term) a
               | faster way to run X people evaluate it against that
               | implicit benchmark.
        
               | jonahx wrote:
               | > I spent hours in meetings to make sure everything is
               | correct. Yet, sometimes it feels that no matter how much
               | effort I put, people will just find ways to misinterpret
               | it.
               | 
               | from reply below:
               | 
               | > I apologize if I got defensive, it is just that I put
               | so much effort on being truthful, double-checking,
               | putting disclaimers everywhere about every possible
               | misinterpretation.
               | 
               | I just want to say: don't stop. There will always be some
               | people who don't notice or acknowledge the effort to be
               | precise and truthful. But others will. For me, this
               | attitude elevates the project to something I will be
               | watching.
        
               | dheera wrote:
               | > I'm personally putting a LOT of effort to make our
               | claims as accurate and truthful as possible, in every
               | single place
               | 
               | Thank you. I understand in such an early irritation of a
               | language there are going to be lots of bugs.
               | 
               | This seems like a very, very cool project and I really
               | hope it or something like it is successful at making
               | utilizing the GPU less cumbersome.
        
               | mgaunard wrote:
               | Identifying what's parallelizable is valuable in the
               | world of language theory, but pure functional languages
               | are as trivial as it gets, so that research isn't exactly
               | ground-breaking.
               | 
               | And you're just not fast enough for anyone doing HPC,
               | where the problem is not identifying what can be
               | parallelized, but figuring out to make the most of the
               | hardware, i.e. the codegen.
        
         | glitchc wrote:
         | I have no dog in this fight, but feel compelled to defend the
         | authors here. Recursion does not test compute, rather it tests
         | the compiler's/interpreter's efficiency at standing up and
         | tearing down the call stack.
         | 
         | Clearly this language is positioned at using the gpu for
         | compute-heavy applications and it's still in its early stages.
         | Recursion is not the target application and should not be a
         | relevant benchmark.
        
           | light_hue_1 wrote:
           | It's the author's own benchmark, but it shows the system
           | doesn't work. Recursion isn't hard to eliminate for pure
           | functions; there's nothing inherently wrong with it.
           | 
           | The authors made obviously false claims. That this is an
           | extremely fast implementation. When it's not. That's really
           | poor form.
        
             | rowanG077 wrote:
             | Where did he claim it is fast? As far as I can see the only
             | claim is that it scales linearly with cores. Which it
             | actually seems to do.
        
               | light_hue_1 wrote:
               | They show benchmarks with massive performance
               | improvements. That clearly implies that the system is
               | fast. Where's the disclaimer that this performance is
               | worse than Python but eats up an entire massive GPU?
               | 
               | The readme is also blatantly lies: "It is not the kind of
               | algorithm you'd expect to run fast on GPUs." Bitonic sort
               | on GPUs is a thing and it performs well.
               | 
               | The gaslighting is just amazing here.
        
               | rowanG077 wrote:
               | Yes and those benchmarks are real. Showing linear speed
               | up in the number cores when writing standard code is a
               | real achievement. If you assumed that somehow means this
               | is a state of the art compiler with super blazing
               | performance is on no one but you. The readme lays it out
               | very clearly.
        
       | klabb3 wrote:
       | This is very exciting. I don't have any GPU background, but I
       | have been worrying a lot about CUDA cementating itself in the
       | ecosystem. Here devs don't need CUDA directly which would help
       | decoupling the ecosystem from cynical mega corps, always good!
       | Anyway enough politics..
       | 
       | Tried to see what the language is like beyond hello world and
       | found the guide[1]. It looks like a Python and quacks like a
       | Haskell? For instance, variables are immutable, and tree-like
       | divide and conquer data structures/algorithms are promoted for
       | getting good results. That makes sense I guess! I'm not surprised
       | to see a functional core, but I'm surprised to see the pythonic
       | frontend, not that it matters much. I must say I highly doubt
       | that it will make it much easier for Python devs to learn Bend
       | though, although I don't know if that's the goal.
       | 
       | What are some challenges in programming with these kind of
       | restrictions in practice? Also, is there good FFI options?
       | 
       | [1]: https://github.com/HigherOrderCO/bend/blob/main/GUIDE.md
        
         | mathiasgredal wrote:
         | We have a replacement for CUDA, it is called C++17 parallel
         | algorithms. It has vendor support for running on the GPU by
         | Intel, AMD and NVIDIA and will also run on all your cores on
         | the CPU. It uses the GPU vendors compiler to convert your C++
         | to something that can natively run on the GPU. With unified
         | memory support, it becomes very fast to run computations on
         | heap allocated memory using the GPU, but implementations also
         | support non-unified memory
         | 
         | Vendor support:
         | 
         | -
         | https://www.intel.com/content/www/us/en/developer/articles/g...
         | 
         | - https://rocm.blogs.amd.com/software-tools-
         | optimization/hipst...
         | 
         | - https://docs.nvidia.com/hpc-
         | sdk/archive/20.7/pdf/hpc207c++_p...
        
       | britannio wrote:
       | Incredible feat, congratulations!
        
       | light_hue_1 wrote:
       | Massive promises of amazing performance but they can't find one
       | convincing example to showcase. It's hard to see what they're
       | bringing to the table when even the simplest possible Haskell
       | code just as fast on my 4 year old laptop with an ancient version
       | of GHC (8.8). No need for an RTX 4090.                  module
       | Main where                sum' :: Int -> Int -> Int        sum' 0
       | x = x        sum' depth x = sum' (depth - 1) ((x \* 2) + 0) +
       | sum' (depth - 1) ((x \* 2) + 1)                main = print $
       | sum' 30 0
       | 
       | Runs in 2.5s. Sure it's not on a GPU, but it's faster! And things
       | don't get much more high level.
       | 
       | If you're going to promise amazing performance from a high level
       | language, I'd want to see a comparison against JAX.
       | 
       | It's an improvement over traditional interaction nets, sure! But
       | interaction nets have always been a failure performance-wise.
       | Interaction nets are PL equivalent of genetic algorithms in ML,
       | they sound like a cool idea and have a nice story, but then they
       | always seem to be a dead end.
       | 
       | Interaction nets optimize parallelism at the cost of everything
       | else. Including single-threaded performance. You're just warming
       | up the planet by wasting massive amounts of parallel GPU cores to
       | do what a single CPU core could do more easily. They're just the
       | wrong answer to this problem.
        
         | LightMachine wrote:
         | You're wrong. The Haskell code is compiled to a loop, which we
         | didn't optimize for yet. I've edited the README to use the
         | Bitonic Sort instead, on which allocations are unavoidable.
         | Past N=20, HVM2 performs 4x faster than GHC -O2.
        
           | light_hue_1 wrote:
           | What? I ran your example, from your readme, where you promise
           | a massive performance improvement, and you're accusing me of
           | doing something wrong?
           | 
           | This is exactly what a scammer would say.
           | 
           | I guess that's the point here. Scam people who don't know
           | anything about parallel computing by never comparing against
           | any other method?
        
             | LightMachine wrote:
             | Thanks for the feedback! Some clarifications:
             | 
             | 1. I didn't accuse you of doing something wrong, just that
             | your claim was wrong! It has been proven that Interaction
             | Combinators are an optimal model of concurrent computation.
             | I also pointed cases where it also achieves practical
             | efficiency, over-performing GHC's highest optimization
             | level.
             | 
             | 2. The performance scaling claimed been indeed been
             | achieved, and the code is open for anyone to replicate our
             | results. The machines used are listed on the repository and
             | paper. If you find any trouble replicating, please let me
             | know!
             | 
             | 3. We're not selling any product. Bend is Apache-licensed.
        
       | vegadw wrote:
       | A lot of negativity in these threads. I say ~cudas~ kudos to the
       | author for getting this far! The only similar project I'm aware
       | of is Futhark, and that's haskell-y syntax - great for some
       | people, but to the general class of C/C++/Python/Js/Java/etc.
       | devs pretty arcane and hard to work with. My biggest complaint
       | with this is, unlike Futhark, it only targets Cuda or multi-core.
       | Futhark which can target OpenCL, Cuda, ISPC, HIP, sigle core CPU,
       | or multi core CPU. The performance problems others are pointing
       | out I'm certain can be tackled.
        
         | pjmlp wrote:
         | Chapel has a decent use in HPC.
         | 
         | Also NVidia has sponsored variants of Haskell, .NET, Java,
         | Julia on CUDA, have a Python JIT and are collaborating with
         | Mojo folks.
        
         | neonsunset wrote:
         | Take a look at ILGPU. It's very nice and has been around for a
         | long time! (just no one knows about it, sadly)
         | 
         | Short example: https://github.com/m4rs-
         | mt/ILGPU/blob/master/Samples/SimpleM...
         | 
         | Supports even advanced bits like inline PTX assembly:
         | https://github.com/m4rs-mt/ILGPU/blob/master/Samples/InlineP...
        
       | Archit3ch wrote:
       | Pure functions only? This is disappointing. Furthermore, it
       | invites a comparison with JAX.
        
       | chc4 wrote:
       | 24bit integers and floats, no array datatype, and a maximum 4GB
       | heap of nodes are very harse restrictions, especially for any
       | workloads that would actually want to be running on a GPU. The
       | limitations in the HVM2 whitepaper about unsound evaluation
       | around closures and infinite loops because it is evaluating both
       | sides of a conditional without any short circuiting are also
       | extremely concerning.
       | 
       | Before you reply "these are things we can address in the future":
       | that doesn't matter. Everyone can address everything in the
       | future. They are currently hard technical barriers to it's use,
       | with no way of knowing the level of effort that will require or
       | the knock-on effects, especially since some of these issues have
       | been "we can fix that later" for ten years.
       | 
       | I also _highly_ recommend changing your benchmark numbers from
       | "interactions per second" to a standard measurement like FLOPS.
       | No one else on earth knows how many of those interactions are
       | pure overhead from your evaluation semantics, and not doing
       | useful work. They come across as attempting to wow an audience
       | with high numbers and not communicating an apples to apples
       | comparison with other languages.
        
         | LightMachine wrote:
         | So use a metric that makes absolutely no sense on given domain,
         | instead of one that is completely correct, sensible, accurate,
         | stablished on the literature, and vastly superior in context?
         | What even is a FLOPS in the context of Interaction Net
         | evaluation? These things aren't even interchangeable.
        
       | croemer wrote:
       | Dupe of https://news.ycombinator.com/item?id=40387394 and
       | https://news.ycombinator.com/item?id=40383196
        
       | CorrectingYou wrote:
       | OP comes around with some of the coolest things posted in HN
       | recently, and all he gets is extensive criticism, when it is
       | clear that this is an early version :/
        
         | swayvil wrote:
         | The coolest things are often the most difficult to understand.
         | 
         | Difficult to understand is often threatening.
         | 
         | Criticism is a popular response to threat and is the form of
         | reply that requires the least understanding.
        
       | andrewp123 wrote:
       | I just wanted to comment on how good the homepage is - it's
       | immediately clear what you do. Most people working with
       | "combinators" would feel a need to use lots of scary lingo, but
       | OP actually shows the simple idea behind the tool (this is the
       | opposite take of most academics, who instead show every last
       | detail and never tell you what's going on). I really appreciate
       | it - we need more of this.
        
       | markush_ wrote:
       | Exciting project, congrats on the release!
        
       | funny_name wrote:
       | What kind of software would this language be good for? I assume
       | it's not the kind of language you'd use for web servers exactly.
        
         | trenchgun wrote:
         | Erlang-like actor models would be well suited, so yeah, you
         | could use it for web servers (assuming they are able to finish
         | the language). It's a general purpose high level programming
         | language.
        
       | wolfspaw wrote:
       | Nice!
       | 
       | Python-like + High-performance.
       | 
       | And, Different from Mojo, its Fully Open-Source.
        
       | MrLeap wrote:
       | This is incredible. This is the kind of work we need to crack
       | open the under utilized GPUs out there. I know LLMs are all the
       | rage, but there's more gold in them hills.
        
       | gsuuon wrote:
       | Congrats on the HVM2 launch! Been following for a while, excited
       | to see where this project goes. For others who are lost on the
       | interaction net stuff, there was a neat show hn that gave a more
       | hands-on interactive intro:
       | https://news.ycombinator.com/item?id=37406742 (the 'Get Started'
       | writeup was really helpful)
        
       | kerkeslager wrote:
       | This looks like the language I've wanted for a long time. I'm
       | excited to see how this plays out.
        
       | robust-cactus wrote:
       | This is awesome and much needed. Keep going, forget the overly
       | pedantic folks, the vision is great and early results are
       | exciting.
        
       | Arch485 wrote:
       | I want to congratulate the author on this, it's super cool.
       | Making correct automatic parallelization is nothing to sneeze at,
       | and something you should absolutely be proud of.
       | 
       | I'm excited to see how this project progresses.
        
       | mbforbes wrote:
       | Congratulations on the launch and hard work so far! We need
       | projects like this. Great readme and demo as well.
       | 
       | Every time I try to write shaders, or even peek through my
       | fingers at CUDA C(++) code, I recoil in disbelief that we don't
       | have high level programming yet on the GPU. I can't wait until we
       | do. The more great projects attacking it the better in my book.
        
       | npalli wrote:
       | Is the recursive sum the best function to show multi-threading or
       | GPU speedups? Seems unlikely. FWIW, i ported the python example
       | to Julia and it ran in about 2.5 seconds the same as the C++
       | version. Pure python 3.12 took 183 seconds.
       | function sum(depth, x)           if depth == 0
       | return x           else               fst = sum(depth-1, x*2+0)
       | snd = sum(depth-1, x*2+1)           end           return fst +
       | snd       end
       | 
       | println(sum(30,0))
        
       | temp123789246 wrote:
       | Congrats!
       | 
       | I've been watching HVM for a while and think it's extremely cool.
       | 
       | My intuition is that this will eventually be a really big deal.
        
       | praetor22 wrote:
       | Look, I understand the value proposition and how cool it is from
       | a theoretical standpoint, but I honestly don't think this will
       | ever become relevant.
       | 
       | Here are some notes from my first impressions and after skimming
       | through the paper. And yes, I am aware that this is very very
       | early software.
       | 
       | 1. Bend looks like an extremely limited DSL. No FFI. No way of
       | interacting with raw buffers. Weird 24bit floating point format.
       | 
       | 2. There's a reason why ICs are not relevant: performance is and
       | will always be terrible. There is no other way to put it, graph
       | traversal simply doesn't map well on hardware.
       | 
       | 3. The premise of optimal reduction is valid. However, you still
       | need to write the kernels in a way that can be parallelized (ie.
       | no data dependencies, use of recursion).
       | 
       | 4. There are no serious examples that directly compare Bend/HVM
       | code with it's equivalent OMP/CUDA program. How am I suppose to
       | evaluate the reduction in implementation complexity and what to
       | expect on performance. So many claims, so little actual
       | comparisons.
       | 
       | 5. In the real world of high performance parallel computing,
       | tree-like structures are non-existent. Arrays are king. And
       | that's because of the physical nature of how memory works on a
       | hardware level. And do you know what works best on mutable
       | contiguous memory buffers ? Loops. We'll see when HVM will
       | implement this.
       | 
       | In the end, what we currently have is half-baked language that is
       | (almost) fully isolated from external data, extremely slow, a
       | massive abstraction on the underlying hardware (unutilised
       | features: multilevel caches, tensor cores, simd, atomics).
       | 
       | I apologize if this comes out as harsh, I still find the
       | technical implementation and the theoretical background to be
       | very interesting. I'm simply not (yet) convinced of its
       | usefulness in the real world.
        
       | netbioserror wrote:
       | So HVM finally yields fruit. I've been eagerly awaiting this day!
       | Bend seems like a very suitable candidate for a Lispy
       | S-expression makeover.
        
       | smusamashah wrote:
       | I have no interest in this tech as it's apparently for backend
       | stuff and not actually rendering things by itself.
       | 
       | But the demo gif is probably the best I have seen in a Github
       | readme. I watched it till the end. It was instantly engaging. I
       | wanted to see the whole story unfold.
        
       ___________________________________________________________________
       (page generated 2024-05-17 23:00 UTC)