hngopher.com

       [HN Gopher] Run CUDA, unmodified, on AMD GPUs
       ___________________________________________________________________
        
       Run CUDA, unmodified, on AMD GPUs
        
       Author : Straw
       Score  : 435 points
       Date   : 2024-07-15 19:05 UTC (3 hours ago)
        
 (HTM) web link (docs.scale-lang.com)
 (TXT) w3m dump (docs.scale-lang.com)
        
       | dagmx wrote:
       | Has anyone tried this and knows how well it works? It definitely
       | sounds very compelling
        
       | arjvik wrote:
       | Who is this Spectral Compute, and where can we see more about
       | them?
        
         | msond wrote:
         | You can learn more about us on https://spectralcompute.co.uk
        
           | JonChesterfield wrote:
           | The branch free regex engine is an interesting idea. I would
           | have said that can't be implemented in finite code.
           | 
           | Compile to DFA by repeatedly differentiating then unroll the
           | machine? You'd still have back edges for the repeating
           | sections.
        
       | pixelpoet wrote:
       | Isn't this a bit legally dubious, like zluda?
        
         | janice1999 wrote:
         | It's advertised as a "clean room" re-implementation. What part
         | would be illegal?
        
           | ekelsen wrote:
           | If they had to reverse engineer any compiled code to do this,
           | I think that would be against licenses they had to agree to?
           | 
           | At least grounds for suing and starting an extensive
           | discovery process and possibly a costly injunction...
        
             | msond wrote:
             | We have not reverse engineered any compiled code in the
             | process of developing SCALE.
             | 
             | It was clean-room implemented purely from the API surface
             | and by trial-and-error with open CUDA code.
        
             | RockRobotRock wrote:
             | Isn't that exactly what a "clean room" approach avoids?
        
               | ekelsen wrote:
               | oh definitely. But if I was NVIDIA I'd want to verify
               | that in court after discovery rather than relying on
               | their claim on a website.
        
           | Keyframe wrote:
           | Can't run useful shit on it: https://docs.nvidia.com/deeplear
           | ning/cudnn/latest/reference/...
           | 
           | Namely:
           | 
           | "4.1 License Scope. The SDK is licensed for you to develop
           | applications only for use in systems with NVIDIA GPUs."
        
             | mkl wrote:
             | So add a cheap NVidia card alongside grunty AMD ones, and
             | check for its existence. It doesn't seem to say it needs to
             | run on NVidia GPUs.
        
               | Keyframe wrote:
               | Heh, true. On the other hand, I bet companies are eager
               | to challenge the wrath of a $3T company for a promise of
               | "maybe it'll work, not all of it but at least it'll run
               | worse, at least for now".
        
       | adzm wrote:
       | I'd love to see some benchmarks but this is something the market
       | has been yearning for.
        
         | msond wrote:
         | We're putting together benchmarks to publish at a later time,
         | and we've asked some independent third parties to work on their
         | own additionally.
        
       | acheong08 wrote:
       | Impressive if true. Unfortunately not open source and scarce on
       | exact details on how it works
       | 
       | Edit: not sure why I just sort of expect projects to be open
       | source or at least source available these days.
        
         | tempaccount420 wrote:
         | They might be hoping to be acquired by AMD
        
         | ipsum2 wrote:
         | They're using Docusaurus[1] for their website, which is most
         | commonly used with open source projects.
         | 
         | https://docusaurus.io/docs
        
           | msond wrote:
           | Actually, we use mkdocs and the excellent material for mkdocs
           | theme: https://squidfunk.github.io/mkdocs-material/
        
         | msond wrote:
         | We're going to be publishing more details on later blog posts
         | and documentation about how this works and how we've built it.
         | 
         | Yes, we're not open source, however our license is very
         | permissive. It's both in the software distribution and viewable
         | online at https://docs.scale-lang.com/licensing/
        
           | breck wrote:
           | How about trying _Early_ Source?
           | 
           | It's open source with a long delay, but paying users get the
           | latest updates.
           | 
           | Make the git repo from "today - N years" open source, where N
           | is something like 1 or 2.
           | 
           | That way, students can learn on old versions, and when they
           | grow into professionals they can pay for access to the
           | cutting Edge builds.
           | 
           | Win win win win
           | 
           | ( https://breckyunits.com/earlySource.html)
        
             | msond wrote:
             | We're still thinking about our approach but this is a nice
             | suggestion, thank you.
             | 
             | I'm curious, for what reasons are you interested in the
             | source code yourself?
        
               | mindcrime wrote:
               | I'm not the person you replied to, and I can't speak for
               | them. But I can say that for myself, and a not small
               | number of other people, it's an ideological issue. I
               | simply do not use software that isn't F/OSS - to the
               | greatest extent that that is possible. For me, I might
               | use a VERY small amount of non F/OSS stuff, but it's very
               | hard to get me to adopt something new if it isn't.
               | 
               | Now should you make business decisions based on that?
               | Probably not. But while I don't claim to be a
               | representative sample, I am pretty sure the number of
               | people who share my beliefs in this regard is
               | substantially "non zero". _shrug_
        
               | atq2119 wrote:
               | Not GP, but a guaranteed source availability means users
               | can fix issues themselves in the future if the original
               | provider goes belly-up.
        
               | breck wrote:
               | > I'm curious, for what reasons are you interested in the
               | source code yourself?
               | 
               | I am the founder/editor of PLDB. So I try to do my best
               | to help people "build the next great programming
               | language".
               | 
               | We clone the git repos of over 1,000 compilers and
               | interpreters and use cloc to determine what languages the
               | people who are building languages are using. The people
               | who build languages obviously are the experts, so how
               | they go so goes the world.
               | 
               | We call this measurement "Foundation Score". A Foundation
               | Score of 100 means 100 other languages uses this language
               | somehow in their primary implementation.
               | 
               | It is utterly dominated by open source languages, and the
               | disparity is only getting more extreme.
               | 
               | You can see for yourself here:
               | 
               | https://pldb.io/lists/explorer.html#columns=rank~name~id~
               | app...
               | 
               | Some that might have become irrelevant have gained a
               | second wind after going open source.
               | 
               | But some keep falling further behind.
               | 
               | I look at Mathematica, a very powerful and amazing
               | language, and it makes me sad to see so few other
               | language designers using it, and the reason is because
               | its closed source. So they are not doing so hot, and
               | that's a language from one of our world's smartest and
               | most prolific thinkers that's been around for decades.
               | 
               | I don't see a way for a new language to catch on nowadays
               | that is not open source.
        
         | TaylorAlexander wrote:
         | Makes sense to expect this kind of thing to be open source. The
         | whole point of providing improved compatibility is to make
         | people's lives easier, and open source is usually an important
         | feature to ensure wide compatibility. It also means projects
         | can live on after the creators move to other things, people can
         | submit patches for important features or bug fixes, and
         | generally makes the system much more useful.
        
           | dylan604 wrote:
           | I don't find it wrong for someone to attempt to make money
           | back on their time and experience of doing the work. I don't
           | mind people that offer that back as open source either.
           | However, I do have a problem of people expecting everything
           | to be open/free, especially those that then go on a crusade
           | chastising those that do try to make money.
        
             | TaylorAlexander wrote:
             | I'm really trying to keep this about the engineering
             | features of a system rather than moral judgments. Open
             | source systems are simply more flexible and adaptable than
             | proprietary systems, which have their own benefits. In
             | today's world, the engineering value of open source systems
             | is becoming so important that people are looking for other
             | ways to provide for the developers creating these systems.
             | It can be surprising when a project creator builds
             | something in an area that is usually all open source, but
             | they choose a proprietary path. Just look at the problems
             | created by NVIDIA for their use of proprietary software in
             | CUDA and their GPUs. This software is an attempt to fix
             | issues created by proprietary software with another piece
             | of proprietary software, which is if nothing else an
             | interesting decision.
        
               | dylan604 wrote:
               | UNIX wasn't free. Windows wasn't free. It wasn't until
               | some knucklehead came along and did something abnormal
               | and gave away their thing. Bakers don't give away their
               | goods. Mechanics don't typically repair things for free.
               | Builders don't build things for free. Gas stations don't
               | give away gas.
               | 
               | Why do we think all software should be free, and then
               | think that those that don't give it away are the abnormal
               | ones?
        
         | dheera wrote:
         | Also, can I even buy an AMD GPU? I don't see a "buy now" button
         | or a PCIe version anywhere here
         | 
         | https://www.amd.com/en/products/accelerators/instinct/mi300/...
         | 
         | Another big AMD fuckup in my opinion. Nobody is going to drop
         | millions on these things without being able to test them out
         | first.
         | 
         | First rule of sales: If you have something for sale, take my
         | money.
        
           | nwiswell wrote:
           | > I don't see a "buy now" button or a PCIe version anywhere
           | here
           | 
           | "Buy now" buttons and online shopping carts are not generally
           | how organizations looking to spend serious money on AI buy
           | their hardware.
           | 
           | They have a long list of server hardware partners, and odds
           | are you'd already have an existing relationship with one or
           | more of them, and they'd provide a quote.
           | 
           | They even go one step further and show off some of their
           | partners' solutions:
           | 
           | https://www.amd.com/en/graphics/servers-instinct-deep-
           | learni...
           | 
           | FWIW I believe Supermicro and Exxact actually do have web-
           | based shopping carts these days, so maybe you could skip the
           | quotation and buy directly if you were so motivated? Seems
           | kind of weird at this price point.
           | 
           | https://www.exxactcorp.com/Exxact-TS4-185328443-E185328443
        
             | dheera wrote:
             | ... and that's why AMD is losing.
             | 
             | They could break the trend and offer a "buy now" button
             | instead of offering quotes and coffee chats. It's very
             | likely that will kickstart the software snowball with early
             | adopters.
             | 
             | Nobody is going to drop millions on an unproven platform.
             | 
             | > Seems kind of weird at this price point.
             | 
             | Yeah $234K is too much for people to do a trial. It has
             | 8xMI300X GPUs.
             | 
             | Give me a single MI300X GPU in PCIe form factor for $20K
             | and I'd very seriously consider.
        
       | juujian wrote:
       | I don't understand how AMD has messed up so badly that I feel
       | like celebrating a project like this. Features of my laptop are
       | just physically there but not usable, particularly in Linux. So
       | frustrating.
        
         | djbusby wrote:
         | Same boat, AMD CPU but nothing else. I feel like a moderate
         | improvement of their FOSS support, drivers would open new
         | hardware revenue - to say nothing about the AI channel.
        
         | ActorNightly wrote:
         | I don't know if I would call it a mess up. AMD still has
         | massive market in server chips, and their ARM stuff is on the
         | horizon. We all assume that graphics cards are the way forward
         | for ML, which may not be the case in the future.
         | 
         | Nvidia were just ahead in this particular category due to CUDA,
         | so AMD may have just let them run with it for now.
        
         | jeroenhd wrote:
         | AMD hardware works fine, the problem is that the major research
         | projects everyone copies are all developed specifically for
         | Nvidia.
         | 
         | Now AMD is spinning up CUDA compatibility layer after CUDA
         | compatibility layer. It's like trying to beat Windows by
         | building another ReactOS/Wine. It's an approach doomed to fail
         | unless AMD somehow manages to gain vastly more resources than
         | the competition.
         | 
         | Apple's NPU may not be very powerful, but many models have been
         | altered specifically to run on them, making their NPUs vastly
         | more useful than most equivalently powerful iGPUs. AMD doesn't
         | have that just yet, they're always catching up.
         | 
         | It'll be interesting to see what Qualcomm will do to get
         | developers to make use of their NPUs on the new laptop chips.
        
           | JonChesterfield wrote:
           | Interesting analogy. The last few programs from the windows
           | world I tried to run were flawless under wine and abjectly
           | failed under windows 11.
        
       | deliveryboyman wrote:
       | Would like to see benchmarks for the applications in the test
       | suite.
       | 
       | E.g., how does Cycles compare on AMD vs Nvidia?
        
       | Straw wrote:
       | I worked for spectral compute a few years ago. Very smart and
       | capable technical team.
       | 
       | At the time, not only did they target AMD (with less
       | compatibility than they have now), but also outperformed the
       | default LLVM ptx backend, and even NVCC, when compiling for
       | Nvidia GPUs!
        
       | modeless wrote:
       | A lot of people think AMD should support these translation layers
       | but I think it's a bad idea. CUDA is not designed to be vendor
       | agnostic and Nvidia can make things arbitrarily difficult both
       | technically and legally. For example I think it would be against
       | the license agreement of cuDNN or cuBLAS to run them on this. So
       | those and other Nvidia libraries would become part of the API
       | boundary that AMD would need to reimplement and support.
       | 
       | Chasing bug-for-bug compatibility is a fool's errand. The
       | important users of CUDA are open source. AMD can implement
       | support directly in the upstream projects like pytorch or
       | llama.cpp. And once support is there it can be maintained by the
       | community.
        
         | DeepYogurt wrote:
         | Ya, honestly better to leave that to third parties who can
         | dedicate themselves to it and maybe offer support or whatever.
         | Let AMD work on good first party support first.
        
         | fngjdflmdflg wrote:
         | >Nvidia can make things arbitrarily difficult both technically
         | and legally.
         | 
         | I disagree. AMD can simply not implement those APIs, similar to
         | how game emulators implement the most used APIs first and
         | sometimes never bother implementing obscure ones. It would only
         | matter that NVIDIA added eg. patented APIs to CUDA if those
         | APIs were useful. In which case AMD should have a way to do
         | them anyway. Unless NVIDIA comes up with a new patented API
         | which is both useful and impossible to implement in any other
         | way, which would be bad for AMD in any event. On the other
         | hand, if AMD start supporting CUDA and people start using AMD
         | cards, then developers will be hesitant to use APIs that only
         | work on NVIDIA cards. Right now they are losing billions of
         | dollars on this. Then again they barely seem capable of
         | supporting RocM on their cards, much less CUDA.
         | 
         | You have a fair point in terms of cuDNN and cuBLAS but I don't
         | know that that kind of ToS is actually binding.
        
         | dietr1ch wrote:
         | How's this situation different than the one around Java,
         | Sun/Oracle and Google?
        
           | dboreham wrote:
           | The judge might not be a coder next time.
        
             | viraptor wrote:
             | The US law is highly dependent on precedents. The Google-
             | Oracle case has set one fortunately, so anything following
             | it won't start from scratch. Fortunately we may not need a
             | closer judge.
        
               | jjk166 wrote:
               | Google-Oracle side stepped the issue of API
               | copyrightability by saying Google's particular
               | implementation would fall under fair use. Whether APIs
               | are copyrightable remains an open question.
        
               | dylan604 wrote:
               | Until you get an activist court
        
         | blitzar wrote:
         | It would be good if AMD did something, anything.
         | 
         | Support this, reimplement that, support upstream efforts, dont
         | really care. Any of those would cost a couple of million and be
         | worth a trillion dollars to AMD shareholders.
        
           | oezi wrote:
           | A couple of million doesn't get you anything in corporate
           | land
        
             | spacebanana7 wrote:
             | A couple dozen billion for a 10% chance of becoming NVIDIA
             | competitive is worth it, looking at the stock prices.
        
           | slashdave wrote:
           | ROCm counts as "something"
        
             | curt15 wrote:
             | Pretty much any modern NVIDIA GPU supports CUDA. You don't
             | have to buy a datacenter-class unit to get your feet wet
             | with CUDA programming. ROCm will count as "something" when
             | the same is true for AMD GPUs.
        
               | muxr wrote:
               | I don't think AMD needs to support 5+ year old GPUs
               | personally. And all the recent generations are already
               | practically supported.
               | 
               | AMD only claims support for a select few GPUs, but in my
               | testing I find all the GPUs work fine if the architecture
               | is supported. I've tested rx6600, rx6700xt for example
               | and even though they aren't officially supported, they
               | work fine on ROCm.
        
         | Const-me wrote:
         | > Nvidia can make things arbitrarily difficult both technically
         | and legally
         | 
         | Pretty sure APIs are not copyrightable, e.g.
         | https://www.law.cornell.edu/supremecourt/text/18-956
         | 
         | > against the license agreement of cuDNN or cuBLAS to run them
         | on this
         | 
         | They don't run either of them, they instead implement an
         | equivalent API on top of something else. Here's a quote: "Open-
         | source wrapper libraries providing the "CUDA-X" APIs by
         | delegating to the corresponding ROCm libraries. This is how
         | libraries such as cuBLAS and cuSOLVER are handled."
        
           | dralley wrote:
           | I believe it was decided that they are copyrightable but that
           | using them for compatibility purposes is fair use.
        
             | kbolino wrote:
             | No, it's stranger than that: SCOTUS did not rule on
             | copyrightability of APIs at all, but simply ruled that even
             | _if_ they are copyrightable, what Google did (completely
             | reimplement Sun /Oracle's public API) was still fair use.
        
               | mrandish wrote:
               | It would have been nice to get a clear SCOTUS precedent
               | on this. On the other hand, I also value a SCOTUS which
               | rules minimally and narrowly by default (I also
               | appreciate SCOTUS' return to stricter constitutional
               | grounding in the past decade).
        
               | hobs wrote:
               | Incredibly loud laughing from the lawyers whose study of
               | law is being thrown around willy nilly because of all the
               | unprecedented joke decisions they are making right now.
        
               | kbolino wrote:
               | We are stuck between a rock and a hard place politically.
               | The real decisions should be coming from Congress not the
               | courts. However, Congress is too disorganized and
               | disconnected to answer the important questions, leaving
               | the courts to either muddle along or else become semi-
               | dictatorial. In most countries, this would cause a
               | constitutional crisis, but the modern U.S. system seems
               | to be a little too resilient to such otherwise concerning
               | signals.
        
               | hobs wrote:
               | We're far past a constitutional crisis, and the courts
               | taking power nobody wanted to give to them (who wasn't
               | interested in a unitary executive at least) isn't a good
               | solution.
        
         | amelius wrote:
         | Like supporting x86 was a bad idea as well?
        
           | karolist wrote:
           | Was there a large entity steering x86 spec alone with a huge
           | feature lead against their competition, free to steer the
           | spec in any ways they choose? Also, hardware is not
           | opensource software, you get big players onboard and they
           | will be able to implement the spec they want every gen,
           | software has more moving parts and unaligned parties
           | involved.
        
             | cherryteastain wrote:
             | > Was there a large entity steering x86 spec alone with a
             | huge feature lead against their competition, free to steer
             | the spec in any ways they choose?
             | 
             | Ever heard of Intel?
        
               | karolist wrote:
               | I had't considered that angle. Is your point that Intel
               | was the creator of x86, but software chose to support it,
               | then AMD had nothing else but to play catch up in x86
               | support to be part of the software target market? If so
               | and factual (I've no idea), fair point, I didn't know.
        
               | marshray wrote:
               | It was exactly the same instruction set.
               | 
               | C compilers didn't offer an "AMD" CPU target* until AMD
               | came out with the "AMD64" instruction set. Today we call
               | this "x86_64" or "x64".
               | 
               | * Feel free to point out some custom multimedia vector
               | extensions for Athlons or something, but the point
               | remains.
        
           | modeless wrote:
           | Before starting, AMD signed an agreement with Intel that gave
           | them an explicit license to x86. And x86 was a whole lot
           | smaller and simpler back then in _1982_. A completely
           | different and incomparable situation.
        
             | nostrademons wrote:
             | Technically it was after starting - AMD was founded in 1969
             | as a second-sourcer for Fairchild and National
             | Semiconductor, and had reverse-engineered the 8080 by 1975
             | and acquired a formal license to it by 1976.
             | 
             | The 1982 deal you speak of was actually pretty interesting:
             | as a condition of the x86's use in the IBM PC, IBM
             | requested a second source for x86 chips. AMD was that
             | source, and so they cross-licensed the x86 in 1982 to allow
             | the IBM PC project to proceed forward. This makes the
             | Intel/AMD deal even more important for both companies: _the
             | PC market would never have developed_ without the cross-
             | licensing, which would 've been bad for all companies
             | involved. This gave Intel an ongoing stake in AMD's success
             | at least until the PC market consolidated on the x86
             | standard.
        
         | eslaught wrote:
         | Are you aware of HIP? It's officially supported and, for code
         | that avoids obscure features of CUDA like inline PTX, it's
         | pretty much a find-and-replace to get a working build:
         | 
         | https://github.com/ROCm/HIP
         | 
         | Don't believe me? Include this at the top of your CUDA code,
         | build with hipcc, and see what happens:
         | 
         | https://gitlab.com/StanfordLegion/legion/-/blob/master/runti...
         | 
         | It's incomplete because I'm lazy but you can see most things
         | are just a single #ifdef away in the implementation.
        
           | currymj wrote:
           | if you're talking about building anything, that is already
           | too hard for ML researchers.
           | 
           | you have to be able to pip install something and just have it
           | work, reasonably fast, without crashing, and also it has to
           | not interfere with 100 other weird poorly maintained ML
           | library dependencies.
        
             | bootsmann wrote:
             | Don't most orgs that are deep enough to run custom cuda
             | kernels have dedicated engineers for this stuff. I can't
             | imagine a person who can write raw cuda not being able to
             | handle things more difficult than pip install.
        
               | gaogao wrote:
               | Engineers who are really, really good at CUDA are worth
               | their weight in gold, so there's more projects for them
               | than they have time. Worth their weight in gold isn't
               | figurative here - the one I know has a ski house more
               | expensive than 180 lbs of gold (~$5,320,814).
        
               | bbkane wrote:
               | Would you (or your friend) be able to drop any good CUDA
               | learning resources? I'd like to be worth my weight in
               | gold...
        
               | eigenvalue wrote:
               | That's pretty funny. Good test of value across the
               | millennia. I wonder if the best aqueduct engineers during
               | the peak of Ancient Rome's power had villas worth their
               | body weight in gold.
        
             | jchw wrote:
             | The target audience of interoperability technology is
             | whoever is building, though. Ideally, interoperability
             | technology can help software that supports only NVIDIA GPUs
             | today go on to quickly add baseline support for Intel and
             | AMD GPUs tomorrow.
             | 
             | (and for one data point, I believe Blender is actively
             | using HIP for AMD GPU support in Cycles.)
        
             | Agingcoder wrote:
             | Their target is hpc users, not ml researchers. I can
             | understand why this would be valuable to this particular
             | crowd.
        
             | eslaught wrote:
             | If your point is that HIP is not a zero-effort porting
             | solution, that is correct. HIP is a _low_ -effort solution,
             | not a zero effort solution. It targets users who already
             | use and know CUDA, and minimizes the changes that are
             | required from pre-existing CUDA code.
             | 
             | In the case of these abstraction layers, then it would be
             | the responsibility of the abstraction maintainers (or AMD)
             | to port them. Obviously, someone who does not even use CUDA
             | would not use HIP either.
             | 
             | To be honest, I have a hard time believing that a truly
             | zero-effort solution exists. Especially one that gets high
             | performance. Once you start talking about the full stack,
             | there are too many potholes and sharp edges to believe that
             | it will really work. So I am highly skeptical of original
             | article. Not that I wouldn't want to be proved wrong. But
             | what they're claiming to do is a big lift, even taking HIP
             | as a starting point.
             | 
             | The easiest, fastest (for end users), highest-performance
             | solution for ML will come when the ecosystem integrates it
             | natively. HIP would be a way to get there faster, but it
             | will take nonzero effort from CUDA-proficient engineers to
             | get there.
        
           | SushiHippie wrote:
           | AMD has hipify for this, which converts cuda code to hip.
           | 
           | https://github.com/ROCm/HIPIFY
        
           | jph00 wrote:
           | Inline PTX is hardly an obscure feature. It's pretty widely
           | used in practice, at least in the AI space.
        
         | viraptor wrote:
         | Isn't cuDNN a much better case for reimplementing than CUDA? It
         | has much more choice in how things actually happen and cuDNN
         | itself chooses different implementations at runtime + does
         | fusing. It seems way more generic and the reimplementation
         | would allow using the best AMD-targeted kernel rather than one
         | the original has.
        
           | ckitching wrote:
           | AMD have "MIOpen" which is _basically_ cuDNN-for-AMD. Ish.
        
         | anigbrowl wrote:
         | Given AMDs prior lack of interest I'll take whatever options
         | there are. My daily driver has a Vega 10 GPU and it's been
         | quite frustrating not to be able to easily leverage it for
         | doing basic ML tasks, to the point that I've been looking at
         | buying an external nvidia GPU instead just to try out some of
         | the popular Python libraries.
        
         | Wowfunhappy wrote:
         | > CUDA is not designed to be vendor agnostic and Nvidia can
         | make things arbitrarily difficult [...] technically.
         | 
         | (Let's put the legal questions aside for a moment.)
         | 
         | nVidia changes GPU architectures every generation / few
         | generations, right? How does CUDA work across those--and how
         | can it have forwards compatibility in the future--if it's not
         | designed to be technologically agnostic?
        
           | andy_ppp wrote:
           | One way is to make sure the hardware team does certain things
           | to support easy transition to new architectures, we have seen
           | this with Apple Silicon for example!
        
         | koolala wrote:
         | CUDA v1...CUDA v2... CUDA v... CUDA isnt commonly assosiated
         | with a version number...
        
           | Uehreka wrote:
           | ...yes it is? https://developer.nvidia.com/cuda-toolkit-
           | archive
        
       | jarbus wrote:
       | Really, really, _really_ curious as to how they managed to pull
       | this off, if their project works as well as they claim it does.
       | If stuff as complex as paged /flash attention can "just work",
       | this is really cool.
        
         | Straw wrote:
         | My understanding from chatting with them is that tensor core
         | operations aren't supported yet, so FlashAttention likely won't
         | work. I think its on their to-do list though!
         | 
         | Nvidia actually has more and more capable matrix multiplication
         | units, so even with a translation layer I wouldn't expect the
         | same performance until AMD produces better ML cards.
         | 
         | Additionally, these kernels usually have high sensitivity to
         | cache and smem sizes, so they might need to be retuned.
        
           | Der_Einzige wrote:
           | So the only part that anyone actually cares about, as usual,
           | is not supported. Same story as it was in 2012 with AMD vs
           | Nvidia (and likely much before that too!). The more things
           | change, the more they stay the same.
        
         | JonChesterfield wrote:
         | Cuda is a programming language. You implement it like any
         | other. The docs are a bit sparse but not awful. Targeting
         | amdgpu is probably about as difficult as targeting x64, mostly
         | changes the compiler runtime.
         | 
         | The online ptx implementation is notable for being even more
         | annoying to deal with than the cuda, but it's just bytes in /
         | different bytes out. No magic.
        
           | ckitching wrote:
           | [I work on SCALE]
           | 
           | CUDA has a couple of extra problems beyond just any other
           | programming language:
           | 
           | - CUDA is more than a language: it's a giant library (for
           | both CPU and GPU) for interacting with the GPU, and for
           | writing the GPU code. This needed reimplementing. At least
           | for the device-side stuff we can implement it _in CUDA_ , so
           | when we add support for other GPU vendors the code can
           | (mostly) just be recompiled and work there :D. - CUDA (the
           | language) is not actually specified. It is, informally,
           | "whatever nvcc does". This differs significantly from what
           | Clang's CUDA support does (which is ultimately what the HIP
           | compiler is derived from).
           | 
           | PTX is indeed vastly annoying.
        
             | JonChesterfield wrote:
             | The openmp device runtime library was originally written in
             | cuda. I ported that to hip for amdgpu, discovered the
             | upstream hip compiler wasn't quite as solid as advertised,
             | then ported it to openmp with some compiler intrinsics. The
             | languages are all essentially C++ syntax with some spurious
             | noise obfuscating llvm IR. The libc effort has gone with
             | freestanding c++ based on that experience and and we've now
             | mostly fixed the ways that goes wrong.
             | 
             | You might also find raw c++ for device libraries saner to
             | deal with than cuda. In particular you don't need to jury
             | rig the thing to not spuriously embed the GPU code in x64
             | elf objects and/or pull the binaries apart. Though if
             | you're feeding the same device libraries to nvcc with
             | #ifdef around the divergence your hands are tied.
        
               | ckitching wrote:
               | > You might also find raw c++ for device libraries saner
               | to deal with than cuda.
               | 
               | Actually, we just compile all the device libraries to
               | LLVM bitcode and be done with it. Then we can write them
               | using all the clang-dialect, not-nvcc-emulating, C++23 we
               | feel like, and it'll still work when someone imports them
               | into their c++98 CUDA project from hell. :D
        
       | m3kw9 wrote:
       | This isn't a solution for pros because it will always play catch
       | up and Nvidia can always add things to make it difficult. This is
       | like emulation.
        
         | bachmeier wrote:
         | > it will always play catch up
         | 
         | That's not important if the goal is to run existing CUDA code
         | on AMD GPUs. All you have to do is write portable CUDA code in
         | the future regardless of what Nvidia does if you want to keep
         | writing CUDA.
         | 
         | I don't know the economics here, but if the AMD provides a
         | significant cost saving, companies are going to make it work.
         | 
         | > Nvidia can always add things to make it difficult
         | 
         | Sounds like Microsoft embedding the browser in the OS. It's
         | hard to see how doing something like that wouldn't trigger an
         | antitrust case.
        
         | dboreham wrote:
         | Pros will end up overruled by bean counters if it works.
        
         | ok123456 wrote:
         | It's not emulation. It's a compiler.
        
       | joe_the_user wrote:
       | This sounds fabulous. I look forward to AMD being drawn kicking
       | and screaming into direct competition with Nvidia.
        
       | gizajob wrote:
       | Is Nvidia not likely to sue or otherwise bork this into non-
       | existence?
        
         | chx wrote:
         | Sue over what...?
        
           | gizajob wrote:
           | Whatever IP related issues they'd want to sue over. Sorry I
           | don't know specifics about what this would specifically
           | infringe but I'm sure expensive legal brains could come up
           | with something
        
         | CoastalCoder wrote:
         | I wonder if nVidia's current anti-trust woes would make them
         | reluctant to go that route at the moment.
        
       | sakras wrote:
       | One question I always have about these sorts of translation
       | layers is how they deal with the different warp sizes. I'd
       | imagine a lot of CUDA code relies on 32-wide warps, while as far
       | as I know AMD tends to have 64-wide warps. Is there some sort of
       | emulation that needs to happen?
        
         | mpreda wrote:
         | The older AMD _GCN_ had 64-wide wavefront, but the newer AMD
         | GPUs  "RDNA" support both 64 and 32 wavefront, and this is
         | configurable at runtime. It appears the narrower wavefronts are
         | better suited for games in general.
         | 
         | Not sure what is the situation with "CDNA", which is the
         | compute-oriented evolution of "GCN", i.e. whether CDNA is
         | 64-wavefront only or dual like RNDA.
        
         | msond wrote:
         | SCALE is not a "translation layer", it's a full source-to-
         | target compiler from CUDA-like C++ code to AMD GPUs.
         | 
         | See this part of the documentation for more details regarding
         | warp sizes: https://docs.scale-lang.com/manual/language-
         | extensions/#impr...
        
       | ladberg wrote:
       | I don't really see how any code that depends heavily on the
       | underlying hardware can "just work" on AMD. Most serious CUDA
       | code is aware of register file and shared memory sizes, wgmma
       | instructions, optimal tensor core memory & register layouts,
       | tensor memory accelerator instructions, etc...
       | 
       | Presumably that stuff doesn't "just work" but they don't want to
       | mention it?
        
         | lmeyerov wrote:
         | Sort of
         | 
         | A lot of our hw-aware bits are parameterized where we fill in
         | constants based on the available hw . Doable to port, same as
         | we do whenever new Nvidia architectures come out.
         | 
         | But yeah, we have tricky bits that inline PTX, and.. that will
         | be more annoying to redo.
        
           | Retr0id wrote:
           | > SCALE accepts CUDA programs as-is. [...] This is true even
           | if your program uses inline PTX asm
        
             | lmeyerov wrote:
             | Oh that will be interesting to understand, as PTX gets to
             | more about trickier hw-arch-specific phenomena that diff
             | brands disagree on, like memory models. Neat!
        
               | lmeyerov wrote:
               | Looks like the PTX translation is via another project
               | ZLUDA, though how they bridge the differences in
               | memory/consistency/etc models safely remains unclear to
               | me...
        
               | ckitching wrote:
               | Hi! Spectral engineer here!
               | 
               | SCALE does not use any part of ZLUDA. We have modified
               | the clang frontend to convert inline PTX asm block to
               | LLVM IR.
               | 
               | To put in a less compiler-engineer-ey way: for any given
               | block of PTX, there exists a hypothetical sequence of
               | C++/CUDA code you could have written to achieve the same
               | effect, but on AMD (perhaps using funky __builtin_...
               | functions if the code includes shuffles/ballots/other-
               | weird-gpu-stuff). Our compiler effectively converts the
               | PTX into that hypothetical C++.
               | 
               | Regarding memory consistency etc.: NVIDIA document the
               | "CUDA memory consistency model" extremely thoroughly, and
               | likewise, the consistency guarantees for PTX. It is
               | therefore sufficient to ensure that we use operations at
               | least as synchronising as those called for in the
               | documented semantics of the language (be it CUDA or PTX,
               | for each operation).
               | 
               | Differing consistency _between architectures_ is the
               | AMDGPU backend's problem.
        
               | ladberg wrote:
               | Just to check here, if you're given something like the
               | following PTX:
               | wgmma.mma_async.sync.aligned.m64n256k16.f32.bf16.bf16
               | 
               | Do you reverse it back into C++ that does the
               | corresponding FMAs manually instead of using tensor
               | hardware? Or are you able to convert it into a series of
               | __builtin_amdgcn_mfma_CDFmt_MxNxKABFmt instructions that
               | emulate the same behavior?
        
       | shmerl wrote:
       | Compiler isn't open source? That feels like DOA in this day and
       | age. There is ZLUDA already which is open.
       | 
       | If they plan to open it up, it can be something useful to add to
       | options of breaking CUDA lock-in.
        
         | uyzstvqs wrote:
         | ZLUDA is pretty good, except that it lacks cuDNN which makes
         | most PyTorch projects just not work. Not sure if this project
         | does cover that? That could be a game changer, otherwise yeah
         | ZLUDA is the better open-source option.
        
       | cheptsov wrote:
       | Sounds really awesome. Any chance someone can suggest if this
       | works also inside a Docker container?
        
         | ckitching wrote:
         | It works exactly as well as other AMDGPU-related software (HIP
         | etc.) works inside Docker.
         | 
         | There are some delightful AMD driver issues that make certain
         | models of GPU intermittently freeze the kernel when used from
         | docker. That was great fun when building SCALE's CI system :D.
        
         | SushiHippie wrote:
         | Works like described in the rocm documentation (at least the
         | scaleinfo worked for me, haven't tested further)
         | 
         | https://rocm.docs.amd.com/projects/install-on-linux/en/lates...
        
       | resters wrote:
       | The main cause of Nvidia's crazy valuation is AMD's unwillingness
       | to invest in making its GPUs as useful as Nvidia's for ML.
       | 
       | Maybe AMD fears antitrust action, or maybe there is something
       | about its underlying hardware approach that would limit
       | competitiveness, but the company seems to have left billions of
       | dollars on the table during the crypto mining GPU demand spike
       | and now during the AI boom demand spike.
        
         | karolist wrote:
         | I think this could be cultural differences, AMD's software
         | department is underfunded and doing poorly for a long time now.
         | 
         | * https://www.levels.fyi/companies/amd/salaries/software-
         | engin...
         | 
         | * https://www.levels.fyi/companies/nvidia/salaries/software-
         | en...
         | 
         | And it's probably better now. Nvidia was paying much more long
         | before, also their stock growing attracts even more talent.
        
           | 1024core wrote:
           | > I think this could be cultural differences, AMD's software
           | department is underfunded and doing poorly for a long time
           | now.
           | 
           | Rumor is that ML engineers (that AMD really needs) are
           | expensive; and AMD doesn't want to give them more money than
           | the rest of the SWEs they have (for pissing off the existing
           | SWEs). So AMD is caught in a bind: can't pay to get top MLE
           | talent and can't just sit by and watch NVDA eat its lunch.
        
             | mepian wrote:
             | AMD recently acquired Silo AI.
        
             | karolist wrote:
             | I find this strange to believe. Every big company has
             | levels, unless your existing L7+ IC is below market, you
             | can just pull L7+ salaried ML engineers with some secret
             | signing bonus like literally everyone else.
        
               | Der_Einzige wrote:
               | The dirty secret in the tech industry is that most people
               | at AMD or Intel or IBM and historically Nvidia/Oracle
               | (this changed post 2022), were the 2nd-3rd tier tech
               | companies. Staffed heavily by the rejects of the FAANG,
               | they were still happy to have their 100-200K in their
               | MCOL areas, but no free food and a much more boring work
               | culture. Intel's "great place to work" corporate
               | propaganda was known as "great place to leetcode" while I
               | worked there, as Intel was always seen as a stepping
               | stone before you "made it" in a FAANG.
               | 
               | Culturally, none of these companies were happy to pay
               | anyone except the tip, top "distinguished" engineers more
               | than 300K. AMD seems to be stuck in this mentality, just
               | as IBM is.
        
         | dist-epoch wrote:
         | There are stories from credible sources that AMD software
         | engineers had to buy AMD GPUs with their own money to use in CI
         | machines.
        
         | ClassyJacket wrote:
         | I like to watch YouTube retrospectives on old failed tech
         | companies - LGR has some good ones.
         | 
         | When I think of AMD ignoring machine learning, I can't help
         | imagine a future YouTuber's voiceover explaining how this
         | caused their downfall.
         | 
         | There's a tendency sometimes to think "they know what they're
         | doing, they must have good reasons". And sometimes that's
         | right, and sometimes that's wrong. Perhaps there's some great
         | technical, legal, or economic reason I'm just not aware of. But
         | when you actually look into these things, it's surprising how
         | often the answer is indeed just shortsightedness.
         | 
         | They could end up like BlackBerry, Blockbuster, Nokia, and
         | Kodak. I guess it's not quite as severe, since they will still
         | have a market in games and therefore may well continue to
         | exist, but it will still be looked back on as a colossal
         | mistake.
         | 
         | Same with Toyota ignoring electric cars.
         | 
         | I'm not an investor, but I still have stakes in the sense that
         | Nvidia has no significant competition in the machine learning
         | space, and that sucks. GPU prices are sky high and there's
         | nobody else to turn to if there's something about Nvidia you
         | just don't like or if they decide to screw us.
        
       | paulmist wrote:
       | Doesn't seem to mention CDNA?
        
       | JonChesterfield wrote:
       | This is technically feasible so might be the real thing. Parsing
       | inline ptx and mapping that onto amdgpu would be a huge pain.
       | 
       | Working from cuda source that doesn't use inline ptx to target
       | amdgpu is roughly regex find and replace to get hip, which has
       | implemented pretty much the same functionality.
       | 
       | Some of the details would be dubious, e.g. the atomic models
       | probably don't match, and volta has a different instruction
       | pointer model, but it could all be done correctly.
       | 
       | Amd won't do this. Cuda isn't a very nice thing in general and
       | the legal team would have kittens. But other people totally
       | could.
        
         | ckitching wrote:
         | [I work on SCALE]
         | 
         | Mapping inline ptx to AMD machine code would indeed _suck_.
         | Converting it to LLVM IR right at the start of compilation
         | (when the initial IR is being generated) is much simpler, since
         | it then gets  "compiled forward" with the rest of the code.
         | It's as if you wrote C++/intrinsics/whatever instead.
         | 
         | Note that nvcc accepts a different dialect of C++ from clang
         | (and hence hipcc), so there is in fact more that separates CUDA
         | from hip (at the language level) than just find/replace. We
         | discuss this a little in [the manual](https://docs.scale-
         | lang.com/manual/dialects/)
         | 
         | Handling differences between the atomic models is, indeed,
         | "fun". But since CUDA is a programming language with documented
         | semantics for its memory consistency (and so is PTX) it is
         | entirely possible to arrange for the compiler to "play by
         | NVIDIA's rules".
        
           | JonChesterfield wrote:
           | Huh. Inline assembly is strongly associated in my mind with
           | writing things that can't be represented in LLVM IR, but in
           | the specific case of PTX - you can only write things that
           | ptxas understands, and that probably rules out wide classes
           | of horrendous behaviour. Raw bytes being used for
           | instructions and for data, ad hoc self modifying code and so
           | forth.
           | 
           | I believe nvcc is roughly an antique clang build hacked out
           | of all recognition. I remember it rejecting templates with
           | 'I' as the type name and working when changing to 'T',
           | nonsense like that. The HIP language probably corresponds
           | pretty closely to clang's cuda implementation in terms of
           | semantics (a lot of the control flow in clang treats them
           | identically), but I don't believe an exact match to nvcc was
           | considered particularly necessary for the clang -x cuda work.
           | 
           | The ptx to llvm IR approach is clever. I think upstream would
           | be game for that, feel free to tag me on reviews if you want
           | to get that divergence out of your local codebase.
        
       | ur-whale wrote:
       | If this actually works (remains to be seen), I can only say:
       | 1) Kudos        2) Finally !
        
         | gedy wrote:
         | or: 1) CUDAs
        
         | anthonix1 wrote:
         | I just tried it with llm.c ... seems to be missing quite a few
         | key components such as cublaslt, bfloat16 support, nvtx3,
         | compiler flags such as -t
         | 
         | And its linked against an old release of ROCm.
         | 
         | So unclear to me how it is supposed to be an improvement over
         | something like hipify
        
           | ckitching wrote:
           | Greetings, I work on SCALE.
           | 
           | It appears we implemented `--threads` but not `-t` for the
           | compiler flag. Oeps. In either case, the flag has no effect
           | at present, since fatbinary support is still in development,
           | and that's the only part of the process that could
           | conceivably be parallelised.
           | 
           |  _That said_ : clang (and hence the SCALE compiler) tends to
           | compile CUDA much faster than nvcc does, so this lack of the
           | parallelism feature is less problematic than it might at
           | first seem.
           | 
           | NVTX support (if you want more than just "no-ops to make the
           | code compile") requires cooperation with the authors of
           | profilers etc., which has not so far been available
           | 
           | bfloat16 is not properly supported by AMD anyway: the
           | hardware doesn't do it, and HIP's implementatoin just lies
           | and does the math in `float`. For that reason we haven't
           | prioritised putting together the API.
           | 
           | cublasLt is a fair cop. We've got a ticket :D.
        
             | anthonix1 wrote:
             | Hi, why do you believe that bfloat16 is not supported? Can
             | you please provide some references (specifically the part
             | about the hardware "doesn't do it")?
             | 
             | For the hardware you are focussing on (gfx11), the
             | reference manual [2] and the list of LLVM gfx11
             | instructions supported [1] describe the bfloat16 vdot &
             | WMMA operations, and these are in fact implemented and
             | working in various software such as composable kernels and
             | rocBLAS, which I have used (and can guarantee they are not
             | simply being run as float). I've also used these in the AMD
             | fork of llm.c [3]
             | 
             | Outside of gfx11, I have also used bfloat16 in CDNA2 & 3
             | devices, and they are working and being supported.
             | 
             | Regarding cublasLt, what is your plan for support there?
             | Pass everything through to hipblasLt (hipify style) or
             | something else?
             | 
             | Cheers, -A
             | 
             | [1] https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX11.html [2]
             | https://www.amd.com/content/dam/amd/en/documents/radeon-
             | tech... [3] http://github.com/anthonix/llm.c
        
       | ashvardanian wrote:
       | It's great that there is a page about current limitations [1],
       | but I am afraid that what most people describe as "CUDA" is a
       | small subset of the real CUDA functionality. Would be great to
       | have a comparison table for advanced features like warp shuffles,
       | atomics, DPX, TMA, MMA, etc. Ideally a table, mapping every PTX
       | instruction to a direct RDNA counterpart or a list of
       | instructions used to emulate it.
       | 
       | [1]: https://docs.scale-lang.com/manual/differences/
        
         | ckitching wrote:
         | You're right that most people only use a small subset of cuda:
         | we prioritied support for features based on what was needed for
         | various open-source projects, as a way to try to capture the
         | most common things first.
         | 
         | A complete API comparison table is coming soon, I belive. :D
         | 
         | In a nutshell: - DPX: Yes. - Shuffles: Yes. Including the PTX
         | versions, with all their weird/wacky/insane arguments. -
         | Atomics: yes, except the 128-bit atomics nvidia added very
         | recently. - MMA: in development, though of course we can't fix
         | the fact that nvidia's hardware in this area is just _better_
         | than AMD 's, so don't expect performance to be as good in all
         | cases. - TMA: On the same branch as MMA, though it'll just be
         | using AMD's async copy instructions.
         | 
         | > mapping every PTX instruction to a direct RDNA counterpart or
         | a list of instructions used to emulate it.
         | 
         | We plan to publish a compatibility table of which instructons
         | are supported, but a list of the instructions used to produce
         | each PTX instruction is not in general meaningful. The inline
         | PTX handler works by converting the PTX block to LLVM IR at the
         | start of compilation (at the same time the rest of your code
         | gets turned into IR), so it then "compiles forward" with the
         | rest of the program. As a result, the actual instructions
         | chosen vary on a csae-by-case basis due to the whims of the
         | optimiser. This design in principle produces better performance
         | than a hypothetical solution that turned PTX asm into AMD asm,
         | because it conveniently eliminates the optimisation barrier an
         | asm block typically represents. Care, of course, is taken to
         | handle the wacky memory consistency concerns that this implies!
         | 
         | We're documenting which ones are expected to perform worse than
         | on NVIDIA, though!
        
       | qwerty456127 wrote:
       | > gfx1030, gfx1100, gfx1010, gfx1101, gfx900...
       | 
       | How do I find out which do I have?
        
         | ckitching wrote:
         | Like this:
         | 
         | https://docs.scale-lang.com/manual/how-to-use/#identifying-g...
        
       | galaxyLogic wrote:
       | Companies selling CUDA software should no doubt adopt this tool
        
       | yieldcrv wrote:
       | the real question here is whether anybody has gotten cheap,
       | easily available AMD GPUs to run their AI workloads, and if we
       | can predict more people will do so
        
       | EGreg wrote:
       | But the question is, can it also run SHUDA and WUDA?
        
       ___________________________________________________________________
       (page generated 2024-07-15 23:00 UTC)