[HN Gopher] Run CUDA, unmodified, on AMD GPUs
___________________________________________________________________
Run CUDA, unmodified, on AMD GPUs
Author : Straw
Score : 435 points
Date : 2024-07-15 19:05 UTC (3 hours ago)
(HTM) web link (docs.scale-lang.com)
(TXT) w3m dump (docs.scale-lang.com)
| dagmx wrote:
| Has anyone tried this and knows how well it works? It definitely
| sounds very compelling
| arjvik wrote:
| Who is this Spectral Compute, and where can we see more about
| them?
| msond wrote:
| You can learn more about us on https://spectralcompute.co.uk
| JonChesterfield wrote:
| The branch free regex engine is an interesting idea. I would
| have said that can't be implemented in finite code.
|
| Compile to DFA by repeatedly differentiating then unroll the
| machine? You'd still have back edges for the repeating
| sections.
| pixelpoet wrote:
| Isn't this a bit legally dubious, like zluda?
| janice1999 wrote:
| It's advertised as a "clean room" re-implementation. What part
| would be illegal?
| ekelsen wrote:
| If they had to reverse engineer any compiled code to do this,
| I think that would be against licenses they had to agree to?
|
| At least grounds for suing and starting an extensive
| discovery process and possibly a costly injunction...
| msond wrote:
| We have not reverse engineered any compiled code in the
| process of developing SCALE.
|
| It was clean-room implemented purely from the API surface
| and by trial-and-error with open CUDA code.
| RockRobotRock wrote:
| Isn't that exactly what a "clean room" approach avoids?
| ekelsen wrote:
| oh definitely. But if I was NVIDIA I'd want to verify
| that in court after discovery rather than relying on
| their claim on a website.
| Keyframe wrote:
| Can't run useful shit on it: https://docs.nvidia.com/deeplear
| ning/cudnn/latest/reference/...
|
| Namely:
|
| "4.1 License Scope. The SDK is licensed for you to develop
| applications only for use in systems with NVIDIA GPUs."
| mkl wrote:
| So add a cheap NVidia card alongside grunty AMD ones, and
| check for its existence. It doesn't seem to say it needs to
| run on NVidia GPUs.
| Keyframe wrote:
| Heh, true. On the other hand, I bet companies are eager
| to challenge the wrath of a $3T company for a promise of
| "maybe it'll work, not all of it but at least it'll run
| worse, at least for now".
| adzm wrote:
| I'd love to see some benchmarks but this is something the market
| has been yearning for.
| msond wrote:
| We're putting together benchmarks to publish at a later time,
| and we've asked some independent third parties to work on their
| own additionally.
| acheong08 wrote:
| Impressive if true. Unfortunately not open source and scarce on
| exact details on how it works
|
| Edit: not sure why I just sort of expect projects to be open
| source or at least source available these days.
| tempaccount420 wrote:
| They might be hoping to be acquired by AMD
| ipsum2 wrote:
| They're using Docusaurus[1] for their website, which is most
| commonly used with open source projects.
|
| https://docusaurus.io/docs
| msond wrote:
| Actually, we use mkdocs and the excellent material for mkdocs
| theme: https://squidfunk.github.io/mkdocs-material/
| msond wrote:
| We're going to be publishing more details on later blog posts
| and documentation about how this works and how we've built it.
|
| Yes, we're not open source, however our license is very
| permissive. It's both in the software distribution and viewable
| online at https://docs.scale-lang.com/licensing/
| breck wrote:
| How about trying _Early_ Source?
|
| It's open source with a long delay, but paying users get the
| latest updates.
|
| Make the git repo from "today - N years" open source, where N
| is something like 1 or 2.
|
| That way, students can learn on old versions, and when they
| grow into professionals they can pay for access to the
| cutting Edge builds.
|
| Win win win win
|
| ( https://breckyunits.com/earlySource.html)
| msond wrote:
| We're still thinking about our approach but this is a nice
| suggestion, thank you.
|
| I'm curious, for what reasons are you interested in the
| source code yourself?
| mindcrime wrote:
| I'm not the person you replied to, and I can't speak for
| them. But I can say that for myself, and a not small
| number of other people, it's an ideological issue. I
| simply do not use software that isn't F/OSS - to the
| greatest extent that that is possible. For me, I might
| use a VERY small amount of non F/OSS stuff, but it's very
| hard to get me to adopt something new if it isn't.
|
| Now should you make business decisions based on that?
| Probably not. But while I don't claim to be a
| representative sample, I am pretty sure the number of
| people who share my beliefs in this regard is
| substantially "non zero". _shrug_
| atq2119 wrote:
| Not GP, but a guaranteed source availability means users
| can fix issues themselves in the future if the original
| provider goes belly-up.
| breck wrote:
| > I'm curious, for what reasons are you interested in the
| source code yourself?
|
| I am the founder/editor of PLDB. So I try to do my best
| to help people "build the next great programming
| language".
|
| We clone the git repos of over 1,000 compilers and
| interpreters and use cloc to determine what languages the
| people who are building languages are using. The people
| who build languages obviously are the experts, so how
| they go so goes the world.
|
| We call this measurement "Foundation Score". A Foundation
| Score of 100 means 100 other languages uses this language
| somehow in their primary implementation.
|
| It is utterly dominated by open source languages, and the
| disparity is only getting more extreme.
|
| You can see for yourself here:
|
| https://pldb.io/lists/explorer.html#columns=rank~name~id~
| app...
|
| Some that might have become irrelevant have gained a
| second wind after going open source.
|
| But some keep falling further behind.
|
| I look at Mathematica, a very powerful and amazing
| language, and it makes me sad to see so few other
| language designers using it, and the reason is because
| its closed source. So they are not doing so hot, and
| that's a language from one of our world's smartest and
| most prolific thinkers that's been around for decades.
|
| I don't see a way for a new language to catch on nowadays
| that is not open source.
| TaylorAlexander wrote:
| Makes sense to expect this kind of thing to be open source. The
| whole point of providing improved compatibility is to make
| people's lives easier, and open source is usually an important
| feature to ensure wide compatibility. It also means projects
| can live on after the creators move to other things, people can
| submit patches for important features or bug fixes, and
| generally makes the system much more useful.
| dylan604 wrote:
| I don't find it wrong for someone to attempt to make money
| back on their time and experience of doing the work. I don't
| mind people that offer that back as open source either.
| However, I do have a problem of people expecting everything
| to be open/free, especially those that then go on a crusade
| chastising those that do try to make money.
| TaylorAlexander wrote:
| I'm really trying to keep this about the engineering
| features of a system rather than moral judgments. Open
| source systems are simply more flexible and adaptable than
| proprietary systems, which have their own benefits. In
| today's world, the engineering value of open source systems
| is becoming so important that people are looking for other
| ways to provide for the developers creating these systems.
| It can be surprising when a project creator builds
| something in an area that is usually all open source, but
| they choose a proprietary path. Just look at the problems
| created by NVIDIA for their use of proprietary software in
| CUDA and their GPUs. This software is an attempt to fix
| issues created by proprietary software with another piece
| of proprietary software, which is if nothing else an
| interesting decision.
| dylan604 wrote:
| UNIX wasn't free. Windows wasn't free. It wasn't until
| some knucklehead came along and did something abnormal
| and gave away their thing. Bakers don't give away their
| goods. Mechanics don't typically repair things for free.
| Builders don't build things for free. Gas stations don't
| give away gas.
|
| Why do we think all software should be free, and then
| think that those that don't give it away are the abnormal
| ones?
| dheera wrote:
| Also, can I even buy an AMD GPU? I don't see a "buy now" button
| or a PCIe version anywhere here
|
| https://www.amd.com/en/products/accelerators/instinct/mi300/...
|
| Another big AMD fuckup in my opinion. Nobody is going to drop
| millions on these things without being able to test them out
| first.
|
| First rule of sales: If you have something for sale, take my
| money.
| nwiswell wrote:
| > I don't see a "buy now" button or a PCIe version anywhere
| here
|
| "Buy now" buttons and online shopping carts are not generally
| how organizations looking to spend serious money on AI buy
| their hardware.
|
| They have a long list of server hardware partners, and odds
| are you'd already have an existing relationship with one or
| more of them, and they'd provide a quote.
|
| They even go one step further and show off some of their
| partners' solutions:
|
| https://www.amd.com/en/graphics/servers-instinct-deep-
| learni...
|
| FWIW I believe Supermicro and Exxact actually do have web-
| based shopping carts these days, so maybe you could skip the
| quotation and buy directly if you were so motivated? Seems
| kind of weird at this price point.
|
| https://www.exxactcorp.com/Exxact-TS4-185328443-E185328443
| dheera wrote:
| ... and that's why AMD is losing.
|
| They could break the trend and offer a "buy now" button
| instead of offering quotes and coffee chats. It's very
| likely that will kickstart the software snowball with early
| adopters.
|
| Nobody is going to drop millions on an unproven platform.
|
| > Seems kind of weird at this price point.
|
| Yeah $234K is too much for people to do a trial. It has
| 8xMI300X GPUs.
|
| Give me a single MI300X GPU in PCIe form factor for $20K
| and I'd very seriously consider.
| juujian wrote:
| I don't understand how AMD has messed up so badly that I feel
| like celebrating a project like this. Features of my laptop are
| just physically there but not usable, particularly in Linux. So
| frustrating.
| djbusby wrote:
| Same boat, AMD CPU but nothing else. I feel like a moderate
| improvement of their FOSS support, drivers would open new
| hardware revenue - to say nothing about the AI channel.
| ActorNightly wrote:
| I don't know if I would call it a mess up. AMD still has
| massive market in server chips, and their ARM stuff is on the
| horizon. We all assume that graphics cards are the way forward
| for ML, which may not be the case in the future.
|
| Nvidia were just ahead in this particular category due to CUDA,
| so AMD may have just let them run with it for now.
| jeroenhd wrote:
| AMD hardware works fine, the problem is that the major research
| projects everyone copies are all developed specifically for
| Nvidia.
|
| Now AMD is spinning up CUDA compatibility layer after CUDA
| compatibility layer. It's like trying to beat Windows by
| building another ReactOS/Wine. It's an approach doomed to fail
| unless AMD somehow manages to gain vastly more resources than
| the competition.
|
| Apple's NPU may not be very powerful, but many models have been
| altered specifically to run on them, making their NPUs vastly
| more useful than most equivalently powerful iGPUs. AMD doesn't
| have that just yet, they're always catching up.
|
| It'll be interesting to see what Qualcomm will do to get
| developers to make use of their NPUs on the new laptop chips.
| JonChesterfield wrote:
| Interesting analogy. The last few programs from the windows
| world I tried to run were flawless under wine and abjectly
| failed under windows 11.
| deliveryboyman wrote:
| Would like to see benchmarks for the applications in the test
| suite.
|
| E.g., how does Cycles compare on AMD vs Nvidia?
| Straw wrote:
| I worked for spectral compute a few years ago. Very smart and
| capable technical team.
|
| At the time, not only did they target AMD (with less
| compatibility than they have now), but also outperformed the
| default LLVM ptx backend, and even NVCC, when compiling for
| Nvidia GPUs!
| modeless wrote:
| A lot of people think AMD should support these translation layers
| but I think it's a bad idea. CUDA is not designed to be vendor
| agnostic and Nvidia can make things arbitrarily difficult both
| technically and legally. For example I think it would be against
| the license agreement of cuDNN or cuBLAS to run them on this. So
| those and other Nvidia libraries would become part of the API
| boundary that AMD would need to reimplement and support.
|
| Chasing bug-for-bug compatibility is a fool's errand. The
| important users of CUDA are open source. AMD can implement
| support directly in the upstream projects like pytorch or
| llama.cpp. And once support is there it can be maintained by the
| community.
| DeepYogurt wrote:
| Ya, honestly better to leave that to third parties who can
| dedicate themselves to it and maybe offer support or whatever.
| Let AMD work on good first party support first.
| fngjdflmdflg wrote:
| >Nvidia can make things arbitrarily difficult both technically
| and legally.
|
| I disagree. AMD can simply not implement those APIs, similar to
| how game emulators implement the most used APIs first and
| sometimes never bother implementing obscure ones. It would only
| matter that NVIDIA added eg. patented APIs to CUDA if those
| APIs were useful. In which case AMD should have a way to do
| them anyway. Unless NVIDIA comes up with a new patented API
| which is both useful and impossible to implement in any other
| way, which would be bad for AMD in any event. On the other
| hand, if AMD start supporting CUDA and people start using AMD
| cards, then developers will be hesitant to use APIs that only
| work on NVIDIA cards. Right now they are losing billions of
| dollars on this. Then again they barely seem capable of
| supporting RocM on their cards, much less CUDA.
|
| You have a fair point in terms of cuDNN and cuBLAS but I don't
| know that that kind of ToS is actually binding.
| dietr1ch wrote:
| How's this situation different than the one around Java,
| Sun/Oracle and Google?
| dboreham wrote:
| The judge might not be a coder next time.
| viraptor wrote:
| The US law is highly dependent on precedents. The Google-
| Oracle case has set one fortunately, so anything following
| it won't start from scratch. Fortunately we may not need a
| closer judge.
| jjk166 wrote:
| Google-Oracle side stepped the issue of API
| copyrightability by saying Google's particular
| implementation would fall under fair use. Whether APIs
| are copyrightable remains an open question.
| dylan604 wrote:
| Until you get an activist court
| blitzar wrote:
| It would be good if AMD did something, anything.
|
| Support this, reimplement that, support upstream efforts, dont
| really care. Any of those would cost a couple of million and be
| worth a trillion dollars to AMD shareholders.
| oezi wrote:
| A couple of million doesn't get you anything in corporate
| land
| spacebanana7 wrote:
| A couple dozen billion for a 10% chance of becoming NVIDIA
| competitive is worth it, looking at the stock prices.
| slashdave wrote:
| ROCm counts as "something"
| curt15 wrote:
| Pretty much any modern NVIDIA GPU supports CUDA. You don't
| have to buy a datacenter-class unit to get your feet wet
| with CUDA programming. ROCm will count as "something" when
| the same is true for AMD GPUs.
| muxr wrote:
| I don't think AMD needs to support 5+ year old GPUs
| personally. And all the recent generations are already
| practically supported.
|
| AMD only claims support for a select few GPUs, but in my
| testing I find all the GPUs work fine if the architecture
| is supported. I've tested rx6600, rx6700xt for example
| and even though they aren't officially supported, they
| work fine on ROCm.
| Const-me wrote:
| > Nvidia can make things arbitrarily difficult both technically
| and legally
|
| Pretty sure APIs are not copyrightable, e.g.
| https://www.law.cornell.edu/supremecourt/text/18-956
|
| > against the license agreement of cuDNN or cuBLAS to run them
| on this
|
| They don't run either of them, they instead implement an
| equivalent API on top of something else. Here's a quote: "Open-
| source wrapper libraries providing the "CUDA-X" APIs by
| delegating to the corresponding ROCm libraries. This is how
| libraries such as cuBLAS and cuSOLVER are handled."
| dralley wrote:
| I believe it was decided that they are copyrightable but that
| using them for compatibility purposes is fair use.
| kbolino wrote:
| No, it's stranger than that: SCOTUS did not rule on
| copyrightability of APIs at all, but simply ruled that even
| _if_ they are copyrightable, what Google did (completely
| reimplement Sun /Oracle's public API) was still fair use.
| mrandish wrote:
| It would have been nice to get a clear SCOTUS precedent
| on this. On the other hand, I also value a SCOTUS which
| rules minimally and narrowly by default (I also
| appreciate SCOTUS' return to stricter constitutional
| grounding in the past decade).
| hobs wrote:
| Incredibly loud laughing from the lawyers whose study of
| law is being thrown around willy nilly because of all the
| unprecedented joke decisions they are making right now.
| kbolino wrote:
| We are stuck between a rock and a hard place politically.
| The real decisions should be coming from Congress not the
| courts. However, Congress is too disorganized and
| disconnected to answer the important questions, leaving
| the courts to either muddle along or else become semi-
| dictatorial. In most countries, this would cause a
| constitutional crisis, but the modern U.S. system seems
| to be a little too resilient to such otherwise concerning
| signals.
| hobs wrote:
| We're far past a constitutional crisis, and the courts
| taking power nobody wanted to give to them (who wasn't
| interested in a unitary executive at least) isn't a good
| solution.
| amelius wrote:
| Like supporting x86 was a bad idea as well?
| karolist wrote:
| Was there a large entity steering x86 spec alone with a huge
| feature lead against their competition, free to steer the
| spec in any ways they choose? Also, hardware is not
| opensource software, you get big players onboard and they
| will be able to implement the spec they want every gen,
| software has more moving parts and unaligned parties
| involved.
| cherryteastain wrote:
| > Was there a large entity steering x86 spec alone with a
| huge feature lead against their competition, free to steer
| the spec in any ways they choose?
|
| Ever heard of Intel?
| karolist wrote:
| I had't considered that angle. Is your point that Intel
| was the creator of x86, but software chose to support it,
| then AMD had nothing else but to play catch up in x86
| support to be part of the software target market? If so
| and factual (I've no idea), fair point, I didn't know.
| marshray wrote:
| It was exactly the same instruction set.
|
| C compilers didn't offer an "AMD" CPU target* until AMD
| came out with the "AMD64" instruction set. Today we call
| this "x86_64" or "x64".
|
| * Feel free to point out some custom multimedia vector
| extensions for Athlons or something, but the point
| remains.
| modeless wrote:
| Before starting, AMD signed an agreement with Intel that gave
| them an explicit license to x86. And x86 was a whole lot
| smaller and simpler back then in _1982_. A completely
| different and incomparable situation.
| nostrademons wrote:
| Technically it was after starting - AMD was founded in 1969
| as a second-sourcer for Fairchild and National
| Semiconductor, and had reverse-engineered the 8080 by 1975
| and acquired a formal license to it by 1976.
|
| The 1982 deal you speak of was actually pretty interesting:
| as a condition of the x86's use in the IBM PC, IBM
| requested a second source for x86 chips. AMD was that
| source, and so they cross-licensed the x86 in 1982 to allow
| the IBM PC project to proceed forward. This makes the
| Intel/AMD deal even more important for both companies: _the
| PC market would never have developed_ without the cross-
| licensing, which would 've been bad for all companies
| involved. This gave Intel an ongoing stake in AMD's success
| at least until the PC market consolidated on the x86
| standard.
| eslaught wrote:
| Are you aware of HIP? It's officially supported and, for code
| that avoids obscure features of CUDA like inline PTX, it's
| pretty much a find-and-replace to get a working build:
|
| https://github.com/ROCm/HIP
|
| Don't believe me? Include this at the top of your CUDA code,
| build with hipcc, and see what happens:
|
| https://gitlab.com/StanfordLegion/legion/-/blob/master/runti...
|
| It's incomplete because I'm lazy but you can see most things
| are just a single #ifdef away in the implementation.
| currymj wrote:
| if you're talking about building anything, that is already
| too hard for ML researchers.
|
| you have to be able to pip install something and just have it
| work, reasonably fast, without crashing, and also it has to
| not interfere with 100 other weird poorly maintained ML
| library dependencies.
| bootsmann wrote:
| Don't most orgs that are deep enough to run custom cuda
| kernels have dedicated engineers for this stuff. I can't
| imagine a person who can write raw cuda not being able to
| handle things more difficult than pip install.
| gaogao wrote:
| Engineers who are really, really good at CUDA are worth
| their weight in gold, so there's more projects for them
| than they have time. Worth their weight in gold isn't
| figurative here - the one I know has a ski house more
| expensive than 180 lbs of gold (~$5,320,814).
| bbkane wrote:
| Would you (or your friend) be able to drop any good CUDA
| learning resources? I'd like to be worth my weight in
| gold...
| eigenvalue wrote:
| That's pretty funny. Good test of value across the
| millennia. I wonder if the best aqueduct engineers during
| the peak of Ancient Rome's power had villas worth their
| body weight in gold.
| jchw wrote:
| The target audience of interoperability technology is
| whoever is building, though. Ideally, interoperability
| technology can help software that supports only NVIDIA GPUs
| today go on to quickly add baseline support for Intel and
| AMD GPUs tomorrow.
|
| (and for one data point, I believe Blender is actively
| using HIP for AMD GPU support in Cycles.)
| Agingcoder wrote:
| Their target is hpc users, not ml researchers. I can
| understand why this would be valuable to this particular
| crowd.
| eslaught wrote:
| If your point is that HIP is not a zero-effort porting
| solution, that is correct. HIP is a _low_ -effort solution,
| not a zero effort solution. It targets users who already
| use and know CUDA, and minimizes the changes that are
| required from pre-existing CUDA code.
|
| In the case of these abstraction layers, then it would be
| the responsibility of the abstraction maintainers (or AMD)
| to port them. Obviously, someone who does not even use CUDA
| would not use HIP either.
|
| To be honest, I have a hard time believing that a truly
| zero-effort solution exists. Especially one that gets high
| performance. Once you start talking about the full stack,
| there are too many potholes and sharp edges to believe that
| it will really work. So I am highly skeptical of original
| article. Not that I wouldn't want to be proved wrong. But
| what they're claiming to do is a big lift, even taking HIP
| as a starting point.
|
| The easiest, fastest (for end users), highest-performance
| solution for ML will come when the ecosystem integrates it
| natively. HIP would be a way to get there faster, but it
| will take nonzero effort from CUDA-proficient engineers to
| get there.
| SushiHippie wrote:
| AMD has hipify for this, which converts cuda code to hip.
|
| https://github.com/ROCm/HIPIFY
| jph00 wrote:
| Inline PTX is hardly an obscure feature. It's pretty widely
| used in practice, at least in the AI space.
| viraptor wrote:
| Isn't cuDNN a much better case for reimplementing than CUDA? It
| has much more choice in how things actually happen and cuDNN
| itself chooses different implementations at runtime + does
| fusing. It seems way more generic and the reimplementation
| would allow using the best AMD-targeted kernel rather than one
| the original has.
| ckitching wrote:
| AMD have "MIOpen" which is _basically_ cuDNN-for-AMD. Ish.
| anigbrowl wrote:
| Given AMDs prior lack of interest I'll take whatever options
| there are. My daily driver has a Vega 10 GPU and it's been
| quite frustrating not to be able to easily leverage it for
| doing basic ML tasks, to the point that I've been looking at
| buying an external nvidia GPU instead just to try out some of
| the popular Python libraries.
| Wowfunhappy wrote:
| > CUDA is not designed to be vendor agnostic and Nvidia can
| make things arbitrarily difficult [...] technically.
|
| (Let's put the legal questions aside for a moment.)
|
| nVidia changes GPU architectures every generation / few
| generations, right? How does CUDA work across those--and how
| can it have forwards compatibility in the future--if it's not
| designed to be technologically agnostic?
| andy_ppp wrote:
| One way is to make sure the hardware team does certain things
| to support easy transition to new architectures, we have seen
| this with Apple Silicon for example!
| koolala wrote:
| CUDA v1...CUDA v2... CUDA v... CUDA isnt commonly assosiated
| with a version number...
| Uehreka wrote:
| ...yes it is? https://developer.nvidia.com/cuda-toolkit-
| archive
| jarbus wrote:
| Really, really, _really_ curious as to how they managed to pull
| this off, if their project works as well as they claim it does.
| If stuff as complex as paged /flash attention can "just work",
| this is really cool.
| Straw wrote:
| My understanding from chatting with them is that tensor core
| operations aren't supported yet, so FlashAttention likely won't
| work. I think its on their to-do list though!
|
| Nvidia actually has more and more capable matrix multiplication
| units, so even with a translation layer I wouldn't expect the
| same performance until AMD produces better ML cards.
|
| Additionally, these kernels usually have high sensitivity to
| cache and smem sizes, so they might need to be retuned.
| Der_Einzige wrote:
| So the only part that anyone actually cares about, as usual,
| is not supported. Same story as it was in 2012 with AMD vs
| Nvidia (and likely much before that too!). The more things
| change, the more they stay the same.
| JonChesterfield wrote:
| Cuda is a programming language. You implement it like any
| other. The docs are a bit sparse but not awful. Targeting
| amdgpu is probably about as difficult as targeting x64, mostly
| changes the compiler runtime.
|
| The online ptx implementation is notable for being even more
| annoying to deal with than the cuda, but it's just bytes in /
| different bytes out. No magic.
| ckitching wrote:
| [I work on SCALE]
|
| CUDA has a couple of extra problems beyond just any other
| programming language:
|
| - CUDA is more than a language: it's a giant library (for
| both CPU and GPU) for interacting with the GPU, and for
| writing the GPU code. This needed reimplementing. At least
| for the device-side stuff we can implement it _in CUDA_ , so
| when we add support for other GPU vendors the code can
| (mostly) just be recompiled and work there :D. - CUDA (the
| language) is not actually specified. It is, informally,
| "whatever nvcc does". This differs significantly from what
| Clang's CUDA support does (which is ultimately what the HIP
| compiler is derived from).
|
| PTX is indeed vastly annoying.
| JonChesterfield wrote:
| The openmp device runtime library was originally written in
| cuda. I ported that to hip for amdgpu, discovered the
| upstream hip compiler wasn't quite as solid as advertised,
| then ported it to openmp with some compiler intrinsics. The
| languages are all essentially C++ syntax with some spurious
| noise obfuscating llvm IR. The libc effort has gone with
| freestanding c++ based on that experience and and we've now
| mostly fixed the ways that goes wrong.
|
| You might also find raw c++ for device libraries saner to
| deal with than cuda. In particular you don't need to jury
| rig the thing to not spuriously embed the GPU code in x64
| elf objects and/or pull the binaries apart. Though if
| you're feeding the same device libraries to nvcc with
| #ifdef around the divergence your hands are tied.
| ckitching wrote:
| > You might also find raw c++ for device libraries saner
| to deal with than cuda.
|
| Actually, we just compile all the device libraries to
| LLVM bitcode and be done with it. Then we can write them
| using all the clang-dialect, not-nvcc-emulating, C++23 we
| feel like, and it'll still work when someone imports them
| into their c++98 CUDA project from hell. :D
| m3kw9 wrote:
| This isn't a solution for pros because it will always play catch
| up and Nvidia can always add things to make it difficult. This is
| like emulation.
| bachmeier wrote:
| > it will always play catch up
|
| That's not important if the goal is to run existing CUDA code
| on AMD GPUs. All you have to do is write portable CUDA code in
| the future regardless of what Nvidia does if you want to keep
| writing CUDA.
|
| I don't know the economics here, but if the AMD provides a
| significant cost saving, companies are going to make it work.
|
| > Nvidia can always add things to make it difficult
|
| Sounds like Microsoft embedding the browser in the OS. It's
| hard to see how doing something like that wouldn't trigger an
| antitrust case.
| dboreham wrote:
| Pros will end up overruled by bean counters if it works.
| ok123456 wrote:
| It's not emulation. It's a compiler.
| joe_the_user wrote:
| This sounds fabulous. I look forward to AMD being drawn kicking
| and screaming into direct competition with Nvidia.
| gizajob wrote:
| Is Nvidia not likely to sue or otherwise bork this into non-
| existence?
| chx wrote:
| Sue over what...?
| gizajob wrote:
| Whatever IP related issues they'd want to sue over. Sorry I
| don't know specifics about what this would specifically
| infringe but I'm sure expensive legal brains could come up
| with something
| CoastalCoder wrote:
| I wonder if nVidia's current anti-trust woes would make them
| reluctant to go that route at the moment.
| sakras wrote:
| One question I always have about these sorts of translation
| layers is how they deal with the different warp sizes. I'd
| imagine a lot of CUDA code relies on 32-wide warps, while as far
| as I know AMD tends to have 64-wide warps. Is there some sort of
| emulation that needs to happen?
| mpreda wrote:
| The older AMD _GCN_ had 64-wide wavefront, but the newer AMD
| GPUs "RDNA" support both 64 and 32 wavefront, and this is
| configurable at runtime. It appears the narrower wavefronts are
| better suited for games in general.
|
| Not sure what is the situation with "CDNA", which is the
| compute-oriented evolution of "GCN", i.e. whether CDNA is
| 64-wavefront only or dual like RNDA.
| msond wrote:
| SCALE is not a "translation layer", it's a full source-to-
| target compiler from CUDA-like C++ code to AMD GPUs.
|
| See this part of the documentation for more details regarding
| warp sizes: https://docs.scale-lang.com/manual/language-
| extensions/#impr...
| ladberg wrote:
| I don't really see how any code that depends heavily on the
| underlying hardware can "just work" on AMD. Most serious CUDA
| code is aware of register file and shared memory sizes, wgmma
| instructions, optimal tensor core memory & register layouts,
| tensor memory accelerator instructions, etc...
|
| Presumably that stuff doesn't "just work" but they don't want to
| mention it?
| lmeyerov wrote:
| Sort of
|
| A lot of our hw-aware bits are parameterized where we fill in
| constants based on the available hw . Doable to port, same as
| we do whenever new Nvidia architectures come out.
|
| But yeah, we have tricky bits that inline PTX, and.. that will
| be more annoying to redo.
| Retr0id wrote:
| > SCALE accepts CUDA programs as-is. [...] This is true even
| if your program uses inline PTX asm
| lmeyerov wrote:
| Oh that will be interesting to understand, as PTX gets to
| more about trickier hw-arch-specific phenomena that diff
| brands disagree on, like memory models. Neat!
| lmeyerov wrote:
| Looks like the PTX translation is via another project
| ZLUDA, though how they bridge the differences in
| memory/consistency/etc models safely remains unclear to
| me...
| ckitching wrote:
| Hi! Spectral engineer here!
|
| SCALE does not use any part of ZLUDA. We have modified
| the clang frontend to convert inline PTX asm block to
| LLVM IR.
|
| To put in a less compiler-engineer-ey way: for any given
| block of PTX, there exists a hypothetical sequence of
| C++/CUDA code you could have written to achieve the same
| effect, but on AMD (perhaps using funky __builtin_...
| functions if the code includes shuffles/ballots/other-
| weird-gpu-stuff). Our compiler effectively converts the
| PTX into that hypothetical C++.
|
| Regarding memory consistency etc.: NVIDIA document the
| "CUDA memory consistency model" extremely thoroughly, and
| likewise, the consistency guarantees for PTX. It is
| therefore sufficient to ensure that we use operations at
| least as synchronising as those called for in the
| documented semantics of the language (be it CUDA or PTX,
| for each operation).
|
| Differing consistency _between architectures_ is the
| AMDGPU backend's problem.
| ladberg wrote:
| Just to check here, if you're given something like the
| following PTX:
| wgmma.mma_async.sync.aligned.m64n256k16.f32.bf16.bf16
|
| Do you reverse it back into C++ that does the
| corresponding FMAs manually instead of using tensor
| hardware? Or are you able to convert it into a series of
| __builtin_amdgcn_mfma_CDFmt_MxNxKABFmt instructions that
| emulate the same behavior?
| shmerl wrote:
| Compiler isn't open source? That feels like DOA in this day and
| age. There is ZLUDA already which is open.
|
| If they plan to open it up, it can be something useful to add to
| options of breaking CUDA lock-in.
| uyzstvqs wrote:
| ZLUDA is pretty good, except that it lacks cuDNN which makes
| most PyTorch projects just not work. Not sure if this project
| does cover that? That could be a game changer, otherwise yeah
| ZLUDA is the better open-source option.
| cheptsov wrote:
| Sounds really awesome. Any chance someone can suggest if this
| works also inside a Docker container?
| ckitching wrote:
| It works exactly as well as other AMDGPU-related software (HIP
| etc.) works inside Docker.
|
| There are some delightful AMD driver issues that make certain
| models of GPU intermittently freeze the kernel when used from
| docker. That was great fun when building SCALE's CI system :D.
| SushiHippie wrote:
| Works like described in the rocm documentation (at least the
| scaleinfo worked for me, haven't tested further)
|
| https://rocm.docs.amd.com/projects/install-on-linux/en/lates...
| resters wrote:
| The main cause of Nvidia's crazy valuation is AMD's unwillingness
| to invest in making its GPUs as useful as Nvidia's for ML.
|
| Maybe AMD fears antitrust action, or maybe there is something
| about its underlying hardware approach that would limit
| competitiveness, but the company seems to have left billions of
| dollars on the table during the crypto mining GPU demand spike
| and now during the AI boom demand spike.
| karolist wrote:
| I think this could be cultural differences, AMD's software
| department is underfunded and doing poorly for a long time now.
|
| * https://www.levels.fyi/companies/amd/salaries/software-
| engin...
|
| * https://www.levels.fyi/companies/nvidia/salaries/software-
| en...
|
| And it's probably better now. Nvidia was paying much more long
| before, also their stock growing attracts even more talent.
| 1024core wrote:
| > I think this could be cultural differences, AMD's software
| department is underfunded and doing poorly for a long time
| now.
|
| Rumor is that ML engineers (that AMD really needs) are
| expensive; and AMD doesn't want to give them more money than
| the rest of the SWEs they have (for pissing off the existing
| SWEs). So AMD is caught in a bind: can't pay to get top MLE
| talent and can't just sit by and watch NVDA eat its lunch.
| mepian wrote:
| AMD recently acquired Silo AI.
| karolist wrote:
| I find this strange to believe. Every big company has
| levels, unless your existing L7+ IC is below market, you
| can just pull L7+ salaried ML engineers with some secret
| signing bonus like literally everyone else.
| Der_Einzige wrote:
| The dirty secret in the tech industry is that most people
| at AMD or Intel or IBM and historically Nvidia/Oracle
| (this changed post 2022), were the 2nd-3rd tier tech
| companies. Staffed heavily by the rejects of the FAANG,
| they were still happy to have their 100-200K in their
| MCOL areas, but no free food and a much more boring work
| culture. Intel's "great place to work" corporate
| propaganda was known as "great place to leetcode" while I
| worked there, as Intel was always seen as a stepping
| stone before you "made it" in a FAANG.
|
| Culturally, none of these companies were happy to pay
| anyone except the tip, top "distinguished" engineers more
| than 300K. AMD seems to be stuck in this mentality, just
| as IBM is.
| dist-epoch wrote:
| There are stories from credible sources that AMD software
| engineers had to buy AMD GPUs with their own money to use in CI
| machines.
| ClassyJacket wrote:
| I like to watch YouTube retrospectives on old failed tech
| companies - LGR has some good ones.
|
| When I think of AMD ignoring machine learning, I can't help
| imagine a future YouTuber's voiceover explaining how this
| caused their downfall.
|
| There's a tendency sometimes to think "they know what they're
| doing, they must have good reasons". And sometimes that's
| right, and sometimes that's wrong. Perhaps there's some great
| technical, legal, or economic reason I'm just not aware of. But
| when you actually look into these things, it's surprising how
| often the answer is indeed just shortsightedness.
|
| They could end up like BlackBerry, Blockbuster, Nokia, and
| Kodak. I guess it's not quite as severe, since they will still
| have a market in games and therefore may well continue to
| exist, but it will still be looked back on as a colossal
| mistake.
|
| Same with Toyota ignoring electric cars.
|
| I'm not an investor, but I still have stakes in the sense that
| Nvidia has no significant competition in the machine learning
| space, and that sucks. GPU prices are sky high and there's
| nobody else to turn to if there's something about Nvidia you
| just don't like or if they decide to screw us.
| paulmist wrote:
| Doesn't seem to mention CDNA?
| JonChesterfield wrote:
| This is technically feasible so might be the real thing. Parsing
| inline ptx and mapping that onto amdgpu would be a huge pain.
|
| Working from cuda source that doesn't use inline ptx to target
| amdgpu is roughly regex find and replace to get hip, which has
| implemented pretty much the same functionality.
|
| Some of the details would be dubious, e.g. the atomic models
| probably don't match, and volta has a different instruction
| pointer model, but it could all be done correctly.
|
| Amd won't do this. Cuda isn't a very nice thing in general and
| the legal team would have kittens. But other people totally
| could.
| ckitching wrote:
| [I work on SCALE]
|
| Mapping inline ptx to AMD machine code would indeed _suck_.
| Converting it to LLVM IR right at the start of compilation
| (when the initial IR is being generated) is much simpler, since
| it then gets "compiled forward" with the rest of the code.
| It's as if you wrote C++/intrinsics/whatever instead.
|
| Note that nvcc accepts a different dialect of C++ from clang
| (and hence hipcc), so there is in fact more that separates CUDA
| from hip (at the language level) than just find/replace. We
| discuss this a little in [the manual](https://docs.scale-
| lang.com/manual/dialects/)
|
| Handling differences between the atomic models is, indeed,
| "fun". But since CUDA is a programming language with documented
| semantics for its memory consistency (and so is PTX) it is
| entirely possible to arrange for the compiler to "play by
| NVIDIA's rules".
| JonChesterfield wrote:
| Huh. Inline assembly is strongly associated in my mind with
| writing things that can't be represented in LLVM IR, but in
| the specific case of PTX - you can only write things that
| ptxas understands, and that probably rules out wide classes
| of horrendous behaviour. Raw bytes being used for
| instructions and for data, ad hoc self modifying code and so
| forth.
|
| I believe nvcc is roughly an antique clang build hacked out
| of all recognition. I remember it rejecting templates with
| 'I' as the type name and working when changing to 'T',
| nonsense like that. The HIP language probably corresponds
| pretty closely to clang's cuda implementation in terms of
| semantics (a lot of the control flow in clang treats them
| identically), but I don't believe an exact match to nvcc was
| considered particularly necessary for the clang -x cuda work.
|
| The ptx to llvm IR approach is clever. I think upstream would
| be game for that, feel free to tag me on reviews if you want
| to get that divergence out of your local codebase.
| ur-whale wrote:
| If this actually works (remains to be seen), I can only say:
| 1) Kudos 2) Finally !
| gedy wrote:
| or: 1) CUDAs
| anthonix1 wrote:
| I just tried it with llm.c ... seems to be missing quite a few
| key components such as cublaslt, bfloat16 support, nvtx3,
| compiler flags such as -t
|
| And its linked against an old release of ROCm.
|
| So unclear to me how it is supposed to be an improvement over
| something like hipify
| ckitching wrote:
| Greetings, I work on SCALE.
|
| It appears we implemented `--threads` but not `-t` for the
| compiler flag. Oeps. In either case, the flag has no effect
| at present, since fatbinary support is still in development,
| and that's the only part of the process that could
| conceivably be parallelised.
|
| _That said_ : clang (and hence the SCALE compiler) tends to
| compile CUDA much faster than nvcc does, so this lack of the
| parallelism feature is less problematic than it might at
| first seem.
|
| NVTX support (if you want more than just "no-ops to make the
| code compile") requires cooperation with the authors of
| profilers etc., which has not so far been available
|
| bfloat16 is not properly supported by AMD anyway: the
| hardware doesn't do it, and HIP's implementatoin just lies
| and does the math in `float`. For that reason we haven't
| prioritised putting together the API.
|
| cublasLt is a fair cop. We've got a ticket :D.
| anthonix1 wrote:
| Hi, why do you believe that bfloat16 is not supported? Can
| you please provide some references (specifically the part
| about the hardware "doesn't do it")?
|
| For the hardware you are focussing on (gfx11), the
| reference manual [2] and the list of LLVM gfx11
| instructions supported [1] describe the bfloat16 vdot &
| WMMA operations, and these are in fact implemented and
| working in various software such as composable kernels and
| rocBLAS, which I have used (and can guarantee they are not
| simply being run as float). I've also used these in the AMD
| fork of llm.c [3]
|
| Outside of gfx11, I have also used bfloat16 in CDNA2 & 3
| devices, and they are working and being supported.
|
| Regarding cublasLt, what is your plan for support there?
| Pass everything through to hipblasLt (hipify style) or
| something else?
|
| Cheers, -A
|
| [1] https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX11.html [2]
| https://www.amd.com/content/dam/amd/en/documents/radeon-
| tech... [3] http://github.com/anthonix/llm.c
| ashvardanian wrote:
| It's great that there is a page about current limitations [1],
| but I am afraid that what most people describe as "CUDA" is a
| small subset of the real CUDA functionality. Would be great to
| have a comparison table for advanced features like warp shuffles,
| atomics, DPX, TMA, MMA, etc. Ideally a table, mapping every PTX
| instruction to a direct RDNA counterpart or a list of
| instructions used to emulate it.
|
| [1]: https://docs.scale-lang.com/manual/differences/
| ckitching wrote:
| You're right that most people only use a small subset of cuda:
| we prioritied support for features based on what was needed for
| various open-source projects, as a way to try to capture the
| most common things first.
|
| A complete API comparison table is coming soon, I belive. :D
|
| In a nutshell: - DPX: Yes. - Shuffles: Yes. Including the PTX
| versions, with all their weird/wacky/insane arguments. -
| Atomics: yes, except the 128-bit atomics nvidia added very
| recently. - MMA: in development, though of course we can't fix
| the fact that nvidia's hardware in this area is just _better_
| than AMD 's, so don't expect performance to be as good in all
| cases. - TMA: On the same branch as MMA, though it'll just be
| using AMD's async copy instructions.
|
| > mapping every PTX instruction to a direct RDNA counterpart or
| a list of instructions used to emulate it.
|
| We plan to publish a compatibility table of which instructons
| are supported, but a list of the instructions used to produce
| each PTX instruction is not in general meaningful. The inline
| PTX handler works by converting the PTX block to LLVM IR at the
| start of compilation (at the same time the rest of your code
| gets turned into IR), so it then "compiles forward" with the
| rest of the program. As a result, the actual instructions
| chosen vary on a csae-by-case basis due to the whims of the
| optimiser. This design in principle produces better performance
| than a hypothetical solution that turned PTX asm into AMD asm,
| because it conveniently eliminates the optimisation barrier an
| asm block typically represents. Care, of course, is taken to
| handle the wacky memory consistency concerns that this implies!
|
| We're documenting which ones are expected to perform worse than
| on NVIDIA, though!
| qwerty456127 wrote:
| > gfx1030, gfx1100, gfx1010, gfx1101, gfx900...
|
| How do I find out which do I have?
| ckitching wrote:
| Like this:
|
| https://docs.scale-lang.com/manual/how-to-use/#identifying-g...
| galaxyLogic wrote:
| Companies selling CUDA software should no doubt adopt this tool
| yieldcrv wrote:
| the real question here is whether anybody has gotten cheap,
| easily available AMD GPUs to run their AI workloads, and if we
| can predict more people will do so
| EGreg wrote:
| But the question is, can it also run SHUDA and WUDA?
___________________________________________________________________
(page generated 2024-07-15 23:00 UTC)