[HN Gopher] Run CUDA, unmodified, on AMD GPUs
___________________________________________________________________
Run CUDA, unmodified, on AMD GPUs
Author : Straw
Score : 1147 points
Date : 2024-07-15 19:05 UTC (1 days ago)
(HTM) web link (docs.scale-lang.com)
(TXT) w3m dump (docs.scale-lang.com)
| dagmx wrote:
| Has anyone tried this and knows how well it works? It definitely
| sounds very compelling
| arjvik wrote:
| Who is this Spectral Compute, and where can we see more about
| them?
| msond wrote:
| You can learn more about us on https://spectralcompute.co.uk
| JonChesterfield wrote:
| The branch free regex engine is an interesting idea. I would
| have said that can't be implemented in finite code.
|
| Compile to DFA by repeatedly differentiating then unroll the
| machine? You'd still have back edges for the repeating
| sections.
| pixelpoet wrote:
| Isn't this a bit legally dubious, like zluda?
| janice1999 wrote:
| It's advertised as a "clean room" re-implementation. What part
| would be illegal?
| ekelsen wrote:
| If they had to reverse engineer any compiled code to do this,
| I think that would be against licenses they had to agree to?
|
| At least grounds for suing and starting an extensive
| discovery process and possibly a costly injunction...
| msond wrote:
| We have not reverse engineered any compiled code in the
| process of developing SCALE.
|
| It was clean-room implemented purely from the API surface
| and by trial-and-error with open CUDA code.
| RockRobotRock wrote:
| Isn't that exactly what a "clean room" approach avoids?
| ekelsen wrote:
| oh definitely. But if I was NVIDIA I'd want to verify
| that in court after discovery rather than relying on
| their claim on a website.
| RockRobotRock wrote:
| good point
| ekelsen wrote:
| FWIW, I think this is really great work and I wish only
| the best for scale. Super impressed.
| Keyframe wrote:
| Can't run useful shit on it: https://docs.nvidia.com/deeplear
| ning/cudnn/latest/reference/...
|
| Namely:
|
| "4.1 License Scope. The SDK is licensed for you to develop
| applications only for use in systems with NVIDIA GPUs."
| mkl wrote:
| So add a cheap NVidia card alongside grunty AMD ones, and
| check for its existence. It doesn't seem to say it needs to
| run on NVidia GPUs.
| Keyframe wrote:
| Heh, true. On the other hand, I bet companies are eager
| to challenge the wrath of a $3T company for a promise of
| "maybe it'll work, not all of it but at least it'll run
| worse, at least for now".
| JonChesterfield wrote:
| I don't think the terms of the Nvidia SDK can restrict
| running software without said SDK. Nvidia's libraries don't
| seem to be involved here. Their hardware isn't involved
| either. It's just some ascii in a bunch of text files being
| hacked around with before running on someone else's
| hardware.
| adzm wrote:
| I'd love to see some benchmarks but this is something the market
| has been yearning for.
| msond wrote:
| We're putting together benchmarks to publish at a later time,
| and we've asked some independent third parties to work on their
| own additionally.
| acheong08 wrote:
| Impressive if true. Unfortunately not open source and scarce on
| exact details on how it works
|
| Edit: not sure why I just sort of expect projects to be open
| source or at least source available these days.
| tempaccount420 wrote:
| They might be hoping to be acquired by AMD
| ipsum2 wrote:
| They're using Docusaurus[1] for their website, which is most
| commonly used with open source projects.
|
| https://docusaurus.io/docs
| msond wrote:
| Actually, we use mkdocs and the excellent material for mkdocs
| theme: https://squidfunk.github.io/mkdocs-material/
| msond wrote:
| We're going to be publishing more details on later blog posts
| and documentation about how this works and how we've built it.
|
| Yes, we're not open source, however our license is very
| permissive. It's both in the software distribution and viewable
| online at https://docs.scale-lang.com/licensing/
| breck wrote:
| How about trying _Early_ Source?
|
| It's open source with a long delay, but paying users get the
| latest updates.
|
| Make the git repo from "today - N years" open source, where N
| is something like 1 or 2.
|
| That way, students can learn on old versions, and when they
| grow into professionals they can pay for access to the
| cutting Edge builds.
|
| Win win win win
|
| ( https://breckyunits.com/earlySource.html)
| msond wrote:
| We're still thinking about our approach but this is a nice
| suggestion, thank you.
|
| I'm curious, for what reasons are you interested in the
| source code yourself?
| mindcrime wrote:
| I'm not the person you replied to, and I can't speak for
| them. But I can say that for myself, and a not small
| number of other people, it's an ideological issue. I
| simply do not use software that isn't F/OSS - to the
| greatest extent that that is possible. For me, I might
| use a VERY small amount of non F/OSS stuff, but it's very
| hard to get me to adopt something new if it isn't.
|
| Now should you make business decisions based on that?
| Probably not. But while I don't claim to be a
| representative sample, I am pretty sure the number of
| people who share my beliefs in this regard is
| substantially "non zero". _shrug_
| atq2119 wrote:
| Not GP, but a guaranteed source availability means users
| can fix issues themselves in the future if the original
| provider goes belly-up.
| breck wrote:
| > I'm curious, for what reasons are you interested in the
| source code yourself?
|
| I am the founder/editor of PLDB. So I try to do my best
| to help people "build the next great programming
| language".
|
| We clone the git repos of over 1,000 compilers and
| interpreters and use cloc to determine what languages the
| people who are building languages are using. The people
| who build languages obviously are the experts, so how
| they go so goes the world.
|
| We call this measurement "Foundation Score". A Foundation
| Score of 100 means 100 other languages uses this language
| somehow in their primary implementation.
|
| It is utterly dominated by open source languages, and the
| disparity is only getting more extreme.
|
| You can see for yourself here:
|
| https://pldb.io/lists/explorer.html#columns=rank~name~id~
| app...
|
| Some that might have become irrelevant have gained a
| second wind after going open source.
|
| But some keep falling further behind.
|
| I look at Mathematica, a very powerful and amazing
| language, and it makes me sad to see so few other
| language designers using it, and the reason is because
| its closed source. So they are not doing so hot, and
| that's a language from one of our world's smartest and
| most prolific thinkers that's been around for decades.
|
| I don't see a way for a new language to catch on nowadays
| that is not open source.
| msond wrote:
| Very interesting, thank you for sharing!
|
| We do believe in open source software and we do want to
| move the GPGPU market away from fully closed languages.
| The future is open for discussion but regardless, the
| status-quo at the moment is a proprietary and dominant
| implementation which only supports a single vendor.
|
| > I don't see a way for a new language to catch on
| nowadays that is not open source.
|
| I do note that CUDA is itself closed source -- while
| there's an open source implementation in the LLVM
| project, it is not as bleeding edge as NVIDIA's own.
| breck wrote:
| > I do note that CUDA is itself closed source
|
| And this is a good point. However, it also has a 17 year
| head start, and many of those years were spent developing
| before people realized what a huge market there was.
|
| All it will take is one committed genius to create an
| open source alternative to CUDA to dethrone it.
|
| But they would have to have some Mojo (hint hint) to pull
| that off.
| idonotknowwhy wrote:
| I'm a big fan of opensource for most things but if what
| you've got actually works, you could probably earn big
| money selling it. The biggest companies in the world are
| building / using this sort of thing.
|
| Imagine the shift of capital if for example, Intel GPUS
| suddenly had the same ML software compatibility as Nvidia
| TaylorAlexander wrote:
| Makes sense to expect this kind of thing to be open source. The
| whole point of providing improved compatibility is to make
| people's lives easier, and open source is usually an important
| feature to ensure wide compatibility. It also means projects
| can live on after the creators move to other things, people can
| submit patches for important features or bug fixes, and
| generally makes the system much more useful.
| dylan604 wrote:
| I don't find it wrong for someone to attempt to make money
| back on their time and experience of doing the work. I don't
| mind people that offer that back as open source either.
| However, I do have a problem of people expecting everything
| to be open/free, especially those that then go on a crusade
| chastising those that do try to make money.
| TaylorAlexander wrote:
| I'm really trying to keep this about the engineering
| features of a system rather than moral judgments. Open
| source systems are simply more flexible and adaptable than
| proprietary systems, which have their own benefits. In
| today's world, the engineering value of open source systems
| is becoming so important that people are looking for other
| ways to provide for the developers creating these systems.
| It can be surprising when a project creator builds
| something in an area that is usually all open source, but
| they choose a proprietary path. Just look at the problems
| created by NVIDIA for their use of proprietary software in
| CUDA and their GPUs. This software is an attempt to fix
| issues created by proprietary software with another piece
| of proprietary software, which is if nothing else an
| interesting decision.
| dylan604 wrote:
| UNIX wasn't free. Windows wasn't free. It wasn't until
| some knucklehead came along and did something abnormal
| and gave away their thing. Bakers don't give away their
| goods. Mechanics don't typically repair things for free.
| Builders don't build things for free. Gas stations don't
| give away gas.
|
| Why do we think all software should be free, and then
| think that those that don't give it away are the abnormal
| ones?
| talldayo wrote:
| > Why do we think all software should be free
|
| Why do people return Windows laptops when they have to
| pay for a Windows License Activation? Because every
| single OEM pays for it; you don't _expect_ to buy Windows
| because it is a failed B2C business model. Nobody wants
| it. Same goes for proprietary UNIX, and people _wish_ it
| was the case for Nvidia drivers. I own CUDA hardware and
| lament the fact that cross-industry GPGPU died so FAANG
| could sell licensed AI SDKs. The only thing stopping AI
| from being "free" is the limitations OEMs impose on
| their hardware.
|
| > that those that don't give it away are the abnormal
| ones?
|
| They are. Admit it; the internet is the new normal, if
| your software isn't as "free" as opening a website,
| you're weird. If I have to pay to access your little
| forum, I won't use it. If I have to buy your app to see
| what it's like, I'll never know what you're offering.
| Part of what makes Nvidia's business model so successful
| is that they _do_ "give away" CUDA to anyone that owns
| their hardware. There is no developer fee or mandatory
| licensing cost, it is plug-and-play with the hardware.
| Same goes for OpenAI, they'd have never succeeded if you
| had to buy "the ChatGPT App" from your App Store.
| dylan604 wrote:
| > Why do people return Windows laptops when they have to
| pay for a Windows License Activation?
|
| The internet echo chamber strikes again. Exactly how many
| people are actually doing this? Not many, and those that
| are all hangout together. The rest of the world just
| blindly goes about their day using Windows while surfing
| the web using Chrome. Sometimes, it's a good thing to get
| outside your bubble. It's a big world out there, and not
| everybody sees the world as you do
| talldayo wrote:
| > The rest of the world just blindly goes about their day
| using Windows while surfing the web using Chrome.
|
| Paying for Windows? I think you missed my point. If your
| computer doesn't ship with an OS, paid or otherwise,
| people think it's a glitch. The average consumer will
| sooner return their laptop before they buy a license of
| Windows, create an Install Media from their old device
| and flash the new hardware with a purchased license.
| They'll get a Chromebook instead, people don't _buy_
| Windows today.
|
| The internet has conditioned the majority of modern
| technology users to reject and habitually avoid non-free
| experiences. Ad-enabled free platforms and their
| pervasive success is all the evidence you need.
| Commercial software as it existed 20 or 30 years ago is a
| dead business. Free reigns supreme.
| dylan604 wrote:
| Who/where/how does someone buy a laptop without an OS?
| I'm just not able to follow down this hypothetical path
| that you are insisting on blazing
| hamilyon2 wrote:
| That is kind of his point. You don't, Windows is bundled
| with laptop. It is not that I agree with his points.
| Windows for example isn't open source in remotest sense
| dylan604 wrote:
| Dell offers laptops with a version of Linux preinstalled
| and supports them. System76, Lenovo, Purism as well to
| name a few. Apple also sells laptops without Windows on
| them. There are actually quite a few options that do
| this. If you don't want Windows, we have options now.
| Yes, historically, it was Windows or Apple's OS, but
| that's no longer true and not recognizing that just makes
| you look like you're pushing a false narrative on the
| situation for what purpose only you know.
| alt227 wrote:
| > Commercial software as it existed 20 or 30 years ago is
| a dead business. Free reigns supreme.
|
| What nonsense. Go into any business and you will find
| every single piece of software they use is bought and
| paid for with bells on. The 'Free World' you speak of is
| only there to get you, an individual, used to using the
| software so that businesses are made to purchase it. In
| the old days we called this 'demo' or 'shareware'. Now
| its 'free' or 'personal' tier subscription.
|
| Go and ask any designer if their copy of Adobe Creative
| Cloud, 3d studio Max, or AutoCAD is free. Any office
| worker if Micsrosoft Office(including Teams and
| Sharedpoint etc) or even google docs for business.
| Majority of developers are running paid versions of
| Jetbrains. Running an online shop? Chances are you are
| paying for shopify software, or something like Zoho to
| manage your customers and orders.
|
| 'Free' as you put it is very much only in the online
| individual consumer world, a very small part of the
| software world.
|
| The commercial software market is more alive and
| expensive than it has ever been.
| TaylorAlexander wrote:
| > Bakers don't give away their goods. Mechanics don't
| typically repair things for free. Builders don't build
| things for free. Gas stations don't give away gas.
|
| These all have the property which is that they are scarce
| physical goods or services. Software is not scarce
| (though of course the labor to create it is), so this is
| a really bad comparison.
|
| And again I did not say it should or should not be free,
| I said there are engineering benefits to open source
| software and more and more people recognize those
| benefits and choose to make things free because they see
| the value and are willing to recognize the tradeoffs. I
| never said what "should" be done. "Should" is kind of a
| nonsense term when used in this way as it hides a lot of
| assumptions, so I generally do not use it, and notably
| did not use it in my comment. I want to point out the
| peculiarity in your rather strong response to a word and
| concept I never used. I think you are having an argument
| with imagined people, not a discussion with me.
|
| And for what it is worth, I am a robotics engineer and I
| am designing a completely open source solar powered
| farming robot designed to be made in a small shop in any
| city in the world (see my profile), funded by a wealthy
| robotics entrepreneur who recognizes the value in making
| this technology available to people all over the world.
|
| So I am one of those engineers making this choice, and
| not someone just asking for things without doing the same
| of my work. Everything I produce is open source,
| including person projects and even my personal writing.
| dTal wrote:
| Because software is information. It is closer to a
| scientific paper than a loaf of bread, and I do expect
| those to be free. I do not expect scientists to _work_
| for free, but the marginal cost of copying their output
| is 0 and the social benefit is huge.
|
| Free software, like open science, clearly has something
| going for it pragmatically. The developer hours put into
| it have paid for themselves magnitudes of times over.
| Megacorps hire people to work on free software. If you
| can't see the value, that's a you problem.
| acuozzo wrote:
| > the social benefit is huge
|
| It will be interesting to see if this is the case in the
| long run, assuming "huge" has a positive connotation in
| your post, of course.
|
| If AGI comes to pass and it winds up being a net negative
| for humanity, then the ethics of any practice which
| involves freely distributing information that can be
| endlessly copied for very little cost must be
| reevaluated.
| TaylorAlexander wrote:
| > If AGI comes to pass
|
| Increasingly, I am not putting much weight in any
| predictions about whether this will happen in the way we
| think it will, or what it could possibly mean. We might
| as well be talking about the rapture.
| voidUpdate wrote:
| If all software was free and made no money, how could
| developers pay their bills?
| TaylorAlexander wrote:
| Free software is so important to society that I believe
| the most reasonable solution is to provide for all people
| without their need to work for survival. Automate as much
| as possible such that work is not compulsory, and enough
| people simply want something to do (and possibly
| additional pay depending on how the system is arranged)
| that everything that needs to get done by people does get
| done.
|
| For now that is fiction, but so is "if all software was
| free". I do think though that both would lead to a faster
| rate of innovation in society versus one where critical
| information is withheld from society to pay someone's
| rent and food bills.
| einpoklum wrote:
| Most software is free and makes no money - and that has
| always been the case. There are some very popular and
| widely-used non-free systems, but most software isn't
| that, and its developers still pay the bills.
|
| This is somewhat analogous to music or books/literature.
| Most composers and performers and authors make no money
| from people copying and sharing their works. Some pay the
| bills working professionally for entities who want their
| product enough to pay for it; some do other things in
| life. Some indeed give up their work on music because
| they can't afford to not do more gainful work. And still,
| neither music nor books go away as copying them gets
| closer to being free.
| voidUpdate wrote:
| If my current employer can't make any money from the code
| we write, then it would collapse faster than a souffle
| taken out of the oven too early, and I would be out of a
| job
| einpoklum wrote:
| That does not contradict my point... also, there are
| other ways to make money from writing code than forcing
| people to pay for copies of that code.
| napoleongl wrote:
| Otoh recepies and drawings are commonly available for
| free. So if you can support yourself the cake and engine
| repair is free. But if you need support then you can get
| someone to bake or build for you.
| nicce wrote:
| > Makes sense to expect this kind of thing to be open source.
| The whole point of providing improved compatibility is to
| make people's lives easier, and open source is usually an
| important feature to ensure wide compatibility. It also means
| projects can live on after the creator
|
| AMD just bought company working with similar things for more
| than 600m.
| dheera wrote:
| Also, can I even buy an AMD GPU? I don't see a "buy now" button
| or a PCIe version anywhere here
|
| https://www.amd.com/en/products/accelerators/instinct/mi300/...
|
| Another big AMD fuckup in my opinion. Nobody is going to drop
| millions on these things without being able to test them out
| first.
|
| First rule of sales: If you have something for sale, take my
| money.
| nwiswell wrote:
| > I don't see a "buy now" button or a PCIe version anywhere
| here
|
| "Buy now" buttons and online shopping carts are not generally
| how organizations looking to spend serious money on AI buy
| their hardware.
|
| They have a long list of server hardware partners, and odds
| are you'd already have an existing relationship with one or
| more of them, and they'd provide a quote.
|
| They even go one step further and show off some of their
| partners' solutions:
|
| https://www.amd.com/en/graphics/servers-instinct-deep-
| learni...
|
| FWIW I believe Supermicro and Exxact actually do have web-
| based shopping carts these days, so maybe you could skip the
| quotation and buy directly if you were so motivated? Seems
| kind of weird at this price point.
|
| https://www.exxactcorp.com/Exxact-TS4-185328443-E185328443
| dheera wrote:
| ... and that's why AMD is losing.
|
| They could break the trend and offer a "buy now" button
| instead of offering quotes and coffee chats. It's very
| likely that will kickstart the software snowball with early
| adopters.
|
| Nobody is going to drop millions on an unproven platform.
|
| > Seems kind of weird at this price point.
|
| Yeah that $234K server is too much for people to do a
| trial. It has 8xMI300X GPUs along with a bunch of other
| shit.
|
| Give me a single MI300X GPU in PCIe form factor for $20K
| and I'd very seriously consider. I'm sure there are many
| people who would help adapt the ecosystem if they were
| truly available.
| nwiswell wrote:
| Why would you be looking to dip your toe into the AMD
| ecosystem for the first time using an MI300X? It doesn't
| make any sense. It's not entry level hardware.
| dheera wrote:
| To help fix the ecosystem. It's way more affordable than
| Nvidia.
|
| I'm not looking for entry level hardware.
| nwiswell wrote:
| Yes, that's why you'd choose AMD, I'm saying that you
| don't enter the ecosystem for the first time by
| purchasing the absolute cutting edge hardware.
|
| As far as I'm aware you can't simply buy an Nvidia B200
| PCIe card over the counter, either.
| dheera wrote:
| I'm not looking to enter the ecosystem, I'm already deep
| in it and want to fix the AMD problem so that I can build
| big projects around it and undercut everyone who's using
| Nvidia.
|
| You can purchase H100 and A100 PCIe cards over the
| counter. They're great for compiling CUDA code, testing
| code before you launch a multi-node job into a cluster,
| and for running evaluations.
|
| AMD has nothing of the sort, and it's hurting them.
|
| I cannot blow 250K on an SMCI server, nor do I have the
| electricity setup for it. I _can_ blow 20K on a PCIe GPU
| and start contributing to the ecosystem, or maybe prove
| out an idea on one GPU before trying to raise millions
| from a VC to build a more cost-effective datacenter that
| actually works.
| nwiswell wrote:
| > AMD has nothing of the sort, and it's hurting them.
|
| What are you talking about? Have you looked?
|
| https://www.dell.com/en-us/shop/amd-mi210-300w-pcie-64gb-
| pas...
|
| https://www.bitworks.io/product/amd-instinct-mi210-64gb-
| hbm2...
| shaklee3 wrote:
| A 20k GPU will be passively cooled and you'll need a real
| server for that. Even the old MI210 another poster sent
| is passive.
| latchkey wrote:
| > _Give me a single MI300X GPU in PCIe form factor for
| $20K and I 'd very seriously consider. I'm sure there are
| many people who would help adapt the ecosystem if they
| were truly available._
|
| I know this isn't what you're looking for entirely, but
| my business, Hot Aisle, is working on making MI300x
| available for rental. Our pricing isn't too crazy given
| that the GPU has 192GB and one week minimum isn't too
| bad. We will add on-demand hourly pricing as soon as we
| technically can.
|
| I'm also pushing hard on Dell and AMD to pre-purchase
| developer credits on our hardware, that we can then give
| away to people who want to "kick the tires".
|
| https://hotaisle.xyz/pricing/
| juujian wrote:
| I don't understand how AMD has messed up so badly that I feel
| like celebrating a project like this. Features of my laptop are
| just physically there but not usable, particularly in Linux. So
| frustrating.
| djbusby wrote:
| Same boat, AMD CPU but nothing else. I feel like a moderate
| improvement of their FOSS support, drivers would open new
| hardware revenue - to say nothing about the AI channel.
| ActorNightly wrote:
| I don't know if I would call it a mess up. AMD still has
| massive market in server chips, and their ARM stuff is on the
| horizon. We all assume that graphics cards are the way forward
| for ML, which may not be the case in the future.
|
| Nvidia were just ahead in this particular category due to CUDA,
| so AMD may have just let them run with it for now.
| jeroenhd wrote:
| AMD hardware works fine, the problem is that the major research
| projects everyone copies are all developed specifically for
| Nvidia.
|
| Now AMD is spinning up CUDA compatibility layer after CUDA
| compatibility layer. It's like trying to beat Windows by
| building another ReactOS/Wine. It's an approach doomed to fail
| unless AMD somehow manages to gain vastly more resources than
| the competition.
|
| Apple's NPU may not be very powerful, but many models have been
| altered specifically to run on them, making their NPUs vastly
| more useful than most equivalently powerful iGPUs. AMD doesn't
| have that just yet, they're always catching up.
|
| It'll be interesting to see what Qualcomm will do to get
| developers to make use of their NPUs on the new laptop chips.
| JonChesterfield wrote:
| Interesting analogy. The last few programs from the windows
| world I tried to run were flawless under wine and abjectly
| failed under windows 11.
| deliveryboyman wrote:
| Would like to see benchmarks for the applications in the test
| suite.
|
| E.g., how does Cycles compare on AMD vs Nvidia?
| Straw wrote:
| I worked for spectral compute a few years ago. Very smart and
| capable technical team.
|
| At the time, not only did they target AMD (with less
| compatibility than they have now), but also outperformed the
| default LLVM ptx backend, and even NVCC, when compiling for
| Nvidia GPUs!
| modeless wrote:
| A lot of people think AMD should support these translation layers
| but I think it's a bad idea. CUDA is not designed to be vendor
| agnostic and Nvidia can make things arbitrarily difficult both
| technically and legally. For example I think it would be against
| the license agreement of cuDNN or cuBLAS to run them on this. So
| those and other Nvidia libraries would become part of the API
| boundary that AMD would need to reimplement and support.
|
| Chasing bug-for-bug compatibility is a fool's errand. The
| important users of CUDA are open source. AMD can implement
| support directly in the upstream projects like pytorch or
| llama.cpp. And once support is there it can be maintained by the
| community.
| DeepYogurt wrote:
| Ya, honestly better to leave that to third parties who can
| dedicate themselves to it and maybe offer support or whatever.
| Let AMD work on good first party support first.
| fngjdflmdflg wrote:
| >Nvidia can make things arbitrarily difficult both technically
| and legally.
|
| I disagree. AMD can simply not implement those APIs, similar to
| how game emulators implement the most used APIs first and
| sometimes never bother implementing obscure ones. It would only
| matter that NVIDIA added eg. patented APIs to CUDA if those
| APIs were useful. In which case AMD should have a way to do
| them anyway. Unless NVIDIA comes up with a new patented API
| which is both useful and impossible to implement in any other
| way, which would be bad for AMD in any event. On the other
| hand, if AMD start supporting CUDA and people start using AMD
| cards, then developers will be hesitant to use APIs that only
| work on NVIDIA cards. Right now they are losing billions of
| dollars on this. Then again they barely seem capable of
| supporting RocM on their cards, much less CUDA.
|
| You have a fair point in terms of cuDNN and cuBLAS but I don't
| know that that kind of ToS is actually binding.
| selimnairb wrote:
| Patented API? I thought Google v. Oracle settled this? Making
| an implementation of an API spec is fair use, is it not?
| fngjdflmdflg wrote:
| My understanding is that Google v. Oracle only applies to
| copyright.
| nl wrote:
| Well you can't patent an API so....
| fngjdflmdflg wrote:
| You can patent the implementation. You can't patent the
| API name DecodeH265Video() but you can still sue someone
| for implementing that function correctly.
| anticensor wrote:
| If there is only one way to solve a problem, there is
| nothing to invent, just discover, and discoveries are
| decidedly not patentable.
| dietr1ch wrote:
| How's this situation different than the one around Java,
| Sun/Oracle and Google?
| dboreham wrote:
| The judge might not be a coder next time.
| viraptor wrote:
| The US law is highly dependent on precedents. The Google-
| Oracle case has set one fortunately, so anything following
| it won't start from scratch. Fortunately we may not need a
| closer judge.
| jjk166 wrote:
| Google-Oracle side stepped the issue of API
| copyrightability by saying Google's particular
| implementation would fall under fair use. Whether APIs
| are copyrightable remains an open question.
| dylan604 wrote:
| Until you get an activist court
| blitzar wrote:
| It would be good if AMD did something, anything.
|
| Support this, reimplement that, support upstream efforts, dont
| really care. Any of those would cost a couple of million and be
| worth a trillion dollars to AMD shareholders.
| oezi wrote:
| A couple of million doesn't get you anything in corporate
| land
| spacebanana7 wrote:
| A couple dozen billion for a 10% chance of becoming NVIDIA
| competitive is worth it, looking at the stock prices.
| oezi wrote:
| Billions. Now we are talking.
| slashdave wrote:
| ROCm counts as "something"
| curt15 wrote:
| Pretty much any modern NVIDIA GPU supports CUDA. You don't
| have to buy a datacenter-class unit to get your feet wet
| with CUDA programming. ROCm will count as "something" when
| the same is true for AMD GPUs.
| muxr wrote:
| I don't think AMD needs to support 5+ year old GPUs
| personally. And all the recent generations are already
| practically supported.
|
| AMD only claims support for a select few GPUs, but in my
| testing I find all the GPUs work fine if the architecture
| is supported. I've tested rx6600, rx6700xt for example
| and even though they aren't officially supported, they
| work fine on ROCm.
| Dylan16807 wrote:
| > 5+ year old GPUs
|
| AMD had a big architecture switchover _exactly_ 5 years
| ago, and the full launch wasn 't over until 4.5 years
| ago. I think that generation _should_ have full support.
| Especially because it 's not like they're cutting support
| now. They didn't support it at launch, and they didn't
| support it after 1, 2, 3, 4 years either.
|
| The other way to look at things, I'd say that for a mid
| to high tier GPU to be obsolete based on performance, the
| replacement model needs to be over twice as fast. 7700XT
| is just over 50% faster than 5700XT.
| imtringued wrote:
| I'm on a 5+ year old GPU, because I don't trust AMD to
| offer a compelling GPU that actually works. An RX 7 570
| is good enough for the little gaming I do. It mostly acts
| as an oversized iGPU that has good Linux drivers, but
| since AMD is not supporting ROCm on this GPU, there is no
| need to hurry on upgrading to a better GPU or to get my
| feet wet on running things locally on the GPU like Stable
| Diffusion, LLMs, etc.
| jacoblambda wrote:
| ROCm supports current gen consumer gpus officially and a
| decent chunk of recent gen consumer gpus unofficially.
| Not all of them of course but a decent chunk.
|
| It's not ideal but I'm pretty sure CUDA didn't support
| everything from day 1. And ROCm is part of AMD's vendor
| part of the Windows AI stack so from upcoming gen on out
| basically anything that outputs video should support
| ROCm.
| ChoGGi wrote:
| No, but CUDA at least supported the 8800 gt on release
| [1]. ROCm didn't support any consumer cards on release,
| looks like they didn't support any till last year? [2]
|
| [1]https://www.gamesindustry.biz/nvidia-unveils-cuda-the-
| gpu-co...
|
| [2]https://www.tomshardware.com/news/amd-rocm-comes-to-
| windows-...
| squidgyhead wrote:
| Here is the support list:
|
| https://rocm.docs.amd.com/projects/install-on-
| linux/en/lates...
| mappu wrote:
| AMD's definition of "support" I think is different than
| what people expect, and pretty misleading - ROCm itself
| will run on almost anything, back as far as the RX
| 400/500 series:
|
| https://en.wikipedia.org/wiki/ROCm#:~:text=GCN%205%20%2D%
| 20V...
|
| Stable Diffusion ran fine for me on RX 570 and RX 6600XT
| with nothing but distro packages.
| imtringued wrote:
| I don't buy it. Even running things like llama.cpp on my
| RX 570 via Vulkan crashes the entire system.
| Nab443 wrote:
| The last time I checked, I was stuck with a pretty old
| kernel if I wanted to have the last version of ROCm
| available for my rx470. It's compatible at some point in
| time, but not kept compatible with recent kernels.
| mappu wrote:
| It's the responsibility of your distro to ship things
| that work together,
| slavik81 wrote:
| There are out-of-bounds writes in the BLAS libraries for
| gfx803 GPUs (such as the RX 570). That hardware might
| work fine for your use case, but there's a lot of
| failures in the test suites.
|
| I agree that the official support list is very
| conservative, but I wouldn't recommend pre-Vega GPUs for
| use with ROCm. Stick to gfx900 and newer, if you can.
| bavell wrote:
| Huh? I've been running ROCm for SD and LLMs for over a
| year and a half on my puny consumer 6750X - not even
| latest gen.
| slashdave wrote:
| AMD should focus their efforts on competitive hardware
| offerings, because that is where the need and the money
| is. Sorry, I don't think the hobbyist should be a
| priority.
| chatmasta wrote:
| Is it weird how the comments here are blaming AMD and not
| Nvidia? Sure, the obvious argument is that Nvidia has no
| practical motivation to build an open platform. But there are
| counterexamples that suggest otherwise (Android). And there
| is a compelling argument that long term, their proprietary
| firmware layer will become an insufficient moat to their
| hardware dominance.
|
| Who's the root cause? The company with the dominant platform
| that refuses to open it up, or the competitor who can't catch
| up because they're running so far behind? Even if AMD made
| their own version of CUDA that was better in every way, it
| still wouldn't gain adoption because CUDA has become the
| standard. No matter what they do, they'll need to have a
| compatibility layer. And in that case maybe it makes sense
| for them to invest in the best one that emerges from the
| community.
| lmm wrote:
| > Is it weird how the comments here are blaming AMD and not
| Nvidia?
|
| Nvidia has put in the legwork and are reaping the rewards.
| They've worked closely with the people who are actually
| using their stuff, funding development and giving loads of
| support to researchers, teachers and so on, for probably a
| decade now. Why should they give all that away?
|
| > But there are counterexamples that suggest otherwise
| (Android).
|
| How is Android a counterexample? Google makes no money off
| of it, nor does anyone else. Google keeps Android open so
| that Apple can't move everyone onto their ad platform, so
| it's worth it for them as a strategic move, but Nvidia has
| no such motive.
|
| > Even if AMD made their own version of CUDA that was
| better in every way, it still wouldn't gain adoption
| because CUDA has become the standard.
|
| Maybe. But again, that's because NVidia has been putting in
| the work to make something better for a decade or more. The
| best time for AMD to start actually trying was 10 years
| ago; the second-best time is today.
| Zambyte wrote:
| > Google makes no money off of it, nor does anyone else
|
| Google makes no money off of Android? That seems like a
| really weird claim to make. Do you really think Google
| would be anywhere near as valuable of a company if iOS
| had all of the market share that the data vacuum that is
| Android has? I can't imagine that being the case.
|
| Google makes a boatload off of Android, just like AMD
| would if they supported open GPGPU efforts aggressively.
| michaelt wrote:
| Google gave away the software platform - Android - to
| hardware vendors for free, vendors compete making the
| hardware into cheap, low-margin commodity items, and
| google makes boatloads of money from ads, tracking and
| the app store.
|
| nvidia _could_ give away the software platform - CUDA -
| to hardware vendors for free, making the hardware into
| cheap, low-margin commodity items. But how would they
| make boatloads of money when there 's nowhere to put ads,
| tracking or an app store?
| rjurney wrote:
| Android is a complement to Google's business, which is
| when open source works. What would be the complement
| worth $1 Trillion to NVIDIA to build a truly open
| platform? There isn't one. That was his point.
| chatmasta wrote:
| There's an entire derivative industry of GPUs, namely
| GenAI and LLM providers, that could be the "complement"
| to an open GPU platform. The exact design and interface
| between such a complement and platform is yet undefined,
| but I'm sure there are creative approaches to this
| problem.
| rjurney wrote:
| And NVIDIA is playing in that game too. Why would they
| not play in higher level services as well? They already
| publish the source to their entire software stack. A
| comparison to Android is completely useless. Google is a
| multi-sided platform that does lots of things for free
| for some people (web users, Android users) so it can
| charge other people for their data (ad buyers). That
| isn't the chip business whatsoever. The original comment
| only makes sense if you know nothing about their
| respective business models.
| chatmasta wrote:
| Yes, so when the ground inevitably shifts below their
| feet (it might happen years from now, but it _will_
| happen - open platforms always emerge and eventually
| proliferate), wouldn't it be better for them to own that
| platform?
|
| On the other hand, they could always wait for the most
| viable threat to emerge and then pay a few billion
| dollars to acquire it and own its direction. Google
| didn't invent Android, after all...
|
| > Google is a multi-sided platform that does lots of
| things for free for some people... That isn't the chip
| business whatsoever.
|
| This is a reductionist differentiation that overlooks the
| similarities between the platforms of "mobile" and "GPU"
| (and also mischaracterizes the business model of Google,
| who does in fact make money directly from Android sales,
| and even moved all the way down the stack to selling
| hardware). In fact there is even a potentially direct
| analogy between the two platforms: LLM is the top of the
| stack with GPU on the bottom, just like Advertising is
| the top of the stack with Mobile on the bottom.
|
| Yes, Google's top level money printer is advertising, and
| everything they do (including Android) is about
| controlling the maximum number of layers below that money
| printer. But that doesn't mean there is no benefit to
| Nvidia doing the same. They might approach it
| differently, since they currently own the bottom layer
| whereas Google started from the top layer. But the end
| result of controlling the whole stack will lead to the
| same benefits.
|
| And you even admit in your comment that Nvidia is
| investing in these higher levels. My argument is that
| they are jeopardizing the longevity of these high-level
| investments due to their reluctance to invest in an open
| platform at the bottom layer (not even the bottom, but
| one level above their hardware). This will leave them
| vulnerable to encroachment by a player that comes from a
| higher level, like OpenAI for example, who gets to define
| the open platform before Nvidia ever has a chance to own
| it.
| roenxi wrote:
| > Is it weird how the comments here are blaming AMD and not
| Nvidia?
|
| Not even a little bit. It simply isn't Nvidia's job to
| provide competitive alternatives to Nvidia. Competing is
| something AMD must take responsibility for.
|
| The only reason CUDA is such a big talking point is because
| AMD tripped over their own feet supporting accelerated BLAS
| on AMD GPUs. Realistically it probably is hard to implement
| (AMD have a lot of competent people on staff) but Nvidia
| hasn't done anything unfair apart from execute so well that
| they make all the alternatives look bad.
| jkmcf wrote:
| I agree with you, but replace NVIDIA with Apple. What
| would the EU say?
| LtWorf wrote:
| I don't think nvidia bans anyone from running code on
| their devices.
| padthai wrote:
| They do from time to time:
| https://wirelesswire.jp/2017/12/62708/
| kbolino wrote:
| This seems to be more about certain devices (consumer-
| grade GPUs) in certain settings (data centers), though I
| do question how enforceable it actually is. My guess is
| that it can only apply when you try to get discounts from
| bulk-ordering GPUs.
|
| Also, was there any followup to this story? It seems a
| bit unnecessary because nVidia has already neutered
| consumer cards for many/most data center purposes by not
| using ECC and by providing so few FP64 units that double
| precision FLOPS is barely better than CPU SIMD.
| paulmd wrote:
| it's also not really a thing anymore because of the open
| kernel driver... at that point it's just MIT licensed.
|
| of course people continued to melt down about that for
| some reason too, in the customary "nothing is ever libre
| enough!" circular firing squad. Just like streamline etc.
|
| There's a really shitty strain of fanboy thought that
| wants libre software to be actively worsened (even
| stonewalled by the kernel team if necessary) so that they
| can continue to argue against nvidia as a bad actor that
| doesn't play nicely with open source. You saw it with all
| these things but especially with the open kernel driver,
| people were really happy it didn't get upstreamed. Shitty
| behavior all around.
|
| You see it every time someone quotes Linus Torvalds on
| the issue. Some slight from 2006 is more important than
| users having good, open drivers upstreamed. Some petty
| brand preferences are legitimately far important than
| working with and bringing that vendor into the fold long-
| term, _for a large number of people_. Most of whom don't
| even consider themselves fanboys! They just say all the
| things a fanboy would say, and act all the ways a fanboy
| would act...
| Zambyte wrote:
| https://www.pcgamer.com/nvidia-officially-confirms-hash-
| rate...
|
| Also: look into why the Nouveau driver performance is
| limited.
| paulmd wrote:
| so terrible that vendors can enforce these proprietary
| licenses on software they paid to develop /s
| Zambyte wrote:
| Huh? Why the sarcasm? You think it's a good thing that
| someone besides the person who owns the hardware has the
| final say on what the hardware is allowed to be used for?
| nemothekid wrote:
| > _Is it weird how the comments here are blaming AMD and
| not Nvidia?_
|
| It's not. Even as it is, I do not trust HIP or RocM to be a
| viable alternative to Cuda. George Hotz did plenty of work
| trying to port various ML architectures to AMD and was met
| with countless driver bugs. The problem isn't nvidia won't
| build an open platform - the problem is AMD won't invest in
| a competitive platform. 99% of ML engineers do not write
| CUDA. For the vast majority of workloads, there are
| probably 20 engineers at Meta who write the Cuda backend
| for Pytorch that every other engineer uses. Meta could hire
| another 20 engineers to support whatever AMD has (they did,
| and it's not as robust as CUDA).
|
| Even if CUDA was open - do you expect nvidia to also write
| drivers for AMD? I don't believe 3rd parties will get
| anywhere writing "compatibility layers" because AMD's own
| GPU aren't optimized or tested for CUDA-like workloads.
| pjmlp wrote:
| Khrons, AMD and Intel have had 15 years to make something
| out of OpenCL that could rival CUDA.
|
| Instead they managed 15 years of disappointment, in a
| standard stuck in C99, that adopted C++ and a polyglot
| bytecode too late to matter, never produced an ecosystem of
| IDE tooling and GPU libraries.
|
| Naturally CUDA became the standard, when NVIDIA provided
| what the GPU community cared about.
| whywhywhywhy wrote:
| >Is it weird how the comments here are blaming AMD and not
| Nvidia?
|
| Because it IS AMD/Apple/etcs fault for the position they're
| in right now. CUDA showed where the world was heading and
| where the gains in compute would be made well over a decade
| ago now.
|
| They even had OpenCL, didn't put the right amount of effort
| into it, all the talent found CUDA easier to work with so
| built there. Then what did AMD, Apple do? Double down and
| try and make something better and compete? Nah they
| fragmented and went their own way, AMD with what feels like
| a fraction of the effort even Apple put in.
|
| From the actions of the other teams in the game it's not
| hard to imagine a world without CUDA being a world where
| this tech is running at a fraction of it's potential.
| immibis wrote:
| It's always been on the straggler to catch up by cheating.
| That's just how the world works - even in open source. If
| AMD supported CUDA, it would have a bigger market share.
| That's a fact. Nvidia doesn't want that. That's a fact. But
| when Reddit started, it just scraped feeds from Digg, and
| when Facebook started, it let you link your MySpace
| credentials and scraped your MySpace account. Adversarial
| interoperability is nothing new.
| cogman10 wrote:
| Funnily, who I blame the most for there not being real
| competition to CUDA is apple. As of late, Apple has been
| really pushing for vender lock in APIs rather than adopting
| open standards. The end result is you can get AMD and Intel
| onboard with some standard which is ultimately torpedoed by
| apple. (See apple departing from and rejecting everything
| that comes from the khronos group).
|
| With the number of devs that use Apple silicon now-a-days,
| I have to think that their support for khronos initiatives
| like SYCL and OpenCL would have significantly accelerated
| progress and adoption in both.
|
| We need an open standard that isn't just AMD specific to be
| successful in toppling CUDA.
| Const-me wrote:
| > Nvidia can make things arbitrarily difficult both technically
| and legally
|
| Pretty sure APIs are not copyrightable, e.g.
| https://www.law.cornell.edu/supremecourt/text/18-956
|
| > against the license agreement of cuDNN or cuBLAS to run them
| on this
|
| They don't run either of them, they instead implement an
| equivalent API on top of something else. Here's a quote: "Open-
| source wrapper libraries providing the "CUDA-X" APIs by
| delegating to the corresponding ROCm libraries. This is how
| libraries such as cuBLAS and cuSOLVER are handled."
| dralley wrote:
| I believe it was decided that they are copyrightable but that
| using them for compatibility purposes is fair use.
| kbolino wrote:
| No, it's stranger than that: SCOTUS did not rule on
| copyrightability of APIs at all, but simply ruled that even
| _if_ they are copyrightable, what Google did (completely
| reimplement Sun /Oracle's public API) was still fair use.
| mrandish wrote:
| It would have been nice to get a clear SCOTUS precedent
| on this. On the other hand, I also value a SCOTUS which
| rules minimally and narrowly by default (I also
| appreciate SCOTUS' return to stricter constitutional
| grounding in the past decade).
| hobs wrote:
| Incredibly loud laughing from the lawyers whose study of
| law is being thrown around willy nilly because of all the
| unprecedented joke decisions they are making right now.
| kbolino wrote:
| We are stuck between a rock and a hard place politically.
| The real decisions should be coming from Congress not the
| courts. However, Congress is too disorganized and
| disconnected to answer the important questions, leaving
| the courts to either muddle along or else become semi-
| dictatorial. In most countries, this would cause a
| constitutional crisis, but the modern U.S. system seems
| to be a little too resilient to such otherwise concerning
| signals.
| hobs wrote:
| We're far past a constitutional crisis, and the courts
| taking power nobody wanted to give to them (who wasn't
| interested in a unitary executive at least) isn't a good
| solution.
| kbolino wrote:
| What constitutional crisis has occurred that hasn't been
| resolved?
|
| Constitutional crises involve fundamental breaks in the
| working of government that bring two or more of its
| elements into direct conflict that can't be reconciled
| through the normal means. The last of these by my
| accounting was over desegregation, which was resolved
| with the President ordering the Army to force the
| recalcitrant states to comply. Before that was a showdown
| between the New Deal Congress and the Supreme Court,
| which the former won by credibly threatening to pack the
| latter (which is IMO a much less severe crisis but still
| more substantial than anything happening today). However,
| that was almost a century ago, and Congress has not been
| that coherent lately.
| ted_dunning wrote:
| I would think the latest one where SCOTUS ruled that the
| president was a king except in matters where the SCOTUS
| decides they aren't counts as a constitutional crisis.
| FeepingCreature wrote:
| Constitutional crises are not a matter of opinion but of
| occurrence, arising from an _actual power conflict_
| between arms of the government that is caused by a
| conflicted reading of the constitutional text. Basically,
| if the system just ticks on, it 's not a constitutional
| crisis.
|
| If "I think this is a very bad decision" was cause for a
| constitutional crisis, any state with more than three
| digit population would be in constitutional crisis
| perpetually.
| jolux wrote:
| > Constitutional crises are not a matter of opinion but
| of occurrence, arising from an actual power conflict
| between arms of the government that is caused by a
| conflicted reading of the constitutional text. Basically,
| if the system just ticks on, it's not a constitutional
| crisis.
|
| This happened as recently as 2021-01-06; strong evidence
| that the military subverted the president to call the
| National Guard into Washington DC and secure the
| electoral count.
| hnfong wrote:
| If Trump didn't back down it could have definitely been a
| constitutional crisis.
|
| I'd say it was narrowly averted though.
| kbolino wrote:
| That's close. Both the excessively long lame duck period
| (2 months for Congress and 2.5 months for the President)
| and disunity between the President and the rest of the
| executive branch have also been fodder for crises in the
| past (Marbury v Madison, Andrew Johnson's impeachment).
| not2b wrote:
| That is how the SC used to work: they would decide cases
| on the narrowest possible grounds. If they don't have to
| decide a tough question, but they can finesse it with
| something simpler, good enough. More recently they have
| been willing to tear up decades of established law on a
| regular basis.
| hnfong wrote:
| "Used to work"... this was 2021.
|
| And generally courts/judges just choose the scope of
| their legal opinions based on how far reaching they want
| the legal principles to apply.
|
| IMHO, copyright-ability of APIs is so far away from their
| political agenda that they probably just decided to leave
| the issue on a cliffhanger...
| immibis wrote:
| Yes, "used to". Now, in 2024, the same supreme court has
| decided that presidents have immunity in all official
| acts, from stealing documents, up to and including
| assassination attempts on their opponents. This is a
| radical shift in how the court operates.
| kbolino wrote:
| This "opponent assassination" hypothetical gets bandied
| about a lot but I have not seen any evidence that any
| court considers that to be an "official act". Official
| acts are constrained to legitimate exercises of
| constitutional authority and are not merely anything a
| President (or especially, an ex-President) does.
| not2b wrote:
| It's specifically mentioned in the dissents.
| jpadkins wrote:
| the only thing radical is the opinions of people you are
| listening to if you believe SCOTUS enabled legally
| sanctioned assassinations. It was political hyperbole
| based on nothing, and it worked (with you). Think for
| yourself.
| consf wrote:
| You're correct! Fair Use Doctrine
| amelius wrote:
| Like supporting x86 was a bad idea as well?
| karolist wrote:
| Was there a large entity steering x86 spec alone with a huge
| feature lead against their competition, free to steer the
| spec in any ways they choose? Also, hardware is not
| opensource software, you get big players onboard and they
| will be able to implement the spec they want every gen,
| software has more moving parts and unaligned parties
| involved.
| cherryteastain wrote:
| > Was there a large entity steering x86 spec alone with a
| huge feature lead against their competition, free to steer
| the spec in any ways they choose?
|
| Ever heard of Intel?
| karolist wrote:
| I had't considered that angle. Is your point that Intel
| was the creator of x86, but software chose to support it,
| then AMD had nothing else but to play catch up in x86
| support to be part of the software target market? If so
| and factual (I've no idea), fair point, I didn't know.
| marshray wrote:
| It was exactly the same instruction set.
|
| C compilers didn't offer an "AMD" CPU target* until AMD
| came out with the "AMD64" instruction set. Today we call
| this "x86_64" or "x64".
|
| * Feel free to point out some custom multimedia vector
| extensions for Athlons or something, but the point
| remains.
| gmokki wrote:
| And Intel named its licenced implementation of AMD64 as
| IA-32e, just to make it clear to everyone that it is
| based on Intel architecture 32bit version with an
| extension. Luckily they dropped that name few years later
| modeless wrote:
| Before starting, AMD signed an agreement with Intel that gave
| them an explicit license to x86. And x86 was a whole lot
| smaller and simpler back then in _1982_. A completely
| different and incomparable situation.
| nostrademons wrote:
| Technically it was after starting - AMD was founded in 1969
| as a second-sourcer for Fairchild and National
| Semiconductor, and had reverse-engineered the 8080 by 1975
| and acquired a formal license to it by 1976.
|
| The 1982 deal you speak of was actually pretty interesting:
| as a condition of the x86's use in the IBM PC, IBM
| requested a second source for x86 chips. AMD was that
| source, and so they cross-licensed the x86 in 1982 to allow
| the IBM PC project to proceed forward. This makes the
| Intel/AMD deal even more important for both companies: _the
| PC market would never have developed_ without the cross-
| licensing, which would 've been bad for all companies
| involved. This gave Intel an ongoing stake in AMD's success
| at least until the PC market consolidated on the x86
| standard.
| eslaught wrote:
| Are you aware of HIP? It's officially supported and, for code
| that avoids obscure features of CUDA like inline PTX, it's
| pretty much a find-and-replace to get a working build:
|
| https://github.com/ROCm/HIP
|
| Don't believe me? Include this at the top of your CUDA code,
| build with hipcc, and see what happens:
|
| https://gitlab.com/StanfordLegion/legion/-/blob/master/runti...
|
| It's incomplete because I'm lazy but you can see most things
| are just a single #ifdef away in the implementation.
| currymj wrote:
| if you're talking about building anything, that is already
| too hard for ML researchers.
|
| you have to be able to pip install something and just have it
| work, reasonably fast, without crashing, and also it has to
| not interfere with 100 other weird poorly maintained ML
| library dependencies.
| bootsmann wrote:
| Don't most orgs that are deep enough to run custom cuda
| kernels have dedicated engineers for this stuff. I can't
| imagine a person who can write raw cuda not being able to
| handle things more difficult than pip install.
| gaogao wrote:
| Engineers who are really, really good at CUDA are worth
| their weight in gold, so there's more projects for them
| than they have time. Worth their weight in gold isn't
| figurative here - the one I know has a ski house more
| expensive than 180 lbs of gold (~$5,320,814).
| bbkane wrote:
| Would you (or your friend) be able to drop any good CUDA
| learning resources? I'd like to be worth my weight in
| gold...
| throwaway81523 wrote:
| A working knowledge of C++, plus a bit of online reading
| about CUDA and the NVidia GPU architecture, plus studying
| the LCZero chess engine source code (the CUDA neural net
| part, I mean) seems like enough to get started. I did
| that and felt like I could contribute to that code, at
| least at a newbie level, given the hardware and build
| tools. At least in the pre-NNUE era, the code was pretty
| readable. I didn't pursue it though.
|
| Of course becoming "really good" is a lot different and
| like anything else, it presumably takes a lot of callused
| fingertips (from typing) to get there.
| 8n4vidtmkvmk wrote:
| Does this pay more than $500k/yr? I already know C++,
| could be tempted to learn CUDA.
| throwaway81523 wrote:
| I kinda doubt it. Nobody paid me to do that though. I was
| just interested in LCZero. To get that $500k/year, I
| think you need up to date ML understanding and not just
| CUDA. CUDA is just another programming language while ML
| is a big area of active research. You could watch some of
| the fast.ai ML videos and then enter some Kaggle
| competitions if you want to go that route.
| almostgotcaught wrote:
| You're wrong. The people building the models don't write
| CUDA kernels. The people optimizing the models write CUDA
| kernels. And you don't need to know a bunch of ML bs to
| optimize kernels. Source: I optimize GPU kernels. I don't
| make 500k but I'm not that far from.
| throwaway81523 wrote:
| Heh I'm in the wrong business then. Interesting. Used to
| be that game programmers spent lots of time optimizing
| non-ML CUDA code. They didn't make anything like 500k at
| that time. I wonder what the ML industry has done to game
| development, or for that matter to scientific
| programming. Wow.
| HarHarVeryFunny wrote:
| How much performance difference is there between writing
| a kernel in a high level language/framework like PyTorch
| (torch.compile) or Triton, and hand optimizing? Are you
| writing kernels in PTX?
|
| What's your opinion on the future of writing optimized
| GPU code/kernels - how long before compilers are as good
| or better than (most) humans writing hand-optimized PTX?
| throwaway81523 wrote:
| The CUDA version of LCZero was around 2x or 3x faster
| than the Tensorflow(?) version iirc.
| mosselman wrote:
| The real challenge is probably getting your hands on a
| 4090 for a price you can pay before you are worth your
| weight in gold. Because an arm and a limb in gold is
| quite a lot.
| throwaway81523 wrote:
| You don't really need a 4090. An older board is plenty.
| The software is basically the same. I fooled around with
| what I think was a 1080 on Paperspace for something like
| 50 cents an hour, but it was mostly with some Pytorch
| models rather than CUDA directly.
| ahepp wrote:
| I was looking into this recently and it seems like the
| cheapest AWS instance with a CUDA GPU is something on the
| order of $1/hr. It looks like an H100 instance might be
| $15/hr (although I'm not sure if I'm looking at a monthly
| price).
|
| So yeah it's not ideal if you're on a budget, but it
| seems like there are some solutions that don't involve
| massive capex.
| throwaway81523 wrote:
| Look on vast.ai instead of AWS, you can rent machines
| with older GPU's dirt cheap. I don't see how they even
| cover the electricity bills. A 4090 machine starts at
| about $.25/hour though I didn't examine the
| configuration.
|
| A new 4090 costs around $1800
| (https://www.centralcomputer.com/asus-tuf-
| rtx4090-o24g-gaming...) and that's probably affordable to
| AWS users. I see a 2080Ti on Craigslist for $300
| (https://sfbay.craigslist.org/scz/sop/d/aptos-nvidia-
| geforce-...) though used GPU's are possibly thrashed by
| bitcoin mining. I don't have a suitable host machine,
| unfortunately.
| dotancohen wrote:
| Thrashed? What type of damage could a mostly-solid state
| device suffer? Fan problems? Worn PCi connectors?
| Deteriorating Arctic Ice from repeated heat cycling?
| ssl-3 wrote:
| Nope, none of those.
|
| When people were mining Ethereum (which was the last
| craze that GPUs were capable of playing in -- BTC has
| been off the GPU radar for a long time), profitable
| mining was fairly kind to cards compared to gaming.
|
| Folks wanted their hardware to produce as much as
| possible, for as little as possible, before it became
| outdated.
|
| The load was constant, so heat cycles weren't really a
| thing.
|
| That heat was minimized; cards were clocked (and voltages
| tweaked) to optimize the ratio of crypto output to Watts
| input. For Ethereum, this meant undervolting and
| underclocking the GPU -- which are kind to it.
|
| Fan speeds were kept both moderate and tightly
| controlled; too fast, and it would cost more (the fans
| themselves cost money to run, and money to replace). Too
| slow, and potential output was left on the table.
|
| For Ethereum, RAM got hit hard. But RAM doesn't
| necessarily care about that; DRAM in general is more or
| less just an array of solid-state capacitors. And people
| needed that RAM to work reliably -- it's NFG to spend
| money producing bad blocks.
|
| Power supplies tended to be stable, because good, cheap,
| stable, high-current, and stupidly-efficient are
| qualities that go hand-in-hand thanks to HP server PSUs
| being cheap like chips.
|
| There were exceptions, of course: Some people did not
| mine smartly.
|
| ---
|
| But this is broadly very different from how gamers treat
| hardware, wherein: Heat cycles are real, over clocking
| everything to eek out an extra few FPS is real, pushing
| things a bit too far and producing glitches can be
| tolerated sometimes, fan speeds are whatever, and power
| supplies are picked based on what they _look like_
| instead of an actual price /performance comparison.
|
| A card that was used for mining is not implicitly worse
| in any way than one that was used for gaming. Purchasing
| either thing involves non-zero risk.
| mschuster91 wrote:
| Heat. A lot of components - and not just in computers but
| _everything_ hardware - are spec 'd for something called
| "duty cycles", basically how long a thing is active in a
| specific time frame.
|
| Gaming cards/rigs, which many of the early miners were
| based on, rarely run at 100% all the time, the workload
| is burst-y (and distributed amongst different areas of
| the system). In comparison, a miner runs at 100% all the
| time.
|
| On top of that, for silicon there is an effect called
| electromigration [1], where the literal movement of
| electrons erodes the material over time - made worse by
| ever shrinking feature sizes as well as, again, the chips
| being used in exactly the same way all the time.
|
| [1] https://en.wikipedia.org/wiki/Electromigration
| SonOfLilit wrote:
| replying to sibling @dotancohen, they melt, and they
| suffer from thermal expansion and compression
| robotnikman wrote:
| Are there any certifications or other ways to prove your
| knowledge to employers in order to get your foot in the
| door?
| suresk wrote:
| Having dabbled in CUDA, but not worked on it
| professionally, it feels like a lot of the complexity
| isn't really in CUDA/C++, but in the algorithms you have
| to come up with to really take advantage of the hardware.
|
| Optimizing something for SIMD execution isn't often
| straightforward and it isn't something a lot of
| developers encounter outside a few small areas. There are
| also a lot of hardware architecture considerations you
| have to work with (memory transfer speed is a big one) to
| even come close to saturating the compute units.
| iftheshoefitss wrote:
| On bro forget gold if like to be worth my weight in paper
| lmao
| eigenvalue wrote:
| That's pretty funny. Good test of value across the
| millennia. I wonder if the best aqueduct engineers during
| the peak of Ancient Rome's power had villas worth their
| body weight in gold.
| Winse wrote:
| Lol. For once being overweight may come with some
| advantages here.
| necovek wrote:
| Or disadvantages: you may be as rich as your skinny
| neighbour, but they are the only ones worth their weight
| in gold ;)
| Willish42 wrote:
| The fact that "worth their weight in cold" typically
| means in the single-digit millions is _fascinating_ to me
| (though I doubt I 'll be able to get there myself, maybe
| someday). I looked it up though and I think this is
| undercounting the current value of gold per ounce/lb/etc.
|
| 5320814 / 180 / 16 = ~1847.5
|
| Per https://www.apmex.com/gold-price and
| https://goldprice.org/, current value is north of $2400 /
| oz. It was around $1800 in 2020. That growth for _gold_
| of all things (up 71% in the last 5 years) is crazy to
| me.
|
| It's worth noting that anyone with a ski house that
| expensive probably has a net worth well over twice the
| price of that ski house. I guess it's time to start
| learning CUDA!
| boulos wrote:
| Note: gold uses _troy_ ounces, so adjust by ~10%. It 's
| easier to just use grams or kilograms :).
| Willish42 wrote:
| Thanks, I'm a bit new to this entire concept. Do _troy_
| lbs also exist, or is that just a term when measuring
| ounces?
| atwrk wrote:
| _> That growth for _gold_ of all things (up 71% in the
| last 5 years) is crazy to me._
|
| For comparison: S&P500 grew about the same during that
| period (more than 100% from Jan 2019, about 70 from Dec
| 2019), so the higher price of gold did not outperform the
| growth of the general (financial) economy.
| dash2 wrote:
| But that's still surprising performance, because the S&P
| generates income and pays dividends. Its increase
| reflects (at least, is supposed to!) expectations of
| future higher income. Gold doesn't even bear interest....
| iftheshoefitss wrote:
| What do people study to figure out CUDA? I'm studying to
| get me GED and hope to go to school one day
| paulmd wrote:
| Computer science. This is a grad level topic probably.
|
| Nvidia literally wrote most of the textbooks in this
| field and you'd probably be taught using one of these
| anyway:
|
| https://developer.nvidia.com/cuda-books-archive
|
| "GPGPU Gems" is another "cookbook" sort of textbook that
| might be helpful starting out but you'll want a good
| understanding of the SIMT model etc.
| amelius wrote:
| Just wait until someone trains an ML model that can
| translate any CUDA code into something more portable like
| HIP.
|
| GP says it is just some #ifdefs in most cases, so an LLM
| should be able to do it, right?
| FuriouslyAdrift wrote:
| OpenAI Triton? Pytorch 2.0 already uses it.
|
| https://openai.com/index/triton/
| phkahler wrote:
| >> Don't most orgs that are deep enough to run custom
| cuda kernels have dedicated engineers for this stuff. I
| can't imagine a person who can write raw cuda not being
| able to handle things more difficult than pip install.
|
| This seems to be fairly common problem with software. The
| people who create software regularly deal with complex
| tool chains, dependency management, configuration files,
| and so on. As a result they think that if a solutions
| "exists" everything is fine. Need to edit a config file
| for your particular setup? No problem. The thing is, _I_
| have been programming stuff for decades and I really
| _hate_ having to do that stuff and will avoid tools that
| make me do it. I have my own problems to solve, and don
| 't want to deal with figuring out tools no matter how
| "simple" the author thinks that is to do.
|
| A huge part of the reason commercial software exists
| today is probably because open source projects don't take
| things to this extreme. I look at some things that
| qualify as products and think they're really simplistic,
| but they take care of some minutia that regular people
| are will to pay so they don't have to learn or deal with
| it. The same can be true for developers and ML
| researchers or whatever.
| jchw wrote:
| The target audience of interoperability technology is
| whoever is building, though. Ideally, interoperability
| technology can help software that supports only NVIDIA GPUs
| today go on to quickly add baseline support for Intel and
| AMD GPUs tomorrow.
|
| (and for one data point, I believe Blender is actively
| using HIP for AMD GPU support in Cycles.)
| Agingcoder wrote:
| Their target is hpc users, not ml researchers. I can
| understand why this would be valuable to this particular
| crowd.
| eslaught wrote:
| If your point is that HIP is not a zero-effort porting
| solution, that is correct. HIP is a _low_ -effort solution,
| not a zero effort solution. It targets users who already
| use and know CUDA, and minimizes the changes that are
| required from pre-existing CUDA code.
|
| In the case of these abstraction layers, then it would be
| the responsibility of the abstraction maintainers (or AMD)
| to port them. Obviously, someone who does not even use CUDA
| would not use HIP either.
|
| To be honest, I have a hard time believing that a truly
| zero-effort solution exists. Especially one that gets high
| performance. Once you start talking about the full stack,
| there are too many potholes and sharp edges to believe that
| it will really work. So I am highly skeptical of original
| article. Not that I wouldn't want to be proved wrong. But
| what they're claiming to do is a big lift, even taking HIP
| as a starting point.
|
| The easiest, fastest (for end users), highest-performance
| solution for ML will come when the ecosystem integrates it
| natively. HIP would be a way to get there faster, but it
| will take nonzero effort from CUDA-proficient engineers to
| get there.
| currymj wrote:
| I agree completely with your last point.
|
| As other commenters have pointed out, this is probably a
| good solution for HPC jobs where everyone is using C++ or
| Fortran anyway and you frequently write your own CUDA
| kernels.
|
| From time to time I run into a decision maker who
| understandably wants to believe that AMD cards are now
| "ready" to be used for deep learning, and points to
| things like the fact that HIP mostly works pretty well. I
| was kind of reacting against that.
| ezekiel68 wrote:
| > if you're talking about building anything, that is
| already too hard for ML researchers.
|
| I don't think so. I agree it is too hard for the ML
| researches at the companies which will have their rear ends
| handed to them by the other companies whose ML researchers
| can be bothered to follow a blog post and prompt ChatGPT to
| resolve error messages.
| jokethrowaway wrote:
| a lot of ML researchers stay pretty high level and
| reinstall conda when things stop working
|
| and rightly so, they have more complicated issues to
| tackle
|
| It's on developers to provide better infrastructure and
| solve these challenges
| LtWorf wrote:
| Not rightly. It'd be faster on the long term to address
| the issues.
| bayindirh wrote:
| Currently nobody think that long term. They just
| reinstall, that's it.
| currymj wrote:
| I'm not really talking about companies here for the most
| part, I'm talking about academic ML researchers (or
| industry researchers whose role is primarily academic-
| style research). In companies there is more incentive for
| good software engineering practices.
|
| I'm also speaking from personal experience: I once had to
| hand-write my own CUDA kernels (on official NVIDIA cards,
| not even this weird translation layer): it was useful and
| I figured it out, but everything was constantly breaking
| at first.
|
| It was a drag on productivity and more importantly, it
| made it too difficult for other people to run my code
| (which means they are less likely to cite my work).
| elashri wrote:
| As someone doing a lot of work with CUDA in a big research
| organization, there are few of us. If you are working with
| CUDA, then you are not from the type of people who wait to
| have something that just works like you describe. CUDA
| itself is a battle with poorly documented stuff.
| klik99 wrote:
| God this explains so much about my last month, working with
| tensorflow lite and libtorch in C++
| SushiHippie wrote:
| AMD has hipify for this, which converts cuda code to hip.
|
| https://github.com/ROCm/HIPIFY
| 3abiton wrote:
| There is more glaring issue, ROCm doesn't even work well on
| most AMD devices nowadays, and hip performance wise
| deterioriates on the same hardware compared to ROCm.
| boroboro4 wrote:
| It supports all of current datacenter GPUs.
|
| If you want to write very efficient CUDA kernel for
| modern datacenter NVIDIA GPU (read H100), you need to
| write it with having hardware in mind (and preferably in
| hands, H100 and RTX 4090 behave _very_ differently in
| practice). So I don 't think the difference between AMD
| and NVIDIA is as big as everyone perceives.
| jph00 wrote:
| Inline PTX is hardly an obscure feature. It's pretty widely
| used in practice, at least in the AI space.
| saagarjha wrote:
| Yeah, a lot of the newer accelerators are not even
| available without using inline PTX assembly. Even the ones
| that are have weird shapes that are not amenable to high-
| performance work.
| HarHarVeryFunny wrote:
| Are you saying that the latest NVIDIA nvcc doesn't
| support the latest NVIDIA devices?
| adrian_b wrote:
| For any compiler, "supporting" a certain CPU or GPU only
| means that they can generate correct translated code with
| that CPU or GPU as the execution target.
|
| It does not mean that the compiler is able to generate
| code that has optimal performance, when that can be
| achieved by using certain instructions without a direct
| equivalent in a high-level language.
|
| No compiler that supports the Intel-AMD ISA knows how to
| use all the instructions available in this ISA.
| HarHarVeryFunny wrote:
| Sure, but I'm not sure if that is what the parent poster
| was saying (that nvcc generates poor quality PTX for
| newer devices).
|
| It's been a while since I looked at CUDA, but it used to
| be that NVIDIA were continually extending cuDNN to add
| support for kernels needed by SOTA models, and I assume
| these kernels were all hand optimized.
|
| I'm curious what kind of models people are writing where
| not only is there is no optimized cuDNN support, but also
| solutions like Triton or torch.compile, and even hand
| optimized CUDA C kernels are too slow. Are hand written
| PTX kernels really that common ?
| pjmlp wrote:
| How does it run CUDA Fortran?
| viraptor wrote:
| Isn't cuDNN a much better case for reimplementing than CUDA? It
| has much more choice in how things actually happen and cuDNN
| itself chooses different implementations at runtime + does
| fusing. It seems way more generic and the reimplementation
| would allow using the best AMD-targeted kernel rather than one
| the original has.
| ckitching wrote:
| AMD have "MIOpen" which is _basically_ cuDNN-for-AMD. Ish.
| mmis1000 wrote:
| And that thing is left for unreleased on windows for almost
| a whole year for unknown reason. Even though there is
| activity on github and build fix frequently. There is just
| no .exe or .msi for you to download. In fact, the rocm for
| linux is on major 6 release (which includes miopen). But
| somehow windows is still on major 5 (don't have miopen) for
| almost a whole year.
|
| It almost make me wonder. Is there a shady trade somewhere
| to ask amd never release sdk for Windows to hike the price
| of nvidia card higher? Why they keep developing these
| without release it at all?
| anigbrowl wrote:
| Given AMDs prior lack of interest I'll take whatever options
| there are. My daily driver has a Vega 10 GPU and it's been
| quite frustrating not to be able to easily leverage it for
| doing basic ML tasks, to the point that I've been looking at
| buying an external nvidia GPU instead just to try out some of
| the popular Python libraries.
| Wowfunhappy wrote:
| > CUDA is not designed to be vendor agnostic and Nvidia can
| make things arbitrarily difficult [...] technically.
|
| (Let's put the legal questions aside for a moment.)
|
| nVidia changes GPU architectures every generation / few
| generations, right? How does CUDA work across those--and how
| can it have forwards compatibility in the future--if it's not
| designed to be technologically agnostic?
| andy_ppp wrote:
| One way is to make sure the hardware team does certain things
| to support easy transition to new architectures, we have seen
| this with Apple Silicon for example!
| saagarjha wrote:
| PTX is meant to be portable across GPU microarchitectures.
| That said, Nvidia owns the entire spec, so they can just keep
| adding new instructions that their GPUs now support but AMD
| GPUs don't.
| koolala wrote:
| CUDA v1...CUDA v2... CUDA v... CUDA isnt commonly assosiated
| with a version number...
| Uehreka wrote:
| ...yes it is? https://developer.nvidia.com/cuda-toolkit-
| archive
| rjurney wrote:
| Not having a layer like this has left AMD completely out of the
| AI game that has made NVDA the world's most valuable company.
| ChoGGi wrote:
| Self-inflicted wounds hurt the most.
| HarHarVeryFunny wrote:
| Well, they kinda have it with their hipify tool, although
| this is for porting CUDA code to AMD's HIP which supports
| both AMD and NVIDIA. This supports CUDA C code and libraries
| with AMD equivalents like cuDNN, cuBLAS, cuRAND, but doesn't
| support porting of CUDA C inline PTX assembler. AMD have
| their own inline GCN assembler, but seem to discourage it's
| use.
|
| There are also versions of PyTorch, TensorFlow and JAX with
| AMD support.
|
| PyTorch's torch.compile can generate Triton (OpenAI's GPU
| compiler) kernels, with Triton also supporting AMD.
| neutrinobro wrote:
| _Cries in OpenCL_
| apatheticonion wrote:
| Agreed. Rather than making CUDA the standard; AMD should
| push/drive an open standard that can be run on any hardware.
|
| We have seen this succeed multiple times: FreeSync vs GSync,
| DLSS vs FSR, (not AMD but) Vulkan vs DirectX & Metal.
|
| All of the big tech companies are obsessed with ring-fencing
| developers behind the thin veil of "innovation" - where really
| it's just good for business (I swear it should be regulated
| because it's really bad for consumers).
|
| A CUDA translation layer is okay for now but it does risk CUDA
| becoming the standard API. Personally, I am comfortable with
| waiting on an open standard to take over - ROCm has serviced my
| needs pretty well so far.
|
| Just wish GPU sharing with VMs was as easy as CPU sharing.
| amy-petrik-214 wrote:
| we actually also saw this historically with openGL. openGL
| comes from an ancient company whispered about by the elderly
| programmers (30 + year old) known as SGI. Originally it was
| CLOSED SOURCE and SGI called it "SGI-GL" for a computer
| codename IRIS which was cool looking with bright popping
| color plastic and faux granite keyboard. Good guy SGI open
| sourced SGI-GL to become what we called "openGL" (get it, now
| it's open), and then it stuck.
|
| That's all to say NVIDIA _could_ pull a SGI and open their
| stuff, but they 're going more sony style and trying to
| monopolize. Oh, and SGI also wrote another ancient lore
| library known as "STL" or the "SGI Template Library" which is
| like the original boost template metaprogramming granddaddy
| adrian_b wrote:
| Also the XFS file system.
| usr1106 wrote:
| Nice story, but is it correct? Wikipedia says STL was first
| implemented by HP and later by the same authors at SGI.
| adrian_b wrote:
| STL started even earlier, obviously without using the
| name "STL", as a library of generic algorithms for the
| programming language Ada (David R. Musser & Alexander A.
| Stepanov, 1987).
| pjmlp wrote:
| Vulkan only matters on Android (from version 10 onwards) and
| GNU/Linux.
|
| Zero impact on Switch, Playstation, XBox, Windows, macOS,
| iOS, iPadOS, Vision OS.
| ChoGGi wrote:
| "Windows"
|
| dxvk-gplasync is a game changer for dx9-11 shader stutter.
| pjmlp wrote:
| Sure, for the 2% folks that enjoy Windows games, written
| againt DirectX, on Linux Steam Store.
|
| Which Android Studios can't even be bothered to target
| with their NDK engines, based on GL ES, Vulkan.
| ChoGGi wrote:
| I'm on windows 11, if I see not dx12 in my afterburner
| overlay, I use it.
|
| Even if there's no shader stutter, Vulkan tends to use
| less juice than DX.
| ChoGGi wrote:
| "We have seen this succeed multiple times: FreeSync vs GSync,
| DLSS vs FSR, (not AMD but) Vulkan vs DirectX & Metal."
|
| I'll definitely agree with you on Sync and Vulkan, but dlss
| and xess are both better than fsr.
|
| https://youtube.com/watch?v=el70HE6rXV4
| gjulianm wrote:
| OpenCL was released in 2009. AMD has had plenty of time to
| push and drive that standard. But OpenCL had a worse
| experience than CUDA, and AMD wasn't up to the task in terms
| of hardware, so it made no real sense to go for OpenCL.
| imtringued wrote:
| AMD shouldn't push on anything. They have the wrong
| incentives. They should just make sure that software runs on
| their GPUs and nothing else.
|
| Karol Herbst is working on Rusticl, which is mesa's latest
| OpenCL implementation and will pave the way for other things
| such as SYCL.
| consf wrote:
| A strategic and forward-thinking approach
| naasking wrote:
| > AMD should push/drive an open standard that can be run on
| any hardware.
|
| AMD has always been notoriously bad at the software side, and
| they frequently abandon their projects when they're almost
| usable, so I won't hold my breath.
| magic_hamster wrote:
| CUDA is the juice that built Nvidia in the AI space and allowed
| them to charge crazy money for their hardware. To be able to
| run CUDA on cost effective AMD hardware can be a big leap
| forward, allow more people to research, and break away from
| Nvidia's stranglehold over VRAM. Nvidia will never open source
| their own platform unless their hand is forced. I think we all
| should support this endeavor and contribute where possible.
| Sparkyte wrote:
| That is why an open standard should be made so it isn't locked
| to a particular piece of hardware and then allow modular
| support for different hardware to interface with supported
| drivers.
| raxxorraxor wrote:
| I really hope they will do what you suggested. With some
| innovative product placement, GPUs with a lot of memory for
| example, they could dethrone nvidia if it doesn't change
| strategy.
|
| That said, easier said than done. You need very specialized
| developers to build a CUDA equivalent and have people start
| using it. AMD could do it with a more open development process
| leveraging the open source community. I believe this will
| happen at some point anyway by AMD or someone else. The market
| just gets more attractive by the day and at some point the high
| entry barrier will not matter much.
|
| So why should AMD skimp on their ambitions here? This would be
| a most sensible investment, few risks and high gains if
| successful.
| consf wrote:
| This expanding market provides AMD with a lucrative
| opportunity indeed
| consf wrote:
| The legal, technical and strategic challenges make it a less
| attractive option
| jarbus wrote:
| Really, really, _really_ curious as to how they managed to pull
| this off, if their project works as well as they claim it does.
| If stuff as complex as paged /flash attention can "just work",
| this is really cool.
| Straw wrote:
| My understanding from chatting with them is that tensor core
| operations aren't supported yet, so FlashAttention likely won't
| work. I think its on their to-do list though!
|
| Nvidia actually has more and more capable matrix multiplication
| units, so even with a translation layer I wouldn't expect the
| same performance until AMD produces better ML cards.
|
| Additionally, these kernels usually have high sensitivity to
| cache and smem sizes, so they might need to be retuned.
| Der_Einzige wrote:
| So the only part that anyone actually cares about, as usual,
| is not supported. Same story as it was in 2012 with AMD vs
| Nvidia (and likely much before that too!). The more things
| change, the more they stay the same.
| Straw wrote:
| People did GPGPU computing long before GPUs. Simply look at
| the list of tested, supported projects on their docs page!
| Straw wrote:
| [EDIT] long before deep learning!
| JonChesterfield wrote:
| Cuda is a programming language. You implement it like any
| other. The docs are a bit sparse but not awful. Targeting
| amdgpu is probably about as difficult as targeting x64, mostly
| changes the compiler runtime.
|
| The online ptx implementation is notable for being even more
| annoying to deal with than the cuda, but it's just bytes in /
| different bytes out. No magic.
| ckitching wrote:
| [I work on SCALE]
|
| CUDA has a couple of extra problems beyond just any other
| programming language:
|
| - CUDA is more than a language: it's a giant library (for
| both CPU and GPU) for interacting with the GPU, and for
| writing the GPU code. This needed reimplementing. At least
| for the device-side stuff we can implement it _in CUDA_ , so
| when we add support for other GPU vendors the code can
| (mostly) just be recompiled and work there :D. - CUDA (the
| language) is not actually specified. It is, informally,
| "whatever nvcc does". This differs significantly from what
| Clang's CUDA support does (which is ultimately what the HIP
| compiler is derived from).
|
| PTX is indeed vastly annoying.
| JonChesterfield wrote:
| The openmp device runtime library was originally written in
| cuda. I ported that to hip for amdgpu, discovered the
| upstream hip compiler wasn't quite as solid as advertised,
| then ported it to openmp with some compiler intrinsics. The
| languages are all essentially C++ syntax with some spurious
| noise obfuscating llvm IR. The libc effort has gone with
| freestanding c++ based on that experience and and we've now
| mostly fixed the ways that goes wrong.
|
| You might also find raw c++ for device libraries saner to
| deal with than cuda. In particular you don't need to jury
| rig the thing to not spuriously embed the GPU code in x64
| elf objects and/or pull the binaries apart. Though if
| you're feeding the same device libraries to nvcc with
| #ifdef around the divergence your hands are tied.
| ckitching wrote:
| > You might also find raw c++ for device libraries saner
| to deal with than cuda.
|
| Actually, we just compile all the device libraries to
| LLVM bitcode and be done with it. Then we can write them
| using all the clang-dialect, not-nvcc-emulating, C++23 we
| feel like, and it'll still work when someone imports them
| into their c++98 CUDA project from hell. :D
| m3kw9 wrote:
| This isn't a solution for pros because it will always play catch
| up and Nvidia can always add things to make it difficult. This is
| like emulation.
| bachmeier wrote:
| > it will always play catch up
|
| That's not important if the goal is to run existing CUDA code
| on AMD GPUs. All you have to do is write portable CUDA code in
| the future regardless of what Nvidia does if you want to keep
| writing CUDA.
|
| I don't know the economics here, but if the AMD provides a
| significant cost saving, companies are going to make it work.
|
| > Nvidia can always add things to make it difficult
|
| Sounds like Microsoft embedding the browser in the OS. It's
| hard to see how doing something like that wouldn't trigger an
| antitrust case.
| dboreham wrote:
| Pros will end up overruled by bean counters if it works.
| ok123456 wrote:
| It's not emulation. It's a compiler.
| joe_the_user wrote:
| This sounds fabulous. I look forward to AMD being drawn kicking
| and screaming into direct competition with Nvidia.
| gizajob wrote:
| Is Nvidia not likely to sue or otherwise bork this into non-
| existence?
| chx wrote:
| Sue over what...?
| gizajob wrote:
| Whatever IP related issues they'd want to sue over. Sorry I
| don't know specifics about what this would specifically
| infringe but I'm sure expensive legal brains could come up
| with something
| CoastalCoder wrote:
| I wonder if nVidia's current anti-trust woes would make them
| reluctant to go that route at the moment.
| sakras wrote:
| One question I always have about these sorts of translation
| layers is how they deal with the different warp sizes. I'd
| imagine a lot of CUDA code relies on 32-wide warps, while as far
| as I know AMD tends to have 64-wide warps. Is there some sort of
| emulation that needs to happen?
| mpreda wrote:
| The older AMD _GCN_ had 64-wide wavefront, but the newer AMD
| GPUs "RDNA" support both 64 and 32 wavefront, and this is
| configurable at runtime. It appears the narrower wavefronts are
| better suited for games in general.
|
| Not sure what is the situation with "CDNA", which is the
| compute-oriented evolution of "GCN", i.e. whether CDNA is
| 64-wavefront only or dual like RNDA.
| msond wrote:
| SCALE is not a "translation layer", it's a full source-to-
| target compiler from CUDA-like C++ code to AMD GPUs.
|
| See this part of the documentation for more details regarding
| warp sizes: https://docs.scale-lang.com/manual/language-
| extensions/#impr...
| ladberg wrote:
| I don't really see how any code that depends heavily on the
| underlying hardware can "just work" on AMD. Most serious CUDA
| code is aware of register file and shared memory sizes, wgmma
| instructions, optimal tensor core memory & register layouts,
| tensor memory accelerator instructions, etc...
|
| Presumably that stuff doesn't "just work" but they don't want to
| mention it?
| lmeyerov wrote:
| Sort of
|
| A lot of our hw-aware bits are parameterized where we fill in
| constants based on the available hw . Doable to port, same as
| we do whenever new Nvidia architectures come out.
|
| But yeah, we have tricky bits that inline PTX, and.. that will
| be more annoying to redo.
| Retr0id wrote:
| > SCALE accepts CUDA programs as-is. [...] This is true even
| if your program uses inline PTX asm
| lmeyerov wrote:
| Oh that will be interesting to understand, as PTX gets to
| more about trickier hw-arch-specific phenomena that diff
| brands disagree on, like memory models. Neat!
| lmeyerov wrote:
| Looks like the PTX translation is via another project
| ZLUDA, though how they bridge the differences in
| memory/consistency/etc models safely remains unclear to
| me...
| ckitching wrote:
| Hi! Spectral engineer here!
|
| SCALE does not use any part of ZLUDA. We have modified
| the clang frontend to convert inline PTX asm block to
| LLVM IR.
|
| To put in a less compiler-engineer-ey way: for any given
| block of PTX, there exists a hypothetical sequence of
| C++/CUDA code you could have written to achieve the same
| effect, but on AMD (perhaps using funky __builtin_...
| functions if the code includes shuffles/ballots/other-
| weird-gpu-stuff). Our compiler effectively converts the
| PTX into that hypothetical C++.
|
| Regarding memory consistency etc.: NVIDIA document the
| "CUDA memory consistency model" extremely thoroughly, and
| likewise, the consistency guarantees for PTX. It is
| therefore sufficient to ensure that we use operations at
| least as synchronising as those called for in the
| documented semantics of the language (be it CUDA or PTX,
| for each operation).
|
| Differing consistency _between architectures_ is the
| AMDGPU backend's problem.
| ladberg wrote:
| Just to check here, if you're given something like the
| following PTX:
| wgmma.mma_async.sync.aligned.m64n256k16.f32.bf16.bf16
|
| Do you reverse it back into C++ that does the
| corresponding FMAs manually instead of using tensor
| hardware? Or are you able to convert it into a series of
| __builtin_amdgcn_mfma_CDFmt_MxNxKABFmt instructions that
| emulate the same behavior?
| ckitching wrote:
| Rather awkwardly, you've asked about an instruction that
| isn't currently implemented. :D Support for wmma and
| friends is in development.
|
| But in general the answer to your question is yes: we use
| AMD-specific builtins where available/efficient to make
| things work. Otherwise many things would be
| unrepresentble, not just slow!
| saagarjha wrote:
| What do you do when a builtin doesn't exist?
| ckitching wrote:
| Add one: it's trivial to add a compiler builtin to carry
| the instruction from the frontend to the backend if an
| instruction exists and the backend knows about it.
|
| If there's no instruction, either, you can write a C++
| function to replicate the behaviour and codegen a call to
| it. Since the PTX blocks are expanded during initial IR
| generation, it all inlines nicely by the end. Of course,
| such software emulation is potentially suboptimal
| (depends on the situation).
| lmeyerov wrote:
| Ah I was reading the 'deeper dive' section on my phone
| and missed it was a comparison, not a warning, thank you
|
| I'm curious how something like this example would
| translate:
|
| ===
|
| Mapping lower-level ptx patterns to higher-level AMD
| constructs like __ballot, and knowing it's safe
|
| ``` #ifdef INLINEPTX inline uint
| ptx_thread_vote(float rSq, float rCritSq) {
| uint result = 0; asm("{\n\t"
| ".reg .pred cond, out;\n\t" "setp.ge.f32
| cond, %1, %2;\n\t" "vote.sync.all.pred
| out, cond, 0xffffffff;\n\t" "selp.u32 %0,
| 1, 0, out;\n\t" "}\n\t" :
| "=r"(result) : "f"(rSq), "f"(rCritSq));
| return result; } #endif
|
| ```
|
| ===
|
| Again, I'm guessing there might be an equiv simpler
| program involving AMD's __ballot, but I'm unsure of the
| true equivalence wrt safety, and it seems like a tricky
| rewrite as it needs to (afaict) decompile to recover the
| higher-level abstraction. Normally it's easier to compile
| down or sideways (translate), and it's not clear to me
| these primitives are 1:1 for safely doing so.
|
| ===
|
| FWIW, this is all pretty cool. We stay away from PTX --
| most of our app code is higher-level, whether RAPIDS (GPU
| dataframes, GPU ML, etc libs), minimal cuda, and minimal
| opencl, with only small traces of inline ptx. So more
| realistically, if we had the motivation, we'd likely
| explore just #ifdef'ing it with something predictable.
| ckitching wrote:
| I compiled your function with SCALE for gfx1030:
| .p2align 2 ; --
| Begin function _Z15ptx_thread_voteff .type
| _Z15ptx_thread_voteff,@function
| _Z15ptx_thread_voteff: ;
| @_Z15ptx_thread_voteff ; %bb.0:
| ; %entry s_waitcnt vmcnt(0) expcnt(0)
| lgkmcnt(0) s_waitcnt_vscnt null, 0x0
| v_cmp_ge_f32_e32 vcc_lo, v0, v1 s_cmp_eq_u32
| vcc_lo, -1 s_cselect_b32 s4, -1, 0
| v_cndmask_b32_e64 v0, 0, 1, s4 s_setpc_b64
| s[30:31] .Lfunc_end1: .size
| _Z15ptx_thread_voteff, .Lfunc_end1-_Z15ptx_thread_voteff
| ; -- End function
|
| What were the safety concerns you had? This code seems to
| be something like `return __all_sync(rSq >= rCritSq) ? 1
| : 0`, right?
| lmeyerov wrote:
| It's supposed to be waiting for all threads to vote
|
| I'm not familiar with AMD enough to know if additional
| synchronization is needed. ChatGPT recommended adding
| barriers beyond what that gave, but again, I'm not
| familiar with AMD commands.
| ckitching wrote:
| Indeed, no extra synchronisation is needed here due to
| the nature of the hardware (threads in a warp can't get
| out of sync with each other).
|
| Even on NVIDIA, you could've written this without the asm
| a discussed above!
| lmeyerov wrote:
| Yeah I think, after this snippet was written, cuda added
| __all_sync as an intrinsic. The divergent code before
| this was plain-ish cuda, and this snippet ensures they
| wait on the comparison vote before recurring.
|
| So in the AMD version, the compiler correctly realized
| the synchronization was on the comparison, so adds the
| AMD version right before it. That seems like a
| straightforward transform here.
|
| It'd be interesting to understand the comparison of what
| Nvidia primitives map vs what doesn't. The above is a
| fairly simple barrier. We avoided PTX as much as we could
| and wrote it as simply as we could, I'd expect most of
| our PTX to port for similar reasons. The story is a bit
| diff for libraries we call. E.g., cudf probably has
| little compute-tier ptx directly, but will call nvidia
| libs, and use weird IO bits like cufile / gpu direct
| storage.
| Moldoteck wrote:
| it's a speculation, but I think it's similar with processors =
| nobody guarantees the code will run the way you set it up. You
| may want to use some specific register but if the processor
| will think it has another register that can fulfill the task,
| it'll use that but tell you that your code is executed as
| expected. Maybe the internal gpu processor of amd can
| sufficiently simulate the behavior of nvidia hardware so that
| higher abstractions will be unaware that something different is
| happening under the hood
| consf wrote:
| It involves significant challenges
| shmerl wrote:
| Compiler isn't open source? That feels like DOA in this day and
| age. There is ZLUDA already which is open.
|
| If they plan to open it up, it can be something useful to add to
| options of breaking CUDA lock-in.
| uyzstvqs wrote:
| ZLUDA is pretty good, except that it lacks cuDNN which makes
| most PyTorch projects just not work. Not sure if this project
| does cover that? That could be a game changer, otherwise yeah
| ZLUDA is the better open-source option.
| cheptsov wrote:
| Sounds really awesome. Any chance someone can suggest if this
| works also inside a Docker container?
| ckitching wrote:
| It works exactly as well as other AMDGPU-related software (HIP
| etc.) works inside Docker.
|
| There are some delightful AMD driver issues that make certain
| models of GPU intermittently freeze the kernel when used from
| docker. That was great fun when building SCALE's CI system :D.
| cheptsov wrote:
| Would love to give it a try! Thanks for answering my
| question.
| SushiHippie wrote:
| Works like described in the rocm documentation (at least the
| scaleinfo worked for me, haven't tested further)
|
| https://rocm.docs.amd.com/projects/install-on-linux/en/lates...
| cheptsov wrote:
| Thank you! This link is very helpful.
| cheptsov wrote:
| Wow, somebody doesn't like Docker enough to downvote my
| question.
| resters wrote:
| The main cause of Nvidia's crazy valuation is AMD's unwillingness
| to invest in making its GPUs as useful as Nvidia's for ML.
|
| Maybe AMD fears antitrust action, or maybe there is something
| about its underlying hardware approach that would limit
| competitiveness, but the company seems to have left billions of
| dollars on the table during the crypto mining GPU demand spike
| and now during the AI boom demand spike.
| karolist wrote:
| I think this could be cultural differences, AMD's software
| department is underfunded and doing poorly for a long time now.
|
| * https://www.levels.fyi/companies/amd/salaries/software-
| engin...
|
| * https://www.levels.fyi/companies/nvidia/salaries/software-
| en...
|
| And it's probably better now. Nvidia was paying much more long
| before, also their stock growing attracts even more talent.
| 1024core wrote:
| > I think this could be cultural differences, AMD's software
| department is underfunded and doing poorly for a long time
| now.
|
| Rumor is that ML engineers (that AMD really needs) are
| expensive; and AMD doesn't want to give them more money than
| the rest of the SWEs they have (for pissing off the existing
| SWEs). So AMD is caught in a bind: can't pay to get top MLE
| talent and can't just sit by and watch NVDA eat its lunch.
| mepian wrote:
| AMD recently acquired Silo AI.
| karolist wrote:
| I find this strange to believe. Every big company has
| levels, unless your existing L7+ IC is below market, you
| can just pull L7+ salaried ML engineers with some secret
| signing bonus like literally everyone else.
| Der_Einzige wrote:
| The dirty secret in the tech industry is that most people
| at AMD or Intel or IBM and historically Nvidia/Oracle
| (this changed post 2022), were the 2nd-3rd tier tech
| companies. Staffed heavily by the rejects of the FAANG,
| they were still happy to have their 100-200K in their
| MCOL areas, but no free food and a much more boring work
| culture. Intel's "great place to work" corporate
| propaganda was known as "great place to leetcode" while I
| worked there, as Intel was always seen as a stepping
| stone before you "made it" in a FAANG.
|
| Culturally, none of these companies were happy to pay
| anyone except the tip, top "distinguished" engineers more
| than 300K. AMD seems to be stuck in this mentality, just
| as IBM is.
| quotemstr wrote:
| > AMD seems to be stuck in this mentality, just as IBM
| is.
|
| And that's why creative destruction is essential for
| technological progress. It's common for organizations to
| get stuck in stable-but-suboptimal social equilibria:
| everyone knows there's a problem but nobody can fix it.
| The only way out is to make a new organization and let
| the old one die.
| xboxnolifes wrote:
| > So AMD is caught in a bind: can't pay to get top MLE
| talent and can't just sit by and watch NVDA eat its lunch.
|
| This isn't being caught in a bind. This is, if true, just
| making a poor decision. Nothing is _really_ preventing them
| from paying more for specialized work.
| DaoVeles wrote:
| So nothing has changed since the era of ATI.
| dist-epoch wrote:
| There are stories from credible sources that AMD software
| engineers had to buy AMD GPUs with their own money to use in CI
| machines.
| ClassyJacket wrote:
| I like to watch YouTube retrospectives on old failed tech
| companies - LGR has some good ones.
|
| When I think of AMD ignoring machine learning, I can't help
| imagine a future YouTuber's voiceover explaining how this
| caused their downfall.
|
| There's a tendency sometimes to think "they know what they're
| doing, they must have good reasons". And sometimes that's
| right, and sometimes that's wrong. Perhaps there's some great
| technical, legal, or economic reason I'm just not aware of. But
| when you actually look into these things, it's surprising how
| often the answer is indeed just shortsightedness.
|
| They could end up like BlackBerry, Blockbuster, Nokia, and
| Kodak. I guess it's not quite as severe, since they will still
| have a market in games and therefore may well continue to
| exist, but it will still be looked back on as a colossal
| mistake.
|
| Same with Toyota ignoring electric cars.
|
| I'm not an investor, but I still have stakes in the sense that
| Nvidia has no significant competition in the machine learning
| space, and that sucks. GPU prices are sky high and there's
| nobody else to turn to if there's something about Nvidia you
| just don't like or if they decide to screw us.
| robertlagrant wrote:
| There's also just the idea of endeavour - Nvidia tried
| something, and it worked. Businesses (or rather their
| shareholders) take risks with their capital sometimes, and it
| doesn't always work. But in this case it did.
| robocat wrote:
| And NVidea has a reputation for going all-in on certain
| market decisions. That is hard to compete against when the
| bet works.
| _boffin_ wrote:
| If you haven't heard of this book, you might like it. Dealers
| of lightening
| hedora wrote:
| In fairness to AMD, they bet on crypto, and nvidia bet on AI.
| Crypto was the right short term bet.
|
| Also, ignoring is a strong word: I'm staring at a little <<
| $1000, silent 53 watt mini-PC with an AMD SoC. It has an NPU
| comparable to an M1. In a few months, with the ryzen 9000
| series, NPUs for devices of its class will bump from 16 tops
| to 50 tops.
|
| I'm pretty sure the linux taint bit is off, and everything
| just worked out of the box.
| daedrdev wrote:
| Toyota is extremely strong in the hybrid car market, and with
| ravenous competition for electric cars and slowing demand
| Toyota may have made the right decision after all
| gukov wrote:
| The companies' CEO's are related. My conspiracy theory is that
| they don't want to step on each other's toes. Not sure if that
| works with fiduciary duty, though.
| arendtio wrote:
| I searched for it and found this (in case someone else might
| want to read it):
|
| https://www.tomshardware.com/news/jensen-huang-and-lisa-
| su-f...
| anticensor wrote:
| AMD fears anti-collusion action, remember, CEOs of the two are
| just barely far enough of kinship to not be automatically
| considered colluding with each other.
| paulmist wrote:
| Doesn't seem to mention CDNA?
| JonChesterfield wrote:
| This is technically feasible so might be the real thing. Parsing
| inline ptx and mapping that onto amdgpu would be a huge pain.
|
| Working from cuda source that doesn't use inline ptx to target
| amdgpu is roughly regex find and replace to get hip, which has
| implemented pretty much the same functionality.
|
| Some of the details would be dubious, e.g. the atomic models
| probably don't match, and volta has a different instruction
| pointer model, but it could all be done correctly.
|
| Amd won't do this. Cuda isn't a very nice thing in general and
| the legal team would have kittens. But other people totally
| could.
| ckitching wrote:
| [I work on SCALE]
|
| Mapping inline ptx to AMD machine code would indeed _suck_.
| Converting it to LLVM IR right at the start of compilation
| (when the initial IR is being generated) is much simpler, since
| it then gets "compiled forward" with the rest of the code.
| It's as if you wrote C++/intrinsics/whatever instead.
|
| Note that nvcc accepts a different dialect of C++ from clang
| (and hence hipcc), so there is in fact more that separates CUDA
| from hip (at the language level) than just find/replace. We
| discuss this a little in [the manual](https://docs.scale-
| lang.com/manual/dialects/)
|
| Handling differences between the atomic models is, indeed,
| "fun". But since CUDA is a programming language with documented
| semantics for its memory consistency (and so is PTX) it is
| entirely possible to arrange for the compiler to "play by
| NVIDIA's rules".
| JonChesterfield wrote:
| Huh. Inline assembly is strongly associated in my mind with
| writing things that can't be represented in LLVM IR, but in
| the specific case of PTX - you can only write things that
| ptxas understands, and that probably rules out wide classes
| of horrendous behaviour. Raw bytes being used for
| instructions and for data, ad hoc self modifying code and so
| forth.
|
| I believe nvcc is roughly an antique clang build hacked out
| of all recognition. I remember it rejecting templates with
| 'I' as the type name and working when changing to 'T',
| nonsense like that. The HIP language probably corresponds
| pretty closely to clang's cuda implementation in terms of
| semantics (a lot of the control flow in clang treats them
| identically), but I don't believe an exact match to nvcc was
| considered particularly necessary for the clang -x cuda work.
|
| The ptx to llvm IR approach is clever. I think upstream would
| be game for that, feel free to tag me on reviews if you want
| to get that divergence out of your local codebase.
| saagarjha wrote:
| nvcc is nowhere near that bad these days, it supports most
| C++ code directly (for example, I've written kernels that
| include headers like <span> or <algorithm> and they work
| just fine).
| ckitching wrote:
| NVCC is doing much better than before in terms of "broken
| C++". There was indeed a time when lots of modern C++
| just _didn 't work_.
|
| Nowadays the issues are more subtle and nasty. Subtle
| differences in overload resolution. Subtle differences in
| lambda handling. Enough to break code in "spicy" ways
| when you try to port it over.
| ckitching wrote:
| I certainly would not attempt this feat with x86 `asm`
| blocks :D. PTX is indeed very pedestrian: it's more like IR
| than machine code, really. All the usual "machine-level
| craziness" that would otherwise make this impossible is
| just unrepresentable in PTX (though you do run into cases
| of "oopsie, AMD don't have hardware for this so we have to
| do something insane").
| JonChesterfield wrote:
| It's a beautiful answer to a deeply annoying language
| feature. I absolutely love it. Yes, inline asm containing
| PTX definitely should be burned off at the compiler front
| end, regardless of whether it ultimately codegens as PTX
| or something else.
|
| I'm spawned a thread on the llvm board asking if anyone
| else wants that as a feature
| https://discourse.llvm.org/t/fexpand-inline-ptx-as-a-
| feature... in the upstream. That doesn't feel great -
| you've done something clever in a proprietary compiler
| and I'm suggesting upstream reimplement it - so I hope
| that doesn't cause you any distress. AMD is relatively
| unlikely to greenlight me writing it so it's _probably_
| just more marketing unless other people are keen to parse
| asm in string literals.
| ur-whale wrote:
| If this actually works (remains to be seen), I can only say:
| 1) Kudos 2) Finally !
| gedy wrote:
| or: 1) CUDAs
| anthonix1 wrote:
| I just tried it with llm.c ... seems to be missing quite a few
| key components such as cublaslt, bfloat16 support, nvtx3,
| compiler flags such as -t
|
| And its linked against an old release of ROCm.
|
| So unclear to me how it is supposed to be an improvement over
| something like hipify
| ckitching wrote:
| Greetings, I work on SCALE.
|
| It appears we implemented `--threads` but not `-t` for the
| compiler flag. Oeps. In either case, the flag has no effect
| at present, since fatbinary support is still in development,
| and that's the only part of the process that could
| conceivably be parallelised.
|
| _That said_ : clang (and hence the SCALE compiler) tends to
| compile CUDA much faster than nvcc does, so this lack of the
| parallelism feature is less problematic than it might at
| first seem.
|
| NVTX support (if you want more than just "no-ops to make the
| code compile") requires cooperation with the authors of
| profilers etc., which has not so far been available
|
| bfloat16 is not properly supported by AMD anyway: the
| hardware doesn't do it, and HIP's implementatoin just lies
| and does the math in `float`. For that reason we haven't
| prioritised putting together the API.
|
| cublasLt is a fair cop. We've got a ticket :D.
| anthonix1 wrote:
| Hi, why do you believe that bfloat16 is not supported? Can
| you please provide some references (specifically the part
| about the hardware "doesn't do it")?
|
| For the hardware you are focussing on (gfx11), the
| reference manual [2] and the list of LLVM gfx11
| instructions supported [1] describe the bfloat16 vdot &
| WMMA operations, and these are in fact implemented and
| working in various software such as composable kernels and
| rocBLAS, which I have used (and can guarantee they are not
| simply being run as float). I've also used these in the AMD
| fork of llm.c [3]
|
| Outside of gfx11, I have also used bfloat16 in CDNA2 & 3
| devices, and they are working and being supported.
|
| Regarding cublasLt, what is your plan for support there?
| Pass everything through to hipblasLt (hipify style) or
| something else?
|
| Cheers, -A
|
| [1] https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX11.html [2]
| https://www.amd.com/content/dam/amd/en/documents/radeon-
| tech... [3] http://github.com/anthonix/llm.c
| ckitching wrote:
| > Hi, why do you believe that bfloat16 is not supported?
|
| Apologies, I appear to be talking nonsense. I conflated
| bfloat16 with nvidia's other wacky floating point
| formats. This is probably my cue to stop answering
| reddit/HN comments and go to bed. :D
|
| So: ahem: bfloat16 support is basically just missing the
| fairly boring header.
|
| > Regarding cublasLt, what is your plan for support
| there? Pass everything through to hipblasLt (hipify
| style) or something else?
|
| Prettymuch that, yes. Not much point reimplementing all
| the math libraries when AMD is doing that part of the
| legwork already.
| anthonix1 wrote:
| OK, so in the case of llm.c, if you're just including the
| HIP headers, using hipblasLt, etc, what would be the
| benefit of using scale instead of hipify?
| Straw wrote:
| Seems like a big benefit would come from not forking the
| codebase into two versions!
| ashvardanian wrote:
| It's great that there is a page about current limitations [1],
| but I am afraid that what most people describe as "CUDA" is a
| small subset of the real CUDA functionality. Would be great to
| have a comparison table for advanced features like warp shuffles,
| atomics, DPX, TMA, MMA, etc. Ideally a table, mapping every PTX
| instruction to a direct RDNA counterpart or a list of
| instructions used to emulate it.
|
| [1]: https://docs.scale-lang.com/manual/differences/
| ckitching wrote:
| You're right that most people only use a small subset of cuda:
| we prioritied support for features based on what was needed for
| various open-source projects, as a way to try to capture the
| most common things first.
|
| A complete API comparison table is coming soon, I belive. :D
|
| In a nutshell: - DPX: Yes. - Shuffles: Yes. Including the PTX
| versions, with all their weird/wacky/insane arguments. -
| Atomics: yes, except the 128-bit atomics nvidia added very
| recently. - MMA: in development, though of course we can't fix
| the fact that nvidia's hardware in this area is just _better_
| than AMD 's, so don't expect performance to be as good in all
| cases. - TMA: On the same branch as MMA, though it'll just be
| using AMD's async copy instructions.
|
| > mapping every PTX instruction to a direct RDNA counterpart or
| a list of instructions used to emulate it.
|
| We plan to publish a compatibility table of which instructons
| are supported, but a list of the instructions used to produce
| each PTX instruction is not in general meaningful. The inline
| PTX handler works by converting the PTX block to LLVM IR at the
| start of compilation (at the same time the rest of your code
| gets turned into IR), so it then "compiles forward" with the
| rest of the program. As a result, the actual instructions
| chosen vary on a csae-by-case basis due to the whims of the
| optimiser. This design in principle produces better performance
| than a hypothetical solution that turned PTX asm into AMD asm,
| because it conveniently eliminates the optimisation barrier an
| asm block typically represents. Care, of course, is taken to
| handle the wacky memory consistency concerns that this implies!
|
| We're documenting which ones are expected to perform worse than
| on NVIDIA, though!
| ashvardanian wrote:
| Have you seen anyone productively using TMA on Nvidia or
| async instructions on AMD? I'm currently looking at a 60%
| throughput degradation for 2D inputs on H100:
| https://github.com/ashvardanian/scaling-
| democracy/blob/a8092...
| einpoklum wrote:
| > You're right that most people only use a small subset of
| cuda
|
| This is true first and foremost for the host-side API. From
| my StackOverflow and NVIDIA forums experience - I'm often the
| first and only person to ask about any number of nooks and
| crannies of the CUDA Driver API, with issues which nobody
| seems to have stumbled onto before; or at least - not
| stumbled and wrote anything in public about it.
| ckitching wrote:
| Oh yes, we found all kinds of bugs in Nvidia's cuda
| implementation during this project :D.
|
| There's a bunch of pretty obscure functions in the device
| side apis too: some esoteric math functions, old simd
| "intrinsics" that are mostly irrelevant with modern
| compilers, etc.
| qwerty456127 wrote:
| > gfx1030, gfx1100, gfx1010, gfx1101, gfx900...
|
| How do I find out which do I have?
| ckitching wrote:
| Like this:
|
| https://docs.scale-lang.com/manual/how-to-use/#identifying-g...
| systemBuilder wrote:
| gfx1101 : https://www.techpowerup.com/gpu-specs/amd-
| navi-32.g1000
|
| gfx1100 : https://www.techpowerup.com/gpu-specs/amd-
| navi-31.g998
|
| gfx1030 : https://www.techpowerup.com/gpu-specs/amd-
| navi-21.g923
|
| gfx1010 : https://www.techpowerup.com/gpu-specs/amd-
| navi-10.g861
|
| gfx900 : https://www.techpowerup.com/gpu-specs/amd-vega-10.g800
| galaxyLogic wrote:
| Companies selling CUDA software should no doubt adopt this tool
| yieldcrv wrote:
| the real question here is whether anybody has gotten cheap,
| easily available AMD GPUs to run their AI workloads, and if we
| can predict more people will do so
| JonChesterfield wrote:
| Microsoft have their production models running on amdgpu. I
| doubt it was easy but it's pretty compelling as an existence
| proof
| anthonix1 wrote:
| I ported Karparthy's llm.c repo to AMD devices [1], and have
| trained GPT2 from scratch with 10B tokens of fineweb-edu on a
| 4x 7900XTX machine in just a few hours (about $2 worth of
| electricity) [2].
|
| I've also trained the larger GPT2-XL model from scratch on
| bigger CDNA machines.
|
| Works fine.
|
| [1] https://github.com/anthonix/llm.c [2]
| https://x.com/zealandic1
| EGreg wrote:
| But the question is, can it also run SHUDA and WUDA?
| nabogh wrote:
| I've written a bit of CUDA before. If I want to go pretty bare-
| bones, what's the equivalent setup for writing code for my AMD
| card?
| JonChesterfield wrote:
| HIP works very similarly. Install rocm from your Linux
| distribution or from amd's repo, or build it from
| github.com/rocm. Has the nice feature of being pure userspace
| if you use the driver version that's already in your kernel.
|
| How turn-key / happy an experience that is depends on how
| closely your system correlates with one of the
| documented/tested distro versions and what GPU you have. If
| it's one that doesn't have binary versions of rocblas etc in
| the binary blob, either build rocm from source or don't bother
| with rocblas.
| spfd wrote:
| Very impressive!
|
| But I can't help but think if something like this can be done to
| this extend, I wonder what went wrong/why it's a struggle for
| OpenCL to unify the two fragmentized communities. While this is
| very practical and has a significant impact for people who
| develop GPGPU/AI applications, for the heterogeneous computing
| community as a whole, relying on/promoting a proprietary
| interface/API/language to become THE interface to work with
| different GPUs sounds like bad news.
|
| Can someone educate me on why OpenCL seems to be out of scene in
| the comments/any of the recent discussions related to this topic?
| vedranm wrote:
| If you are going the "open standard" route, SYCL is much more
| modern than OpenCL and also nicer to work with.
| JonChesterfield wrote:
| Opencl gives you the subset of capability that a lot of
| different companies were confident they could implement. That
| subset turns out to be intensely annoying to program in - it's
| just the compiler saying no over and over again.
|
| Or you can compile as freestanding c++ with clang extensions
| and it works much like a CPU does. Or you can compile as cuda
| or openmp and most stuff you write actually turns into code,
| not a semantic error.
|
| Currently cuda holds lead position but it should lose that
| place because it's horrible to work in (and to a lesser extent
| because more than one company knows how to make a GPU). Openmp
| is an interesting alternative - need to be a little careful to
| get fast code out but lots of things work somewhat intuitively.
|
| Personally, I think raw C++ is going to win out and the many
| heterogeneous languages will ultimately be dropped as basically
| a bad idea. But time will tell. Opencl looks very DoA.
| mschuetz wrote:
| OpenCL isn't nice to use and lacks tons of quality of life
| features. I wouldn't use it, even if it was double as fast as
| CUDA.
| localfirst wrote:
| > SCALE does not require the CUDA program or its build system to
| be modified.
|
| how big of a deal is this?
| JonChesterfield wrote:
| People can be wildly hostile to changing their programs. The
| people who wrote it aren't here any more, the program was
| validated as-is, changing it tends to stop the magic thing
| working and so forth.
|
| That changing the compiler is strongly equivalent to changing
| the source doesn't necessarily influence this pattern of
| thinking. Customer requests to keep the performance gains from
| a new compiler but not change the UB they were relying on with
| the old are definitely a thing.
| rjurney wrote:
| If it's efficient, this is very good for competition.
| ekelsen wrote:
| A major component of many CUDA programs these days involves NCCL
| and high bandwidth intra-node communication.
|
| Does NCCL just work? If not, what would be involved in getting it
| to work?
| pjmlp wrote:
| This targets CUDA C++, not CUDA the NVIDIA infrastructure for C,
| C++, Fortran, and anything else targeting PTX.
| ckitching wrote:
| The CUDA C APIs are supported as much in C as in C++ using
| SCALE!
|
| Cuda-fortran is not currently supported by scale since we
| haven't seen much use of it "in the wild" to push it up our
| priority list.
| anon291 wrote:
| It doesn't matter though. NVIDIA distributes tons of
| libraries built atop CUDA that you cannot distribute or use
| on AMD chips legally. Cutlass, CuBLAS, NCCL, etc.
| tama_sala wrote:
| Correct, which one of the main moats Nvidia has when it
| comes to training
| ckitching wrote:
| SCALE doesn't use cuBlas and friends. For those APIs, it
| uses either its own implementations of the functions, or
| delegates to an existing AMD library (such as rocblas).
|
| It wouldn't even be technically possible for SCALE to
| distribute and use cuBlas, since the source code is not
| available. I suppose maybe you could do distribute cuBlas
| and run it through ZLUDA, but that would likely become
| legally troublesome.
| uptownfunk wrote:
| Very clearly the business motive make sense, go after nvidia gpu
| monopoly. Can someone help a lay person understand the pitfalls
| here that prevent this from being an intelligent venture?
| JonChesterfield wrote:
| It's technically non-trivial and deeply irritating to implement
| in places as people expect bugward compatibility with cuda.
|
| Also nvidia might savage you with lawyers for threatening their
| revenue stream. Big companies can kill small ones by strangling
| them in the courts then paying the fine when they lose a decade
| later.
| einpoklum wrote:
| At my workplace, we were reluctant in making the choice between
| writing OpenCL and being AMD-compliant, but missing out on CUDA
| features and tooling; and writing CUDA and being vendor-locked.
|
| Our jerry-rigged solution for now is writing kernels that are the
| same source for both OpenCL and CUDA, with a few macros doing a
| bit of adaptation (e.g. the syntax for constructing a struct).
| This requires no special library or complicated runtime work -
| but it does have the downside of forcing our code to be C'ish
| rather than C++'ish, which is quite annoying if you want to write
| anything that's templated.
|
| Note that all of this regards device-side, not host-side, code.
| For the host-side, I would like, at some point, to take the
| modern-C++ CUDA API wrappers (https://github.com/eyalroz/cuda-
| api-wrappers/) and derive from them something which supports
| CUDA, OpenCL and maybe HIP/ROCm. Unfortunately, I don't have the
| free time to do this on my own, so if anyone is interested in
| collaborating on something like that, please drop me a line.
|
| -----
|
| You can find the OpenCL-that-is-also-CUDA mechanism at:
|
| https://github.com/eyalroz/gpu-kernel-runner/blob/main/kerne...
|
| and
|
| https://github.com/eyalroz/gpu-kernel-runner/blob/main/kerne...
|
| (the files are provided alongside a tool for testing, profiling
| and debugging individual kernels outside of their respective
| applications.)
| JonChesterfield wrote:
| Freestanding c++ with compiler intrinsics is a nicer
| alternative. You can do things like take the address of a
| function.
|
| Use an interface over memory allocation/queue launch with
| implementations in cuda, hsa, opencl whatever.
|
| All the rest of the GPU side stuff is syntax sugar/salt over
| slightly weird semantics, totally possible to opt out of all of
| that.
| stuaxo wrote:
| What's the licensing, will I be able run this as a hobbyist for
| free software?
| tallmed wrote:
| I wonder if this thing has anything common with zluda, its
| permissively licensed after all.
| EGreg wrote:
| Does it translate to OpenCL?
|
| This sounds like DirectX vs OpenGL debate when I was younger lol
| lukan wrote:
| Ok, so I just stumbled on the problem, that I tried out
| openwhisper (from OpenAI), but on my CPU, because of no CUDA and
| workarounds seem hacky. So the headline sounds good!
|
| But can this help me directly? Or would OpenAI have to use this
| tool for me to benefit?
|
| It is not immediately clear to me (but I am a beginner in this
| space).
___________________________________________________________________
(page generated 2024-07-16 23:00 UTC)