[HN Gopher] Run CUDA, unmodified, on AMD GPUs
       ___________________________________________________________________
        
       Run CUDA, unmodified, on AMD GPUs
        
       Author : Straw
       Score  : 1147 points
       Date   : 2024-07-15 19:05 UTC (1 days ago)
        
 (HTM) web link (docs.scale-lang.com)
 (TXT) w3m dump (docs.scale-lang.com)
        
       | dagmx wrote:
       | Has anyone tried this and knows how well it works? It definitely
       | sounds very compelling
        
       | arjvik wrote:
       | Who is this Spectral Compute, and where can we see more about
       | them?
        
         | msond wrote:
         | You can learn more about us on https://spectralcompute.co.uk
        
           | JonChesterfield wrote:
           | The branch free regex engine is an interesting idea. I would
           | have said that can't be implemented in finite code.
           | 
           | Compile to DFA by repeatedly differentiating then unroll the
           | machine? You'd still have back edges for the repeating
           | sections.
        
       | pixelpoet wrote:
       | Isn't this a bit legally dubious, like zluda?
        
         | janice1999 wrote:
         | It's advertised as a "clean room" re-implementation. What part
         | would be illegal?
        
           | ekelsen wrote:
           | If they had to reverse engineer any compiled code to do this,
           | I think that would be against licenses they had to agree to?
           | 
           | At least grounds for suing and starting an extensive
           | discovery process and possibly a costly injunction...
        
             | msond wrote:
             | We have not reverse engineered any compiled code in the
             | process of developing SCALE.
             | 
             | It was clean-room implemented purely from the API surface
             | and by trial-and-error with open CUDA code.
        
             | RockRobotRock wrote:
             | Isn't that exactly what a "clean room" approach avoids?
        
               | ekelsen wrote:
               | oh definitely. But if I was NVIDIA I'd want to verify
               | that in court after discovery rather than relying on
               | their claim on a website.
        
               | RockRobotRock wrote:
               | good point
        
               | ekelsen wrote:
               | FWIW, I think this is really great work and I wish only
               | the best for scale. Super impressed.
        
           | Keyframe wrote:
           | Can't run useful shit on it: https://docs.nvidia.com/deeplear
           | ning/cudnn/latest/reference/...
           | 
           | Namely:
           | 
           | "4.1 License Scope. The SDK is licensed for you to develop
           | applications only for use in systems with NVIDIA GPUs."
        
             | mkl wrote:
             | So add a cheap NVidia card alongside grunty AMD ones, and
             | check for its existence. It doesn't seem to say it needs to
             | run on NVidia GPUs.
        
               | Keyframe wrote:
               | Heh, true. On the other hand, I bet companies are eager
               | to challenge the wrath of a $3T company for a promise of
               | "maybe it'll work, not all of it but at least it'll run
               | worse, at least for now".
        
             | JonChesterfield wrote:
             | I don't think the terms of the Nvidia SDK can restrict
             | running software without said SDK. Nvidia's libraries don't
             | seem to be involved here. Their hardware isn't involved
             | either. It's just some ascii in a bunch of text files being
             | hacked around with before running on someone else's
             | hardware.
        
       | adzm wrote:
       | I'd love to see some benchmarks but this is something the market
       | has been yearning for.
        
         | msond wrote:
         | We're putting together benchmarks to publish at a later time,
         | and we've asked some independent third parties to work on their
         | own additionally.
        
       | acheong08 wrote:
       | Impressive if true. Unfortunately not open source and scarce on
       | exact details on how it works
       | 
       | Edit: not sure why I just sort of expect projects to be open
       | source or at least source available these days.
        
         | tempaccount420 wrote:
         | They might be hoping to be acquired by AMD
        
         | ipsum2 wrote:
         | They're using Docusaurus[1] for their website, which is most
         | commonly used with open source projects.
         | 
         | https://docusaurus.io/docs
        
           | msond wrote:
           | Actually, we use mkdocs and the excellent material for mkdocs
           | theme: https://squidfunk.github.io/mkdocs-material/
        
         | msond wrote:
         | We're going to be publishing more details on later blog posts
         | and documentation about how this works and how we've built it.
         | 
         | Yes, we're not open source, however our license is very
         | permissive. It's both in the software distribution and viewable
         | online at https://docs.scale-lang.com/licensing/
        
           | breck wrote:
           | How about trying _Early_ Source?
           | 
           | It's open source with a long delay, but paying users get the
           | latest updates.
           | 
           | Make the git repo from "today - N years" open source, where N
           | is something like 1 or 2.
           | 
           | That way, students can learn on old versions, and when they
           | grow into professionals they can pay for access to the
           | cutting Edge builds.
           | 
           | Win win win win
           | 
           | ( https://breckyunits.com/earlySource.html)
        
             | msond wrote:
             | We're still thinking about our approach but this is a nice
             | suggestion, thank you.
             | 
             | I'm curious, for what reasons are you interested in the
             | source code yourself?
        
               | mindcrime wrote:
               | I'm not the person you replied to, and I can't speak for
               | them. But I can say that for myself, and a not small
               | number of other people, it's an ideological issue. I
               | simply do not use software that isn't F/OSS - to the
               | greatest extent that that is possible. For me, I might
               | use a VERY small amount of non F/OSS stuff, but it's very
               | hard to get me to adopt something new if it isn't.
               | 
               | Now should you make business decisions based on that?
               | Probably not. But while I don't claim to be a
               | representative sample, I am pretty sure the number of
               | people who share my beliefs in this regard is
               | substantially "non zero". _shrug_
        
               | atq2119 wrote:
               | Not GP, but a guaranteed source availability means users
               | can fix issues themselves in the future if the original
               | provider goes belly-up.
        
               | breck wrote:
               | > I'm curious, for what reasons are you interested in the
               | source code yourself?
               | 
               | I am the founder/editor of PLDB. So I try to do my best
               | to help people "build the next great programming
               | language".
               | 
               | We clone the git repos of over 1,000 compilers and
               | interpreters and use cloc to determine what languages the
               | people who are building languages are using. The people
               | who build languages obviously are the experts, so how
               | they go so goes the world.
               | 
               | We call this measurement "Foundation Score". A Foundation
               | Score of 100 means 100 other languages uses this language
               | somehow in their primary implementation.
               | 
               | It is utterly dominated by open source languages, and the
               | disparity is only getting more extreme.
               | 
               | You can see for yourself here:
               | 
               | https://pldb.io/lists/explorer.html#columns=rank~name~id~
               | app...
               | 
               | Some that might have become irrelevant have gained a
               | second wind after going open source.
               | 
               | But some keep falling further behind.
               | 
               | I look at Mathematica, a very powerful and amazing
               | language, and it makes me sad to see so few other
               | language designers using it, and the reason is because
               | its closed source. So they are not doing so hot, and
               | that's a language from one of our world's smartest and
               | most prolific thinkers that's been around for decades.
               | 
               | I don't see a way for a new language to catch on nowadays
               | that is not open source.
        
               | msond wrote:
               | Very interesting, thank you for sharing!
               | 
               | We do believe in open source software and we do want to
               | move the GPGPU market away from fully closed languages.
               | The future is open for discussion but regardless, the
               | status-quo at the moment is a proprietary and dominant
               | implementation which only supports a single vendor.
               | 
               | > I don't see a way for a new language to catch on
               | nowadays that is not open source.
               | 
               | I do note that CUDA is itself closed source -- while
               | there's an open source implementation in the LLVM
               | project, it is not as bleeding edge as NVIDIA's own.
        
               | breck wrote:
               | > I do note that CUDA is itself closed source
               | 
               | And this is a good point. However, it also has a 17 year
               | head start, and many of those years were spent developing
               | before people realized what a huge market there was.
               | 
               | All it will take is one committed genius to create an
               | open source alternative to CUDA to dethrone it.
               | 
               | But they would have to have some Mojo (hint hint) to pull
               | that off.
        
               | idonotknowwhy wrote:
               | I'm a big fan of opensource for most things but if what
               | you've got actually works, you could probably earn big
               | money selling it. The biggest companies in the world are
               | building / using this sort of thing.
               | 
               | Imagine the shift of capital if for example, Intel GPUS
               | suddenly had the same ML software compatibility as Nvidia
        
         | TaylorAlexander wrote:
         | Makes sense to expect this kind of thing to be open source. The
         | whole point of providing improved compatibility is to make
         | people's lives easier, and open source is usually an important
         | feature to ensure wide compatibility. It also means projects
         | can live on after the creators move to other things, people can
         | submit patches for important features or bug fixes, and
         | generally makes the system much more useful.
        
           | dylan604 wrote:
           | I don't find it wrong for someone to attempt to make money
           | back on their time and experience of doing the work. I don't
           | mind people that offer that back as open source either.
           | However, I do have a problem of people expecting everything
           | to be open/free, especially those that then go on a crusade
           | chastising those that do try to make money.
        
             | TaylorAlexander wrote:
             | I'm really trying to keep this about the engineering
             | features of a system rather than moral judgments. Open
             | source systems are simply more flexible and adaptable than
             | proprietary systems, which have their own benefits. In
             | today's world, the engineering value of open source systems
             | is becoming so important that people are looking for other
             | ways to provide for the developers creating these systems.
             | It can be surprising when a project creator builds
             | something in an area that is usually all open source, but
             | they choose a proprietary path. Just look at the problems
             | created by NVIDIA for their use of proprietary software in
             | CUDA and their GPUs. This software is an attempt to fix
             | issues created by proprietary software with another piece
             | of proprietary software, which is if nothing else an
             | interesting decision.
        
               | dylan604 wrote:
               | UNIX wasn't free. Windows wasn't free. It wasn't until
               | some knucklehead came along and did something abnormal
               | and gave away their thing. Bakers don't give away their
               | goods. Mechanics don't typically repair things for free.
               | Builders don't build things for free. Gas stations don't
               | give away gas.
               | 
               | Why do we think all software should be free, and then
               | think that those that don't give it away are the abnormal
               | ones?
        
               | talldayo wrote:
               | > Why do we think all software should be free
               | 
               | Why do people return Windows laptops when they have to
               | pay for a Windows License Activation? Because every
               | single OEM pays for it; you don't _expect_ to buy Windows
               | because it is a failed B2C business model. Nobody wants
               | it. Same goes for proprietary UNIX, and people _wish_ it
               | was the case for Nvidia drivers. I own CUDA hardware and
               | lament the fact that cross-industry GPGPU died so FAANG
               | could sell licensed AI SDKs. The only thing stopping AI
               | from being  "free" is the limitations OEMs impose on
               | their hardware.
               | 
               | > that those that don't give it away are the abnormal
               | ones?
               | 
               | They are. Admit it; the internet is the new normal, if
               | your software isn't as "free" as opening a website,
               | you're weird. If I have to pay to access your little
               | forum, I won't use it. If I have to buy your app to see
               | what it's like, I'll never know what you're offering.
               | Part of what makes Nvidia's business model so successful
               | is that they _do_ "give away" CUDA to anyone that owns
               | their hardware. There is no developer fee or mandatory
               | licensing cost, it is plug-and-play with the hardware.
               | Same goes for OpenAI, they'd have never succeeded if you
               | had to buy "the ChatGPT App" from your App Store.
        
               | dylan604 wrote:
               | > Why do people return Windows laptops when they have to
               | pay for a Windows License Activation?
               | 
               | The internet echo chamber strikes again. Exactly how many
               | people are actually doing this? Not many, and those that
               | are all hangout together. The rest of the world just
               | blindly goes about their day using Windows while surfing
               | the web using Chrome. Sometimes, it's a good thing to get
               | outside your bubble. It's a big world out there, and not
               | everybody sees the world as you do
        
               | talldayo wrote:
               | > The rest of the world just blindly goes about their day
               | using Windows while surfing the web using Chrome.
               | 
               | Paying for Windows? I think you missed my point. If your
               | computer doesn't ship with an OS, paid or otherwise,
               | people think it's a glitch. The average consumer will
               | sooner return their laptop before they buy a license of
               | Windows, create an Install Media from their old device
               | and flash the new hardware with a purchased license.
               | They'll get a Chromebook instead, people don't _buy_
               | Windows today.
               | 
               | The internet has conditioned the majority of modern
               | technology users to reject and habitually avoid non-free
               | experiences. Ad-enabled free platforms and their
               | pervasive success is all the evidence you need.
               | Commercial software as it existed 20 or 30 years ago is a
               | dead business. Free reigns supreme.
        
               | dylan604 wrote:
               | Who/where/how does someone buy a laptop without an OS?
               | I'm just not able to follow down this hypothetical path
               | that you are insisting on blazing
        
               | hamilyon2 wrote:
               | That is kind of his point. You don't, Windows is bundled
               | with laptop. It is not that I agree with his points.
               | Windows for example isn't open source in remotest sense
        
               | dylan604 wrote:
               | Dell offers laptops with a version of Linux preinstalled
               | and supports them. System76, Lenovo, Purism as well to
               | name a few. Apple also sells laptops without Windows on
               | them. There are actually quite a few options that do
               | this. If you don't want Windows, we have options now.
               | Yes, historically, it was Windows or Apple's OS, but
               | that's no longer true and not recognizing that just makes
               | you look like you're pushing a false narrative on the
               | situation for what purpose only you know.
        
               | alt227 wrote:
               | > Commercial software as it existed 20 or 30 years ago is
               | a dead business. Free reigns supreme.
               | 
               | What nonsense. Go into any business and you will find
               | every single piece of software they use is bought and
               | paid for with bells on. The 'Free World' you speak of is
               | only there to get you, an individual, used to using the
               | software so that businesses are made to purchase it. In
               | the old days we called this 'demo' or 'shareware'. Now
               | its 'free' or 'personal' tier subscription.
               | 
               | Go and ask any designer if their copy of Adobe Creative
               | Cloud, 3d studio Max, or AutoCAD is free. Any office
               | worker if Micsrosoft Office(including Teams and
               | Sharedpoint etc) or even google docs for business.
               | Majority of developers are running paid versions of
               | Jetbrains. Running an online shop? Chances are you are
               | paying for shopify software, or something like Zoho to
               | manage your customers and orders.
               | 
               | 'Free' as you put it is very much only in the online
               | individual consumer world, a very small part of the
               | software world.
               | 
               | The commercial software market is more alive and
               | expensive than it has ever been.
        
               | TaylorAlexander wrote:
               | > Bakers don't give away their goods. Mechanics don't
               | typically repair things for free. Builders don't build
               | things for free. Gas stations don't give away gas.
               | 
               | These all have the property which is that they are scarce
               | physical goods or services. Software is not scarce
               | (though of course the labor to create it is), so this is
               | a really bad comparison.
               | 
               | And again I did not say it should or should not be free,
               | I said there are engineering benefits to open source
               | software and more and more people recognize those
               | benefits and choose to make things free because they see
               | the value and are willing to recognize the tradeoffs. I
               | never said what "should" be done. "Should" is kind of a
               | nonsense term when used in this way as it hides a lot of
               | assumptions, so I generally do not use it, and notably
               | did not use it in my comment. I want to point out the
               | peculiarity in your rather strong response to a word and
               | concept I never used. I think you are having an argument
               | with imagined people, not a discussion with me.
               | 
               | And for what it is worth, I am a robotics engineer and I
               | am designing a completely open source solar powered
               | farming robot designed to be made in a small shop in any
               | city in the world (see my profile), funded by a wealthy
               | robotics entrepreneur who recognizes the value in making
               | this technology available to people all over the world.
               | 
               | So I am one of those engineers making this choice, and
               | not someone just asking for things without doing the same
               | of my work. Everything I produce is open source,
               | including person projects and even my personal writing.
        
               | dTal wrote:
               | Because software is information. It is closer to a
               | scientific paper than a loaf of bread, and I do expect
               | those to be free. I do not expect scientists to _work_
               | for free, but the marginal cost of copying their output
               | is 0 and the social benefit is huge.
               | 
               | Free software, like open science, clearly has something
               | going for it pragmatically. The developer hours put into
               | it have paid for themselves magnitudes of times over.
               | Megacorps hire people to work on free software. If you
               | can't see the value, that's a you problem.
        
               | acuozzo wrote:
               | > the social benefit is huge
               | 
               | It will be interesting to see if this is the case in the
               | long run, assuming "huge" has a positive connotation in
               | your post, of course.
               | 
               | If AGI comes to pass and it winds up being a net negative
               | for humanity, then the ethics of any practice which
               | involves freely distributing information that can be
               | endlessly copied for very little cost must be
               | reevaluated.
        
               | TaylorAlexander wrote:
               | > If AGI comes to pass
               | 
               | Increasingly, I am not putting much weight in any
               | predictions about whether this will happen in the way we
               | think it will, or what it could possibly mean. We might
               | as well be talking about the rapture.
        
               | voidUpdate wrote:
               | If all software was free and made no money, how could
               | developers pay their bills?
        
               | TaylorAlexander wrote:
               | Free software is so important to society that I believe
               | the most reasonable solution is to provide for all people
               | without their need to work for survival. Automate as much
               | as possible such that work is not compulsory, and enough
               | people simply want something to do (and possibly
               | additional pay depending on how the system is arranged)
               | that everything that needs to get done by people does get
               | done.
               | 
               | For now that is fiction, but so is "if all software was
               | free". I do think though that both would lead to a faster
               | rate of innovation in society versus one where critical
               | information is withheld from society to pay someone's
               | rent and food bills.
        
               | einpoklum wrote:
               | Most software is free and makes no money - and that has
               | always been the case. There are some very popular and
               | widely-used non-free systems, but most software isn't
               | that, and its developers still pay the bills.
               | 
               | This is somewhat analogous to music or books/literature.
               | Most composers and performers and authors make no money
               | from people copying and sharing their works. Some pay the
               | bills working professionally for entities who want their
               | product enough to pay for it; some do other things in
               | life. Some indeed give up their work on music because
               | they can't afford to not do more gainful work. And still,
               | neither music nor books go away as copying them gets
               | closer to being free.
        
               | voidUpdate wrote:
               | If my current employer can't make any money from the code
               | we write, then it would collapse faster than a souffle
               | taken out of the oven too early, and I would be out of a
               | job
        
               | einpoklum wrote:
               | That does not contradict my point... also, there are
               | other ways to make money from writing code than forcing
               | people to pay for copies of that code.
        
               | napoleongl wrote:
               | Otoh recepies and drawings are commonly available for
               | free. So if you can support yourself the cake and engine
               | repair is free. But if you need support then you can get
               | someone to bake or build for you.
        
           | nicce wrote:
           | > Makes sense to expect this kind of thing to be open source.
           | The whole point of providing improved compatibility is to
           | make people's lives easier, and open source is usually an
           | important feature to ensure wide compatibility. It also means
           | projects can live on after the creator
           | 
           | AMD just bought company working with similar things for more
           | than 600m.
        
         | dheera wrote:
         | Also, can I even buy an AMD GPU? I don't see a "buy now" button
         | or a PCIe version anywhere here
         | 
         | https://www.amd.com/en/products/accelerators/instinct/mi300/...
         | 
         | Another big AMD fuckup in my opinion. Nobody is going to drop
         | millions on these things without being able to test them out
         | first.
         | 
         | First rule of sales: If you have something for sale, take my
         | money.
        
           | nwiswell wrote:
           | > I don't see a "buy now" button or a PCIe version anywhere
           | here
           | 
           | "Buy now" buttons and online shopping carts are not generally
           | how organizations looking to spend serious money on AI buy
           | their hardware.
           | 
           | They have a long list of server hardware partners, and odds
           | are you'd already have an existing relationship with one or
           | more of them, and they'd provide a quote.
           | 
           | They even go one step further and show off some of their
           | partners' solutions:
           | 
           | https://www.amd.com/en/graphics/servers-instinct-deep-
           | learni...
           | 
           | FWIW I believe Supermicro and Exxact actually do have web-
           | based shopping carts these days, so maybe you could skip the
           | quotation and buy directly if you were so motivated? Seems
           | kind of weird at this price point.
           | 
           | https://www.exxactcorp.com/Exxact-TS4-185328443-E185328443
        
             | dheera wrote:
             | ... and that's why AMD is losing.
             | 
             | They could break the trend and offer a "buy now" button
             | instead of offering quotes and coffee chats. It's very
             | likely that will kickstart the software snowball with early
             | adopters.
             | 
             | Nobody is going to drop millions on an unproven platform.
             | 
             | > Seems kind of weird at this price point.
             | 
             | Yeah that $234K server is too much for people to do a
             | trial. It has 8xMI300X GPUs along with a bunch of other
             | shit.
             | 
             | Give me a single MI300X GPU in PCIe form factor for $20K
             | and I'd very seriously consider. I'm sure there are many
             | people who would help adapt the ecosystem if they were
             | truly available.
        
               | nwiswell wrote:
               | Why would you be looking to dip your toe into the AMD
               | ecosystem for the first time using an MI300X? It doesn't
               | make any sense. It's not entry level hardware.
        
               | dheera wrote:
               | To help fix the ecosystem. It's way more affordable than
               | Nvidia.
               | 
               | I'm not looking for entry level hardware.
        
               | nwiswell wrote:
               | Yes, that's why you'd choose AMD, I'm saying that you
               | don't enter the ecosystem for the first time by
               | purchasing the absolute cutting edge hardware.
               | 
               | As far as I'm aware you can't simply buy an Nvidia B200
               | PCIe card over the counter, either.
        
               | dheera wrote:
               | I'm not looking to enter the ecosystem, I'm already deep
               | in it and want to fix the AMD problem so that I can build
               | big projects around it and undercut everyone who's using
               | Nvidia.
               | 
               | You can purchase H100 and A100 PCIe cards over the
               | counter. They're great for compiling CUDA code, testing
               | code before you launch a multi-node job into a cluster,
               | and for running evaluations.
               | 
               | AMD has nothing of the sort, and it's hurting them.
               | 
               | I cannot blow 250K on an SMCI server, nor do I have the
               | electricity setup for it. I _can_ blow 20K on a PCIe GPU
               | and start contributing to the ecosystem, or maybe prove
               | out an idea on one GPU before trying to raise millions
               | from a VC to build a more cost-effective datacenter that
               | actually works.
        
               | nwiswell wrote:
               | > AMD has nothing of the sort, and it's hurting them.
               | 
               | What are you talking about? Have you looked?
               | 
               | https://www.dell.com/en-us/shop/amd-mi210-300w-pcie-64gb-
               | pas...
               | 
               | https://www.bitworks.io/product/amd-instinct-mi210-64gb-
               | hbm2...
        
               | shaklee3 wrote:
               | A 20k GPU will be passively cooled and you'll need a real
               | server for that. Even the old MI210 another poster sent
               | is passive.
        
               | latchkey wrote:
               | > _Give me a single MI300X GPU in PCIe form factor for
               | $20K and I 'd very seriously consider. I'm sure there are
               | many people who would help adapt the ecosystem if they
               | were truly available._
               | 
               | I know this isn't what you're looking for entirely, but
               | my business, Hot Aisle, is working on making MI300x
               | available for rental. Our pricing isn't too crazy given
               | that the GPU has 192GB and one week minimum isn't too
               | bad. We will add on-demand hourly pricing as soon as we
               | technically can.
               | 
               | I'm also pushing hard on Dell and AMD to pre-purchase
               | developer credits on our hardware, that we can then give
               | away to people who want to "kick the tires".
               | 
               | https://hotaisle.xyz/pricing/
        
       | juujian wrote:
       | I don't understand how AMD has messed up so badly that I feel
       | like celebrating a project like this. Features of my laptop are
       | just physically there but not usable, particularly in Linux. So
       | frustrating.
        
         | djbusby wrote:
         | Same boat, AMD CPU but nothing else. I feel like a moderate
         | improvement of their FOSS support, drivers would open new
         | hardware revenue - to say nothing about the AI channel.
        
         | ActorNightly wrote:
         | I don't know if I would call it a mess up. AMD still has
         | massive market in server chips, and their ARM stuff is on the
         | horizon. We all assume that graphics cards are the way forward
         | for ML, which may not be the case in the future.
         | 
         | Nvidia were just ahead in this particular category due to CUDA,
         | so AMD may have just let them run with it for now.
        
         | jeroenhd wrote:
         | AMD hardware works fine, the problem is that the major research
         | projects everyone copies are all developed specifically for
         | Nvidia.
         | 
         | Now AMD is spinning up CUDA compatibility layer after CUDA
         | compatibility layer. It's like trying to beat Windows by
         | building another ReactOS/Wine. It's an approach doomed to fail
         | unless AMD somehow manages to gain vastly more resources than
         | the competition.
         | 
         | Apple's NPU may not be very powerful, but many models have been
         | altered specifically to run on them, making their NPUs vastly
         | more useful than most equivalently powerful iGPUs. AMD doesn't
         | have that just yet, they're always catching up.
         | 
         | It'll be interesting to see what Qualcomm will do to get
         | developers to make use of their NPUs on the new laptop chips.
        
           | JonChesterfield wrote:
           | Interesting analogy. The last few programs from the windows
           | world I tried to run were flawless under wine and abjectly
           | failed under windows 11.
        
       | deliveryboyman wrote:
       | Would like to see benchmarks for the applications in the test
       | suite.
       | 
       | E.g., how does Cycles compare on AMD vs Nvidia?
        
       | Straw wrote:
       | I worked for spectral compute a few years ago. Very smart and
       | capable technical team.
       | 
       | At the time, not only did they target AMD (with less
       | compatibility than they have now), but also outperformed the
       | default LLVM ptx backend, and even NVCC, when compiling for
       | Nvidia GPUs!
        
       | modeless wrote:
       | A lot of people think AMD should support these translation layers
       | but I think it's a bad idea. CUDA is not designed to be vendor
       | agnostic and Nvidia can make things arbitrarily difficult both
       | technically and legally. For example I think it would be against
       | the license agreement of cuDNN or cuBLAS to run them on this. So
       | those and other Nvidia libraries would become part of the API
       | boundary that AMD would need to reimplement and support.
       | 
       | Chasing bug-for-bug compatibility is a fool's errand. The
       | important users of CUDA are open source. AMD can implement
       | support directly in the upstream projects like pytorch or
       | llama.cpp. And once support is there it can be maintained by the
       | community.
        
         | DeepYogurt wrote:
         | Ya, honestly better to leave that to third parties who can
         | dedicate themselves to it and maybe offer support or whatever.
         | Let AMD work on good first party support first.
        
         | fngjdflmdflg wrote:
         | >Nvidia can make things arbitrarily difficult both technically
         | and legally.
         | 
         | I disagree. AMD can simply not implement those APIs, similar to
         | how game emulators implement the most used APIs first and
         | sometimes never bother implementing obscure ones. It would only
         | matter that NVIDIA added eg. patented APIs to CUDA if those
         | APIs were useful. In which case AMD should have a way to do
         | them anyway. Unless NVIDIA comes up with a new patented API
         | which is both useful and impossible to implement in any other
         | way, which would be bad for AMD in any event. On the other
         | hand, if AMD start supporting CUDA and people start using AMD
         | cards, then developers will be hesitant to use APIs that only
         | work on NVIDIA cards. Right now they are losing billions of
         | dollars on this. Then again they barely seem capable of
         | supporting RocM on their cards, much less CUDA.
         | 
         | You have a fair point in terms of cuDNN and cuBLAS but I don't
         | know that that kind of ToS is actually binding.
        
           | selimnairb wrote:
           | Patented API? I thought Google v. Oracle settled this? Making
           | an implementation of an API spec is fair use, is it not?
        
             | fngjdflmdflg wrote:
             | My understanding is that Google v. Oracle only applies to
             | copyright.
        
               | nl wrote:
               | Well you can't patent an API so....
        
               | fngjdflmdflg wrote:
               | You can patent the implementation. You can't patent the
               | API name DecodeH265Video() but you can still sue someone
               | for implementing that function correctly.
        
               | anticensor wrote:
               | If there is only one way to solve a problem, there is
               | nothing to invent, just discover, and discoveries are
               | decidedly not patentable.
        
         | dietr1ch wrote:
         | How's this situation different than the one around Java,
         | Sun/Oracle and Google?
        
           | dboreham wrote:
           | The judge might not be a coder next time.
        
             | viraptor wrote:
             | The US law is highly dependent on precedents. The Google-
             | Oracle case has set one fortunately, so anything following
             | it won't start from scratch. Fortunately we may not need a
             | closer judge.
        
               | jjk166 wrote:
               | Google-Oracle side stepped the issue of API
               | copyrightability by saying Google's particular
               | implementation would fall under fair use. Whether APIs
               | are copyrightable remains an open question.
        
               | dylan604 wrote:
               | Until you get an activist court
        
         | blitzar wrote:
         | It would be good if AMD did something, anything.
         | 
         | Support this, reimplement that, support upstream efforts, dont
         | really care. Any of those would cost a couple of million and be
         | worth a trillion dollars to AMD shareholders.
        
           | oezi wrote:
           | A couple of million doesn't get you anything in corporate
           | land
        
             | spacebanana7 wrote:
             | A couple dozen billion for a 10% chance of becoming NVIDIA
             | competitive is worth it, looking at the stock prices.
        
               | oezi wrote:
               | Billions. Now we are talking.
        
           | slashdave wrote:
           | ROCm counts as "something"
        
             | curt15 wrote:
             | Pretty much any modern NVIDIA GPU supports CUDA. You don't
             | have to buy a datacenter-class unit to get your feet wet
             | with CUDA programming. ROCm will count as "something" when
             | the same is true for AMD GPUs.
        
               | muxr wrote:
               | I don't think AMD needs to support 5+ year old GPUs
               | personally. And all the recent generations are already
               | practically supported.
               | 
               | AMD only claims support for a select few GPUs, but in my
               | testing I find all the GPUs work fine if the architecture
               | is supported. I've tested rx6600, rx6700xt for example
               | and even though they aren't officially supported, they
               | work fine on ROCm.
        
               | Dylan16807 wrote:
               | > 5+ year old GPUs
               | 
               | AMD had a big architecture switchover _exactly_ 5 years
               | ago, and the full launch wasn 't over until 4.5 years
               | ago. I think that generation _should_ have full support.
               | Especially because it 's not like they're cutting support
               | now. They didn't support it at launch, and they didn't
               | support it after 1, 2, 3, 4 years either.
               | 
               | The other way to look at things, I'd say that for a mid
               | to high tier GPU to be obsolete based on performance, the
               | replacement model needs to be over twice as fast. 7700XT
               | is just over 50% faster than 5700XT.
        
               | imtringued wrote:
               | I'm on a 5+ year old GPU, because I don't trust AMD to
               | offer a compelling GPU that actually works. An RX 7 570
               | is good enough for the little gaming I do. It mostly acts
               | as an oversized iGPU that has good Linux drivers, but
               | since AMD is not supporting ROCm on this GPU, there is no
               | need to hurry on upgrading to a better GPU or to get my
               | feet wet on running things locally on the GPU like Stable
               | Diffusion, LLMs, etc.
        
               | jacoblambda wrote:
               | ROCm supports current gen consumer gpus officially and a
               | decent chunk of recent gen consumer gpus unofficially.
               | Not all of them of course but a decent chunk.
               | 
               | It's not ideal but I'm pretty sure CUDA didn't support
               | everything from day 1. And ROCm is part of AMD's vendor
               | part of the Windows AI stack so from upcoming gen on out
               | basically anything that outputs video should support
               | ROCm.
        
               | ChoGGi wrote:
               | No, but CUDA at least supported the 8800 gt on release
               | [1]. ROCm didn't support any consumer cards on release,
               | looks like they didn't support any till last year? [2]
               | 
               | [1]https://www.gamesindustry.biz/nvidia-unveils-cuda-the-
               | gpu-co...
               | 
               | [2]https://www.tomshardware.com/news/amd-rocm-comes-to-
               | windows-...
        
               | squidgyhead wrote:
               | Here is the support list:
               | 
               | https://rocm.docs.amd.com/projects/install-on-
               | linux/en/lates...
        
               | mappu wrote:
               | AMD's definition of "support" I think is different than
               | what people expect, and pretty misleading - ROCm itself
               | will run on almost anything, back as far as the RX
               | 400/500 series:
               | 
               | https://en.wikipedia.org/wiki/ROCm#:~:text=GCN%205%20%2D%
               | 20V...
               | 
               | Stable Diffusion ran fine for me on RX 570 and RX 6600XT
               | with nothing but distro packages.
        
               | imtringued wrote:
               | I don't buy it. Even running things like llama.cpp on my
               | RX 570 via Vulkan crashes the entire system.
        
               | Nab443 wrote:
               | The last time I checked, I was stuck with a pretty old
               | kernel if I wanted to have the last version of ROCm
               | available for my rx470. It's compatible at some point in
               | time, but not kept compatible with recent kernels.
        
               | mappu wrote:
               | It's the responsibility of your distro to ship things
               | that work together,
        
               | slavik81 wrote:
               | There are out-of-bounds writes in the BLAS libraries for
               | gfx803 GPUs (such as the RX 570). That hardware might
               | work fine for your use case, but there's a lot of
               | failures in the test suites.
               | 
               | I agree that the official support list is very
               | conservative, but I wouldn't recommend pre-Vega GPUs for
               | use with ROCm. Stick to gfx900 and newer, if you can.
        
               | bavell wrote:
               | Huh? I've been running ROCm for SD and LLMs for over a
               | year and a half on my puny consumer 6750X - not even
               | latest gen.
        
               | slashdave wrote:
               | AMD should focus their efforts on competitive hardware
               | offerings, because that is where the need and the money
               | is. Sorry, I don't think the hobbyist should be a
               | priority.
        
           | chatmasta wrote:
           | Is it weird how the comments here are blaming AMD and not
           | Nvidia? Sure, the obvious argument is that Nvidia has no
           | practical motivation to build an open platform. But there are
           | counterexamples that suggest otherwise (Android). And there
           | is a compelling argument that long term, their proprietary
           | firmware layer will become an insufficient moat to their
           | hardware dominance.
           | 
           | Who's the root cause? The company with the dominant platform
           | that refuses to open it up, or the competitor who can't catch
           | up because they're running so far behind? Even if AMD made
           | their own version of CUDA that was better in every way, it
           | still wouldn't gain adoption because CUDA has become the
           | standard. No matter what they do, they'll need to have a
           | compatibility layer. And in that case maybe it makes sense
           | for them to invest in the best one that emerges from the
           | community.
        
             | lmm wrote:
             | > Is it weird how the comments here are blaming AMD and not
             | Nvidia?
             | 
             | Nvidia has put in the legwork and are reaping the rewards.
             | They've worked closely with the people who are actually
             | using their stuff, funding development and giving loads of
             | support to researchers, teachers and so on, for probably a
             | decade now. Why should they give all that away?
             | 
             | > But there are counterexamples that suggest otherwise
             | (Android).
             | 
             | How is Android a counterexample? Google makes no money off
             | of it, nor does anyone else. Google keeps Android open so
             | that Apple can't move everyone onto their ad platform, so
             | it's worth it for them as a strategic move, but Nvidia has
             | no such motive.
             | 
             | > Even if AMD made their own version of CUDA that was
             | better in every way, it still wouldn't gain adoption
             | because CUDA has become the standard.
             | 
             | Maybe. But again, that's because NVidia has been putting in
             | the work to make something better for a decade or more. The
             | best time for AMD to start actually trying was 10 years
             | ago; the second-best time is today.
        
               | Zambyte wrote:
               | > Google makes no money off of it, nor does anyone else
               | 
               | Google makes no money off of Android? That seems like a
               | really weird claim to make. Do you really think Google
               | would be anywhere near as valuable of a company if iOS
               | had all of the market share that the data vacuum that is
               | Android has? I can't imagine that being the case.
               | 
               | Google makes a boatload off of Android, just like AMD
               | would if they supported open GPGPU efforts aggressively.
        
               | michaelt wrote:
               | Google gave away the software platform - Android - to
               | hardware vendors for free, vendors compete making the
               | hardware into cheap, low-margin commodity items, and
               | google makes boatloads of money from ads, tracking and
               | the app store.
               | 
               | nvidia _could_ give away the software platform - CUDA -
               | to hardware vendors for free, making the hardware into
               | cheap, low-margin commodity items. But how would they
               | make boatloads of money when there 's nowhere to put ads,
               | tracking or an app store?
        
               | rjurney wrote:
               | Android is a complement to Google's business, which is
               | when open source works. What would be the complement
               | worth $1 Trillion to NVIDIA to build a truly open
               | platform? There isn't one. That was his point.
        
               | chatmasta wrote:
               | There's an entire derivative industry of GPUs, namely
               | GenAI and LLM providers, that could be the "complement"
               | to an open GPU platform. The exact design and interface
               | between such a complement and platform is yet undefined,
               | but I'm sure there are creative approaches to this
               | problem.
        
               | rjurney wrote:
               | And NVIDIA is playing in that game too. Why would they
               | not play in higher level services as well? They already
               | publish the source to their entire software stack. A
               | comparison to Android is completely useless. Google is a
               | multi-sided platform that does lots of things for free
               | for some people (web users, Android users) so it can
               | charge other people for their data (ad buyers). That
               | isn't the chip business whatsoever. The original comment
               | only makes sense if you know nothing about their
               | respective business models.
        
               | chatmasta wrote:
               | Yes, so when the ground inevitably shifts below their
               | feet (it might happen years from now, but it _will_
               | happen - open platforms always emerge and eventually
               | proliferate), wouldn't it be better for them to own that
               | platform?
               | 
               | On the other hand, they could always wait for the most
               | viable threat to emerge and then pay a few billion
               | dollars to acquire it and own its direction. Google
               | didn't invent Android, after all...
               | 
               | > Google is a multi-sided platform that does lots of
               | things for free for some people... That isn't the chip
               | business whatsoever.
               | 
               | This is a reductionist differentiation that overlooks the
               | similarities between the platforms of "mobile" and "GPU"
               | (and also mischaracterizes the business model of Google,
               | who does in fact make money directly from Android sales,
               | and even moved all the way down the stack to selling
               | hardware). In fact there is even a potentially direct
               | analogy between the two platforms: LLM is the top of the
               | stack with GPU on the bottom, just like Advertising is
               | the top of the stack with Mobile on the bottom.
               | 
               | Yes, Google's top level money printer is advertising, and
               | everything they do (including Android) is about
               | controlling the maximum number of layers below that money
               | printer. But that doesn't mean there is no benefit to
               | Nvidia doing the same. They might approach it
               | differently, since they currently own the bottom layer
               | whereas Google started from the top layer. But the end
               | result of controlling the whole stack will lead to the
               | same benefits.
               | 
               | And you even admit in your comment that Nvidia is
               | investing in these higher levels. My argument is that
               | they are jeopardizing the longevity of these high-level
               | investments due to their reluctance to invest in an open
               | platform at the bottom layer (not even the bottom, but
               | one level above their hardware). This will leave them
               | vulnerable to encroachment by a player that comes from a
               | higher level, like OpenAI for example, who gets to define
               | the open platform before Nvidia ever has a chance to own
               | it.
        
             | roenxi wrote:
             | > Is it weird how the comments here are blaming AMD and not
             | Nvidia?
             | 
             | Not even a little bit. It simply isn't Nvidia's job to
             | provide competitive alternatives to Nvidia. Competing is
             | something AMD must take responsibility for.
             | 
             | The only reason CUDA is such a big talking point is because
             | AMD tripped over their own feet supporting accelerated BLAS
             | on AMD GPUs. Realistically it probably is hard to implement
             | (AMD have a lot of competent people on staff) but Nvidia
             | hasn't done anything unfair apart from execute so well that
             | they make all the alternatives look bad.
        
               | jkmcf wrote:
               | I agree with you, but replace NVIDIA with Apple. What
               | would the EU say?
        
               | LtWorf wrote:
               | I don't think nvidia bans anyone from running code on
               | their devices.
        
               | padthai wrote:
               | They do from time to time:
               | https://wirelesswire.jp/2017/12/62708/
        
               | kbolino wrote:
               | This seems to be more about certain devices (consumer-
               | grade GPUs) in certain settings (data centers), though I
               | do question how enforceable it actually is. My guess is
               | that it can only apply when you try to get discounts from
               | bulk-ordering GPUs.
               | 
               | Also, was there any followup to this story? It seems a
               | bit unnecessary because nVidia has already neutered
               | consumer cards for many/most data center purposes by not
               | using ECC and by providing so few FP64 units that double
               | precision FLOPS is barely better than CPU SIMD.
        
               | paulmd wrote:
               | it's also not really a thing anymore because of the open
               | kernel driver... at that point it's just MIT licensed.
               | 
               | of course people continued to melt down about that for
               | some reason too, in the customary "nothing is ever libre
               | enough!" circular firing squad. Just like streamline etc.
               | 
               | There's a really shitty strain of fanboy thought that
               | wants libre software to be actively worsened (even
               | stonewalled by the kernel team if necessary) so that they
               | can continue to argue against nvidia as a bad actor that
               | doesn't play nicely with open source. You saw it with all
               | these things but especially with the open kernel driver,
               | people were really happy it didn't get upstreamed. Shitty
               | behavior all around.
               | 
               | You see it every time someone quotes Linus Torvalds on
               | the issue. Some slight from 2006 is more important than
               | users having good, open drivers upstreamed. Some petty
               | brand preferences are legitimately far important than
               | working with and bringing that vendor into the fold long-
               | term, _for a large number of people_. Most of whom don't
               | even consider themselves fanboys! They just say all the
               | things a fanboy would say, and act all the ways a fanboy
               | would act...
        
               | Zambyte wrote:
               | https://www.pcgamer.com/nvidia-officially-confirms-hash-
               | rate...
               | 
               | Also: look into why the Nouveau driver performance is
               | limited.
        
               | paulmd wrote:
               | so terrible that vendors can enforce these proprietary
               | licenses on software they paid to develop /s
        
               | Zambyte wrote:
               | Huh? Why the sarcasm? You think it's a good thing that
               | someone besides the person who owns the hardware has the
               | final say on what the hardware is allowed to be used for?
        
             | nemothekid wrote:
             | > _Is it weird how the comments here are blaming AMD and
             | not Nvidia?_
             | 
             | It's not. Even as it is, I do not trust HIP or RocM to be a
             | viable alternative to Cuda. George Hotz did plenty of work
             | trying to port various ML architectures to AMD and was met
             | with countless driver bugs. The problem isn't nvidia won't
             | build an open platform - the problem is AMD won't invest in
             | a competitive platform. 99% of ML engineers do not write
             | CUDA. For the vast majority of workloads, there are
             | probably 20 engineers at Meta who write the Cuda backend
             | for Pytorch that every other engineer uses. Meta could hire
             | another 20 engineers to support whatever AMD has (they did,
             | and it's not as robust as CUDA).
             | 
             | Even if CUDA was open - do you expect nvidia to also write
             | drivers for AMD? I don't believe 3rd parties will get
             | anywhere writing "compatibility layers" because AMD's own
             | GPU aren't optimized or tested for CUDA-like workloads.
        
             | pjmlp wrote:
             | Khrons, AMD and Intel have had 15 years to make something
             | out of OpenCL that could rival CUDA.
             | 
             | Instead they managed 15 years of disappointment, in a
             | standard stuck in C99, that adopted C++ and a polyglot
             | bytecode too late to matter, never produced an ecosystem of
             | IDE tooling and GPU libraries.
             | 
             | Naturally CUDA became the standard, when NVIDIA provided
             | what the GPU community cared about.
        
             | whywhywhywhy wrote:
             | >Is it weird how the comments here are blaming AMD and not
             | Nvidia?
             | 
             | Because it IS AMD/Apple/etcs fault for the position they're
             | in right now. CUDA showed where the world was heading and
             | where the gains in compute would be made well over a decade
             | ago now.
             | 
             | They even had OpenCL, didn't put the right amount of effort
             | into it, all the talent found CUDA easier to work with so
             | built there. Then what did AMD, Apple do? Double down and
             | try and make something better and compete? Nah they
             | fragmented and went their own way, AMD with what feels like
             | a fraction of the effort even Apple put in.
             | 
             | From the actions of the other teams in the game it's not
             | hard to imagine a world without CUDA being a world where
             | this tech is running at a fraction of it's potential.
        
             | immibis wrote:
             | It's always been on the straggler to catch up by cheating.
             | That's just how the world works - even in open source. If
             | AMD supported CUDA, it would have a bigger market share.
             | That's a fact. Nvidia doesn't want that. That's a fact. But
             | when Reddit started, it just scraped feeds from Digg, and
             | when Facebook started, it let you link your MySpace
             | credentials and scraped your MySpace account. Adversarial
             | interoperability is nothing new.
        
             | cogman10 wrote:
             | Funnily, who I blame the most for there not being real
             | competition to CUDA is apple. As of late, Apple has been
             | really pushing for vender lock in APIs rather than adopting
             | open standards. The end result is you can get AMD and Intel
             | onboard with some standard which is ultimately torpedoed by
             | apple. (See apple departing from and rejecting everything
             | that comes from the khronos group).
             | 
             | With the number of devs that use Apple silicon now-a-days,
             | I have to think that their support for khronos initiatives
             | like SYCL and OpenCL would have significantly accelerated
             | progress and adoption in both.
             | 
             | We need an open standard that isn't just AMD specific to be
             | successful in toppling CUDA.
        
         | Const-me wrote:
         | > Nvidia can make things arbitrarily difficult both technically
         | and legally
         | 
         | Pretty sure APIs are not copyrightable, e.g.
         | https://www.law.cornell.edu/supremecourt/text/18-956
         | 
         | > against the license agreement of cuDNN or cuBLAS to run them
         | on this
         | 
         | They don't run either of them, they instead implement an
         | equivalent API on top of something else. Here's a quote: "Open-
         | source wrapper libraries providing the "CUDA-X" APIs by
         | delegating to the corresponding ROCm libraries. This is how
         | libraries such as cuBLAS and cuSOLVER are handled."
        
           | dralley wrote:
           | I believe it was decided that they are copyrightable but that
           | using them for compatibility purposes is fair use.
        
             | kbolino wrote:
             | No, it's stranger than that: SCOTUS did not rule on
             | copyrightability of APIs at all, but simply ruled that even
             | _if_ they are copyrightable, what Google did (completely
             | reimplement Sun /Oracle's public API) was still fair use.
        
               | mrandish wrote:
               | It would have been nice to get a clear SCOTUS precedent
               | on this. On the other hand, I also value a SCOTUS which
               | rules minimally and narrowly by default (I also
               | appreciate SCOTUS' return to stricter constitutional
               | grounding in the past decade).
        
               | hobs wrote:
               | Incredibly loud laughing from the lawyers whose study of
               | law is being thrown around willy nilly because of all the
               | unprecedented joke decisions they are making right now.
        
               | kbolino wrote:
               | We are stuck between a rock and a hard place politically.
               | The real decisions should be coming from Congress not the
               | courts. However, Congress is too disorganized and
               | disconnected to answer the important questions, leaving
               | the courts to either muddle along or else become semi-
               | dictatorial. In most countries, this would cause a
               | constitutional crisis, but the modern U.S. system seems
               | to be a little too resilient to such otherwise concerning
               | signals.
        
               | hobs wrote:
               | We're far past a constitutional crisis, and the courts
               | taking power nobody wanted to give to them (who wasn't
               | interested in a unitary executive at least) isn't a good
               | solution.
        
               | kbolino wrote:
               | What constitutional crisis has occurred that hasn't been
               | resolved?
               | 
               | Constitutional crises involve fundamental breaks in the
               | working of government that bring two or more of its
               | elements into direct conflict that can't be reconciled
               | through the normal means. The last of these by my
               | accounting was over desegregation, which was resolved
               | with the President ordering the Army to force the
               | recalcitrant states to comply. Before that was a showdown
               | between the New Deal Congress and the Supreme Court,
               | which the former won by credibly threatening to pack the
               | latter (which is IMO a much less severe crisis but still
               | more substantial than anything happening today). However,
               | that was almost a century ago, and Congress has not been
               | that coherent lately.
        
               | ted_dunning wrote:
               | I would think the latest one where SCOTUS ruled that the
               | president was a king except in matters where the SCOTUS
               | decides they aren't counts as a constitutional crisis.
        
               | FeepingCreature wrote:
               | Constitutional crises are not a matter of opinion but of
               | occurrence, arising from an _actual power conflict_
               | between arms of the government that is caused by a
               | conflicted reading of the constitutional text. Basically,
               | if the system just ticks on, it 's not a constitutional
               | crisis.
               | 
               | If "I think this is a very bad decision" was cause for a
               | constitutional crisis, any state with more than three
               | digit population would be in constitutional crisis
               | perpetually.
        
               | jolux wrote:
               | > Constitutional crises are not a matter of opinion but
               | of occurrence, arising from an actual power conflict
               | between arms of the government that is caused by a
               | conflicted reading of the constitutional text. Basically,
               | if the system just ticks on, it's not a constitutional
               | crisis.
               | 
               | This happened as recently as 2021-01-06; strong evidence
               | that the military subverted the president to call the
               | National Guard into Washington DC and secure the
               | electoral count.
        
               | hnfong wrote:
               | If Trump didn't back down it could have definitely been a
               | constitutional crisis.
               | 
               | I'd say it was narrowly averted though.
        
               | kbolino wrote:
               | That's close. Both the excessively long lame duck period
               | (2 months for Congress and 2.5 months for the President)
               | and disunity between the President and the rest of the
               | executive branch have also been fodder for crises in the
               | past (Marbury v Madison, Andrew Johnson's impeachment).
        
               | not2b wrote:
               | That is how the SC used to work: they would decide cases
               | on the narrowest possible grounds. If they don't have to
               | decide a tough question, but they can finesse it with
               | something simpler, good enough. More recently they have
               | been willing to tear up decades of established law on a
               | regular basis.
        
               | hnfong wrote:
               | "Used to work"... this was 2021.
               | 
               | And generally courts/judges just choose the scope of
               | their legal opinions based on how far reaching they want
               | the legal principles to apply.
               | 
               | IMHO, copyright-ability of APIs is so far away from their
               | political agenda that they probably just decided to leave
               | the issue on a cliffhanger...
        
               | immibis wrote:
               | Yes, "used to". Now, in 2024, the same supreme court has
               | decided that presidents have immunity in all official
               | acts, from stealing documents, up to and including
               | assassination attempts on their opponents. This is a
               | radical shift in how the court operates.
        
               | kbolino wrote:
               | This "opponent assassination" hypothetical gets bandied
               | about a lot but I have not seen any evidence that any
               | court considers that to be an "official act". Official
               | acts are constrained to legitimate exercises of
               | constitutional authority and are not merely anything a
               | President (or especially, an ex-President) does.
        
               | not2b wrote:
               | It's specifically mentioned in the dissents.
        
               | jpadkins wrote:
               | the only thing radical is the opinions of people you are
               | listening to if you believe SCOTUS enabled legally
               | sanctioned assassinations. It was political hyperbole
               | based on nothing, and it worked (with you). Think for
               | yourself.
        
           | consf wrote:
           | You're correct! Fair Use Doctrine
        
         | amelius wrote:
         | Like supporting x86 was a bad idea as well?
        
           | karolist wrote:
           | Was there a large entity steering x86 spec alone with a huge
           | feature lead against their competition, free to steer the
           | spec in any ways they choose? Also, hardware is not
           | opensource software, you get big players onboard and they
           | will be able to implement the spec they want every gen,
           | software has more moving parts and unaligned parties
           | involved.
        
             | cherryteastain wrote:
             | > Was there a large entity steering x86 spec alone with a
             | huge feature lead against their competition, free to steer
             | the spec in any ways they choose?
             | 
             | Ever heard of Intel?
        
               | karolist wrote:
               | I had't considered that angle. Is your point that Intel
               | was the creator of x86, but software chose to support it,
               | then AMD had nothing else but to play catch up in x86
               | support to be part of the software target market? If so
               | and factual (I've no idea), fair point, I didn't know.
        
               | marshray wrote:
               | It was exactly the same instruction set.
               | 
               | C compilers didn't offer an "AMD" CPU target* until AMD
               | came out with the "AMD64" instruction set. Today we call
               | this "x86_64" or "x64".
               | 
               | * Feel free to point out some custom multimedia vector
               | extensions for Athlons or something, but the point
               | remains.
        
               | gmokki wrote:
               | And Intel named its licenced implementation of AMD64 as
               | IA-32e, just to make it clear to everyone that it is
               | based on Intel architecture 32bit version with an
               | extension. Luckily they dropped that name few years later
        
           | modeless wrote:
           | Before starting, AMD signed an agreement with Intel that gave
           | them an explicit license to x86. And x86 was a whole lot
           | smaller and simpler back then in _1982_. A completely
           | different and incomparable situation.
        
             | nostrademons wrote:
             | Technically it was after starting - AMD was founded in 1969
             | as a second-sourcer for Fairchild and National
             | Semiconductor, and had reverse-engineered the 8080 by 1975
             | and acquired a formal license to it by 1976.
             | 
             | The 1982 deal you speak of was actually pretty interesting:
             | as a condition of the x86's use in the IBM PC, IBM
             | requested a second source for x86 chips. AMD was that
             | source, and so they cross-licensed the x86 in 1982 to allow
             | the IBM PC project to proceed forward. This makes the
             | Intel/AMD deal even more important for both companies: _the
             | PC market would never have developed_ without the cross-
             | licensing, which would 've been bad for all companies
             | involved. This gave Intel an ongoing stake in AMD's success
             | at least until the PC market consolidated on the x86
             | standard.
        
         | eslaught wrote:
         | Are you aware of HIP? It's officially supported and, for code
         | that avoids obscure features of CUDA like inline PTX, it's
         | pretty much a find-and-replace to get a working build:
         | 
         | https://github.com/ROCm/HIP
         | 
         | Don't believe me? Include this at the top of your CUDA code,
         | build with hipcc, and see what happens:
         | 
         | https://gitlab.com/StanfordLegion/legion/-/blob/master/runti...
         | 
         | It's incomplete because I'm lazy but you can see most things
         | are just a single #ifdef away in the implementation.
        
           | currymj wrote:
           | if you're talking about building anything, that is already
           | too hard for ML researchers.
           | 
           | you have to be able to pip install something and just have it
           | work, reasonably fast, without crashing, and also it has to
           | not interfere with 100 other weird poorly maintained ML
           | library dependencies.
        
             | bootsmann wrote:
             | Don't most orgs that are deep enough to run custom cuda
             | kernels have dedicated engineers for this stuff. I can't
             | imagine a person who can write raw cuda not being able to
             | handle things more difficult than pip install.
        
               | gaogao wrote:
               | Engineers who are really, really good at CUDA are worth
               | their weight in gold, so there's more projects for them
               | than they have time. Worth their weight in gold isn't
               | figurative here - the one I know has a ski house more
               | expensive than 180 lbs of gold (~$5,320,814).
        
               | bbkane wrote:
               | Would you (or your friend) be able to drop any good CUDA
               | learning resources? I'd like to be worth my weight in
               | gold...
        
               | throwaway81523 wrote:
               | A working knowledge of C++, plus a bit of online reading
               | about CUDA and the NVidia GPU architecture, plus studying
               | the LCZero chess engine source code (the CUDA neural net
               | part, I mean) seems like enough to get started. I did
               | that and felt like I could contribute to that code, at
               | least at a newbie level, given the hardware and build
               | tools. At least in the pre-NNUE era, the code was pretty
               | readable. I didn't pursue it though.
               | 
               | Of course becoming "really good" is a lot different and
               | like anything else, it presumably takes a lot of callused
               | fingertips (from typing) to get there.
        
               | 8n4vidtmkvmk wrote:
               | Does this pay more than $500k/yr? I already know C++,
               | could be tempted to learn CUDA.
        
               | throwaway81523 wrote:
               | I kinda doubt it. Nobody paid me to do that though. I was
               | just interested in LCZero. To get that $500k/year, I
               | think you need up to date ML understanding and not just
               | CUDA. CUDA is just another programming language while ML
               | is a big area of active research. You could watch some of
               | the fast.ai ML videos and then enter some Kaggle
               | competitions if you want to go that route.
        
               | almostgotcaught wrote:
               | You're wrong. The people building the models don't write
               | CUDA kernels. The people optimizing the models write CUDA
               | kernels. And you don't need to know a bunch of ML bs to
               | optimize kernels. Source: I optimize GPU kernels. I don't
               | make 500k but I'm not that far from.
        
               | throwaway81523 wrote:
               | Heh I'm in the wrong business then. Interesting. Used to
               | be that game programmers spent lots of time optimizing
               | non-ML CUDA code. They didn't make anything like 500k at
               | that time. I wonder what the ML industry has done to game
               | development, or for that matter to scientific
               | programming. Wow.
        
               | HarHarVeryFunny wrote:
               | How much performance difference is there between writing
               | a kernel in a high level language/framework like PyTorch
               | (torch.compile) or Triton, and hand optimizing? Are you
               | writing kernels in PTX?
               | 
               | What's your opinion on the future of writing optimized
               | GPU code/kernels - how long before compilers are as good
               | or better than (most) humans writing hand-optimized PTX?
        
               | throwaway81523 wrote:
               | The CUDA version of LCZero was around 2x or 3x faster
               | than the Tensorflow(?) version iirc.
        
               | mosselman wrote:
               | The real challenge is probably getting your hands on a
               | 4090 for a price you can pay before you are worth your
               | weight in gold. Because an arm and a limb in gold is
               | quite a lot.
        
               | throwaway81523 wrote:
               | You don't really need a 4090. An older board is plenty.
               | The software is basically the same. I fooled around with
               | what I think was a 1080 on Paperspace for something like
               | 50 cents an hour, but it was mostly with some Pytorch
               | models rather than CUDA directly.
        
               | ahepp wrote:
               | I was looking into this recently and it seems like the
               | cheapest AWS instance with a CUDA GPU is something on the
               | order of $1/hr. It looks like an H100 instance might be
               | $15/hr (although I'm not sure if I'm looking at a monthly
               | price).
               | 
               | So yeah it's not ideal if you're on a budget, but it
               | seems like there are some solutions that don't involve
               | massive capex.
        
               | throwaway81523 wrote:
               | Look on vast.ai instead of AWS, you can rent machines
               | with older GPU's dirt cheap. I don't see how they even
               | cover the electricity bills. A 4090 machine starts at
               | about $.25/hour though I didn't examine the
               | configuration.
               | 
               | A new 4090 costs around $1800
               | (https://www.centralcomputer.com/asus-tuf-
               | rtx4090-o24g-gaming...) and that's probably affordable to
               | AWS users. I see a 2080Ti on Craigslist for $300
               | (https://sfbay.craigslist.org/scz/sop/d/aptos-nvidia-
               | geforce-...) though used GPU's are possibly thrashed by
               | bitcoin mining. I don't have a suitable host machine,
               | unfortunately.
        
               | dotancohen wrote:
               | Thrashed? What type of damage could a mostly-solid state
               | device suffer? Fan problems? Worn PCi connectors?
               | Deteriorating Arctic Ice from repeated heat cycling?
        
               | ssl-3 wrote:
               | Nope, none of those.
               | 
               | When people were mining Ethereum (which was the last
               | craze that GPUs were capable of playing in -- BTC has
               | been off the GPU radar for a long time), profitable
               | mining was fairly kind to cards compared to gaming.
               | 
               | Folks wanted their hardware to produce as much as
               | possible, for as little as possible, before it became
               | outdated.
               | 
               | The load was constant, so heat cycles weren't really a
               | thing.
               | 
               | That heat was minimized; cards were clocked (and voltages
               | tweaked) to optimize the ratio of crypto output to Watts
               | input. For Ethereum, this meant undervolting and
               | underclocking the GPU -- which are kind to it.
               | 
               | Fan speeds were kept both moderate and tightly
               | controlled; too fast, and it would cost more (the fans
               | themselves cost money to run, and money to replace). Too
               | slow, and potential output was left on the table.
               | 
               | For Ethereum, RAM got hit hard. But RAM doesn't
               | necessarily care about that; DRAM in general is more or
               | less just an array of solid-state capacitors. And people
               | needed that RAM to work reliably -- it's NFG to spend
               | money producing bad blocks.
               | 
               | Power supplies tended to be stable, because good, cheap,
               | stable, high-current, and stupidly-efficient are
               | qualities that go hand-in-hand thanks to HP server PSUs
               | being cheap like chips.
               | 
               | There were exceptions, of course: Some people did not
               | mine smartly.
               | 
               | ---
               | 
               | But this is broadly very different from how gamers treat
               | hardware, wherein: Heat cycles are real, over clocking
               | everything to eek out an extra few FPS is real, pushing
               | things a bit too far and producing glitches can be
               | tolerated sometimes, fan speeds are whatever, and power
               | supplies are picked based on what they _look like_
               | instead of an actual price /performance comparison.
               | 
               | A card that was used for mining is not implicitly worse
               | in any way than one that was used for gaming. Purchasing
               | either thing involves non-zero risk.
        
               | mschuster91 wrote:
               | Heat. A lot of components - and not just in computers but
               | _everything_ hardware - are spec 'd for something called
               | "duty cycles", basically how long a thing is active in a
               | specific time frame.
               | 
               | Gaming cards/rigs, which many of the early miners were
               | based on, rarely run at 100% all the time, the workload
               | is burst-y (and distributed amongst different areas of
               | the system). In comparison, a miner runs at 100% all the
               | time.
               | 
               | On top of that, for silicon there is an effect called
               | electromigration [1], where the literal movement of
               | electrons erodes the material over time - made worse by
               | ever shrinking feature sizes as well as, again, the chips
               | being used in exactly the same way all the time.
               | 
               | [1] https://en.wikipedia.org/wiki/Electromigration
        
               | SonOfLilit wrote:
               | replying to sibling @dotancohen, they melt, and they
               | suffer from thermal expansion and compression
        
               | robotnikman wrote:
               | Are there any certifications or other ways to prove your
               | knowledge to employers in order to get your foot in the
               | door?
        
               | suresk wrote:
               | Having dabbled in CUDA, but not worked on it
               | professionally, it feels like a lot of the complexity
               | isn't really in CUDA/C++, but in the algorithms you have
               | to come up with to really take advantage of the hardware.
               | 
               | Optimizing something for SIMD execution isn't often
               | straightforward and it isn't something a lot of
               | developers encounter outside a few small areas. There are
               | also a lot of hardware architecture considerations you
               | have to work with (memory transfer speed is a big one) to
               | even come close to saturating the compute units.
        
               | iftheshoefitss wrote:
               | On bro forget gold if like to be worth my weight in paper
               | lmao
        
               | eigenvalue wrote:
               | That's pretty funny. Good test of value across the
               | millennia. I wonder if the best aqueduct engineers during
               | the peak of Ancient Rome's power had villas worth their
               | body weight in gold.
        
               | Winse wrote:
               | Lol. For once being overweight may come with some
               | advantages here.
        
               | necovek wrote:
               | Or disadvantages: you may be as rich as your skinny
               | neighbour, but they are the only ones worth their weight
               | in gold ;)
        
               | Willish42 wrote:
               | The fact that "worth their weight in cold" typically
               | means in the single-digit millions is _fascinating_ to me
               | (though I doubt I 'll be able to get there myself, maybe
               | someday). I looked it up though and I think this is
               | undercounting the current value of gold per ounce/lb/etc.
               | 
               | 5320814 / 180 / 16 = ~1847.5
               | 
               | Per https://www.apmex.com/gold-price and
               | https://goldprice.org/, current value is north of $2400 /
               | oz. It was around $1800 in 2020. That growth for _gold_
               | of all things (up 71% in the last 5 years) is crazy to
               | me.
               | 
               | It's worth noting that anyone with a ski house that
               | expensive probably has a net worth well over twice the
               | price of that ski house. I guess it's time to start
               | learning CUDA!
        
               | boulos wrote:
               | Note: gold uses _troy_ ounces, so adjust by ~10%. It 's
               | easier to just use grams or kilograms :).
        
               | Willish42 wrote:
               | Thanks, I'm a bit new to this entire concept. Do _troy_
               | lbs also exist, or is that just a term when measuring
               | ounces?
        
               | atwrk wrote:
               | _> That growth for _gold_ of all things (up 71% in the
               | last 5 years) is crazy to me._
               | 
               | For comparison: S&P500 grew about the same during that
               | period (more than 100% from Jan 2019, about 70 from Dec
               | 2019), so the higher price of gold did not outperform the
               | growth of the general (financial) economy.
        
               | dash2 wrote:
               | But that's still surprising performance, because the S&P
               | generates income and pays dividends. Its increase
               | reflects (at least, is supposed to!) expectations of
               | future higher income. Gold doesn't even bear interest....
        
               | iftheshoefitss wrote:
               | What do people study to figure out CUDA? I'm studying to
               | get me GED and hope to go to school one day
        
               | paulmd wrote:
               | Computer science. This is a grad level topic probably.
               | 
               | Nvidia literally wrote most of the textbooks in this
               | field and you'd probably be taught using one of these
               | anyway:
               | 
               | https://developer.nvidia.com/cuda-books-archive
               | 
               | "GPGPU Gems" is another "cookbook" sort of textbook that
               | might be helpful starting out but you'll want a good
               | understanding of the SIMT model etc.
        
               | amelius wrote:
               | Just wait until someone trains an ML model that can
               | translate any CUDA code into something more portable like
               | HIP.
               | 
               | GP says it is just some #ifdefs in most cases, so an LLM
               | should be able to do it, right?
        
               | FuriouslyAdrift wrote:
               | OpenAI Triton? Pytorch 2.0 already uses it.
               | 
               | https://openai.com/index/triton/
        
               | phkahler wrote:
               | >> Don't most orgs that are deep enough to run custom
               | cuda kernels have dedicated engineers for this stuff. I
               | can't imagine a person who can write raw cuda not being
               | able to handle things more difficult than pip install.
               | 
               | This seems to be fairly common problem with software. The
               | people who create software regularly deal with complex
               | tool chains, dependency management, configuration files,
               | and so on. As a result they think that if a solutions
               | "exists" everything is fine. Need to edit a config file
               | for your particular setup? No problem. The thing is, _I_
               | have been programming stuff for decades and I really
               | _hate_ having to do that stuff and will avoid tools that
               | make me do it. I have my own problems to solve, and don
               | 't want to deal with figuring out tools no matter how
               | "simple" the author thinks that is to do.
               | 
               | A huge part of the reason commercial software exists
               | today is probably because open source projects don't take
               | things to this extreme. I look at some things that
               | qualify as products and think they're really simplistic,
               | but they take care of some minutia that regular people
               | are will to pay so they don't have to learn or deal with
               | it. The same can be true for developers and ML
               | researchers or whatever.
        
             | jchw wrote:
             | The target audience of interoperability technology is
             | whoever is building, though. Ideally, interoperability
             | technology can help software that supports only NVIDIA GPUs
             | today go on to quickly add baseline support for Intel and
             | AMD GPUs tomorrow.
             | 
             | (and for one data point, I believe Blender is actively
             | using HIP for AMD GPU support in Cycles.)
        
             | Agingcoder wrote:
             | Their target is hpc users, not ml researchers. I can
             | understand why this would be valuable to this particular
             | crowd.
        
             | eslaught wrote:
             | If your point is that HIP is not a zero-effort porting
             | solution, that is correct. HIP is a _low_ -effort solution,
             | not a zero effort solution. It targets users who already
             | use and know CUDA, and minimizes the changes that are
             | required from pre-existing CUDA code.
             | 
             | In the case of these abstraction layers, then it would be
             | the responsibility of the abstraction maintainers (or AMD)
             | to port them. Obviously, someone who does not even use CUDA
             | would not use HIP either.
             | 
             | To be honest, I have a hard time believing that a truly
             | zero-effort solution exists. Especially one that gets high
             | performance. Once you start talking about the full stack,
             | there are too many potholes and sharp edges to believe that
             | it will really work. So I am highly skeptical of original
             | article. Not that I wouldn't want to be proved wrong. But
             | what they're claiming to do is a big lift, even taking HIP
             | as a starting point.
             | 
             | The easiest, fastest (for end users), highest-performance
             | solution for ML will come when the ecosystem integrates it
             | natively. HIP would be a way to get there faster, but it
             | will take nonzero effort from CUDA-proficient engineers to
             | get there.
        
               | currymj wrote:
               | I agree completely with your last point.
               | 
               | As other commenters have pointed out, this is probably a
               | good solution for HPC jobs where everyone is using C++ or
               | Fortran anyway and you frequently write your own CUDA
               | kernels.
               | 
               | From time to time I run into a decision maker who
               | understandably wants to believe that AMD cards are now
               | "ready" to be used for deep learning, and points to
               | things like the fact that HIP mostly works pretty well. I
               | was kind of reacting against that.
        
             | ezekiel68 wrote:
             | > if you're talking about building anything, that is
             | already too hard for ML researchers.
             | 
             | I don't think so. I agree it is too hard for the ML
             | researches at the companies which will have their rear ends
             | handed to them by the other companies whose ML researchers
             | can be bothered to follow a blog post and prompt ChatGPT to
             | resolve error messages.
        
               | jokethrowaway wrote:
               | a lot of ML researchers stay pretty high level and
               | reinstall conda when things stop working
               | 
               | and rightly so, they have more complicated issues to
               | tackle
               | 
               | It's on developers to provide better infrastructure and
               | solve these challenges
        
               | LtWorf wrote:
               | Not rightly. It'd be faster on the long term to address
               | the issues.
        
               | bayindirh wrote:
               | Currently nobody think that long term. They just
               | reinstall, that's it.
        
               | currymj wrote:
               | I'm not really talking about companies here for the most
               | part, I'm talking about academic ML researchers (or
               | industry researchers whose role is primarily academic-
               | style research). In companies there is more incentive for
               | good software engineering practices.
               | 
               | I'm also speaking from personal experience: I once had to
               | hand-write my own CUDA kernels (on official NVIDIA cards,
               | not even this weird translation layer): it was useful and
               | I figured it out, but everything was constantly breaking
               | at first.
               | 
               | It was a drag on productivity and more importantly, it
               | made it too difficult for other people to run my code
               | (which means they are less likely to cite my work).
        
             | elashri wrote:
             | As someone doing a lot of work with CUDA in a big research
             | organization, there are few of us. If you are working with
             | CUDA, then you are not from the type of people who wait to
             | have something that just works like you describe. CUDA
             | itself is a battle with poorly documented stuff.
        
             | klik99 wrote:
             | God this explains so much about my last month, working with
             | tensorflow lite and libtorch in C++
        
           | SushiHippie wrote:
           | AMD has hipify for this, which converts cuda code to hip.
           | 
           | https://github.com/ROCm/HIPIFY
        
             | 3abiton wrote:
             | There is more glaring issue, ROCm doesn't even work well on
             | most AMD devices nowadays, and hip performance wise
             | deterioriates on the same hardware compared to ROCm.
        
               | boroboro4 wrote:
               | It supports all of current datacenter GPUs.
               | 
               | If you want to write very efficient CUDA kernel for
               | modern datacenter NVIDIA GPU (read H100), you need to
               | write it with having hardware in mind (and preferably in
               | hands, H100 and RTX 4090 behave _very_ differently in
               | practice). So I don 't think the difference between AMD
               | and NVIDIA is as big as everyone perceives.
        
           | jph00 wrote:
           | Inline PTX is hardly an obscure feature. It's pretty widely
           | used in practice, at least in the AI space.
        
             | saagarjha wrote:
             | Yeah, a lot of the newer accelerators are not even
             | available without using inline PTX assembly. Even the ones
             | that are have weird shapes that are not amenable to high-
             | performance work.
        
               | HarHarVeryFunny wrote:
               | Are you saying that the latest NVIDIA nvcc doesn't
               | support the latest NVIDIA devices?
        
               | adrian_b wrote:
               | For any compiler, "supporting" a certain CPU or GPU only
               | means that they can generate correct translated code with
               | that CPU or GPU as the execution target.
               | 
               | It does not mean that the compiler is able to generate
               | code that has optimal performance, when that can be
               | achieved by using certain instructions without a direct
               | equivalent in a high-level language.
               | 
               | No compiler that supports the Intel-AMD ISA knows how to
               | use all the instructions available in this ISA.
        
               | HarHarVeryFunny wrote:
               | Sure, but I'm not sure if that is what the parent poster
               | was saying (that nvcc generates poor quality PTX for
               | newer devices).
               | 
               | It's been a while since I looked at CUDA, but it used to
               | be that NVIDIA were continually extending cuDNN to add
               | support for kernels needed by SOTA models, and I assume
               | these kernels were all hand optimized.
               | 
               | I'm curious what kind of models people are writing where
               | not only is there is no optimized cuDNN support, but also
               | solutions like Triton or torch.compile, and even hand
               | optimized CUDA C kernels are too slow. Are hand written
               | PTX kernels really that common ?
        
           | pjmlp wrote:
           | How does it run CUDA Fortran?
        
         | viraptor wrote:
         | Isn't cuDNN a much better case for reimplementing than CUDA? It
         | has much more choice in how things actually happen and cuDNN
         | itself chooses different implementations at runtime + does
         | fusing. It seems way more generic and the reimplementation
         | would allow using the best AMD-targeted kernel rather than one
         | the original has.
        
           | ckitching wrote:
           | AMD have "MIOpen" which is _basically_ cuDNN-for-AMD. Ish.
        
             | mmis1000 wrote:
             | And that thing is left for unreleased on windows for almost
             | a whole year for unknown reason. Even though there is
             | activity on github and build fix frequently. There is just
             | no .exe or .msi for you to download. In fact, the rocm for
             | linux is on major 6 release (which includes miopen). But
             | somehow windows is still on major 5 (don't have miopen) for
             | almost a whole year.
             | 
             | It almost make me wonder. Is there a shady trade somewhere
             | to ask amd never release sdk for Windows to hike the price
             | of nvidia card higher? Why they keep developing these
             | without release it at all?
        
         | anigbrowl wrote:
         | Given AMDs prior lack of interest I'll take whatever options
         | there are. My daily driver has a Vega 10 GPU and it's been
         | quite frustrating not to be able to easily leverage it for
         | doing basic ML tasks, to the point that I've been looking at
         | buying an external nvidia GPU instead just to try out some of
         | the popular Python libraries.
        
         | Wowfunhappy wrote:
         | > CUDA is not designed to be vendor agnostic and Nvidia can
         | make things arbitrarily difficult [...] technically.
         | 
         | (Let's put the legal questions aside for a moment.)
         | 
         | nVidia changes GPU architectures every generation / few
         | generations, right? How does CUDA work across those--and how
         | can it have forwards compatibility in the future--if it's not
         | designed to be technologically agnostic?
        
           | andy_ppp wrote:
           | One way is to make sure the hardware team does certain things
           | to support easy transition to new architectures, we have seen
           | this with Apple Silicon for example!
        
           | saagarjha wrote:
           | PTX is meant to be portable across GPU microarchitectures.
           | That said, Nvidia owns the entire spec, so they can just keep
           | adding new instructions that their GPUs now support but AMD
           | GPUs don't.
        
         | koolala wrote:
         | CUDA v1...CUDA v2... CUDA v... CUDA isnt commonly assosiated
         | with a version number...
        
           | Uehreka wrote:
           | ...yes it is? https://developer.nvidia.com/cuda-toolkit-
           | archive
        
         | rjurney wrote:
         | Not having a layer like this has left AMD completely out of the
         | AI game that has made NVDA the world's most valuable company.
        
           | ChoGGi wrote:
           | Self-inflicted wounds hurt the most.
        
           | HarHarVeryFunny wrote:
           | Well, they kinda have it with their hipify tool, although
           | this is for porting CUDA code to AMD's HIP which supports
           | both AMD and NVIDIA. This supports CUDA C code and libraries
           | with AMD equivalents like cuDNN, cuBLAS, cuRAND, but doesn't
           | support porting of CUDA C inline PTX assembler. AMD have
           | their own inline GCN assembler, but seem to discourage it's
           | use.
           | 
           | There are also versions of PyTorch, TensorFlow and JAX with
           | AMD support.
           | 
           | PyTorch's torch.compile can generate Triton (OpenAI's GPU
           | compiler) kernels, with Triton also supporting AMD.
        
         | neutrinobro wrote:
         | _Cries in OpenCL_
        
         | apatheticonion wrote:
         | Agreed. Rather than making CUDA the standard; AMD should
         | push/drive an open standard that can be run on any hardware.
         | 
         | We have seen this succeed multiple times: FreeSync vs GSync,
         | DLSS vs FSR, (not AMD but) Vulkan vs DirectX & Metal.
         | 
         | All of the big tech companies are obsessed with ring-fencing
         | developers behind the thin veil of "innovation" - where really
         | it's just good for business (I swear it should be regulated
         | because it's really bad for consumers).
         | 
         | A CUDA translation layer is okay for now but it does risk CUDA
         | becoming the standard API. Personally, I am comfortable with
         | waiting on an open standard to take over - ROCm has serviced my
         | needs pretty well so far.
         | 
         | Just wish GPU sharing with VMs was as easy as CPU sharing.
        
           | amy-petrik-214 wrote:
           | we actually also saw this historically with openGL. openGL
           | comes from an ancient company whispered about by the elderly
           | programmers (30 + year old) known as SGI. Originally it was
           | CLOSED SOURCE and SGI called it "SGI-GL" for a computer
           | codename IRIS which was cool looking with bright popping
           | color plastic and faux granite keyboard. Good guy SGI open
           | sourced SGI-GL to become what we called "openGL" (get it, now
           | it's open), and then it stuck.
           | 
           | That's all to say NVIDIA _could_ pull a SGI and open their
           | stuff, but they 're going more sony style and trying to
           | monopolize. Oh, and SGI also wrote another ancient lore
           | library known as "STL" or the "SGI Template Library" which is
           | like the original boost template metaprogramming granddaddy
        
             | adrian_b wrote:
             | Also the XFS file system.
        
             | usr1106 wrote:
             | Nice story, but is it correct? Wikipedia says STL was first
             | implemented by HP and later by the same authors at SGI.
        
               | adrian_b wrote:
               | STL started even earlier, obviously without using the
               | name "STL", as a library of generic algorithms for the
               | programming language Ada (David R. Musser & Alexander A.
               | Stepanov, 1987).
        
           | pjmlp wrote:
           | Vulkan only matters on Android (from version 10 onwards) and
           | GNU/Linux.
           | 
           | Zero impact on Switch, Playstation, XBox, Windows, macOS,
           | iOS, iPadOS, Vision OS.
        
             | ChoGGi wrote:
             | "Windows"
             | 
             | dxvk-gplasync is a game changer for dx9-11 shader stutter.
        
               | pjmlp wrote:
               | Sure, for the 2% folks that enjoy Windows games, written
               | againt DirectX, on Linux Steam Store.
               | 
               | Which Android Studios can't even be bothered to target
               | with their NDK engines, based on GL ES, Vulkan.
        
               | ChoGGi wrote:
               | I'm on windows 11, if I see not dx12 in my afterburner
               | overlay, I use it.
               | 
               | Even if there's no shader stutter, Vulkan tends to use
               | less juice than DX.
        
           | ChoGGi wrote:
           | "We have seen this succeed multiple times: FreeSync vs GSync,
           | DLSS vs FSR, (not AMD but) Vulkan vs DirectX & Metal."
           | 
           | I'll definitely agree with you on Sync and Vulkan, but dlss
           | and xess are both better than fsr.
           | 
           | https://youtube.com/watch?v=el70HE6rXV4
        
           | gjulianm wrote:
           | OpenCL was released in 2009. AMD has had plenty of time to
           | push and drive that standard. But OpenCL had a worse
           | experience than CUDA, and AMD wasn't up to the task in terms
           | of hardware, so it made no real sense to go for OpenCL.
        
           | imtringued wrote:
           | AMD shouldn't push on anything. They have the wrong
           | incentives. They should just make sure that software runs on
           | their GPUs and nothing else.
           | 
           | Karol Herbst is working on Rusticl, which is mesa's latest
           | OpenCL implementation and will pave the way for other things
           | such as SYCL.
        
           | consf wrote:
           | A strategic and forward-thinking approach
        
           | naasking wrote:
           | > AMD should push/drive an open standard that can be run on
           | any hardware.
           | 
           | AMD has always been notoriously bad at the software side, and
           | they frequently abandon their projects when they're almost
           | usable, so I won't hold my breath.
        
         | magic_hamster wrote:
         | CUDA is the juice that built Nvidia in the AI space and allowed
         | them to charge crazy money for their hardware. To be able to
         | run CUDA on cost effective AMD hardware can be a big leap
         | forward, allow more people to research, and break away from
         | Nvidia's stranglehold over VRAM. Nvidia will never open source
         | their own platform unless their hand is forced. I think we all
         | should support this endeavor and contribute where possible.
        
         | Sparkyte wrote:
         | That is why an open standard should be made so it isn't locked
         | to a particular piece of hardware and then allow modular
         | support for different hardware to interface with supported
         | drivers.
        
         | raxxorraxor wrote:
         | I really hope they will do what you suggested. With some
         | innovative product placement, GPUs with a lot of memory for
         | example, they could dethrone nvidia if it doesn't change
         | strategy.
         | 
         | That said, easier said than done. You need very specialized
         | developers to build a CUDA equivalent and have people start
         | using it. AMD could do it with a more open development process
         | leveraging the open source community. I believe this will
         | happen at some point anyway by AMD or someone else. The market
         | just gets more attractive by the day and at some point the high
         | entry barrier will not matter much.
         | 
         | So why should AMD skimp on their ambitions here? This would be
         | a most sensible investment, few risks and high gains if
         | successful.
        
           | consf wrote:
           | This expanding market provides AMD with a lucrative
           | opportunity indeed
        
         | consf wrote:
         | The legal, technical and strategic challenges make it a less
         | attractive option
        
       | jarbus wrote:
       | Really, really, _really_ curious as to how they managed to pull
       | this off, if their project works as well as they claim it does.
       | If stuff as complex as paged /flash attention can "just work",
       | this is really cool.
        
         | Straw wrote:
         | My understanding from chatting with them is that tensor core
         | operations aren't supported yet, so FlashAttention likely won't
         | work. I think its on their to-do list though!
         | 
         | Nvidia actually has more and more capable matrix multiplication
         | units, so even with a translation layer I wouldn't expect the
         | same performance until AMD produces better ML cards.
         | 
         | Additionally, these kernels usually have high sensitivity to
         | cache and smem sizes, so they might need to be retuned.
        
           | Der_Einzige wrote:
           | So the only part that anyone actually cares about, as usual,
           | is not supported. Same story as it was in 2012 with AMD vs
           | Nvidia (and likely much before that too!). The more things
           | change, the more they stay the same.
        
             | Straw wrote:
             | People did GPGPU computing long before GPUs. Simply look at
             | the list of tested, supported projects on their docs page!
        
               | Straw wrote:
               | [EDIT] long before deep learning!
        
         | JonChesterfield wrote:
         | Cuda is a programming language. You implement it like any
         | other. The docs are a bit sparse but not awful. Targeting
         | amdgpu is probably about as difficult as targeting x64, mostly
         | changes the compiler runtime.
         | 
         | The online ptx implementation is notable for being even more
         | annoying to deal with than the cuda, but it's just bytes in /
         | different bytes out. No magic.
        
           | ckitching wrote:
           | [I work on SCALE]
           | 
           | CUDA has a couple of extra problems beyond just any other
           | programming language:
           | 
           | - CUDA is more than a language: it's a giant library (for
           | both CPU and GPU) for interacting with the GPU, and for
           | writing the GPU code. This needed reimplementing. At least
           | for the device-side stuff we can implement it _in CUDA_ , so
           | when we add support for other GPU vendors the code can
           | (mostly) just be recompiled and work there :D. - CUDA (the
           | language) is not actually specified. It is, informally,
           | "whatever nvcc does". This differs significantly from what
           | Clang's CUDA support does (which is ultimately what the HIP
           | compiler is derived from).
           | 
           | PTX is indeed vastly annoying.
        
             | JonChesterfield wrote:
             | The openmp device runtime library was originally written in
             | cuda. I ported that to hip for amdgpu, discovered the
             | upstream hip compiler wasn't quite as solid as advertised,
             | then ported it to openmp with some compiler intrinsics. The
             | languages are all essentially C++ syntax with some spurious
             | noise obfuscating llvm IR. The libc effort has gone with
             | freestanding c++ based on that experience and and we've now
             | mostly fixed the ways that goes wrong.
             | 
             | You might also find raw c++ for device libraries saner to
             | deal with than cuda. In particular you don't need to jury
             | rig the thing to not spuriously embed the GPU code in x64
             | elf objects and/or pull the binaries apart. Though if
             | you're feeding the same device libraries to nvcc with
             | #ifdef around the divergence your hands are tied.
        
               | ckitching wrote:
               | > You might also find raw c++ for device libraries saner
               | to deal with than cuda.
               | 
               | Actually, we just compile all the device libraries to
               | LLVM bitcode and be done with it. Then we can write them
               | using all the clang-dialect, not-nvcc-emulating, C++23 we
               | feel like, and it'll still work when someone imports them
               | into their c++98 CUDA project from hell. :D
        
       | m3kw9 wrote:
       | This isn't a solution for pros because it will always play catch
       | up and Nvidia can always add things to make it difficult. This is
       | like emulation.
        
         | bachmeier wrote:
         | > it will always play catch up
         | 
         | That's not important if the goal is to run existing CUDA code
         | on AMD GPUs. All you have to do is write portable CUDA code in
         | the future regardless of what Nvidia does if you want to keep
         | writing CUDA.
         | 
         | I don't know the economics here, but if the AMD provides a
         | significant cost saving, companies are going to make it work.
         | 
         | > Nvidia can always add things to make it difficult
         | 
         | Sounds like Microsoft embedding the browser in the OS. It's
         | hard to see how doing something like that wouldn't trigger an
         | antitrust case.
        
         | dboreham wrote:
         | Pros will end up overruled by bean counters if it works.
        
         | ok123456 wrote:
         | It's not emulation. It's a compiler.
        
       | joe_the_user wrote:
       | This sounds fabulous. I look forward to AMD being drawn kicking
       | and screaming into direct competition with Nvidia.
        
       | gizajob wrote:
       | Is Nvidia not likely to sue or otherwise bork this into non-
       | existence?
        
         | chx wrote:
         | Sue over what...?
        
           | gizajob wrote:
           | Whatever IP related issues they'd want to sue over. Sorry I
           | don't know specifics about what this would specifically
           | infringe but I'm sure expensive legal brains could come up
           | with something
        
         | CoastalCoder wrote:
         | I wonder if nVidia's current anti-trust woes would make them
         | reluctant to go that route at the moment.
        
       | sakras wrote:
       | One question I always have about these sorts of translation
       | layers is how they deal with the different warp sizes. I'd
       | imagine a lot of CUDA code relies on 32-wide warps, while as far
       | as I know AMD tends to have 64-wide warps. Is there some sort of
       | emulation that needs to happen?
        
         | mpreda wrote:
         | The older AMD _GCN_ had 64-wide wavefront, but the newer AMD
         | GPUs  "RDNA" support both 64 and 32 wavefront, and this is
         | configurable at runtime. It appears the narrower wavefronts are
         | better suited for games in general.
         | 
         | Not sure what is the situation with "CDNA", which is the
         | compute-oriented evolution of "GCN", i.e. whether CDNA is
         | 64-wavefront only or dual like RNDA.
        
         | msond wrote:
         | SCALE is not a "translation layer", it's a full source-to-
         | target compiler from CUDA-like C++ code to AMD GPUs.
         | 
         | See this part of the documentation for more details regarding
         | warp sizes: https://docs.scale-lang.com/manual/language-
         | extensions/#impr...
        
       | ladberg wrote:
       | I don't really see how any code that depends heavily on the
       | underlying hardware can "just work" on AMD. Most serious CUDA
       | code is aware of register file and shared memory sizes, wgmma
       | instructions, optimal tensor core memory & register layouts,
       | tensor memory accelerator instructions, etc...
       | 
       | Presumably that stuff doesn't "just work" but they don't want to
       | mention it?
        
         | lmeyerov wrote:
         | Sort of
         | 
         | A lot of our hw-aware bits are parameterized where we fill in
         | constants based on the available hw . Doable to port, same as
         | we do whenever new Nvidia architectures come out.
         | 
         | But yeah, we have tricky bits that inline PTX, and.. that will
         | be more annoying to redo.
        
           | Retr0id wrote:
           | > SCALE accepts CUDA programs as-is. [...] This is true even
           | if your program uses inline PTX asm
        
             | lmeyerov wrote:
             | Oh that will be interesting to understand, as PTX gets to
             | more about trickier hw-arch-specific phenomena that diff
             | brands disagree on, like memory models. Neat!
        
               | lmeyerov wrote:
               | Looks like the PTX translation is via another project
               | ZLUDA, though how they bridge the differences in
               | memory/consistency/etc models safely remains unclear to
               | me...
        
               | ckitching wrote:
               | Hi! Spectral engineer here!
               | 
               | SCALE does not use any part of ZLUDA. We have modified
               | the clang frontend to convert inline PTX asm block to
               | LLVM IR.
               | 
               | To put in a less compiler-engineer-ey way: for any given
               | block of PTX, there exists a hypothetical sequence of
               | C++/CUDA code you could have written to achieve the same
               | effect, but on AMD (perhaps using funky __builtin_...
               | functions if the code includes shuffles/ballots/other-
               | weird-gpu-stuff). Our compiler effectively converts the
               | PTX into that hypothetical C++.
               | 
               | Regarding memory consistency etc.: NVIDIA document the
               | "CUDA memory consistency model" extremely thoroughly, and
               | likewise, the consistency guarantees for PTX. It is
               | therefore sufficient to ensure that we use operations at
               | least as synchronising as those called for in the
               | documented semantics of the language (be it CUDA or PTX,
               | for each operation).
               | 
               | Differing consistency _between architectures_ is the
               | AMDGPU backend's problem.
        
               | ladberg wrote:
               | Just to check here, if you're given something like the
               | following PTX:
               | wgmma.mma_async.sync.aligned.m64n256k16.f32.bf16.bf16
               | 
               | Do you reverse it back into C++ that does the
               | corresponding FMAs manually instead of using tensor
               | hardware? Or are you able to convert it into a series of
               | __builtin_amdgcn_mfma_CDFmt_MxNxKABFmt instructions that
               | emulate the same behavior?
        
               | ckitching wrote:
               | Rather awkwardly, you've asked about an instruction that
               | isn't currently implemented. :D Support for wmma and
               | friends is in development.
               | 
               | But in general the answer to your question is yes: we use
               | AMD-specific builtins where available/efficient to make
               | things work. Otherwise many things would be
               | unrepresentble, not just slow!
        
               | saagarjha wrote:
               | What do you do when a builtin doesn't exist?
        
               | ckitching wrote:
               | Add one: it's trivial to add a compiler builtin to carry
               | the instruction from the frontend to the backend if an
               | instruction exists and the backend knows about it.
               | 
               | If there's no instruction, either, you can write a C++
               | function to replicate the behaviour and codegen a call to
               | it. Since the PTX blocks are expanded during initial IR
               | generation, it all inlines nicely by the end. Of course,
               | such software emulation is potentially suboptimal
               | (depends on the situation).
        
               | lmeyerov wrote:
               | Ah I was reading the 'deeper dive' section on my phone
               | and missed it was a comparison, not a warning, thank you
               | 
               | I'm curious how something like this example would
               | translate:
               | 
               | ===
               | 
               | Mapping lower-level ptx patterns to higher-level AMD
               | constructs like __ballot, and knowing it's safe
               | 
               | ```                 #ifdef INLINEPTX       inline uint
               | ptx_thread_vote(float rSq, float rCritSq) {
               | uint result = 0;           asm("{\n\t"
               | ".reg .pred cond, out;\n\t"                "setp.ge.f32
               | cond, %1, %2;\n\t"                "vote.sync.all.pred
               | out, cond, 0xffffffff;\n\t"                "selp.u32 %0,
               | 1, 0, out;\n\t"                "}\n\t"                :
               | "=r"(result)                : "f"(rSq), "f"(rCritSq));
               | return result;       }       #endif
               | 
               | ```
               | 
               | ===
               | 
               | Again, I'm guessing there might be an equiv simpler
               | program involving AMD's __ballot, but I'm unsure of the
               | true equivalence wrt safety, and it seems like a tricky
               | rewrite as it needs to (afaict) decompile to recover the
               | higher-level abstraction. Normally it's easier to compile
               | down or sideways (translate), and it's not clear to me
               | these primitives are 1:1 for safely doing so.
               | 
               | ===
               | 
               | FWIW, this is all pretty cool. We stay away from PTX --
               | most of our app code is higher-level, whether RAPIDS (GPU
               | dataframes, GPU ML, etc libs), minimal cuda, and minimal
               | opencl, with only small traces of inline ptx. So more
               | realistically, if we had the motivation, we'd likely
               | explore just #ifdef'ing it with something predictable.
        
               | ckitching wrote:
               | I compiled your function with SCALE for gfx1030:
               | .p2align        2                               ; --
               | Begin function _Z15ptx_thread_voteff             .type
               | _Z15ptx_thread_voteff,@function
               | _Z15ptx_thread_voteff:                  ;
               | @_Z15ptx_thread_voteff       ; %bb.0:
               | ; %entry             s_waitcnt vmcnt(0) expcnt(0)
               | lgkmcnt(0)             s_waitcnt_vscnt null, 0x0
               | v_cmp_ge_f32_e32 vcc_lo, v0, v1             s_cmp_eq_u32
               | vcc_lo, -1             s_cselect_b32 s4, -1, 0
               | v_cndmask_b32_e64 v0, 0, 1, s4             s_setpc_b64
               | s[30:31]       .Lfunc_end1:             .size
               | _Z15ptx_thread_voteff, .Lfunc_end1-_Z15ptx_thread_voteff
               | ; -- End function
               | 
               | What were the safety concerns you had? This code seems to
               | be something like `return __all_sync(rSq >= rCritSq) ? 1
               | : 0`, right?
        
               | lmeyerov wrote:
               | It's supposed to be waiting for all threads to vote
               | 
               | I'm not familiar with AMD enough to know if additional
               | synchronization is needed. ChatGPT recommended adding
               | barriers beyond what that gave, but again, I'm not
               | familiar with AMD commands.
        
               | ckitching wrote:
               | Indeed, no extra synchronisation is needed here due to
               | the nature of the hardware (threads in a warp can't get
               | out of sync with each other).
               | 
               | Even on NVIDIA, you could've written this without the asm
               | a discussed above!
        
               | lmeyerov wrote:
               | Yeah I think, after this snippet was written, cuda added
               | __all_sync as an intrinsic. The divergent code before
               | this was plain-ish cuda, and this snippet ensures they
               | wait on the comparison vote before recurring.
               | 
               | So in the AMD version, the compiler correctly realized
               | the synchronization was on the comparison, so adds the
               | AMD version right before it. That seems like a
               | straightforward transform here.
               | 
               | It'd be interesting to understand the comparison of what
               | Nvidia primitives map vs what doesn't. The above is a
               | fairly simple barrier. We avoided PTX as much as we could
               | and wrote it as simply as we could, I'd expect most of
               | our PTX to port for similar reasons. The story is a bit
               | diff for libraries we call. E.g., cudf probably has
               | little compute-tier ptx directly, but will call nvidia
               | libs, and use weird IO bits like cufile / gpu direct
               | storage.
        
         | Moldoteck wrote:
         | it's a speculation, but I think it's similar with processors =
         | nobody guarantees the code will run the way you set it up. You
         | may want to use some specific register but if the processor
         | will think it has another register that can fulfill the task,
         | it'll use that but tell you that your code is executed as
         | expected. Maybe the internal gpu processor of amd can
         | sufficiently simulate the behavior of nvidia hardware so that
         | higher abstractions will be unaware that something different is
         | happening under the hood
        
         | consf wrote:
         | It involves significant challenges
        
       | shmerl wrote:
       | Compiler isn't open source? That feels like DOA in this day and
       | age. There is ZLUDA already which is open.
       | 
       | If they plan to open it up, it can be something useful to add to
       | options of breaking CUDA lock-in.
        
         | uyzstvqs wrote:
         | ZLUDA is pretty good, except that it lacks cuDNN which makes
         | most PyTorch projects just not work. Not sure if this project
         | does cover that? That could be a game changer, otherwise yeah
         | ZLUDA is the better open-source option.
        
       | cheptsov wrote:
       | Sounds really awesome. Any chance someone can suggest if this
       | works also inside a Docker container?
        
         | ckitching wrote:
         | It works exactly as well as other AMDGPU-related software (HIP
         | etc.) works inside Docker.
         | 
         | There are some delightful AMD driver issues that make certain
         | models of GPU intermittently freeze the kernel when used from
         | docker. That was great fun when building SCALE's CI system :D.
        
           | cheptsov wrote:
           | Would love to give it a try! Thanks for answering my
           | question.
        
         | SushiHippie wrote:
         | Works like described in the rocm documentation (at least the
         | scaleinfo worked for me, haven't tested further)
         | 
         | https://rocm.docs.amd.com/projects/install-on-linux/en/lates...
        
           | cheptsov wrote:
           | Thank you! This link is very helpful.
        
         | cheptsov wrote:
         | Wow, somebody doesn't like Docker enough to downvote my
         | question.
        
       | resters wrote:
       | The main cause of Nvidia's crazy valuation is AMD's unwillingness
       | to invest in making its GPUs as useful as Nvidia's for ML.
       | 
       | Maybe AMD fears antitrust action, or maybe there is something
       | about its underlying hardware approach that would limit
       | competitiveness, but the company seems to have left billions of
       | dollars on the table during the crypto mining GPU demand spike
       | and now during the AI boom demand spike.
        
         | karolist wrote:
         | I think this could be cultural differences, AMD's software
         | department is underfunded and doing poorly for a long time now.
         | 
         | * https://www.levels.fyi/companies/amd/salaries/software-
         | engin...
         | 
         | * https://www.levels.fyi/companies/nvidia/salaries/software-
         | en...
         | 
         | And it's probably better now. Nvidia was paying much more long
         | before, also their stock growing attracts even more talent.
        
           | 1024core wrote:
           | > I think this could be cultural differences, AMD's software
           | department is underfunded and doing poorly for a long time
           | now.
           | 
           | Rumor is that ML engineers (that AMD really needs) are
           | expensive; and AMD doesn't want to give them more money than
           | the rest of the SWEs they have (for pissing off the existing
           | SWEs). So AMD is caught in a bind: can't pay to get top MLE
           | talent and can't just sit by and watch NVDA eat its lunch.
        
             | mepian wrote:
             | AMD recently acquired Silo AI.
        
             | karolist wrote:
             | I find this strange to believe. Every big company has
             | levels, unless your existing L7+ IC is below market, you
             | can just pull L7+ salaried ML engineers with some secret
             | signing bonus like literally everyone else.
        
               | Der_Einzige wrote:
               | The dirty secret in the tech industry is that most people
               | at AMD or Intel or IBM and historically Nvidia/Oracle
               | (this changed post 2022), were the 2nd-3rd tier tech
               | companies. Staffed heavily by the rejects of the FAANG,
               | they were still happy to have their 100-200K in their
               | MCOL areas, but no free food and a much more boring work
               | culture. Intel's "great place to work" corporate
               | propaganda was known as "great place to leetcode" while I
               | worked there, as Intel was always seen as a stepping
               | stone before you "made it" in a FAANG.
               | 
               | Culturally, none of these companies were happy to pay
               | anyone except the tip, top "distinguished" engineers more
               | than 300K. AMD seems to be stuck in this mentality, just
               | as IBM is.
        
               | quotemstr wrote:
               | > AMD seems to be stuck in this mentality, just as IBM
               | is.
               | 
               | And that's why creative destruction is essential for
               | technological progress. It's common for organizations to
               | get stuck in stable-but-suboptimal social equilibria:
               | everyone knows there's a problem but nobody can fix it.
               | The only way out is to make a new organization and let
               | the old one die.
        
             | xboxnolifes wrote:
             | > So AMD is caught in a bind: can't pay to get top MLE
             | talent and can't just sit by and watch NVDA eat its lunch.
             | 
             | This isn't being caught in a bind. This is, if true, just
             | making a poor decision. Nothing is _really_ preventing them
             | from paying more for specialized work.
        
           | DaoVeles wrote:
           | So nothing has changed since the era of ATI.
        
         | dist-epoch wrote:
         | There are stories from credible sources that AMD software
         | engineers had to buy AMD GPUs with their own money to use in CI
         | machines.
        
         | ClassyJacket wrote:
         | I like to watch YouTube retrospectives on old failed tech
         | companies - LGR has some good ones.
         | 
         | When I think of AMD ignoring machine learning, I can't help
         | imagine a future YouTuber's voiceover explaining how this
         | caused their downfall.
         | 
         | There's a tendency sometimes to think "they know what they're
         | doing, they must have good reasons". And sometimes that's
         | right, and sometimes that's wrong. Perhaps there's some great
         | technical, legal, or economic reason I'm just not aware of. But
         | when you actually look into these things, it's surprising how
         | often the answer is indeed just shortsightedness.
         | 
         | They could end up like BlackBerry, Blockbuster, Nokia, and
         | Kodak. I guess it's not quite as severe, since they will still
         | have a market in games and therefore may well continue to
         | exist, but it will still be looked back on as a colossal
         | mistake.
         | 
         | Same with Toyota ignoring electric cars.
         | 
         | I'm not an investor, but I still have stakes in the sense that
         | Nvidia has no significant competition in the machine learning
         | space, and that sucks. GPU prices are sky high and there's
         | nobody else to turn to if there's something about Nvidia you
         | just don't like or if they decide to screw us.
        
           | robertlagrant wrote:
           | There's also just the idea of endeavour - Nvidia tried
           | something, and it worked. Businesses (or rather their
           | shareholders) take risks with their capital sometimes, and it
           | doesn't always work. But in this case it did.
        
             | robocat wrote:
             | And NVidea has a reputation for going all-in on certain
             | market decisions. That is hard to compete against when the
             | bet works.
        
           | _boffin_ wrote:
           | If you haven't heard of this book, you might like it. Dealers
           | of lightening
        
           | hedora wrote:
           | In fairness to AMD, they bet on crypto, and nvidia bet on AI.
           | Crypto was the right short term bet.
           | 
           | Also, ignoring is a strong word: I'm staring at a little <<
           | $1000, silent 53 watt mini-PC with an AMD SoC. It has an NPU
           | comparable to an M1. In a few months, with the ryzen 9000
           | series, NPUs for devices of its class will bump from 16 tops
           | to 50 tops.
           | 
           | I'm pretty sure the linux taint bit is off, and everything
           | just worked out of the box.
        
           | daedrdev wrote:
           | Toyota is extremely strong in the hybrid car market, and with
           | ravenous competition for electric cars and slowing demand
           | Toyota may have made the right decision after all
        
         | gukov wrote:
         | The companies' CEO's are related. My conspiracy theory is that
         | they don't want to step on each other's toes. Not sure if that
         | works with fiduciary duty, though.
        
           | arendtio wrote:
           | I searched for it and found this (in case someone else might
           | want to read it):
           | 
           | https://www.tomshardware.com/news/jensen-huang-and-lisa-
           | su-f...
        
         | anticensor wrote:
         | AMD fears anti-collusion action, remember, CEOs of the two are
         | just barely far enough of kinship to not be automatically
         | considered colluding with each other.
        
       | paulmist wrote:
       | Doesn't seem to mention CDNA?
        
       | JonChesterfield wrote:
       | This is technically feasible so might be the real thing. Parsing
       | inline ptx and mapping that onto amdgpu would be a huge pain.
       | 
       | Working from cuda source that doesn't use inline ptx to target
       | amdgpu is roughly regex find and replace to get hip, which has
       | implemented pretty much the same functionality.
       | 
       | Some of the details would be dubious, e.g. the atomic models
       | probably don't match, and volta has a different instruction
       | pointer model, but it could all be done correctly.
       | 
       | Amd won't do this. Cuda isn't a very nice thing in general and
       | the legal team would have kittens. But other people totally
       | could.
        
         | ckitching wrote:
         | [I work on SCALE]
         | 
         | Mapping inline ptx to AMD machine code would indeed _suck_.
         | Converting it to LLVM IR right at the start of compilation
         | (when the initial IR is being generated) is much simpler, since
         | it then gets  "compiled forward" with the rest of the code.
         | It's as if you wrote C++/intrinsics/whatever instead.
         | 
         | Note that nvcc accepts a different dialect of C++ from clang
         | (and hence hipcc), so there is in fact more that separates CUDA
         | from hip (at the language level) than just find/replace. We
         | discuss this a little in [the manual](https://docs.scale-
         | lang.com/manual/dialects/)
         | 
         | Handling differences between the atomic models is, indeed,
         | "fun". But since CUDA is a programming language with documented
         | semantics for its memory consistency (and so is PTX) it is
         | entirely possible to arrange for the compiler to "play by
         | NVIDIA's rules".
        
           | JonChesterfield wrote:
           | Huh. Inline assembly is strongly associated in my mind with
           | writing things that can't be represented in LLVM IR, but in
           | the specific case of PTX - you can only write things that
           | ptxas understands, and that probably rules out wide classes
           | of horrendous behaviour. Raw bytes being used for
           | instructions and for data, ad hoc self modifying code and so
           | forth.
           | 
           | I believe nvcc is roughly an antique clang build hacked out
           | of all recognition. I remember it rejecting templates with
           | 'I' as the type name and working when changing to 'T',
           | nonsense like that. The HIP language probably corresponds
           | pretty closely to clang's cuda implementation in terms of
           | semantics (a lot of the control flow in clang treats them
           | identically), but I don't believe an exact match to nvcc was
           | considered particularly necessary for the clang -x cuda work.
           | 
           | The ptx to llvm IR approach is clever. I think upstream would
           | be game for that, feel free to tag me on reviews if you want
           | to get that divergence out of your local codebase.
        
             | saagarjha wrote:
             | nvcc is nowhere near that bad these days, it supports most
             | C++ code directly (for example, I've written kernels that
             | include headers like <span> or <algorithm> and they work
             | just fine).
        
               | ckitching wrote:
               | NVCC is doing much better than before in terms of "broken
               | C++". There was indeed a time when lots of modern C++
               | just _didn 't work_.
               | 
               | Nowadays the issues are more subtle and nasty. Subtle
               | differences in overload resolution. Subtle differences in
               | lambda handling. Enough to break code in "spicy" ways
               | when you try to port it over.
        
             | ckitching wrote:
             | I certainly would not attempt this feat with x86 `asm`
             | blocks :D. PTX is indeed very pedestrian: it's more like IR
             | than machine code, really. All the usual "machine-level
             | craziness" that would otherwise make this impossible is
             | just unrepresentable in PTX (though you do run into cases
             | of "oopsie, AMD don't have hardware for this so we have to
             | do something insane").
        
               | JonChesterfield wrote:
               | It's a beautiful answer to a deeply annoying language
               | feature. I absolutely love it. Yes, inline asm containing
               | PTX definitely should be burned off at the compiler front
               | end, regardless of whether it ultimately codegens as PTX
               | or something else.
               | 
               | I'm spawned a thread on the llvm board asking if anyone
               | else wants that as a feature
               | https://discourse.llvm.org/t/fexpand-inline-ptx-as-a-
               | feature... in the upstream. That doesn't feel great -
               | you've done something clever in a proprietary compiler
               | and I'm suggesting upstream reimplement it - so I hope
               | that doesn't cause you any distress. AMD is relatively
               | unlikely to greenlight me writing it so it's _probably_
               | just more marketing unless other people are keen to parse
               | asm in string literals.
        
       | ur-whale wrote:
       | If this actually works (remains to be seen), I can only say:
       | 1) Kudos        2) Finally !
        
         | gedy wrote:
         | or: 1) CUDAs
        
         | anthonix1 wrote:
         | I just tried it with llm.c ... seems to be missing quite a few
         | key components such as cublaslt, bfloat16 support, nvtx3,
         | compiler flags such as -t
         | 
         | And its linked against an old release of ROCm.
         | 
         | So unclear to me how it is supposed to be an improvement over
         | something like hipify
        
           | ckitching wrote:
           | Greetings, I work on SCALE.
           | 
           | It appears we implemented `--threads` but not `-t` for the
           | compiler flag. Oeps. In either case, the flag has no effect
           | at present, since fatbinary support is still in development,
           | and that's the only part of the process that could
           | conceivably be parallelised.
           | 
           |  _That said_ : clang (and hence the SCALE compiler) tends to
           | compile CUDA much faster than nvcc does, so this lack of the
           | parallelism feature is less problematic than it might at
           | first seem.
           | 
           | NVTX support (if you want more than just "no-ops to make the
           | code compile") requires cooperation with the authors of
           | profilers etc., which has not so far been available
           | 
           | bfloat16 is not properly supported by AMD anyway: the
           | hardware doesn't do it, and HIP's implementatoin just lies
           | and does the math in `float`. For that reason we haven't
           | prioritised putting together the API.
           | 
           | cublasLt is a fair cop. We've got a ticket :D.
        
             | anthonix1 wrote:
             | Hi, why do you believe that bfloat16 is not supported? Can
             | you please provide some references (specifically the part
             | about the hardware "doesn't do it")?
             | 
             | For the hardware you are focussing on (gfx11), the
             | reference manual [2] and the list of LLVM gfx11
             | instructions supported [1] describe the bfloat16 vdot &
             | WMMA operations, and these are in fact implemented and
             | working in various software such as composable kernels and
             | rocBLAS, which I have used (and can guarantee they are not
             | simply being run as float). I've also used these in the AMD
             | fork of llm.c [3]
             | 
             | Outside of gfx11, I have also used bfloat16 in CDNA2 & 3
             | devices, and they are working and being supported.
             | 
             | Regarding cublasLt, what is your plan for support there?
             | Pass everything through to hipblasLt (hipify style) or
             | something else?
             | 
             | Cheers, -A
             | 
             | [1] https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX11.html [2]
             | https://www.amd.com/content/dam/amd/en/documents/radeon-
             | tech... [3] http://github.com/anthonix/llm.c
        
               | ckitching wrote:
               | > Hi, why do you believe that bfloat16 is not supported?
               | 
               | Apologies, I appear to be talking nonsense. I conflated
               | bfloat16 with nvidia's other wacky floating point
               | formats. This is probably my cue to stop answering
               | reddit/HN comments and go to bed. :D
               | 
               | So: ahem: bfloat16 support is basically just missing the
               | fairly boring header.
               | 
               | > Regarding cublasLt, what is your plan for support
               | there? Pass everything through to hipblasLt (hipify
               | style) or something else?
               | 
               | Prettymuch that, yes. Not much point reimplementing all
               | the math libraries when AMD is doing that part of the
               | legwork already.
        
               | anthonix1 wrote:
               | OK, so in the case of llm.c, if you're just including the
               | HIP headers, using hipblasLt, etc, what would be the
               | benefit of using scale instead of hipify?
        
               | Straw wrote:
               | Seems like a big benefit would come from not forking the
               | codebase into two versions!
        
       | ashvardanian wrote:
       | It's great that there is a page about current limitations [1],
       | but I am afraid that what most people describe as "CUDA" is a
       | small subset of the real CUDA functionality. Would be great to
       | have a comparison table for advanced features like warp shuffles,
       | atomics, DPX, TMA, MMA, etc. Ideally a table, mapping every PTX
       | instruction to a direct RDNA counterpart or a list of
       | instructions used to emulate it.
       | 
       | [1]: https://docs.scale-lang.com/manual/differences/
        
         | ckitching wrote:
         | You're right that most people only use a small subset of cuda:
         | we prioritied support for features based on what was needed for
         | various open-source projects, as a way to try to capture the
         | most common things first.
         | 
         | A complete API comparison table is coming soon, I belive. :D
         | 
         | In a nutshell: - DPX: Yes. - Shuffles: Yes. Including the PTX
         | versions, with all their weird/wacky/insane arguments. -
         | Atomics: yes, except the 128-bit atomics nvidia added very
         | recently. - MMA: in development, though of course we can't fix
         | the fact that nvidia's hardware in this area is just _better_
         | than AMD 's, so don't expect performance to be as good in all
         | cases. - TMA: On the same branch as MMA, though it'll just be
         | using AMD's async copy instructions.
         | 
         | > mapping every PTX instruction to a direct RDNA counterpart or
         | a list of instructions used to emulate it.
         | 
         | We plan to publish a compatibility table of which instructons
         | are supported, but a list of the instructions used to produce
         | each PTX instruction is not in general meaningful. The inline
         | PTX handler works by converting the PTX block to LLVM IR at the
         | start of compilation (at the same time the rest of your code
         | gets turned into IR), so it then "compiles forward" with the
         | rest of the program. As a result, the actual instructions
         | chosen vary on a csae-by-case basis due to the whims of the
         | optimiser. This design in principle produces better performance
         | than a hypothetical solution that turned PTX asm into AMD asm,
         | because it conveniently eliminates the optimisation barrier an
         | asm block typically represents. Care, of course, is taken to
         | handle the wacky memory consistency concerns that this implies!
         | 
         | We're documenting which ones are expected to perform worse than
         | on NVIDIA, though!
        
           | ashvardanian wrote:
           | Have you seen anyone productively using TMA on Nvidia or
           | async instructions on AMD? I'm currently looking at a 60%
           | throughput degradation for 2D inputs on H100:
           | https://github.com/ashvardanian/scaling-
           | democracy/blob/a8092...
        
           | einpoklum wrote:
           | > You're right that most people only use a small subset of
           | cuda
           | 
           | This is true first and foremost for the host-side API. From
           | my StackOverflow and NVIDIA forums experience - I'm often the
           | first and only person to ask about any number of nooks and
           | crannies of the CUDA Driver API, with issues which nobody
           | seems to have stumbled onto before; or at least - not
           | stumbled and wrote anything in public about it.
        
             | ckitching wrote:
             | Oh yes, we found all kinds of bugs in Nvidia's cuda
             | implementation during this project :D.
             | 
             | There's a bunch of pretty obscure functions in the device
             | side apis too: some esoteric math functions, old simd
             | "intrinsics" that are mostly irrelevant with modern
             | compilers, etc.
        
       | qwerty456127 wrote:
       | > gfx1030, gfx1100, gfx1010, gfx1101, gfx900...
       | 
       | How do I find out which do I have?
        
         | ckitching wrote:
         | Like this:
         | 
         | https://docs.scale-lang.com/manual/how-to-use/#identifying-g...
        
         | systemBuilder wrote:
         | gfx1101 : https://www.techpowerup.com/gpu-specs/amd-
         | navi-32.g1000
         | 
         | gfx1100 : https://www.techpowerup.com/gpu-specs/amd-
         | navi-31.g998
         | 
         | gfx1030 : https://www.techpowerup.com/gpu-specs/amd-
         | navi-21.g923
         | 
         | gfx1010 : https://www.techpowerup.com/gpu-specs/amd-
         | navi-10.g861
         | 
         | gfx900 : https://www.techpowerup.com/gpu-specs/amd-vega-10.g800
        
       | galaxyLogic wrote:
       | Companies selling CUDA software should no doubt adopt this tool
        
       | yieldcrv wrote:
       | the real question here is whether anybody has gotten cheap,
       | easily available AMD GPUs to run their AI workloads, and if we
       | can predict more people will do so
        
         | JonChesterfield wrote:
         | Microsoft have their production models running on amdgpu. I
         | doubt it was easy but it's pretty compelling as an existence
         | proof
        
         | anthonix1 wrote:
         | I ported Karparthy's llm.c repo to AMD devices [1], and have
         | trained GPT2 from scratch with 10B tokens of fineweb-edu on a
         | 4x 7900XTX machine in just a few hours (about $2 worth of
         | electricity) [2].
         | 
         | I've also trained the larger GPT2-XL model from scratch on
         | bigger CDNA machines.
         | 
         | Works fine.
         | 
         | [1] https://github.com/anthonix/llm.c [2]
         | https://x.com/zealandic1
        
       | EGreg wrote:
       | But the question is, can it also run SHUDA and WUDA?
        
       | nabogh wrote:
       | I've written a bit of CUDA before. If I want to go pretty bare-
       | bones, what's the equivalent setup for writing code for my AMD
       | card?
        
         | JonChesterfield wrote:
         | HIP works very similarly. Install rocm from your Linux
         | distribution or from amd's repo, or build it from
         | github.com/rocm. Has the nice feature of being pure userspace
         | if you use the driver version that's already in your kernel.
         | 
         | How turn-key / happy an experience that is depends on how
         | closely your system correlates with one of the
         | documented/tested distro versions and what GPU you have. If
         | it's one that doesn't have binary versions of rocblas etc in
         | the binary blob, either build rocm from source or don't bother
         | with rocblas.
        
       | spfd wrote:
       | Very impressive!
       | 
       | But I can't help but think if something like this can be done to
       | this extend, I wonder what went wrong/why it's a struggle for
       | OpenCL to unify the two fragmentized communities. While this is
       | very practical and has a significant impact for people who
       | develop GPGPU/AI applications, for the heterogeneous computing
       | community as a whole, relying on/promoting a proprietary
       | interface/API/language to become THE interface to work with
       | different GPUs sounds like bad news.
       | 
       | Can someone educate me on why OpenCL seems to be out of scene in
       | the comments/any of the recent discussions related to this topic?
        
         | vedranm wrote:
         | If you are going the "open standard" route, SYCL is much more
         | modern than OpenCL and also nicer to work with.
        
         | JonChesterfield wrote:
         | Opencl gives you the subset of capability that a lot of
         | different companies were confident they could implement. That
         | subset turns out to be intensely annoying to program in - it's
         | just the compiler saying no over and over again.
         | 
         | Or you can compile as freestanding c++ with clang extensions
         | and it works much like a CPU does. Or you can compile as cuda
         | or openmp and most stuff you write actually turns into code,
         | not a semantic error.
         | 
         | Currently cuda holds lead position but it should lose that
         | place because it's horrible to work in (and to a lesser extent
         | because more than one company knows how to make a GPU). Openmp
         | is an interesting alternative - need to be a little careful to
         | get fast code out but lots of things work somewhat intuitively.
         | 
         | Personally, I think raw C++ is going to win out and the many
         | heterogeneous languages will ultimately be dropped as basically
         | a bad idea. But time will tell. Opencl looks very DoA.
        
         | mschuetz wrote:
         | OpenCL isn't nice to use and lacks tons of quality of life
         | features. I wouldn't use it, even if it was double as fast as
         | CUDA.
        
       | localfirst wrote:
       | > SCALE does not require the CUDA program or its build system to
       | be modified.
       | 
       | how big of a deal is this?
        
         | JonChesterfield wrote:
         | People can be wildly hostile to changing their programs. The
         | people who wrote it aren't here any more, the program was
         | validated as-is, changing it tends to stop the magic thing
         | working and so forth.
         | 
         | That changing the compiler is strongly equivalent to changing
         | the source doesn't necessarily influence this pattern of
         | thinking. Customer requests to keep the performance gains from
         | a new compiler but not change the UB they were relying on with
         | the old are definitely a thing.
        
       | rjurney wrote:
       | If it's efficient, this is very good for competition.
        
       | ekelsen wrote:
       | A major component of many CUDA programs these days involves NCCL
       | and high bandwidth intra-node communication.
       | 
       | Does NCCL just work? If not, what would be involved in getting it
       | to work?
        
       | pjmlp wrote:
       | This targets CUDA C++, not CUDA the NVIDIA infrastructure for C,
       | C++, Fortran, and anything else targeting PTX.
        
         | ckitching wrote:
         | The CUDA C APIs are supported as much in C as in C++ using
         | SCALE!
         | 
         | Cuda-fortran is not currently supported by scale since we
         | haven't seen much use of it "in the wild" to push it up our
         | priority list.
        
           | anon291 wrote:
           | It doesn't matter though. NVIDIA distributes tons of
           | libraries built atop CUDA that you cannot distribute or use
           | on AMD chips legally. Cutlass, CuBLAS, NCCL, etc.
        
             | tama_sala wrote:
             | Correct, which one of the main moats Nvidia has when it
             | comes to training
        
             | ckitching wrote:
             | SCALE doesn't use cuBlas and friends. For those APIs, it
             | uses either its own implementations of the functions, or
             | delegates to an existing AMD library (such as rocblas).
             | 
             | It wouldn't even be technically possible for SCALE to
             | distribute and use cuBlas, since the source code is not
             | available. I suppose maybe you could do distribute cuBlas
             | and run it through ZLUDA, but that would likely become
             | legally troublesome.
        
       | uptownfunk wrote:
       | Very clearly the business motive make sense, go after nvidia gpu
       | monopoly. Can someone help a lay person understand the pitfalls
       | here that prevent this from being an intelligent venture?
        
         | JonChesterfield wrote:
         | It's technically non-trivial and deeply irritating to implement
         | in places as people expect bugward compatibility with cuda.
         | 
         | Also nvidia might savage you with lawyers for threatening their
         | revenue stream. Big companies can kill small ones by strangling
         | them in the courts then paying the fine when they lose a decade
         | later.
        
       | einpoklum wrote:
       | At my workplace, we were reluctant in making the choice between
       | writing OpenCL and being AMD-compliant, but missing out on CUDA
       | features and tooling; and writing CUDA and being vendor-locked.
       | 
       | Our jerry-rigged solution for now is writing kernels that are the
       | same source for both OpenCL and CUDA, with a few macros doing a
       | bit of adaptation (e.g. the syntax for constructing a struct).
       | This requires no special library or complicated runtime work -
       | but it does have the downside of forcing our code to be C'ish
       | rather than C++'ish, which is quite annoying if you want to write
       | anything that's templated.
       | 
       | Note that all of this regards device-side, not host-side, code.
       | For the host-side, I would like, at some point, to take the
       | modern-C++ CUDA API wrappers (https://github.com/eyalroz/cuda-
       | api-wrappers/) and derive from them something which supports
       | CUDA, OpenCL and maybe HIP/ROCm. Unfortunately, I don't have the
       | free time to do this on my own, so if anyone is interested in
       | collaborating on something like that, please drop me a line.
       | 
       | -----
       | 
       | You can find the OpenCL-that-is-also-CUDA mechanism at:
       | 
       | https://github.com/eyalroz/gpu-kernel-runner/blob/main/kerne...
       | 
       | and
       | 
       | https://github.com/eyalroz/gpu-kernel-runner/blob/main/kerne...
       | 
       | (the files are provided alongside a tool for testing, profiling
       | and debugging individual kernels outside of their respective
       | applications.)
        
         | JonChesterfield wrote:
         | Freestanding c++ with compiler intrinsics is a nicer
         | alternative. You can do things like take the address of a
         | function.
         | 
         | Use an interface over memory allocation/queue launch with
         | implementations in cuda, hsa, opencl whatever.
         | 
         | All the rest of the GPU side stuff is syntax sugar/salt over
         | slightly weird semantics, totally possible to opt out of all of
         | that.
        
       | stuaxo wrote:
       | What's the licensing, will I be able run this as a hobbyist for
       | free software?
        
       | tallmed wrote:
       | I wonder if this thing has anything common with zluda, its
       | permissively licensed after all.
        
       | EGreg wrote:
       | Does it translate to OpenCL?
       | 
       | This sounds like DirectX vs OpenGL debate when I was younger lol
        
       | lukan wrote:
       | Ok, so I just stumbled on the problem, that I tried out
       | openwhisper (from OpenAI), but on my CPU, because of no CUDA and
       | workarounds seem hacky. So the headline sounds good!
       | 
       | But can this help me directly? Or would OpenAI have to use this
       | tool for me to benefit?
       | 
       | It is not immediately clear to me (but I am a beginner in this
       | space).
        
       ___________________________________________________________________
       (page generated 2024-07-16 23:00 UTC)