[HN Gopher] AMD may get across the CUDA moat
       ___________________________________________________________________
        
       AMD may get across the CUDA moat
        
       Author : danzheng
       Score  : 218 points
       Date   : 2023-10-06 17:35 UTC (5 hours ago)
        
 (HTM) web link (www.hpcwire.com)
 (TXT) w3m dump (www.hpcwire.com)
        
       | pama wrote:
       | There is only limited empirical evidence of AMD closing the gap
       | that NVidia has created in the science or ML software. Even when
       | considering pytorch only, the engineering effort to maintain
       | specialized ROCm along with CUDA solutions is not trivial (think
       | flashattention, or any customization that optimizes your own
       | model). If your GPUs only need a simple ML workflow all times for
       | a few years nonstop, maybe there exist corner cases where the
       | finances make sense. It is hard for AMD now to close the gap
       | across the scientific/industrial software base of CUDA. NVidia
       | feels like a software company for the hardware they produce;
       | luckily they make the money from hardware thus cannot lock the
       | software libraries.
       | 
       | (Edited "no" to limited empirical evidence after a fellow user
       | mentioned El Capitan.)
        
         | Certhas wrote:
         | The fact that El Capitan is AMD says that at least for
         | Science/HPC there definitely is evidence of a closing gap.
        
           | pama wrote:
           | Thanks. You are actually right that this new supercomputer
           | might move the needle once it is in production mode. I will
           | wait and see how it goes.
        
         | fotcorn wrote:
         | ROCm has HIP (1) which is a compatibility layer to run CUDA
         | code on AMD GPUs. In theory, you only have to adjust #includes,
         | and everything should just work, but as usual, reality is
         | different.
         | 
         | Newer backends for AI frameworks like OpenXLA and OpenAI Triton
         | directly generate GPU native code using MLIR and LLVM, they do
         | not use CUDA apart from some glue code to actually load the
         | code onto the GPU and get the data there. Both already support
         | ROCm, but from what I've read the support is not as mature yet
         | compared to NVIDIA.
         | 
         | 1: https://github.com/ROCm-Developer-Tools/HIP
        
       | binarymax wrote:
       | And the question for most that remains once AMD catches up: will
       | the duopoly result in lower prices to a reasonable level for
       | hobbyists or bootstrapped startups, or will AMD just gouge like
       | NVidia?
        
         | evanjrowley wrote:
         | AMD prices will go up because of the newfound ability to gouge
         | for AI/ML/GPGPU workloads. Nvidia's will likely go down, but I
         | don't expect it will be by much. The market demand is high, so
         | the equilibrium price will also be high. Supply isn't at
         | pandemic / crypto-rush lows, but the supply of cards useful for
         | CUDA/ROCm still is.
        
         | klysm wrote:
         | A simplistic economic take would suggest that the competition
         | would result in lower prices, but given two players in the
         | market who knows.
        
           | sumtechguy wrote:
           | It is oligopoly pricing.
           | 
           | https://www.investopedia.com/terms/o/oligopoly.asp
           | 
           | With that few competitors pricing would not change much.
        
             | ad404b8a372f2b9 wrote:
             | Prices seemed to have lowered when AMD came out with CPUs
             | competitive with Intel's.
        
             | tibbydudeza wrote:
             | Price difference between 13900K and AMD Ryzen 9 7950x is
             | not big - the latest 7950X3D is about on par with the
             | higher clocked 13900KS as well.
        
               | redeeman wrote:
               | because intel lowered their prices
        
             | AnthonyMouse wrote:
             | That's mostly when there isn't a lot of price elasticity of
             | demand. If you're Comcast and Verizon, each customer wants
             | one internet connection and you're not going to change the
             | size of the market much by offering better prices.
             | 
             | If you're AMD and NVIDIA and lowering the price would
             | double the number of customers, you might very well want to
             | do that, unless you're supply constrained -- which has been
             | the issue because they're both bidding against everyone
             | else for limited fab capacity. But that should be
             | temporary.
             | 
             | This is also a market with a network effect. If all your
             | GPUs are $1000 and nobody can afford them then nobody is
             | going to write code for them, and then who wants them? So
             | the winning strategy is actually to make sure that there
             | are kind of okay GPUs available for less than $300 and make
             | sure lots of people have them, then sell very expensive
             | ones that use the same architecture but are faster.
             | 
             | That has been the traditional model, but the lack of
             | production capacity meant that they've only been making the
             | overpriced ones recently. Which isn't actually in their
             | interests once the supply of fab capacity loosens up.
        
           | binarymax wrote:
           | My intuition is along the lines that if AMD had a competing
           | product earlier, then it would have kept prices down. But
           | since Nvidia has shown what the market will pay, AMD won't be
           | able to resist overcharging. It will probably come down a
           | little, but nowhere near to the point of affordability.
           | 
           | I sure hope I'm wrong.
        
             | tyre wrote:
             | AMD might have to charge less to break into customers that
             | are already bought into Nvidia. There has to be a discount
             | to cover the switching costs + still provide savings (or
             | access).
        
               | zirgs wrote:
               | AMD will have to provide a REALLY steep discount to
               | convince me to come back.
        
         | wil421 wrote:
         | Why would their investors allow anything else? I'm sure they
         | see it as a huge loss like intel and mobile.
        
         | quitit wrote:
         | I think in this case the changes needed to make AMD useful will
         | open the market to other players as well (e.g. Intel).
         | 
         | PyTorch is already walking down this path and while CUDA-based
         | performance is significantly better, that is changing and of
         | course an area of continued focus.
         | 
         | It's not that people don't like Nvidia, rather it's just that
         | there is a lot of hardware out there that can technically
         | perform competitively, but the work needs to be done to bring
         | it into the circle.
        
           | binarymax wrote:
           | Last I checked I saw the H100 was about two gens more
           | advanced for certain components (tensor cores, bfloats,
           | cache, mem bandwidth) - but my research may have been wrong
           | as admittedly I'm not as familiar with AMDs offerings for
           | GPU.
        
             | FuriouslyAdrift wrote:
             | They are not behind...
             | https://www.tomshardware.com/news/amd-expands-mi300-with-
             | gpu...
             | 
             | You can also actually buy them as opposed to the nVidia
             | offerings which you are going to have to fight for.
        
         | rdsubhas wrote:
         | Demand will push AMD prices up by couple hundred bucks and
         | Nvidia cards down by couple hundred bucks. A hobbyist customer
         | will be neither better or worse.
        
         | rafaelmn wrote:
         | If the margins and demand is there Intel will eventually show
         | up
        
           | Havoc wrote:
           | Is either in doubt?
        
             | rafaelmn wrote:
             | Wouldn't be surprised if a bunch of investment is hype
             | bubble and demand correction forces price correction. Maybe
             | not immediately but at Intel's pace - they managed to miss
             | out on mining bubble, wouldn't be surprised for them to
             | release in a correction.
        
           | wmf wrote:
           | Intel already showed up three or four times but their
           | software is as bad as AMD's used to be.
        
             | ilc wrote:
             | Thankfully, software can be fixed over time as AMD has
             | shown. Lack of another competitor can't be fixed as easily.
        
       | ris wrote:
       | I don't understand the author's argument (if there is one) -
       | pytorch has existed for ages. AMD's Instinct MI* range has
       | existed for years now. If these are the key ingredients why has
       | it not already happened?
        
       | jiggawatts wrote:
       | Can I buy an MI300 or even rent one in a cloud?
        
       | pjmlp wrote:
       | Unless they get their act together regarding CUDA polyglot
       | tooling, I seriously doubt it.
        
       | javchz wrote:
       | CUDA is the only reason I have an Nvidia card, but if more
       | projects start migrating to a more agnostic environment, I'll be
       | really grateful.
       | 
       | Running Nvidia in Linux isn't as much fun. Fedora and Debian can
       | be incredibly reliable systems, but when you add an Nvidia card,
       | I feel like I am back in Windows Vista with kernel crashes from
       | time to time.
        
         | kombine wrote:
         | I use a rolling distro (OpenSUSE Tumbleweed) and have had zero
         | issues with my NVIDIA card despite it pulling the kernel and
         | driver updates as they get released. The driver repo is
         | maintained by NVIDIA itself, which is amazing.
        
           | filterfiber wrote:
           | Do you use wayland, multiple monitors, and/or play games or
           | is it just for ML/AI?
        
             | smoldesu wrote:
             | I do all of those things with my 3070 and it works just
             | fine. Most of them will depend on your DE's Wayland
             | implementation.
             | 
             | I'm not here to desparage anyone experiencing issues, but
             | my experience on the NixOS rolling-release channel has also
             | been pretty boring. There was a time when my old 1050 Ti
             | struggled, but the modern upstream drivers feel just as
             | smooth as my Intel system does.
        
         | smoldesu wrote:
         | Those problems might just be GNOME-related at this point. I've
         | been daily-driving two different Nvidia cards for ~3 years now
         | (1050 Ti then 3070 Ti) and Wayland has felt pretty stable for
         | the past 12 months. The worst problem I had experienced in that
         | time was Electron and Java apps drawing incorrectly in
         | xWayland, but both of those are fixed upstream.
         | 
         | I'm definitely not against better hardware support for AI, but
         | I think your problems are more GNOME's fault than Nvidia's.
         | KDE's Wayland session is almost flawless on Nvidia nowadays.
        
           | arsome wrote:
           | If GNOME can tank the kernel, it ain't GNOME's fault.
        
           | kombine wrote:
           | I really hope that with KDE 6 I can finally switch to
           | Wayland!
        
         | PH95VuimJjqBqy wrote:
         | I see these complains from time to time and I never understand
         | them.
         | 
         | I've literally been running nvidia on linux since the TNT2 days
         | and have _never_ had this sort of issue. That's across many
         | drivers and many cards over the many many years.
        
           | ant6n wrote:
           | Well tnt2 should be pretty well supported by now ;-)
        
             | PH95VuimJjqBqy wrote:
             | lmao, touche :)
        
           | temp0826 wrote:
           | I understand it, but I also haven't had any trouble since I
           | figured out the right procedure for me on fedora (which
           | probably took some time, but it's been so long that I can't
           | remember). Whenever I read people having issues it sounds
           | like they are using a package installed via dnf for the
           | driver/etc. I've always had issues with dkms and the like and
           | just install the latest .run from nvidia's website whenever I
           | have a kernel update (I made a one-line script to call it
           | with the silent option and flags for signing for secure boot
           | so I don't really think about it). No issues in a very long
           | time even with the whackiness of prime/optimus offloading on
           | my old laptop.
        
             | PH95VuimJjqBqy wrote:
             | actually, it's a good point because that's how I always
             | install nvidia drivers as well. Never from the local
             | package manager.
        
           | einpoklum wrote:
           | I have been NVIDIA cards for compute capabilities only, both
           | personally and at work, for nearly a decade. I've had dozens
           | and dozens of different issues involving the hardware, the
           | drivers, integration with the rest of the OS, version
           | compatibilities, ensuring my desktop environment doesn't try
           | to use the NVIDIA cards, etc. etc.
           | 
           | Having said that - I (or rarely, other people) have almost
           | always managed to work out those issues and get my systems to
           | work. Not in all cases though.
        
           | jjoonathan wrote:
           | Same but linux experience is a steep and bumpy function of
           | hardware.
           | 
           | My guess: something like laptop GPU switching failed badly in
           | the nvidia binary, earning it a reputation.
        
             | HideousKojima wrote:
             | That was my experience, Nvidia Optimus (which is what
             | allows dynamic switching between the integrated and
             | dedicated GPU in laptops) was completely broken (as in a
             | black screen, not just crashes or other issues) for several
             | years, and Nvidia didn't care to do anything about it.
        
               | PH95VuimJjqBqy wrote:
               | I don't run laptops except when work requires it and that
               | tends to be windows so that may explain the difference in
               | experience.
        
               | lhl wrote:
               | Yeah, Optimus was a huge PITA. I remember fighting with
               | workarounds like bumblebee and prime for years. Also
               | Nvidia dragged their feet on Wayland support for a few
               | years too (and simultaneously was seemingly intent on
               | sabotaging Nouveau).
        
               | distract8901 wrote:
               | I tried bumblebee again recently, and it works shockingly
               | well now. I have a thinkpad T530 from 2013 with an
               | NVS5400m.
               | 
               | There is some strange issue with some games where they
               | don't get full performance from the dGPU, but more than
               | the iGPU. I have to use optirun to get full performance.
               | 
               | It also has problems when the computer wakes from sleep.
               | For whatever reason, hardware video decoding doesn't work
               | after entering standby. Makes steam in home streaming
               | crash on the client, but flipping to software decoding
               | usually works fine.
               | 
               | The important part is that battery life is almost as good
               | with bumblebee as it is with the dGPU turned off. No more
               | fucking with Prime or rebooting into BIOS to turn the GPU
               | back on.
        
         | wubrr wrote:
         | Yeah, nvidia linux support is meh, but still much better than
         | amd.
        
           | silisili wrote:
           | In the closed source days of fglrx or whatever it's called
           | I'd agree. Since they went open source, hard disagree. AMD
           | graphics work in Linux about as well as Intel always has.
        
           | phkahler wrote:
           | >> Yeah, nvidia linux support is meh, but still much better
           | than amd.
           | 
           | Can not confirm. I used nvidia for years when it was the only
           | option. Then used the nouveau driver on a well supported card
           | because it worked well and eliminated hassle. Now I'm on AMD
           | APU and it just works out of the box. YMMV of course. We do
           | get reports of issues with AMD on specific driver versions,
           | but I can't reproduce.
        
           | bryanlarsen wrote:
           | Not my experience. The open source AMD drivers are much more
           | pleasant to deal with than the closed source Nvidia ones.
        
           | acomjean wrote:
           | As someone who was tasked with trying to get nvidia working
           | on Ubuntu, it's a pretty terrible experience.
           | 
           | I have a nvidia laptop with popos. That works well.
        
           | Zambyte wrote:
           | Is it better than AMD? I have had literally no graphics
           | issues on my 6650 XT with swaywm using the built in kernel
           | drivers.
        
             | christkv wrote:
             | I think the problems are pro drivers and the issues with
             | ROCm being buggy not the open source graphics drivers.
        
             | treprinum wrote:
             | I never had an issue with nVidia drivers on Linux in the
             | past 5 years, but recently bought a laptop with a 4090 and
             | AMD CPU. Now I get random freezes, often right after I
             | login into Cinnamon but can't really tell if it's the
             | nVidia driver for 4090, AMDGPU driver for integrated RDNA,
             | kernel 6.2 or Cinnamon issue. The laptop just hangs and
             | stops responding to keyboard so I can't login to console
             | and dmesg it.
        
               | SoftTalker wrote:
               | The main issue with Nvidia on Linux AIUI is that they
               | don't release the source code for their drivers.
        
               | treprinum wrote:
               | That might be a philosophical problem that never
               | prevented me from training models on Linux. The half-
               | baked half-crashing AMD solutions just lead to wasting
               | time I can spend on ML research instead.
        
             | aseipp wrote:
             | This week I upgraded my kernel on a 2017 workstation to
             | 6.5.5 and when I rebooted and looked at 'dmesg' there were
             | no less than 7 kernel faults with stack traces in my
             | 'dmesg' from amdgpu. Just from booting up. This is a no-
             | graphical-desktop system using a Radeon Pro W5500, which is
             | 3.5 years old (I just had the card and needed something to
             | plug in for it to POST.)
             | 
             | I have come to accept that graphics card drivers and
             | hardware stability ultimately comes down to whether or not
             | ghosts have decided to haunt you.
        
             | HansHamster wrote:
             | Guess I'm also doing something wrong. Never had any serious
             | issues with either Nvidia or AMD on Linux (and only a few
             | annoyances on RNDA2 shortly after release)...
        
         | distract8901 wrote:
         | My Arch system would occasionally boot to a black screen. When
         | this happened, no amount of tinkering could get it back. I had
         | to reinstall the whole OS.
         | 
         | Turns out it was a conflict between nvidia drivers and my (10
         | year old) Intel integrated GPU. But once I switched to an AMD
         | card, everything works flawlessly.
         | 
         | Ubuntu based systems barely worked at all. Incredibly unstable
         | and would occasionally corrupt the output and barf colors and
         | fragments of the desktop all over my screens.
         | 
         | AMD on arch has been an absolute delight. It just. Works. It's
         | more stable than nvidia on windows.
         | 
         | For a lot of reasons-- but mainly Linux drivers-- I've totally
         | sworn off nvidia cards. AMD just works better for me.
        
       | fluxem wrote:
       | I call it the 90% problem. If AMD works for 90% of my projects, I
       | would still buy NVIDIA, which works for 100%, even though I'm
       | paying a premium
        
         | hot_gril wrote:
         | I'm lazy, so it's 99% for me. I don't even mess with AMD CPUs;
         | I know they're not _exactly_ the same instruction set as Intel,
         | and more importantly they work with a different (and less
         | mainstream) set of mobos, so I don 't want em. If AMD manages
         | to pull more customers their way, that's great, it just means
         | lower Intel premium for me.
        
       | Zetobal wrote:
       | They are just too late even if they catch up. Until they make a
       | leap like they did with ryzen nothing will happen.
        
         | Havoc wrote:
         | >They are just too late even if they catch up.
         | 
         | Late certainly, too late I don't think so.
         | 
         | If you can field a competitively priced consumer card that can
         | run llama fast then you're already halfway there because then
         | the ecosystem takes off. Especially since nvidia is being
         | really stingy with their vram amounts.
         | 
         | H100 & datacenter is a separate battle certainly, but on
         | mindshare I think some deft moves from AMD will get them there
         | quite fast once they pull their finger out their A and actually
         | try sorting out the driver stack.
        
           | dylan604 wrote:
           | >If you can field a competitively priced consumer card
           | 
           | if this unicorn were to show up, what's to say that all the
           | non-consumers won't just scarf up these equally performant
           | yet lower priced cards causing the supply-demand situation
           | we're in now? the only difference would be a sudden supply of
           | the expensive Nvidia cards that nobody wants because of their
           | price.
        
             | AnthonyMouse wrote:
             | The thing that causes it to be competitively priced is
             | having enough production capacity to prevent that from
             | happening.
             | 
             | One way to do that may be to produce a card on an older
             | process node (or the existing one when a new one comes out)
             | that has a lot of VRAM. There is less demand for the older
             | node so they can produce more of them and thereby sell them
             | for a lower price without running out.
        
             | Havoc wrote:
             | >if this unicorn were to show up
             | 
             | A unicorn like that showed up a couple hours ago. Someone
             | posted a guide for getting llama to run on a 7900xtx
             | 
             | https://old.reddit.com/r/LocalLLaMA/comments/170tghx/guide_
             | i...
             | 
             | It's still slow and janky but this really isn't that far
             | away.
             | 
             | I don't buy that AMD can't make this happen if they
             | actually tried.
             | 
             | Go on fiverr, get them to compile a list of top 100 people
             | in the DIY LLM space, send them all free 7900XTXs. Doesn't
             | matter if half of it is wrong, just send it. Next take 1.2m
             | USD, post a dozen 100k bounties against llama.cpp that are
             | AMD specific - support & optimise the gear. Rinse and
             | repeat with every other hobbyist LLM/stable diffusion
             | project. A lot of these are zero profit open source /
             | passion / hobby projects. If 6 figure bounties show up
             | it'll absolute raise pulses. Next do all the big youtubers
             | in the space - carefully on that one so that it doesn't
             | come across as an attempted pay-off...but you want them to
             | know that you want this space to grow and are willing to
             | put your money where your mouth is.
             | 
             | That'll cost AMD what 2m 3m? To move the needle on a multi
             | billion market? That's the cheapest marketing you've ever
             | seen.
             | 
             | As I said the datacenter & enterprise market is another
             | beast entirely full of moats and strategy, but I don't see
             | why a suitably motivated senior AMD exec can't tackle the
             | enthusiast market single handedly with a couple of emails,
             | a cheque book and a tshirt that has the nike slogan on it.
             | 
             | >what's to say that all the non-consumers won't just scarf
             | up these equally performant yet lower priced cards
             | 
             | It doesn't matter. They're in the business of selling
             | cards. To consumers, to datacenters, to your grandmother.
             | From a profit driven capitalist company the details don't
             | matter as long as there is traction & volume. The above -
             | opening up even the possibility of a new market - is gold
             | in that perspective. And from a consumer perspective
             | anything that breaks the nvidia cuda monopoly is a win.
        
               | lhl wrote:
               | llama.cpp, ExLlama, and MLC LLM have all had ROCm
               | inferencing for months (here are a bunch of setup
               | instructions I've written up, for Linux and Windows:
               | https://llm-tracker.info/books/howto-guides/page/amd-gpus
               | ) - but I don't think that's the problem (and wouldn't
               | drive lots of volume or having downstream impact in any
               | case).
               | 
               | The bigger problem is on the training/research support.
               | Eg, here's no official support for AMD GPUs for
               | bitsandbytes, and no support at all for
               | FlashAttention/FA2 (nothing that 100K in hardware/grants
               | to Dettmers or Dao's labs wouldn't fix I suspect).
               | 
               | The real elephant though is that AMD still having the
               | disconnect that lack of support for consumer cards and
               | home/academic devs in general has been disastrous (while
               | Nvidia supports CUDA on basically every single GPU
               | they've made since 2010) - just last week there was this
               | mindblowing thread where it turns out an AMD employee is
               | paying out of pocket for AMD GPUs to support build/CI for
               | drivers on Debian. I mean, WTF, that's stupidity that's
               | beyond embarrassing and gets into negligence terriroty
               | IMO: https://news.ycombinator.com/item?id=37665784
        
               | dylan604 wrote:
               | >an AMD employee is paying out of pocket for AMD GPUs
               | 
               | I hope he's at least getting an employee discount! I
               | guess AMD is not a fan of the 20% concept either
        
       | omneity wrote:
       | I was able to use ROCm recently with Pytorch and after pulling
       | some hair it worked quite well. The Radeon GPU I had on hand was
       | a bit old and underpowered (RDNA2) and it only supported matmul
       | on fp64, but for the job I needed done I saw a 200x increase in
       | it/s over CPU despite the need to cast everywhere, and that made
       | me super happy.
       | 
       | Best of all is that I simply set the device to
       | `torch.device('cuda')` rather than openCL, which does wonders for
       | compatibility and to keep code simple.
       | 
       | Protip: Use the official ROCM Pytorch base docker image [0]. The
       | AMD setup is so finicky and dependent on specific versions of
       | sdk/drivers/libraries and it will be much harder to make work if
       | you try to install them separately.
       | 
       | [0]:
       | https://rocm.docs.amd.com/en/latest/how_to/pytorch_install/p...
        
         | wyldfire wrote:
         | > Best of all is that I simply set the device to
         | `torch.device('cuda')` rather than openCL, which does wonders
         | for compatibility
         | 
         | Man oh man where did we go wrong that cuda is the more
         | compatible option over OpenCL?
        
           | KeplerBoy wrote:
           | It must be a misnomer on PyTorch's side. Clearly it's neither
           | CUDA nor OpenCL.
           | 
           | AMD should just get it's shit together. This is ridiculous.
           | Not the name, but the fact that you can only do FP64 on a
           | GPU. Everybody is moving to FP16 and AMD is stuck on doubles?
        
             | omneity wrote:
             | I believe the fp64 limitation came from the laptop-grade
             | GPU I had rather than inherent to AMD or ROCm.
             | 
             | The API level I could target was at least two or three
             | versions behind the latest they have to offer.
        
               | KeplerBoy wrote:
               | Might very well be true. I don't blame anyone for not
               | diving deeper into figuring out why this stuff doesn't
               | work.
               | 
               | But this is one of the great strengths of CUDA: I can
               | develop a kernel on my workstation, my boss can demo it
               | on his laptop and we can deploy it on Jetsons or the
               | multi-gpu cluster with minimal changes and i can be sure
               | that everything runs everywhere.
        
         | RockRobotRock wrote:
         | Have you gotten it to work with Whisper by any chance?
        
         | mikepurvis wrote:
         | Sigh. It's great that these container images exist to give
         | people an easy on-ramp, but they definitely don't work for
         | every use case (especially once you're in embedded where space
         | matters and you might not be online to pull multi-gb updates
         | from some registry).
         | 
         | So it's important that vendors don't feel let off the hook to
         | provide sane packaging just because there's an option to use a
         | kitchen-sink container image they rebuild every day from
         | source.
        
           | fwsgonzo wrote:
           | I feel the same way, especially about build systems. OpenSSL
           | and v8 are among a large list of things that have horrid
           | build systems. Only way to build them sanely is to use some
           | randos CMake fork, then it Just Works. Literally a two-liner
           | in your build system to add them to your project with a sane
           | CMake script.
        
             | mikepurvis wrote:
             | I was part of a Nix migration over the past two years, and
             | literally one of the first things we checked is that there
             | was already a community-maintained tensorflow+gpu package
             | in nixpkgs because without that the whole thing would have
             | been a complete non-starter, and we sure as heck didn't
             | have the resources or know-how to figure it out for
             | ourselves as a small DevOps team just trying to do basic
             | packaging.
        
           | amelius wrote:
           | > So it's important that vendors don't feel let off the hook
           | to provide sane packaging just because there's an option to
           | use a kitchen-sink container image they rebuild every day.
           | 
           | Sadly if e.g. 95% of their users can use the container, then
           | it could make economical sense to do it that way.
        
           | xahrepap wrote:
           | I know it's still different than what you're looking for, so
           | you probably already know this, but many projects like this
           | have the Dockerfile on github which shows exactly how they
           | set up the image. For example:
           | 
           | https://github.com/RadeonOpenCompute/ROCm-
           | docker/blob/master...
           | 
           | They also have some for Fedora. Looks like for this you need
           | to install their repo:                   curl -sL
           | https://repo.radeon.com/rocm/rocm.gpg.key | apt-key add - \
           | && printf "deb [arch=amd64]
           | https://repo.radeon.com/rocm/apt/$ROCM_VERSION/ jammy main" |
           | tee /etc/apt/sources.list.d/rocm.list \         && printf
           | "deb [arch=amd64]
           | https://repo.radeon.com/amdgpu/$AMDGPU_VERSION/ubuntu jammy
           | main" | tee /etc/apt/sources.list.d/amdgpu.list \
           | 
           | then install Python, a couple other dependencies (build-
           | essential, etc) and then the package in question: rocm-dev
           | 
           | So they are doing the packaging. There might even be
           | documentation elsewhere for that type of setup.
        
             | mikepurvis wrote:
             | Oh yeah, I mean... having the source for the container
             | build is kind of table stakes at this point. No one would
             | accept a 10gb mystery meat blob as the basis of their
             | production system. It's bad enough that we still accept
             | binary-only drivers and proprietary libraries like
             | TensorRT.
             | 
             | I think my issue is more just with the _mindset_ that it 's
             | okay to have one narrow slice of supported versions of
             | everything that are "known to work together" and those are
             | what's in the container and anything outside of those and
             | you're immediately pooched.
             | 
             | This is not hypothetical btw, I've run into real problems
             | around it with libraries like gproto, where tensorflow's
             | bazel build pulls in an exact version that's different from
             | the default one in nixpkgs, and now you get symbol
             | conflicts when something tries to link to the tensorflow
             | c++ API while linking to another component already using
             | the default gproto. I know these problems are solveable
             | with symbol visibility control and whatever, but that stuff
             | is far from universal and hard to get right, especially if
             | the person setting up the build rules for the library
             | doesn't themselves use it in that type of heterogeneous
             | environment (like, everyone at Google just links the same
             | global proto version from the monorepo so it doesn't
             | matter).
        
           | mathisfun123 wrote:
           | > especially once you're in embedded
           | 
           | is this a real problem? exactly which embedded platform has a
           | device that ROCm supports?
        
             | mikepurvis wrote:
             | Robotic perception is the one relevant to me. You want to
             | do object recognition on an industrial x86 or Jetson-type
             | machine, without having to use Ubuntu or whatever the one
             | "blessed" underlay system is (either natively or implicitly
             | because you pulled a container based on it).
        
               | mathisfun123 wrote:
               | >industrial x86 or Jetson-type machine
               | 
               | that's not embedded dev. if you
               | 
               | 1. use underpowered devices to perform sophisticated
               | tasks
               | 
               | 2. using code/tools that operate at extremely high levels
               | of "abstraction"
               | 
               | don't be surprised when all the inherent complexity is
               | tamed using just more layers of "abstraction". if that
               | becomes a problem for your cost/power/space budget then
               | reconsider choice 1 or choice 2.
        
               | mikepurvis wrote:
               | Not sure this is worth an argument over semantics, but
               | modern "embedded" development is a lot bigger than just
               | microcontrollers and wearables. IMO as soon as you're
               | deploying a computer into any kind of "appliance", or
               | you're offline for periods of time, or you're running on
               | batteries or your primary network connection is
               | wireless... then yeah, you're starting to hit the
               | requirements associated with embedded and need to seek
               | established solutions for them, including using distros
               | which account for those requirements.
        
               | mathisfun123 wrote:
               | > IMO as soon as you're deploying a computer into any
               | kind of "appliance", or you're offline for periods of
               | time, or you're running on batteries or your primary
               | network connection is wireless
               | 
               | yes and in those instances you do not reach for
               | pytorch/tensorflow on top of ubuntu on top of x86 with a
               | discrete gpu and 32gb of ram. instead you reach for C and
               | micro or some arm soc that supports baremetal or at most
               | rtos. that's embedded dev.
               | 
               | so i'll repeat myself: if you want to run extremely high-
               | level code then don't be "surprised pikachu" when your
               | underpowered platform, that you chose due to concrete,
               | tight budgets doesn't work out.
        
       | IronWolve wrote:
       | Yup, thank the hobbyists. Pytorch is allowing other hardware.
       | Stable diffusion working on m chips, intel arc, and Amd.
       | 
       | Now what I'd like to see is real benchmarks for compute power.
       | Might even get a few startups to compete in this new area.
        
         | mattnewton wrote:
         | Re: startups, Geohotz raised a few million for this already.
         | https://tinygrad.org/
        
           | nomel wrote:
           | Obligatory Lex Fridman podcast, where he discusses it:
           | https://youtu.be/dNrTrx42DGQ?t=2408
        
           | IntelMiner wrote:
           | Didn't he do what he always does. Rake in a ton of money,
           | fart around and then cash out exclaiming it's everyone else's
           | fault?
           | 
           | The way he stole Fail0verflow's work with the PS3 security
           | leak after failing to find a hypervisor exploit for months
           | absolutely soured any respect I had for him at the time
        
             | adastra22 wrote:
             | Wow, TIL
        
             | throwitawayfam wrote:
             | Yep, did exactly that. IMO he threw a fit, even though AMD
             | was working with him squashing bugs. https://github.com/Rad
             | eonOpenCompute/ROCm/issues/2198#issuec...
        
               | [deleted]
        
               | nomel wrote:
               | To be fair, kernel crashes from running an AMD provided
               | demo loop isn't something he should have to work with
               | them on. That's borderline incompetence. His perspective
               | was around integration into his product, where every AMD
               | bug is a bug in his product. They deserve criticism, and
               | responded accordingly (actual resources to get their shit
               | together). It's not like GPU accelerated ML is some new
               | thing.
        
               | aeyes wrote:
               | He's back on it after getting AMD's CEO to commit
               | resources to this:
               | 
               | https://twitter.com/realGeorgeHotz/status/166980346408248
               | 934...
               | 
               | https://twitter.com/LisaSu/status/1669848494637735936
        
             | kinematikk wrote:
             | Do you have a source on the stealing part? A quick Google
             | search didn't result in anything
        
               | IntelMiner wrote:
               | Marcan (of Asahi Linux fame) has talked about it _many_
               | times before. But an abridged version
               | 
               | Fail0verflow demoed how they were able to derive the
               | private signing keys for the Sony Playstation 3 console
               | at I believe CCC
               | 
               | Geohot after watching the livestream raced into action to
               | demo a "hello world!" jailbreak application and
               | absolutely stole their thunder without giving any credit
        
         | mandevil wrote:
         | It isn't the hobbyists who are making sure that PyTorch and
         | other frameworks runs well on these chips, but teams of
         | engineers who work for NVIDIA, AMD, Intel, etc. who are doing
         | this as their primary assigned jobs, in exchange for money from
         | their employer, who are paying those salaries because they want
         | to sell chips into the enormous demand for running PyTorch
         | faster.
         | 
         | Hobbyist and open-source are definitely not synonyms.
        
           | Eisenstein wrote:
           | People don't usually get employed to make things with no
           | demand, and people who work for companies with a budget line
           | don't really care how much the nVidia tax is. You can thank
           | hobbyists for creating a lot of demand for compatability with
           | other cards.
        
             | kiratp wrote:
             | There are so many billions of dollar being spent on this
             | hardware that everyone other than Nvidia is doing
             | everything they can to make competition happen.
             | 
             | Eg: https://www.intel.com/content/www/us/en/developer/video
             | s/opt...
             | 
             | https://www.intel.com/content/www/us/en/developer/tools/one
             | a...
             | 
             | https://developer.apple.com/metal/tensorflow-plugin/
             | 
             | Large scale opensource is, outside of a few exceptions,
             | built by engineers paid to build it.
        
         | jauntywundrkind wrote:
         | Pytorch is just using Google's OpenXLA now, & OpenXLA is the
         | actual cross platform thing, no? I'm not very well versed in
         | this area, so pardon if mistaken.
         | https://pytorch.org/blog/pytorch-2.0-xla-path-forward/
        
           | fotcorn wrote:
           | You can use OpenXLA, but it's not the default. The main use-
           | case for OpenXLA is running PyTorch on Google TPUs. OpenXLA
           | also supports GPUs, but I am not sure how many people use
           | that. Afaik JAX uses OpenXLA as backend to run on GPUs.
           | 
           | If you use model.compile() in PyTorch, you use TorchInductor
           | and OpenAIs Triton by default.
        
       | nabla9 wrote:
       | > Crossing the CUDA moat for AMD GPUs may be as easy as using
       | PyTorch.
       | 
       | Nvidia has spent huge amount of work to make code run smoothly
       | and fast. AMD has to work hard to catch up. ROCm code is slower ,
       | has more bugs, don't have enough features and they have
       | compatibility issues between cards.
        
         | latchkey wrote:
         | Lisa has said that they are committed to improving ROCm,
         | especially for AI workloads. Recent releases (5.6/5.7) prove
         | that.
        
         | einpoklum wrote:
         | > Nvidia has spent huge amount of work to make code run
         | smoothly and fast.
         | 
         | Well, let's say "smoother" rather than "smoothly".
         | 
         | > ROCm code is slower
         | 
         | On physically-comparable hardware? Possible, but that's not an
         | easy claim to make, certainly not as expansively as you have.
         | References?
         | 
         | > has more bugs
         | 
         | Possible, but - NVIDIA keeps their bug database secret. I'm
         | guessing you're concluding this from anecdotal experience?
         | That's fair enough, but then - say so.
         | 
         | > ROCm ... don't have enough features and
         | 
         | Likely. while AMD has both spent less in that department (and
         | had less to spend I guess); plus, and no less importantly - it
         | tried to go along with the OpenCL initiative, as specified by
         | the Khronos consortium, while NVIDIA has sort of "betrayed" the
         | initiative by investing in it's vendor-locked, incompatible
         | ecosystem and letting their OpenCL support decay in some
         | respects.
         | 
         | > they have compatibility issues between cards.
         | 
         | such as?
        
       | whywhywhywhy wrote:
       | Anyone who has to work in this ecosystem surely thinks this is a
       | naive take
        
         | freedomben wrote:
         | For someone who doesn't work in this ecosystem, can you
         | elaborate? What's the real situation currently?
        
       | superkuh wrote:
       | >There is also a version of PyTorch that uses AMD ROCm, an open-
       | source software stack for AMD GPU programming. Crossing the CUDA
       | moat for AMD GPUs may be as easy as using PyTorch.
       | 
       | Unfortunately since the AMD firmware doesn't reliably do what
       | it's supposed to those ROCm calls often don't either. That's if
       | your AMD card is even still supported by ROCm: the AMD RX 580 I
       | bought in 2021 (the great GPU shortage) had it's ROCm support
       | dropped in 2022 (4 years support total).
       | 
       | The only reliable interface in my experience has been via opencl.
        
         | zucker42 wrote:
         | Do you mean OpenCL using Rusticl or something else? And what DL
         | framework, if any?
        
           | superkuh wrote:
           | I should clarify that I mean for human person uses. Not
           | commercial or institutional. But, clBLAST via llama.cpp for
           | LLM currently. Or far in the past just pure opencl for things
           | with AMD cards.
        
         | htrp wrote:
         | has opencl actually improved enough to be competitive?
        
           | orangepurple wrote:
           | I thought ONNX is supposed to be the ultimate common
           | denominator for machine learning model cross platform
           | compatibility
        
             | [deleted]
        
       | the__alchemist wrote:
       | When coding using Vulkan, for graphics or compute (The latter is
       | the relevant one here), you need to have CPU code (Written in
       | C++, Rust etc), then serialize it as bytes, then have shaders
       | which run on the graphics card. This 3-step process creates
       | friction, much in the same way as backend/serialization/frontend
       | does in web dev. Duplication of work, type checking not going
       | across the bridge, the shader language being limited etc.
       | 
       | My understanding is CUDA's main strength is avoiding this. Do you
       | agree? Is that why it's such a big deal? Ie, why this article was
       | written, since you could always do compute shaders on AMD etc
       | using Vulkan.
        
       | atemerev wrote:
       | Nope. PyTorch is not enough, you have to do come C++ occasionally
       | (as the code there can be optimized radically, as we see in
       | llama.cpp and the like). ROCm is unusable compared to CUDA (4x
       | more code for the same problem).
       | 
       | I don't understand why everyone neglects good, usable and
       | performant lower-level APIs. ROCm is fast, low-level, but much
       | much harder to use than CUDA, and the market seems to agree.
        
       | alecco wrote:
       | Regurgitated months-old content. blogspam
        
       | einpoklum wrote:
       | TL;DR:
       | 
       | 1. Since PyTorch has grown very popular, and there's an AMD
       | backend for that, one can switch GPU vendors when doing
       | Generative AI work.
       | 
       | 2. Like NVIDIA's Grace+Hopper CPU-GPU combo, AMD is/will be
       | offering "Instinct MI300A", which improves performance over
       | having the GPU across a PCIe bus from a regular CPU.
        
       | bigcat12345678 wrote:
       | Cuda is the foundation
       | 
       | NVIDIA moat is the years of work built by oss community, big
       | corporations, research insistute
       | 
       | They spend all time building for cuda, a lot of implicit designs
       | are derived from cuda's characteristic
       | 
       | That will be the main challenge
        
         | mikepurvis wrote:
         | It depends on the domain. Increasingly people's interfaces to
         | this stuff are the higher level libraries like tensorflow,
         | pytorch, numpy/cupy, and to a lesser degree accelerated
         | processing libraries such as opencv, PCL, suitesparse, ceres-
         | solver, and friends.
         | 
         | If you can add hardware support to a major library _and_
         | improve on the packaging and deployment front while also
         | undercutting on price, that 's the moat gone overnight. CUDA
         | itself only matters in terms of lock-in if you're calling
         | CUDA's own functions.
        
           | bigcat12345678 wrote:
           | what I meant is that all these stuff have 15 years of
           | implicit accumulation of knowledge and tips and even hacks
           | builtin in the software
           | 
           | No matter what you depends on, you'll have a slew of larger
           | or minor obstacles or annoyance
           | 
           | That collectively is the most itself
           | 
           | As you said, already it's clear that replacing cuda itself is
           | not that daunting
        
       | ddtaylor wrote:
       | It's worth noting that AMD also has a ROCm port of Tensorflow.
        
         | ginko wrote:
         | When I try to install rocm-ml-sdk on Arch linux it'll tell me
         | the total installed size would be about 18GB.
         | 
         | What can possibly explain this much bloat for what should
         | essentially be a library on top of a graphics driver as well as
         | some tools (compiler, profiler etc.)? A couple hundred MB I
         | could understand if they come with graphical apps and demos,
         | but not this..
        
           | tomsmeding wrote:
           | A regular TensorFlow installation, just the Python library,
           | is an 184 MB wheel that unpacks to about 1.2 GB of stuff. I
           | have no clue what mess goes in there, but it's a lot.
           | 
           | Still, if you're right that this package seems to take 18 GB
           | disk size, something weird is going on.
        
             | slavik81 wrote:
             | There's a lot of kernels that are specialized for
             | particular sets of input parameters and tuned for improved
             | performance on specific hardware, which makes the libraries
             | a couple hundred megabytes per architecture. The ROCm
             | libraries are huge because they are fat binaries containing
             | native machine code for ~13 different GPU architectures.
        
       | RcouF1uZ4gsC wrote:
       | I am not so sure.
       | 
       | Everyone knows that CUDA is a core competency of Nvidia and they
       | have stuck to it for years and years refining it, fixing bugs,
       | and making the experience smoother on Nvidia hardware.
       | 
       | On the other hand, AMD has not had the same level of commitment.
       | They used to sing the praises of OpenCL. And then there is ROCm.
       | Tomorrow, it might be something else.
       | 
       | Thus, Nvidia CUDA will get a lot more attention and tuning from
       | even the portability layers because they know that their
       | investment in it will reap dividends even years from now, whereas
       | their investment in AMD might be obsolete in a few years.
       | 
       | In addition, even if there is theoretical support, getting
       | specific driver support and working around driver bugs is likely
       | to be more of a pain with AMD.
        
         | AnthonyMouse wrote:
         | This is what people complain about, but at the same time there
         | aren't enough cards, so the people with AMD cards want to use
         | them. So they fix the bugs, or report them to AMD so they can
         | fix them, and it gets better. Then more people use them and
         | submit patches and bug reporters, and it gets better.
         | 
         | At some point the old complaints are no longer valid.
        
       | pixelesque wrote:
       | Does AMD have a solution to forward device combatibility (like
       | PTX for NVidia)?
       | 
       | Last time I looked into ROCm (two years ago?), you seemed to have
       | to compile stuff explicitly for the architecture you were using,
       | so if a new card came out, you couldn't use it without a
       | recompile.
        
         | mnau wrote:
         | Not natively, but AdaptiveCpp (previously hiSycl, then
         | OpenSycl) has a single source single compiler pass, where they
         | basically store LLVM IR as an intermediate representation.
         | 
         | https://github.com/AdaptiveCpp/AdaptiveCpp/blob/develop/doc/...
         | 
         | Performance penalty was within ew precents, at least according
         | to the paper (figure 9 and 10)
         | https://cdrdv2-public.intel.com/786536/Heidelberg_IWOCL__SYC...
        
         | einpoklum wrote:
         | I don't know what they do with ROCm, but with OpenCL, the
         | answer is: Certainly. It's called SPIR:
         | 
         | https://www.khronos.org/spir/
        
       | ur-whale wrote:
       | > AMD May Get Across the CUDA Moat
       | 
       | I really wish they would, and properly, as in: fully open
       | solution to match CUDA.
       | 
       | CUDA is a cancer on the industry.
        
       | hot_gril wrote:
       | People complain about Nvidia being anticompetitive with CUDA, but
       | I don't really see it. They saw a gap in the standards for on-GPU
       | compute and put tons of effort into a proprietary alternative.
       | They tied CUDA to their own hardware, which sorta makes technical
       | sense given the optimizations involved, but it's their choice
       | anyway. They still support the open standards, but many prefer
       | CUDA and will pay the Nvidia premium for it because it's actually
       | nicer. They also don't have CPU marketshare to tie things to.
       | 
       | Good for them. We can hope the open side catches up either by
       | improving their standards, or adding more layers like this
       | article describes.
        
         | zirgs wrote:
         | CUDA was released in 2007 and the development of it started
         | even earlier - possibly even in the 90s. Back then nobody else
         | cared about GPU compute. OpenCL came out 2 years after that.
        
           | killerstorm wrote:
           | Not true. People got interested in general-purpose GPU
           | compute (GPGPU) in early 2000s when video cards with
           | programmable shaders became available.
           | https://en.wikipedia.org/wiki/General-
           | purpose_computing_on_g...
           | 
           | People made a programming language & a compiler/runtime for
           | GPGPU in 2004: https://en.wikipedia.org/wiki/BrookGPU
        
       | frnkng wrote:
       | As a former ETH miner I learned the hard way that saving a few
       | bucks on hardware may not be worth operational issues.
       | 
       | I had a miner running with Nividia cards and a miner running with
       | AMD cards. One of them had massive maintenance demand and the
       | other did not. I will not state which brand was better imho.
       | 
       | Currently I estimate that running miners and running gpu servers
       | has similar operational requirements and finally at scale similar
       | financial considerations.
       | 
       | So, whatever is cheapest to operate in terms of time expenditure,
       | hw cost, energy use,... will be used the most.
       | 
       | P.s.: I ran the mining operation not to earn money but mainly out
       | of curiosity. And it was a small scale business powered by a pv
       | system and a attached heat pump.
        
         | latchkey wrote:
         | I ran 150,000+ AMD cards for mining ETH. Once I fully automated
         | all the vbios installs and individual card tuning, it ran
         | beautifully. Took a lot of work to get there though!
         | 
         | Fact is that every single GPU chip is a snowflake. No two
         | operate the same.
        
           | rottencupcakes wrote:
           | Have you ever written about this enterprise? This sounds
           | super unique and I would be very interested in hearing about
           | how it was run and how it turned out.
        
             | latchkey wrote:
             | It was unique, not many people on the planet, that I know
             | of, who've run as many GPUs as I have. Especially not
             | working for a giant company with large teams of people. For
             | the tech team, it was just me and one other guy. Everything
             | _had_ to be automated because there was no way we could
             | survive otherwise.
             | 
             | I've put a bunch of comments here on HN about the stuff I
             | can talk about.
             | 
             | It no longer exists after PoS.
        
               | freedomben wrote:
               | what type of cards did you have? what did you do with
               | them after PoS? How did you even buy so many cards?
               | Sorry, like the other commenter I'm extremely curious
        
               | latchkey wrote:
               | Primarily 470,480,570,580. We also ran a very large
               | cluster of PS5 APU chips too.
               | 
               | Got the chips directly from AMD. Since these are 4-5 year
               | old chips, they were not going to ever be used. It is
               | more ROI efficient with ETH mining to use older cards
               | than newer ones.
               | 
               | Had a couple OEM manufacture the cards specially for us
               | with 8gb, heatsinks instead of fans (lower power usage)
               | and no display ports (lower cost).
               | 
               | They will be recycled as there isn't much use for them
               | now.
               | 
               | I'm also no longer with the company.
        
       ___________________________________________________________________
       (page generated 2023-10-06 23:00 UTC)