[HN Gopher] A dive into the AMD driver workflow
       ___________________________________________________________________
        
       A dive into the AMD driver workflow
        
       Author : tikkun
       Score  : 116 points
       Date   : 2023-07-30 13:06 UTC (9 hours ago)
        
 (HTM) web link (geohot.github.io)
 (TXT) w3m dump (geohot.github.io)
        
       | aliljet wrote:
       | I'm so thoroughly confused about why AMD wouldn't be falling over
       | themselves to enable geohot and his followers to build an
       | alternative to CUDA and NVIDIA. This feels like a conversation
       | that geohot is attempting with feckless product and software
       | managers who certainly can't make bold decisions. Has the CEO of
       | AMD effectively spoken about this problem?
        
         | izacus wrote:
         | Why would they spend limited resources on a fight they don't
         | want?
         | 
         | Part of leading a company is knowing which markets to pass up.
        
       | Roark66 wrote:
       | You can say a lot about nVidia, but for me all their products
       | mostly just work on Linux(I use cuda a lot). I don't understand
       | why AMD is having such trouble doing the same. Likewise with
       | CPUs. It is ironic I have to be using Intel's math libraries to
       | get good performance out of my AMD CPU.
        
         | shmerl wrote:
         | For gaming and desktop usage, situation is completely the
         | opposite. Nvidia is plagued by having not upstreamed drivers,
         | and AMD just works.
        
           | amlib wrote:
           | really on linux you would be better off having a primary amd
           | or intel gpu and then a secondary nvidia gpu just for
           | compute/cuda tasks. But alas, many new am5 motherboards don't
           | even have a secondary 16x pci-e slot capable of even pcie-4
           | 4x bandwidth anymore. I guess we should thank this cloud
           | compute craze for that...
        
             | shmerl wrote:
             | Some like Asrock reduced the number of PCIe slots and
             | increased the number of USB 4 ports. So there is an option
             | at least to use an external GPU now, since that's where
             | they are re-allocating available bandwidth it seems.
        
         | sosodev wrote:
         | This hasn't been my experience on Linux. The nvidia drivers
         | appear to "just work" but actually caused a lot of instability.
         | My desktop no longer crashes every other day (or more often
         | when gaming) since switching to an AMD GPU.
        
           | jeroenhd wrote:
           | They work absolutely fine if you stick with what worked five
           | years ago and don't update anything until some Nvidia blog
           | says you should.
           | 
           | Just don't use Wayland, don't use too many screens, don't use
           | a laptop, don't run a recent kernel and don't expect software
           | features like their special sauce screen recorder, or that
           | trick where you can get a free camera inside a game, or
           | anything else packed into their gaming toolkit on Windows.
           | Oh, and accept a very high idle power draw if you choose not
           | to go with Windows.
           | 
           | With all of that, most games work out of the box. CUDA works,
           | video encoding and decoding works (though you're severely
           | limited in terms of the number of simultaneous streams
           | without hex editing the driver).
           | 
           | I do get the occasional Nvidia related crash, but it's been a
           | while. Still, I don't think I'll ever consider buying Nvidia
           | for a Linux device again. I was a fool to think "Nvidia has
           | come a long way, I'm sure a laptop with an Nvidia GPU will
           | work fine with a few tweaks".
        
             | adrian_b wrote:
             | I assume that how an Nvidia GPU works on a laptop depends a
             | lot on the laptop.
             | 
             | On a Lenovo gaming laptop, I have lost two days until
             | finding how to configure Linux on it against Nvidia
             | Optimus, but after that it has worked fine.
             | 
             | On the other hand, on several Dell Precision laptops with
             | Nvidia GPUs (sold with Ubuntu, but I have wiped their
             | Ubuntu and I have installed another Linux distribution from
             | scratch) the Nvidia GPUs have worked perfectly, out of the
             | box, without any effort.
             | 
             | I have not tried Wayland, but I have used 3 monitors for
             | many years, with various Nvidia GPUs and without any
             | problems. The configuration of multiple monitors with the
             | NVIDIA X Server Settings program is much simpler than in
             | Linux systems with AMD or Intel GPUs.
             | 
             | Because almost every new Linux kernel version breaks the
             | out-of-tree device drivers, it is unavoidable that some
             | time must pass until NVIDIA releases a compatible driver
             | version, though that might change soon, when their new
             | open-source kernel module will be integrated in the kernel
             | sources.
             | 
             | Nevertheless, the NVIDIA driver has always supported the
             | latest long-term kernel, so if you update only between
             | long-term versions there are no problems with incompatible
             | NVIDIA drivers.
        
             | formerly_proven wrote:
             | > Just don't use Wayland
             | 
             | Still reasonable
             | 
             | > don't use too many screens,
             | 
             | Is there too many?
             | 
             | > don't use a laptop
             | 
             | Never bought one with a dGPU due to NVH concerns
             | 
             | > don't run a recent kernel
             | 
             | Using nvidia drivers on Arch for years now, what seems to
             | be the problem, officer?
             | 
             | > don't expect software features like their special sauce
             | screen recorder, or that trick where you can get a free
             | camera inside a game, or anything else packed into their
             | gaming toolkit on Windows.
             | 
             | Yeah their desktop-software on Linux is clearly the same
             | stuff they had in 2004, right down to the Qt 3.1 version
             | it's built with and the gamma ramp editor.
             | 
             | > Oh, and accept a very high idle power draw if you choose
             | not to go with Windows.
             | 
             | Power management worked exactly identically to Windows with
             | every nVidia card I've ever seen on Linux.
             | 
             | > a laptop
             | 
             | I think this is the actual salient issue here. Firmware
             | quality varies wildly on laptops, not just for stuff like
             | this, but even much more... basic things. Intel AX2xx wifi
             | cards cause crashes and freezes in both Windows and Linux
             | when used in some laptops (Thinkpads, namely), but are
             | perfectly fine and dandy in others, or on desktops.
             | 
             | I suspect firmware quality correlates strongly with how
             | much the OEM decided to write, which is to say, it gets
             | worse with every line of code added by the OEM. I think
             | that's why Thinkpads, HPs, Dells etc. are so notorious for
             | their shitty firmware and trashy ECs, while practical no-
             | names have none of these problems, simply because they're
             | more or less just sticking the Intel (or AMD) reference
             | platform in a box - and suddenly it "just works".
        
           | Gordonjcp wrote:
           | [dead]
        
           | theresistor wrote:
           | I always see this comment, but I have used NVidia GPUs for
           | gaming on Linux for a decade and they have always worked
           | perfectly with the proprietary drivers.
        
       | oldgradstudent wrote:
       | > A note culturally, I do sadly feel like what they responded to
       | was george is upset and saying bad things about us which is bad
       | for our brand and not holy shit we have a broken driver released
       | panicing thousands of people's kernels and crashing their GPUs.
       | But it's a start.
        
         | LargeTomato wrote:
         | George is right.
         | 
         | But
         | 
         | He has a history of being an arrogant prick. That will color
         | people's perceptions of you even if it's not relevant to the
         | immediate interaction.
        
           | brucethemoose2 wrote:
           | I didn't even know who Hotz was, but that YouTube rant on
           | this very issue made me extremely skeptical.
           | 
           | Should someone who will just drop a GPU vendor if they get
           | flustered _really_ be leading a ML framework?
        
             | BearOso wrote:
             | Hotz is known for hacking the iPhone and PS3. But you
             | should look deeper into that before giving him any credit.
             | 
             | His post is mostly hot air. AMD driver development is
             | _very_ open, excepting only ROCm. Not knowing where things
             | are discussed or how the various driver systems, excluding
             | Nvidia, interact on Linux is not an excuse to rant.
             | 
             | The corporate amd kernel development starts amd-staging-
             | drm-next, with internal AMD pull requests available on the
             | mailing list or here:
             | https://patchwork.freedesktop.org/project/amd-xorg-
             | ddx/serie..., before going through airlied, the maintainer
             | of drm in the main kernel, then to Linus.
             | 
             | Everything user-space regarding OpenGL/Vulkan and some
             | rusty OpenCL, aside from the proprietary amdgpu-pro driver
             | which should almost never be used, is in Mesa.
             | 
             | ROCm is the only thing with huge code dumps, obviously
             | because it's a new effort, and it's said outright that it's
             | unsupported for consumer GPUs. Yes, it's a buggy mess. Did
             | his issue warrant the corporate response he got? No.
        
             | cavisne wrote:
             | To be fair one of the issues with AMD and ML is they have
             | been pretending it works for years now, when nothing
             | actually works.
             | 
             | Calling that out publicly is probably needed at this point,
             | or they will keep putting AI in their earnings reports
             | without anyone actually being able to train a model on an
             | AMD chip.
        
             | [deleted]
        
             | vbezhenar wrote:
             | You're welcome to fork and find another leader, I guess.
        
               | brucethemoose2 wrote:
               | I don't think some stability or a formal decision making
               | structure is too much to ask. TVM, MLIR based projects,
               | GGML seem to have stable leadership.
        
           | wmf wrote:
           | Everybody saying "poor AMD is such an underdog" for years =
           | perma-broken drivers
           | 
           | George being a prick one time = suddenly working driver
           | appears?
           | 
           | My perception is indeed colored but in the opposite
           | direction.
        
           | ysleepy wrote:
           | I found his video pretty dumb. Everybody is assuming the
           | driver is shitty, but he tried is an officially
           | unsupported[1] GPU with ROCm and it failed. Big whoop.
           | 
           | He is a serial shit stirrer and this is no exception.
           | 
           | Did he try Windows?
           | 
           | As much as I hate that AMD is not really supporting consumer
           | GPUs for compute, presenting his problems as some variant of
           | production drivers breaking is a stretch at best.
           | 
           | [1]: https://rocm.docs.amd.com/en/latest/release/gpu_os_suppo
           | rt.h...
        
             | mindcrime wrote:
             | _As much as I hate that AMD is not really supporting
             | consumer GPUs for compute,_
             | 
             | Keep in mind this post is around two months old now, and
             | since it was published AMD have already officially
             | announced plans to support (at least some) consumer GPU's
             | in ROCm.
        
             | jjoonathan wrote:
             | Leaving consumer GPUs unsupported is part of the problem
             | and AMD deserves to have its shit stirred for it.
             | 
             | They need to be better than nvidia, not "you'll take what
             | you can get." We can get nvidia. That's the bar. NVidia
             | charges a premium but their shit works (comparatively
             | speaking). AMD has been half-assing their compute offerings
             | for 15 years, it finally became important, and now they
             | need to play catch-up not drag their heels and toss out
             | excuses as to why that's OK. My prediction: AMD won't be
             | second place in this race for long, someone who actually
             | wants the #2 spot will take it from them and AMD will sit
             | around wondering how they blew a 15 year head start.
        
               | latchkey wrote:
               | The AI race is so hot right now that AMD will sit in #2
               | and pick up all of the people who can't get access to
               | NVIDIA cards. There is a huge centralization issue around
               | only writing your code for NVIDIA... and that is a huge
               | business risk. People are going to wake up to that _fast_
               | as the supplies of NVIDIA go to zero.
               | 
               | That said, I don't think people realize that there is
               | literally no more large scale tier 3 (redundant) data
               | center power in the US. Even if you are sitting on 1000
               | or 10k NVIDIA cards, you can't deploy them anywhere.
               | 
               | They also need to be in the same data center for speed.
               | You can't just colo in 2-3 data centers to get what you
               | want. If you want to train a large model across 1000
               | gpus? You're screwed.
               | 
               | Think you can just go to the cloud? Go try to sign up for
               | coreweave. They are full and not taking any more
               | customers. A lot of the other sites advertising nvidia
               | gpus are just reselling coreweave under the covers.
               | 
               | Forget the software problems. There are far bigger issues
               | and they are not getting better any time soon.
        
               | Aeolos wrote:
               | The problem is AMD is not #2 in this field, not by a long
               | shot.
               | 
               | Google and Amazon offer their own hardware that is both
               | better supported and actually available for customers.
               | Apple has fast inference hardware in every computer and
               | mobile device they offer.
               | 
               | I raised this issue with AMD 8 years ago in a technology
               | conference, and the answer I got back then was a shoulder
               | shrug "we don't think this is an important market." 8
               | years later and they have all but lost the war.
        
               | latchkey wrote:
               | G/A can't be used for large scale training, nobody is
               | going to give their data to them. Major trust issues
               | there.
               | 
               | Apple is Apple. Not public. Let's also not mix consumer
               | needs with enterprise.
               | 
               | You're correct, 8 years ago and up until recently, AMD
               | only cared about gamers. They are waking up fast though.
               | 
               | ROCm 5.6 is a visible first step in that regard. MI300
               | will blow A/H100's out of the water.
               | 
               | But again, hardware/software isn't the problem here. The
               | problem is much deeper than that... even if you have
               | those things resolved, you can't put them anywhere.
        
               | [deleted]
        
               | jjoonathan wrote:
               | > Let's also not mix consumer needs with enterprise.
               | 
               | NVidia mixed them and now everything is written in CUDA.
               | Lol.
        
               | latchkey wrote:
               | And we now have hipcc to go back to AMD. Sweet!
        
               | jjoonathan wrote:
               | Have fun with that. I burned my hand badly enough on
               | OpenCL that I now know to wait for proof, not promises.
        
               | latchkey wrote:
               | People are doing benchmarks on older rocm's and it is
               | looking pretty good.
               | 
               | https://www.mosaicml.com/blog/amd-mi250
               | 
               | Waiting on the updates.
               | 
               | I'll add that I have learned over time to not discount
               | motivation. If AMD it motivated, they can do it. This has
               | been proven already with their dominance over the server
               | cpu market.
        
               | querez wrote:
               | > G/A can't be used for large scale training, nobody is
               | going to give their data to them. Major trust issues
               | there.
               | 
               | And yet, that's what a lot of the big AI startups are
               | doing. Granted, it's not what everyday business are doing
               | (yet). But TPUs offer pretty impressive perf/cost ratio,
               | so I'd be surprised if it's actually "nobody".
        
               | latchkey wrote:
               | > that's what a lot of the big AI startups are doing
               | 
               | They don't have any other choice or they are just dumb...
               | 
               | https://www.popsci.com/technology/google-ai-lawsuit/
        
               | rrdharan wrote:
               | The fact that this lawsuit exists doesn't prove anything.
               | 
               | Real evidence that Google or Amazon actually introspected
               | the contents of their cloud platform customer's VMs,
               | Databases, GPUs, disks, blob storage buckets etc. would
               | be far more convincing, but such evidence doesn't exist -
               | because it doesn't happen.
        
               | photonbeam wrote:
               | > no more large scale tier 3 (redundant) data center
               | power in the US.
               | 
               | This is interesting to me, what is the constraining
               | factor? Raw generator output? Transmission lines getting
               | power to the right places?
        
               | latchkey wrote:
               | Both. Transformers are a big one.
               | 
               | All the large FAANGs have been sucking up availability.
        
               | [deleted]
        
               | imtringued wrote:
               | AMD's AI cookie cutter business model is getting million
               | dollar contracts with a handful of companies. The
               | peasants don't get access to or use AMDs data center
               | cards.
        
               | Jasper_ wrote:
               | Remember, geohot was going to write the driver -- he came
               | out of the gate saying "AMD has ROCm but you can't use it
               | on consumer cards, so I'm going to write my own software
               | stack that works with AMD's consumer grade GPUs". He
               | raised $5.1m on that promise.
               | 
               | So far, with all that money, he has compiled their
               | driver, ran it in an unsupported configuration, and then
               | had a complete public mental breakdown because it didn't
               | work. Something he already knew going in.
               | 
               | Should AMD support ROCm on its consumer grade GPUs?
               | Probably. But that's really not geohot's choice to make,
               | unless he wants to get his hands dirty and actually write
               | the software he promised to write.
               | 
               | Having worked on GPU drivers, it's not a couple-lines fix
               | situation, it would be a pretty big investment to add
               | stable ROCm support for consumer GPUs. AMD higher-ups
               | responding with anything other than "lol what it's
               | unsupported what did you expect" is extending a pretty
               | long olive branch here.
        
               | AshamedCaptain wrote:
               | > "AMD has ROCm but you can't use it on consumer cards,
               | so I'm going to write my own software stack that works
               | with AMD's consumer grade GPUs". He raised $5.1m on that
               | promise.
               | 
               | Having not followed this, I don't understand what the
               | promise was. ROCm works just fine on at least some
               | consumer cards like the 6xxx, and by fine I mean "as bad
               | as on AMD's pro cards", but at least it works out of the
               | box.
               | 
               | Certainly it's not supported, and so therefore they are
               | not shipping precompiled binaries, but it does seem to
               | work...
        
               | hedgehog wrote:
               | To give him credit I don't think he was saying he was
               | going to write new drivers, just build a ML software
               | stack that worked on whatever they ship for the consumer
               | cards. Certainly a large project by itself with
               | questionable value at this point, but not writing new
               | drivers.
        
               | jjoonathan wrote:
               | The problem is mismatch between spec sheet and customer
               | needs, not mismatch between spec sheet and card. I don't
               | know whether this is a management problem at AMD or an
               | engineering problem at AMD or both. I don't really care,
               | either -- a "wans't me!" from engineering is _completely_
               | uninteresting to me. The problem is that AMD 's consumer
               | cards don't run ROCm while NVidia's consumer cards all
               | run CUDA.
               | 
               | I am amazed that Geohot reached out to AMD to extend an
               | olive branch so far as to force their failure of a
               | product to work despite itself, because frankly I'd just
               | have expected AMD to spin excuses like you did. I am
               | encouraged to see that they took a higher road; hopefully
               | that translates to actual execution. We'll see.
        
             | imtringued wrote:
             | Your comment reminds me of another comment by (SilverBirch
             | I believe) that AMD can do this and will get away with it
             | and that George Hotz is unimportant to AMD. Then a few
             | weeks later Lisa Su tweeted this
             | https://twitter.com/LisaSu/status/1669848494637735936.
        
       | WesolyKubeczek wrote:
       | I knew for years that AMD's development model leaves a lot to be
       | desired.
       | 
       | They like to throw big balls of source over the wall, and very
       | soon after the bugs that keep haunting the previous generation of
       | hardware just stop getting fixed. You're SOL unless Dave Airlie
       | himself runs into identical problems on his personal gear and
       | gets angry enough about it to make a fix.
        
         | iforgotpassword wrote:
         | Yeah, I've ranted about this here in the past. I'm glad someone
         | high-profile enough is finally doing the same, as that might
         | actually lead somewhere.
         | 
         | Their closed development process is so moronic. It means their
         | code is always ahead and out of sync with what is in public.
         | There have been user provided fixes and improvements to ROCm on
         | GitHub with no reaction from AMD. Probably because it wouldn't
         | apply cleanly to whatever they currently have. It's sad to see
         | your customers having to fix your drivers. It's even sadder to
         | see you ignore it.
        
           | ethbr0 wrote:
           | This is the greatest travesty of closed/open models, IMHO.
           | 
           | I get exactly what happens -- devs are shipping their sprint
           | pipeline internally, and pull requests or issues from the
           | current open source head never make it onto their radar.
           | 
           | This gatekeeps bugs visibility behind whatever PM is running
           | sprint planning, which always leads to delivering what
           | management wants instead of what users are dealing with.
           | 
           | In complex, diverse environments, you're always going to have
           | a myriad of bugs that are caused by a specific configuration.
           | 
           | Absent a path to actually fix that, you end up with an
           | enterprise-only product that's only stable in extremely
           | specific configurations.
           | 
           | And the _greatest_ evil... you 're also ignoring helpful
           | reports of configurations where it's broken!
        
       | znpy wrote:
       | > And in order to beat them, you must be playing to win, not just
       | playing not to lose. Combine the driver openness with public
       | hardware docs and you have a competitive advantage.
       | 
       | This is so immensely true. I've been so enthusiastically loyal to
       | intel gpus in the last ten years or so (avoiding laptops with any
       | kind of discrete gpus) because dealing with closed drivers is so
       | much pain.
       | 
       | I'm still skeptical about amd gpus, even though i hear good stuff
       | about it.
       | 
       | I kist want hardware i can trust, knowing I won't have dumb
       | driver issues.
        
       | tikkun wrote:
       | A possible prescription for AMD regarding AI and CUDA:
       | 
       | 1) Open-source driver development as mentioned in this post
       | 
       | 2) Set up 24/7 free software tech support on discord. Maybe for
       | all use cases, maybe only AI use cases. Do the tech support via
       | screen sharing, and have a writer join for all calls, so that
       | every issue once solved gets a blog post
       | 
       | 3) Have employees run all popular AI tools and get them working
       | on AMD hardware, publish written guides and videos showing how to
       | do it.
        
         | zirgs wrote:
         | 4) Release a consumer GPU with 32-80 GB of VRAM.
        
           | Tuna-Fish wrote:
           | It sounds silly, but people would endure a lot of pain to fit
           | significantly larger models into ram.
        
           | ciupicri wrote:
           | What are you planning to play on that thing?
        
             | Filligree wrote:
             | Llama 2.
        
           | latchkey wrote:
           | Or at least make it so that you could rent the larger gpus by
           | the hour.
           | 
           | As far as I know, there isn't a single service that offers
           | bare metal access to MI210/MI250's.
        
           | fbdab103 wrote:
           | It could even be 2-3 generations behind and that thing would
           | still sell.
        
         | hedgehog wrote:
         | The problem is not that people within the company don't have
         | good ideas of how to improve deep learning research end user
         | experience, it's just not a priority for AMD. It's annoying as
         | a potential customer but arguably whatever their overall
         | strategy is it's working.
        
           | [deleted]
        
         | lul_open wrote:
         | Yes, let's build an open source community on top of a closed
         | platform. It's not like twitter, reddit, Facebook etc have
         | taught us anything.
         | 
         | Open a mailing list like every project that survives more than
         | 5 years. Hell the barrier to entry will ensure you get people
         | who can use a text editor and save you half the question of
         | 'how do I install this on on an Intel a2 Mac???'
        
       ___________________________________________________________________
       (page generated 2023-07-30 23:01 UTC)