[HN Gopher] A dive into the AMD driver workflow
___________________________________________________________________
A dive into the AMD driver workflow
Author : tikkun
Score : 116 points
Date : 2023-07-30 13:06 UTC (9 hours ago)
(HTM) web link (geohot.github.io)
(TXT) w3m dump (geohot.github.io)
| aliljet wrote:
| I'm so thoroughly confused about why AMD wouldn't be falling over
| themselves to enable geohot and his followers to build an
| alternative to CUDA and NVIDIA. This feels like a conversation
| that geohot is attempting with feckless product and software
| managers who certainly can't make bold decisions. Has the CEO of
| AMD effectively spoken about this problem?
| izacus wrote:
| Why would they spend limited resources on a fight they don't
| want?
|
| Part of leading a company is knowing which markets to pass up.
| Roark66 wrote:
| You can say a lot about nVidia, but for me all their products
| mostly just work on Linux(I use cuda a lot). I don't understand
| why AMD is having such trouble doing the same. Likewise with
| CPUs. It is ironic I have to be using Intel's math libraries to
| get good performance out of my AMD CPU.
| shmerl wrote:
| For gaming and desktop usage, situation is completely the
| opposite. Nvidia is plagued by having not upstreamed drivers,
| and AMD just works.
| amlib wrote:
| really on linux you would be better off having a primary amd
| or intel gpu and then a secondary nvidia gpu just for
| compute/cuda tasks. But alas, many new am5 motherboards don't
| even have a secondary 16x pci-e slot capable of even pcie-4
| 4x bandwidth anymore. I guess we should thank this cloud
| compute craze for that...
| shmerl wrote:
| Some like Asrock reduced the number of PCIe slots and
| increased the number of USB 4 ports. So there is an option
| at least to use an external GPU now, since that's where
| they are re-allocating available bandwidth it seems.
| sosodev wrote:
| This hasn't been my experience on Linux. The nvidia drivers
| appear to "just work" but actually caused a lot of instability.
| My desktop no longer crashes every other day (or more often
| when gaming) since switching to an AMD GPU.
| jeroenhd wrote:
| They work absolutely fine if you stick with what worked five
| years ago and don't update anything until some Nvidia blog
| says you should.
|
| Just don't use Wayland, don't use too many screens, don't use
| a laptop, don't run a recent kernel and don't expect software
| features like their special sauce screen recorder, or that
| trick where you can get a free camera inside a game, or
| anything else packed into their gaming toolkit on Windows.
| Oh, and accept a very high idle power draw if you choose not
| to go with Windows.
|
| With all of that, most games work out of the box. CUDA works,
| video encoding and decoding works (though you're severely
| limited in terms of the number of simultaneous streams
| without hex editing the driver).
|
| I do get the occasional Nvidia related crash, but it's been a
| while. Still, I don't think I'll ever consider buying Nvidia
| for a Linux device again. I was a fool to think "Nvidia has
| come a long way, I'm sure a laptop with an Nvidia GPU will
| work fine with a few tweaks".
| adrian_b wrote:
| I assume that how an Nvidia GPU works on a laptop depends a
| lot on the laptop.
|
| On a Lenovo gaming laptop, I have lost two days until
| finding how to configure Linux on it against Nvidia
| Optimus, but after that it has worked fine.
|
| On the other hand, on several Dell Precision laptops with
| Nvidia GPUs (sold with Ubuntu, but I have wiped their
| Ubuntu and I have installed another Linux distribution from
| scratch) the Nvidia GPUs have worked perfectly, out of the
| box, without any effort.
|
| I have not tried Wayland, but I have used 3 monitors for
| many years, with various Nvidia GPUs and without any
| problems. The configuration of multiple monitors with the
| NVIDIA X Server Settings program is much simpler than in
| Linux systems with AMD or Intel GPUs.
|
| Because almost every new Linux kernel version breaks the
| out-of-tree device drivers, it is unavoidable that some
| time must pass until NVIDIA releases a compatible driver
| version, though that might change soon, when their new
| open-source kernel module will be integrated in the kernel
| sources.
|
| Nevertheless, the NVIDIA driver has always supported the
| latest long-term kernel, so if you update only between
| long-term versions there are no problems with incompatible
| NVIDIA drivers.
| formerly_proven wrote:
| > Just don't use Wayland
|
| Still reasonable
|
| > don't use too many screens,
|
| Is there too many?
|
| > don't use a laptop
|
| Never bought one with a dGPU due to NVH concerns
|
| > don't run a recent kernel
|
| Using nvidia drivers on Arch for years now, what seems to
| be the problem, officer?
|
| > don't expect software features like their special sauce
| screen recorder, or that trick where you can get a free
| camera inside a game, or anything else packed into their
| gaming toolkit on Windows.
|
| Yeah their desktop-software on Linux is clearly the same
| stuff they had in 2004, right down to the Qt 3.1 version
| it's built with and the gamma ramp editor.
|
| > Oh, and accept a very high idle power draw if you choose
| not to go with Windows.
|
| Power management worked exactly identically to Windows with
| every nVidia card I've ever seen on Linux.
|
| > a laptop
|
| I think this is the actual salient issue here. Firmware
| quality varies wildly on laptops, not just for stuff like
| this, but even much more... basic things. Intel AX2xx wifi
| cards cause crashes and freezes in both Windows and Linux
| when used in some laptops (Thinkpads, namely), but are
| perfectly fine and dandy in others, or on desktops.
|
| I suspect firmware quality correlates strongly with how
| much the OEM decided to write, which is to say, it gets
| worse with every line of code added by the OEM. I think
| that's why Thinkpads, HPs, Dells etc. are so notorious for
| their shitty firmware and trashy ECs, while practical no-
| names have none of these problems, simply because they're
| more or less just sticking the Intel (or AMD) reference
| platform in a box - and suddenly it "just works".
| Gordonjcp wrote:
| [dead]
| theresistor wrote:
| I always see this comment, but I have used NVidia GPUs for
| gaming on Linux for a decade and they have always worked
| perfectly with the proprietary drivers.
| oldgradstudent wrote:
| > A note culturally, I do sadly feel like what they responded to
| was george is upset and saying bad things about us which is bad
| for our brand and not holy shit we have a broken driver released
| panicing thousands of people's kernels and crashing their GPUs.
| But it's a start.
| LargeTomato wrote:
| George is right.
|
| But
|
| He has a history of being an arrogant prick. That will color
| people's perceptions of you even if it's not relevant to the
| immediate interaction.
| brucethemoose2 wrote:
| I didn't even know who Hotz was, but that YouTube rant on
| this very issue made me extremely skeptical.
|
| Should someone who will just drop a GPU vendor if they get
| flustered _really_ be leading a ML framework?
| BearOso wrote:
| Hotz is known for hacking the iPhone and PS3. But you
| should look deeper into that before giving him any credit.
|
| His post is mostly hot air. AMD driver development is
| _very_ open, excepting only ROCm. Not knowing where things
| are discussed or how the various driver systems, excluding
| Nvidia, interact on Linux is not an excuse to rant.
|
| The corporate amd kernel development starts amd-staging-
| drm-next, with internal AMD pull requests available on the
| mailing list or here:
| https://patchwork.freedesktop.org/project/amd-xorg-
| ddx/serie..., before going through airlied, the maintainer
| of drm in the main kernel, then to Linus.
|
| Everything user-space regarding OpenGL/Vulkan and some
| rusty OpenCL, aside from the proprietary amdgpu-pro driver
| which should almost never be used, is in Mesa.
|
| ROCm is the only thing with huge code dumps, obviously
| because it's a new effort, and it's said outright that it's
| unsupported for consumer GPUs. Yes, it's a buggy mess. Did
| his issue warrant the corporate response he got? No.
| cavisne wrote:
| To be fair one of the issues with AMD and ML is they have
| been pretending it works for years now, when nothing
| actually works.
|
| Calling that out publicly is probably needed at this point,
| or they will keep putting AI in their earnings reports
| without anyone actually being able to train a model on an
| AMD chip.
| [deleted]
| vbezhenar wrote:
| You're welcome to fork and find another leader, I guess.
| brucethemoose2 wrote:
| I don't think some stability or a formal decision making
| structure is too much to ask. TVM, MLIR based projects,
| GGML seem to have stable leadership.
| wmf wrote:
| Everybody saying "poor AMD is such an underdog" for years =
| perma-broken drivers
|
| George being a prick one time = suddenly working driver
| appears?
|
| My perception is indeed colored but in the opposite
| direction.
| ysleepy wrote:
| I found his video pretty dumb. Everybody is assuming the
| driver is shitty, but he tried is an officially
| unsupported[1] GPU with ROCm and it failed. Big whoop.
|
| He is a serial shit stirrer and this is no exception.
|
| Did he try Windows?
|
| As much as I hate that AMD is not really supporting consumer
| GPUs for compute, presenting his problems as some variant of
| production drivers breaking is a stretch at best.
|
| [1]: https://rocm.docs.amd.com/en/latest/release/gpu_os_suppo
| rt.h...
| mindcrime wrote:
| _As much as I hate that AMD is not really supporting
| consumer GPUs for compute,_
|
| Keep in mind this post is around two months old now, and
| since it was published AMD have already officially
| announced plans to support (at least some) consumer GPU's
| in ROCm.
| jjoonathan wrote:
| Leaving consumer GPUs unsupported is part of the problem
| and AMD deserves to have its shit stirred for it.
|
| They need to be better than nvidia, not "you'll take what
| you can get." We can get nvidia. That's the bar. NVidia
| charges a premium but their shit works (comparatively
| speaking). AMD has been half-assing their compute offerings
| for 15 years, it finally became important, and now they
| need to play catch-up not drag their heels and toss out
| excuses as to why that's OK. My prediction: AMD won't be
| second place in this race for long, someone who actually
| wants the #2 spot will take it from them and AMD will sit
| around wondering how they blew a 15 year head start.
| latchkey wrote:
| The AI race is so hot right now that AMD will sit in #2
| and pick up all of the people who can't get access to
| NVIDIA cards. There is a huge centralization issue around
| only writing your code for NVIDIA... and that is a huge
| business risk. People are going to wake up to that _fast_
| as the supplies of NVIDIA go to zero.
|
| That said, I don't think people realize that there is
| literally no more large scale tier 3 (redundant) data
| center power in the US. Even if you are sitting on 1000
| or 10k NVIDIA cards, you can't deploy them anywhere.
|
| They also need to be in the same data center for speed.
| You can't just colo in 2-3 data centers to get what you
| want. If you want to train a large model across 1000
| gpus? You're screwed.
|
| Think you can just go to the cloud? Go try to sign up for
| coreweave. They are full and not taking any more
| customers. A lot of the other sites advertising nvidia
| gpus are just reselling coreweave under the covers.
|
| Forget the software problems. There are far bigger issues
| and they are not getting better any time soon.
| Aeolos wrote:
| The problem is AMD is not #2 in this field, not by a long
| shot.
|
| Google and Amazon offer their own hardware that is both
| better supported and actually available for customers.
| Apple has fast inference hardware in every computer and
| mobile device they offer.
|
| I raised this issue with AMD 8 years ago in a technology
| conference, and the answer I got back then was a shoulder
| shrug "we don't think this is an important market." 8
| years later and they have all but lost the war.
| latchkey wrote:
| G/A can't be used for large scale training, nobody is
| going to give their data to them. Major trust issues
| there.
|
| Apple is Apple. Not public. Let's also not mix consumer
| needs with enterprise.
|
| You're correct, 8 years ago and up until recently, AMD
| only cared about gamers. They are waking up fast though.
|
| ROCm 5.6 is a visible first step in that regard. MI300
| will blow A/H100's out of the water.
|
| But again, hardware/software isn't the problem here. The
| problem is much deeper than that... even if you have
| those things resolved, you can't put them anywhere.
| [deleted]
| jjoonathan wrote:
| > Let's also not mix consumer needs with enterprise.
|
| NVidia mixed them and now everything is written in CUDA.
| Lol.
| latchkey wrote:
| And we now have hipcc to go back to AMD. Sweet!
| jjoonathan wrote:
| Have fun with that. I burned my hand badly enough on
| OpenCL that I now know to wait for proof, not promises.
| latchkey wrote:
| People are doing benchmarks on older rocm's and it is
| looking pretty good.
|
| https://www.mosaicml.com/blog/amd-mi250
|
| Waiting on the updates.
|
| I'll add that I have learned over time to not discount
| motivation. If AMD it motivated, they can do it. This has
| been proven already with their dominance over the server
| cpu market.
| querez wrote:
| > G/A can't be used for large scale training, nobody is
| going to give their data to them. Major trust issues
| there.
|
| And yet, that's what a lot of the big AI startups are
| doing. Granted, it's not what everyday business are doing
| (yet). But TPUs offer pretty impressive perf/cost ratio,
| so I'd be surprised if it's actually "nobody".
| latchkey wrote:
| > that's what a lot of the big AI startups are doing
|
| They don't have any other choice or they are just dumb...
|
| https://www.popsci.com/technology/google-ai-lawsuit/
| rrdharan wrote:
| The fact that this lawsuit exists doesn't prove anything.
|
| Real evidence that Google or Amazon actually introspected
| the contents of their cloud platform customer's VMs,
| Databases, GPUs, disks, blob storage buckets etc. would
| be far more convincing, but such evidence doesn't exist -
| because it doesn't happen.
| photonbeam wrote:
| > no more large scale tier 3 (redundant) data center
| power in the US.
|
| This is interesting to me, what is the constraining
| factor? Raw generator output? Transmission lines getting
| power to the right places?
| latchkey wrote:
| Both. Transformers are a big one.
|
| All the large FAANGs have been sucking up availability.
| [deleted]
| imtringued wrote:
| AMD's AI cookie cutter business model is getting million
| dollar contracts with a handful of companies. The
| peasants don't get access to or use AMDs data center
| cards.
| Jasper_ wrote:
| Remember, geohot was going to write the driver -- he came
| out of the gate saying "AMD has ROCm but you can't use it
| on consumer cards, so I'm going to write my own software
| stack that works with AMD's consumer grade GPUs". He
| raised $5.1m on that promise.
|
| So far, with all that money, he has compiled their
| driver, ran it in an unsupported configuration, and then
| had a complete public mental breakdown because it didn't
| work. Something he already knew going in.
|
| Should AMD support ROCm on its consumer grade GPUs?
| Probably. But that's really not geohot's choice to make,
| unless he wants to get his hands dirty and actually write
| the software he promised to write.
|
| Having worked on GPU drivers, it's not a couple-lines fix
| situation, it would be a pretty big investment to add
| stable ROCm support for consumer GPUs. AMD higher-ups
| responding with anything other than "lol what it's
| unsupported what did you expect" is extending a pretty
| long olive branch here.
| AshamedCaptain wrote:
| > "AMD has ROCm but you can't use it on consumer cards,
| so I'm going to write my own software stack that works
| with AMD's consumer grade GPUs". He raised $5.1m on that
| promise.
|
| Having not followed this, I don't understand what the
| promise was. ROCm works just fine on at least some
| consumer cards like the 6xxx, and by fine I mean "as bad
| as on AMD's pro cards", but at least it works out of the
| box.
|
| Certainly it's not supported, and so therefore they are
| not shipping precompiled binaries, but it does seem to
| work...
| hedgehog wrote:
| To give him credit I don't think he was saying he was
| going to write new drivers, just build a ML software
| stack that worked on whatever they ship for the consumer
| cards. Certainly a large project by itself with
| questionable value at this point, but not writing new
| drivers.
| jjoonathan wrote:
| The problem is mismatch between spec sheet and customer
| needs, not mismatch between spec sheet and card. I don't
| know whether this is a management problem at AMD or an
| engineering problem at AMD or both. I don't really care,
| either -- a "wans't me!" from engineering is _completely_
| uninteresting to me. The problem is that AMD 's consumer
| cards don't run ROCm while NVidia's consumer cards all
| run CUDA.
|
| I am amazed that Geohot reached out to AMD to extend an
| olive branch so far as to force their failure of a
| product to work despite itself, because frankly I'd just
| have expected AMD to spin excuses like you did. I am
| encouraged to see that they took a higher road; hopefully
| that translates to actual execution. We'll see.
| imtringued wrote:
| Your comment reminds me of another comment by (SilverBirch
| I believe) that AMD can do this and will get away with it
| and that George Hotz is unimportant to AMD. Then a few
| weeks later Lisa Su tweeted this
| https://twitter.com/LisaSu/status/1669848494637735936.
| WesolyKubeczek wrote:
| I knew for years that AMD's development model leaves a lot to be
| desired.
|
| They like to throw big balls of source over the wall, and very
| soon after the bugs that keep haunting the previous generation of
| hardware just stop getting fixed. You're SOL unless Dave Airlie
| himself runs into identical problems on his personal gear and
| gets angry enough about it to make a fix.
| iforgotpassword wrote:
| Yeah, I've ranted about this here in the past. I'm glad someone
| high-profile enough is finally doing the same, as that might
| actually lead somewhere.
|
| Their closed development process is so moronic. It means their
| code is always ahead and out of sync with what is in public.
| There have been user provided fixes and improvements to ROCm on
| GitHub with no reaction from AMD. Probably because it wouldn't
| apply cleanly to whatever they currently have. It's sad to see
| your customers having to fix your drivers. It's even sadder to
| see you ignore it.
| ethbr0 wrote:
| This is the greatest travesty of closed/open models, IMHO.
|
| I get exactly what happens -- devs are shipping their sprint
| pipeline internally, and pull requests or issues from the
| current open source head never make it onto their radar.
|
| This gatekeeps bugs visibility behind whatever PM is running
| sprint planning, which always leads to delivering what
| management wants instead of what users are dealing with.
|
| In complex, diverse environments, you're always going to have
| a myriad of bugs that are caused by a specific configuration.
|
| Absent a path to actually fix that, you end up with an
| enterprise-only product that's only stable in extremely
| specific configurations.
|
| And the _greatest_ evil... you 're also ignoring helpful
| reports of configurations where it's broken!
| znpy wrote:
| > And in order to beat them, you must be playing to win, not just
| playing not to lose. Combine the driver openness with public
| hardware docs and you have a competitive advantage.
|
| This is so immensely true. I've been so enthusiastically loyal to
| intel gpus in the last ten years or so (avoiding laptops with any
| kind of discrete gpus) because dealing with closed drivers is so
| much pain.
|
| I'm still skeptical about amd gpus, even though i hear good stuff
| about it.
|
| I kist want hardware i can trust, knowing I won't have dumb
| driver issues.
| tikkun wrote:
| A possible prescription for AMD regarding AI and CUDA:
|
| 1) Open-source driver development as mentioned in this post
|
| 2) Set up 24/7 free software tech support on discord. Maybe for
| all use cases, maybe only AI use cases. Do the tech support via
| screen sharing, and have a writer join for all calls, so that
| every issue once solved gets a blog post
|
| 3) Have employees run all popular AI tools and get them working
| on AMD hardware, publish written guides and videos showing how to
| do it.
| zirgs wrote:
| 4) Release a consumer GPU with 32-80 GB of VRAM.
| Tuna-Fish wrote:
| It sounds silly, but people would endure a lot of pain to fit
| significantly larger models into ram.
| ciupicri wrote:
| What are you planning to play on that thing?
| Filligree wrote:
| Llama 2.
| latchkey wrote:
| Or at least make it so that you could rent the larger gpus by
| the hour.
|
| As far as I know, there isn't a single service that offers
| bare metal access to MI210/MI250's.
| fbdab103 wrote:
| It could even be 2-3 generations behind and that thing would
| still sell.
| hedgehog wrote:
| The problem is not that people within the company don't have
| good ideas of how to improve deep learning research end user
| experience, it's just not a priority for AMD. It's annoying as
| a potential customer but arguably whatever their overall
| strategy is it's working.
| [deleted]
| lul_open wrote:
| Yes, let's build an open source community on top of a closed
| platform. It's not like twitter, reddit, Facebook etc have
| taught us anything.
|
| Open a mailing list like every project that survives more than
| 5 years. Hell the barrier to entry will ensure you get people
| who can use a text editor and save you half the question of
| 'how do I install this on on an Intel a2 Mac???'
___________________________________________________________________
(page generated 2023-07-30 23:01 UTC)