[HN Gopher] AMD GPU Inference
___________________________________________________________________
AMD GPU Inference
Author : fazkan
Score : 142 points
Date : 2024-10-02 07:16 UTC (15 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| a2128 wrote:
| It seems to use an old, 2 year old version of ROCm (5.4.2) which
| I'm doubtful would support my RX 7900 XTX. I personally found it
| easiest to just use the latest `rocm/pytorch` image and run what
| I need from there
| slavik81 wrote:
| The RX 7900 XTX (gfx1100) was first enabled in the math
| libraries (e.g. rocBLAS) for ROCm 5.4, but I don't think the AI
| libraries (e.g. MIOpen) had it enabled until ROCm 5.5. I
| believe the performance improved significantly in later
| releases, as well.
| ashirviskas wrote:
| I'm all for having more open source projects, but I do not see
| how it can be useful in this ecosystem, especially for people
| with newer AMD GPUs (not supported in this project) which are
| already supported in most popular projects?
| fazkan wrote:
| Just something that, we found helpful, support for new
| architectures is just a package update. This is more of a
| cookie cutter
| lhl wrote:
| For inference, if you have a supported card (or probably
| architecture if you are on Linux and can use
| HSA_OVERRIDE_GFX_VERSION), then you can probably run anything
| with (upstream) PyTorch and transformers. Also, compiling
| llama.cpp is has been pretty trouble-free for me for at least a
| year.
|
| (If you are on Windows, there is usually a win-hip binary of
| llama.cpp in the project's releases or if things totally refuse
| to work, you can use the Vulkan build as a (less performant)
| fallback).
|
| Having more options can't hurt, but ROCm 5.4.2 is almost 2 years
| old, and things have come a long way since then, so I'm curious
| about this being published freshly today, in October 2024.
|
| BTW, I recently went through and updated my compatibility doc
| (focused on RDNA3) w/ ROCm 6.2 for those interested. A lot has
| changed just in the past few months (upstream bitsandbytes,
| upstream xformers, and Triton-based Flash Attention):
| https://llm-tracker.info/howto/AMD-GPUs
| woodrowbarlow wrote:
| i also have been playing with inference on the amd 7900xtx, and
| i agree. there are no hoops to jump through these days. just
| make sure to install the rocm version of torch (if using a1111
| or similar, don't trust requirements.txt), as shown clearly on
| the pytorch homepage. obsidian is a similar story. hip is
| straightforward, at least on arch and ubuntu (fedora still
| requires some twiddling, though). i didn't realize xformers is
| also functional! that's good news.
| conshama wrote:
| related: https://www.nonbios.ai/post/deploying-
| large-405b-models-in-f...
|
| tldr: uses the latest rocm 6.2 to run full precision inference
| for llama 405b on a single node 8 x MI300x AMD GPU
|
| How mature do you think Rocm 6.2-AMD stack is compared to
| Nvidia ?
| fazkan wrote:
| this uses vllm?
| qamononep wrote:
| It would be great if you included a section on running with
| Docker on Linux. The only one that worked out of the box was
| Ollama, and it had an example.
| https://github.com/ollama/ollama/blob/main/docs/docker.md
|
| has a docker image but no examples to run it
| https://github.com/ggerganov/llama.cpp/blob/master/docs/dock...
|
| has a docker image but no examples to run it
| https://github.com/LostRuins/koboldcpp?tab=readme-ov-file#do...
|
| docker image was broken for me on 7800xt running rhel9
| https://github.com/Atinoda/text-generation-webui-docker
| fazkan wrote:
| good feedback thanks, would you be able to open an issue
| leonheld wrote:
| People use "Docker-based" all the time but what they mean is that
| they ship $SOFTWARE in a Docker image.
|
| "Docker-based" reads, to me, as if you were doing Inference on
| AMD cards with Docker somehow, which doesn't make sense.
| a_vanderbilt wrote:
| You can do inference from a Docker container, just as you'd do
| it with NVidia. OpenAI runs a K8s cluster doing this. I have
| personally only worked with NVidia, but the docs are present
| for AMD too.
|
| Like anything AI and AMD, you need the right card(s) and rocm
| version along with sheer dumb luck to get it working. AMD has
| Docker images with rocm support, so you could merge your app in
| with that as the base layer. Just pass through the GPU to the
| container and you should get it working.
|
| It might just be the software in a Docker image, but it removes
| a variable I would otherwise have to worry about during
| deployment. It literally is inference on AMD with Docker, if
| that's what you meant.
| mikepurvis wrote:
| Docker became part of the standard toolkit for ML because
| deploying Python that links to underlying system libraries is a
| gong show unless you ship that layer too.
| tannhaeuser wrote:
| Even Docker doesn't guarantee reproducible results due to
| sensitivity towards host GPU drivers, and ML
| frontends/integrations bringing their own "helpful" newby-
| friendly all-in-one dependency checks and updater services.
| jeffhuys wrote:
| Why doesn't it make sense? You can talk to devices from a
| Docker container - you just have to attach it.
| fazkan wrote:
| you can mount a specific device to docker. If you read the
| script, we are mounting GPUs
|
| https://github.com/slashml/amd_inference/blob/main/run-docke...
| steeve wrote:
| Hi, we (ZML), fix that: https://github.com/zml/zml
| fazkan wrote:
| This is pretty cool. Is there a document that shows which AMD
| drivers are supported out of the box?
| steeve wrote:
| We are in line with ROCm 6.2 support. We actually just
| opened a PR to bump to 6.2.2:
| https://github.com/zml/zml/pull/39
| khurdula wrote:
| Are we supposed to use AMD GPUs for this to work? Or Does it work
| on any GPU?
| karamanolev wrote:
| > This project provides a Docker-based inference engine for
| running Large Language Models (LLMs) on AMD GPUs.
|
| First sentence of the README in the repo. Was it somehow
| unclear?
| tomxor wrote:
| I almost tried to install AMD rocm a while ago after discovering
| the simplicity of llamafile. sudo apt install
| rocm Summary: Upgrading: 0, Installing: 203,
| Removing: 0, Not Upgrading: 0 Download size: 2,369 MB /
| 2,371 MB Space needed: 35.7 GB / 822 GB available
|
| I don't understand how 36 GB can be justified for what amounts to
| a GPU driver.
| greenavocado wrote:
| It's not just you; AMD manages to completely shit-up the Linux
| kernel with their drivers:
| https://www.phoronix.com/news/AMD-5-Million-Lines
| anthk wrote:
| OpenBSD, too.
| striking wrote:
| > Of course, much of that is auto-generated header files... A
| large portion of it with AMD continuing to introduce new
| auto-generated header files with each new generation/version
| of a given block. These verbose header files has been AMD's
| alternative to creating exhaustive public documentation on
| their GPUs that they were once known for.
| NekkoDroid wrote:
| There have been talks about moving those headers to a
| separate repo and only including the needed headers
| upstream[1]
|
| [1]: https://gitlab.freedesktop.org/drm/amd/-/issues/3636
| atq2119 wrote:
| So no doubt modern software is ridiculously bloated, but ROCm
| isn't just a GPU driver. It includes all sorts of tools and
| libraries as well.
|
| By comparison, if you go and download the CUDA toolkit as a
| single file, you get a download file that's over 4GB, so quite
| a bit larger than the download size you quoted. I haven't
| checked how much that expands to (it seems the ROCm install has
| a lot of redundancy given how well it compresses), but the
| point is, you get something that seems insanely large either
| way.
| tomxor wrote:
| I suspected that, but any binaries being that large just
| seems wrong, I mean the whole thing is 35 time larger than my
| entire OS install.
|
| Do you know what is included in ROCm that could be so big?
| Does it include training datasets or something?
| skirmish wrote:
| My understanding is that ROCm contains all included kernels
| for each supported architecture, so it would have (made
| up): -- matrix multiply 2048x2048 for Navi
| 31, -- same for Navi 32, -- same for Navi 33,
| -- same for Navi 21, -- same for Navi 22, --
| same for Navi 23, -- same for Navi 24, etc. --
| matrix multiply 4096x4096 for Navi 31, -- ...
| burnte wrote:
| CPU drivers are complete OSes that run on the GPUs now.
| steeve wrote:
| You can look us up at https://github.com/zml/zml, we fix that.
| andyferris wrote:
| Wait, looking at that link I don't see how it avoids
| downloading CUDA or ROCM. Do you use MLIR to compile to GPU
| without using the vendor provided tooling at all?
| sandGorgon wrote:
| is anyone using the new HX370 based laptops for any LLM work ? i
| mean the ipex-llm libraries of Intel's new Lunar Lake is already
| supporting Llama 3.2 (https://www.intel.com/content/www/us/en/dev
| eloper/articles/t...), but AMD's new Zen5 chips dont seem to be
| much active here.
| dhruvdh wrote:
| Why would you use this over vLLM?
| fazkan wrote:
| we have vllm in certin production instances, it is a pain for
| most non-nvidia related architectures. A bit of digging around
| and we realized that most of it is just a wrapper on top of
| pytorch function calls. If we can do away with batch processing
| with vllm supports, we can be good, this is what we did here.
| dhruvdh wrote:
| Batching is how you get ~350 tokens/sec on Qwen 14b on vLLM
| (7900XTX). By running 15 requests at once.
|
| Also, there is a Dockerfile.rocm at the root of vLLM's repo.
| How is it a pain?
| fazkan wrote:
| driver mismatch issues, we mostly use publicly available
| instances, so the drivers change as the instances change,
| according to their base image. Not saying it won't work,
| but it was more painful to figure out vllm, than to write a
| simple inference script and do it ourselves.
| white_waluigi wrote:
| Isn't this just a wrapper for huggingface-transformers?
| fazkan wrote:
| yes, but handles all the dependencies for AMD architecture. So
| technically its just a requirements file :). Author of the repo
| above.
| freeqaz wrote:
| What's the best bang-for-your-buck AMD GPU these days? I just
| bought 2 used 3090s for $750ish refurb'd on eBay. Curious what
| others are using for running LLMs locally.
| 3abiton wrote:
| Personal experience: It's not even worth it. AMD (i)GPU breaks
| with every pytorch, ROCm, xformers, or ollama updates. You'll
| sleep more compfortably at night.
| fazkan wrote:
| Thats our observation, which is why we wrote the scripts
| ourselves that way we can control the dependencies at least.
| sangnoir wrote:
| When dealing with ROCM, it's critical that once you have a
| working configuration, you freeze everything in place (except
| your application). Docker is one way to achieve this if your
| host machine is subject to kernel or package updates
| elorant wrote:
| Probably the 7900xtx. $1k for 24GB of RAM.
| freeqaz wrote:
| That's about the same price as a 3090 and it's also 24GB. Are
| they faster at inference?
| elorant wrote:
| I doubt it, but the 3090 is a four year old card which
| means it might have a lot of mileage from the previous
| owner. A lot of them are from mining rigs.
| sipjca wrote:
| it is not, at least in llama.cpp/llamafile
|
| https://benchmarks.andromeda.computer/compare
| stefan_ wrote:
| This seems to be some AI generated wrapper around a wrapper of a
| wrapper.
|
| > # Other AMD-specific optimizations can be added here
|
| > # For example, you might want to set specific flags or use AMD-
| optimized libraries
|
| What are we doing here, then?
| fazkan wrote:
| its just a big requirements file, and a dockerfile :) the rest
| are mostly helper scripts.
| phkahler wrote:
| Does it work with an APU? I just put 64GB in my system and gonna
| drop in a 5700G. Will that be enough? SFF inference if so.
| Const-me wrote:
| The integrated GPU of the 5700G uses old architecture from
| 2017, this one:
| https://en.wikipedia.org/wiki/Radeon_RX_Vega_series Pretty sure
| it does not support ROCm.
|
| BTW if you just want to play with a local LLM, you can try my
| old port of Mistral: https://github.com/Const-
| me/Cgml/tree/master/Mistral/Mistral... Unlike CUDA or ROCm my
| port is based on Direct3D 11 GPU API, runs on all GPUs
| regardless of the brand.
| fazkan wrote:
| @Const-me according to this it should work,
| https://github.com/ROCm/ROCm/issues/2216
| fazkan wrote:
| haven't tested it, but it should according to
| https://github.com/ROCm/ROCm/issues/2216
|
| You just need to update the version check here
|
| https://github.com/slashml/amd_inference/blob/4b9ec069c4b2ac...
|
| feel free to open an issue, with the requirements and we will
| test it.
| BaculumMeumEst wrote:
| how about they follow up 7900 XTX with a card that actually has
| some VRAM
| skirmish wrote:
| They prefer you pay $3,600 for AMD Radeon Pro W7900, 48GB VRAM.
___________________________________________________________________
(page generated 2024-10-02 23:00 UTC)