[HN Gopher] AMD GPU Inference
       ___________________________________________________________________
        
       AMD GPU Inference
        
       Author : fazkan
       Score  : 142 points
       Date   : 2024-10-02 07:16 UTC (15 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | a2128 wrote:
       | It seems to use an old, 2 year old version of ROCm (5.4.2) which
       | I'm doubtful would support my RX 7900 XTX. I personally found it
       | easiest to just use the latest `rocm/pytorch` image and run what
       | I need from there
        
         | slavik81 wrote:
         | The RX 7900 XTX (gfx1100) was first enabled in the math
         | libraries (e.g. rocBLAS) for ROCm 5.4, but I don't think the AI
         | libraries (e.g. MIOpen) had it enabled until ROCm 5.5. I
         | believe the performance improved significantly in later
         | releases, as well.
        
       | ashirviskas wrote:
       | I'm all for having more open source projects, but I do not see
       | how it can be useful in this ecosystem, especially for people
       | with newer AMD GPUs (not supported in this project) which are
       | already supported in most popular projects?
        
         | fazkan wrote:
         | Just something that, we found helpful, support for new
         | architectures is just a package update. This is more of a
         | cookie cutter
        
       | lhl wrote:
       | For inference, if you have a supported card (or probably
       | architecture if you are on Linux and can use
       | HSA_OVERRIDE_GFX_VERSION), then you can probably run anything
       | with (upstream) PyTorch and transformers. Also, compiling
       | llama.cpp is has been pretty trouble-free for me for at least a
       | year.
       | 
       | (If you are on Windows, there is usually a win-hip binary of
       | llama.cpp in the project's releases or if things totally refuse
       | to work, you can use the Vulkan build as a (less performant)
       | fallback).
       | 
       | Having more options can't hurt, but ROCm 5.4.2 is almost 2 years
       | old, and things have come a long way since then, so I'm curious
       | about this being published freshly today, in October 2024.
       | 
       | BTW, I recently went through and updated my compatibility doc
       | (focused on RDNA3) w/ ROCm 6.2 for those interested. A lot has
       | changed just in the past few months (upstream bitsandbytes,
       | upstream xformers, and Triton-based Flash Attention):
       | https://llm-tracker.info/howto/AMD-GPUs
        
         | woodrowbarlow wrote:
         | i also have been playing with inference on the amd 7900xtx, and
         | i agree. there are no hoops to jump through these days. just
         | make sure to install the rocm version of torch (if using a1111
         | or similar, don't trust requirements.txt), as shown clearly on
         | the pytorch homepage. obsidian is a similar story. hip is
         | straightforward, at least on arch and ubuntu (fedora still
         | requires some twiddling, though). i didn't realize xformers is
         | also functional! that's good news.
        
         | conshama wrote:
         | related: https://www.nonbios.ai/post/deploying-
         | large-405b-models-in-f...
         | 
         | tldr: uses the latest rocm 6.2 to run full precision inference
         | for llama 405b on a single node 8 x MI300x AMD GPU
         | 
         | How mature do you think Rocm 6.2-AMD stack is compared to
         | Nvidia ?
        
           | fazkan wrote:
           | this uses vllm?
        
         | qamononep wrote:
         | It would be great if you included a section on running with
         | Docker on Linux. The only one that worked out of the box was
         | Ollama, and it had an example.
         | https://github.com/ollama/ollama/blob/main/docs/docker.md
         | 
         | has a docker image but no examples to run it
         | https://github.com/ggerganov/llama.cpp/blob/master/docs/dock...
         | 
         | has a docker image but no examples to run it
         | https://github.com/LostRuins/koboldcpp?tab=readme-ov-file#do...
         | 
         | docker image was broken for me on 7800xt running rhel9
         | https://github.com/Atinoda/text-generation-webui-docker
        
           | fazkan wrote:
           | good feedback thanks, would you be able to open an issue
        
       | leonheld wrote:
       | People use "Docker-based" all the time but what they mean is that
       | they ship $SOFTWARE in a Docker image.
       | 
       | "Docker-based" reads, to me, as if you were doing Inference on
       | AMD cards with Docker somehow, which doesn't make sense.
        
         | a_vanderbilt wrote:
         | You can do inference from a Docker container, just as you'd do
         | it with NVidia. OpenAI runs a K8s cluster doing this. I have
         | personally only worked with NVidia, but the docs are present
         | for AMD too.
         | 
         | Like anything AI and AMD, you need the right card(s) and rocm
         | version along with sheer dumb luck to get it working. AMD has
         | Docker images with rocm support, so you could merge your app in
         | with that as the base layer. Just pass through the GPU to the
         | container and you should get it working.
         | 
         | It might just be the software in a Docker image, but it removes
         | a variable I would otherwise have to worry about during
         | deployment. It literally is inference on AMD with Docker, if
         | that's what you meant.
        
         | mikepurvis wrote:
         | Docker became part of the standard toolkit for ML because
         | deploying Python that links to underlying system libraries is a
         | gong show unless you ship that layer too.
        
           | tannhaeuser wrote:
           | Even Docker doesn't guarantee reproducible results due to
           | sensitivity towards host GPU drivers, and ML
           | frontends/integrations bringing their own "helpful" newby-
           | friendly all-in-one dependency checks and updater services.
        
         | jeffhuys wrote:
         | Why doesn't it make sense? You can talk to devices from a
         | Docker container - you just have to attach it.
        
         | fazkan wrote:
         | you can mount a specific device to docker. If you read the
         | script, we are mounting GPUs
         | 
         | https://github.com/slashml/amd_inference/blob/main/run-docke...
        
         | steeve wrote:
         | Hi, we (ZML), fix that: https://github.com/zml/zml
        
           | fazkan wrote:
           | This is pretty cool. Is there a document that shows which AMD
           | drivers are supported out of the box?
        
             | steeve wrote:
             | We are in line with ROCm 6.2 support. We actually just
             | opened a PR to bump to 6.2.2:
             | https://github.com/zml/zml/pull/39
        
       | khurdula wrote:
       | Are we supposed to use AMD GPUs for this to work? Or Does it work
       | on any GPU?
        
         | karamanolev wrote:
         | > This project provides a Docker-based inference engine for
         | running Large Language Models (LLMs) on AMD GPUs.
         | 
         | First sentence of the README in the repo. Was it somehow
         | unclear?
        
       | tomxor wrote:
       | I almost tried to install AMD rocm a while ago after discovering
       | the simplicity of llamafile.                 sudo apt install
       | rocm            Summary:         Upgrading: 0, Installing: 203,
       | Removing: 0, Not Upgrading: 0         Download size: 2,369 MB /
       | 2,371 MB         Space needed: 35.7 GB / 822 GB available
       | 
       | I don't understand how 36 GB can be justified for what amounts to
       | a GPU driver.
        
         | greenavocado wrote:
         | It's not just you; AMD manages to completely shit-up the Linux
         | kernel with their drivers:
         | https://www.phoronix.com/news/AMD-5-Million-Lines
        
           | anthk wrote:
           | OpenBSD, too.
        
           | striking wrote:
           | > Of course, much of that is auto-generated header files... A
           | large portion of it with AMD continuing to introduce new
           | auto-generated header files with each new generation/version
           | of a given block. These verbose header files has been AMD's
           | alternative to creating exhaustive public documentation on
           | their GPUs that they were once known for.
        
             | NekkoDroid wrote:
             | There have been talks about moving those headers to a
             | separate repo and only including the needed headers
             | upstream[1]
             | 
             | [1]: https://gitlab.freedesktop.org/drm/amd/-/issues/3636
        
         | atq2119 wrote:
         | So no doubt modern software is ridiculously bloated, but ROCm
         | isn't just a GPU driver. It includes all sorts of tools and
         | libraries as well.
         | 
         | By comparison, if you go and download the CUDA toolkit as a
         | single file, you get a download file that's over 4GB, so quite
         | a bit larger than the download size you quoted. I haven't
         | checked how much that expands to (it seems the ROCm install has
         | a lot of redundancy given how well it compresses), but the
         | point is, you get something that seems insanely large either
         | way.
        
           | tomxor wrote:
           | I suspected that, but any binaries being that large just
           | seems wrong, I mean the whole thing is 35 time larger than my
           | entire OS install.
           | 
           | Do you know what is included in ROCm that could be so big?
           | Does it include training datasets or something?
        
             | skirmish wrote:
             | My understanding is that ROCm contains all included kernels
             | for each supported architecture, so it would have (made
             | up):                 -- matrix multiply 2048x2048 for Navi
             | 31,       -- same for Navi 32,       -- same for Navi 33,
             | -- same for Navi 21,       -- same for Navi 22,       --
             | same for Navi 23,       -- same for Navi 24, etc.       --
             | matrix multiply 4096x4096 for Navi 31,       -- ...
        
         | burnte wrote:
         | CPU drivers are complete OSes that run on the GPUs now.
        
         | steeve wrote:
         | You can look us up at https://github.com/zml/zml, we fix that.
        
           | andyferris wrote:
           | Wait, looking at that link I don't see how it avoids
           | downloading CUDA or ROCM. Do you use MLIR to compile to GPU
           | without using the vendor provided tooling at all?
        
       | sandGorgon wrote:
       | is anyone using the new HX370 based laptops for any LLM work ? i
       | mean the ipex-llm libraries of Intel's new Lunar Lake is already
       | supporting Llama 3.2 (https://www.intel.com/content/www/us/en/dev
       | eloper/articles/t...), but AMD's new Zen5 chips dont seem to be
       | much active here.
        
       | dhruvdh wrote:
       | Why would you use this over vLLM?
        
         | fazkan wrote:
         | we have vllm in certin production instances, it is a pain for
         | most non-nvidia related architectures. A bit of digging around
         | and we realized that most of it is just a wrapper on top of
         | pytorch function calls. If we can do away with batch processing
         | with vllm supports, we can be good, this is what we did here.
        
           | dhruvdh wrote:
           | Batching is how you get ~350 tokens/sec on Qwen 14b on vLLM
           | (7900XTX). By running 15 requests at once.
           | 
           | Also, there is a Dockerfile.rocm at the root of vLLM's repo.
           | How is it a pain?
        
             | fazkan wrote:
             | driver mismatch issues, we mostly use publicly available
             | instances, so the drivers change as the instances change,
             | according to their base image. Not saying it won't work,
             | but it was more painful to figure out vllm, than to write a
             | simple inference script and do it ourselves.
        
       | white_waluigi wrote:
       | Isn't this just a wrapper for huggingface-transformers?
        
         | fazkan wrote:
         | yes, but handles all the dependencies for AMD architecture. So
         | technically its just a requirements file :). Author of the repo
         | above.
        
       | freeqaz wrote:
       | What's the best bang-for-your-buck AMD GPU these days? I just
       | bought 2 used 3090s for $750ish refurb'd on eBay. Curious what
       | others are using for running LLMs locally.
        
         | 3abiton wrote:
         | Personal experience: It's not even worth it. AMD (i)GPU breaks
         | with every pytorch, ROCm, xformers, or ollama updates. You'll
         | sleep more compfortably at night.
        
           | fazkan wrote:
           | Thats our observation, which is why we wrote the scripts
           | ourselves that way we can control the dependencies at least.
        
           | sangnoir wrote:
           | When dealing with ROCM, it's critical that once you have a
           | working configuration, you freeze everything in place (except
           | your application). Docker is one way to achieve this if your
           | host machine is subject to kernel or package updates
        
         | elorant wrote:
         | Probably the 7900xtx. $1k for 24GB of RAM.
        
           | freeqaz wrote:
           | That's about the same price as a 3090 and it's also 24GB. Are
           | they faster at inference?
        
             | elorant wrote:
             | I doubt it, but the 3090 is a four year old card which
             | means it might have a lot of mileage from the previous
             | owner. A lot of them are from mining rigs.
        
             | sipjca wrote:
             | it is not, at least in llama.cpp/llamafile
             | 
             | https://benchmarks.andromeda.computer/compare
        
       | stefan_ wrote:
       | This seems to be some AI generated wrapper around a wrapper of a
       | wrapper.
       | 
       | > # Other AMD-specific optimizations can be added here
       | 
       | > # For example, you might want to set specific flags or use AMD-
       | optimized libraries
       | 
       | What are we doing here, then?
        
         | fazkan wrote:
         | its just a big requirements file, and a dockerfile :) the rest
         | are mostly helper scripts.
        
       | phkahler wrote:
       | Does it work with an APU? I just put 64GB in my system and gonna
       | drop in a 5700G. Will that be enough? SFF inference if so.
        
         | Const-me wrote:
         | The integrated GPU of the 5700G uses old architecture from
         | 2017, this one:
         | https://en.wikipedia.org/wiki/Radeon_RX_Vega_series Pretty sure
         | it does not support ROCm.
         | 
         | BTW if you just want to play with a local LLM, you can try my
         | old port of Mistral: https://github.com/Const-
         | me/Cgml/tree/master/Mistral/Mistral... Unlike CUDA or ROCm my
         | port is based on Direct3D 11 GPU API, runs on all GPUs
         | regardless of the brand.
        
           | fazkan wrote:
           | @Const-me according to this it should work,
           | https://github.com/ROCm/ROCm/issues/2216
        
         | fazkan wrote:
         | haven't tested it, but it should according to
         | https://github.com/ROCm/ROCm/issues/2216
         | 
         | You just need to update the version check here
         | 
         | https://github.com/slashml/amd_inference/blob/4b9ec069c4b2ac...
         | 
         | feel free to open an issue, with the requirements and we will
         | test it.
        
       | BaculumMeumEst wrote:
       | how about they follow up 7900 XTX with a card that actually has
       | some VRAM
        
         | skirmish wrote:
         | They prefer you pay $3,600 for AMD Radeon Pro W7900, 48GB VRAM.
        
       ___________________________________________________________________
       (page generated 2024-10-02 23:00 UTC)