[HN Gopher] Stable Diffusion on AMD RDNA3
       ___________________________________________________________________
        
       Stable Diffusion on AMD RDNA3
        
       Author : tomtomlapomme
       Score  : 144 points
       Date   : 2022-12-21 08:53 UTC (14 hours ago)
        
 (HTM) web link (nod.ai)
 (TXT) w3m dump (nod.ai)
        
       | chem83 wrote:
       | > SHARK is an open source cross platform (Windows, macOS and
       | Linux) Machine Learning Distribution packaged with torch-mlir
       | (for seamless PyTorch integration), LLVM/MLIR for re-targetable
       | compiler technologies along with IREE (for efficient codegen,
       | compilation and runtime) and Nod.ai's tuning. IREE is part of the
       | OpenXLA Project
       | 
       | Google has been doing a good job advancing the IREE ML compiler
       | project, which I think is what will bring other hw platforms like
       | AMD and Intel to the ML game. Industry only has to benefit from
       | increased hardware portability.
        
       | thunkshift1 wrote:
       | Can someone explain what exactly does nod.ai do? Its not clear at
       | all from their page
        
       | imhoguy wrote:
       | Any chance to get SD running on mobile Ryzen APU e.g. Ryzen Pro
       | 4750U (Renoir)?
        
         | sosborn wrote:
         | This worked for me with a 5700xt:
         | 
         | https://www.travelneil.com/stable-diffusion-windows-amd.html
        
           | negativegate wrote:
           | nod-ai/SHARK from the original submission is by far the
           | fastest way I've found to run Stable Diffusion on a 5700 XT.
           | 
           | For 50 iterations:
           | 
           | * ONNX on Windows was 4-5 minutes
           | 
           | * ROCm on Arch Linux was ~2.5 minutes
           | 
           | * SHARK on Windows is ~30 seconds
        
         | magic_at_nodai wrote:
         | Can you give SHARK a try and let us know on our discord? We can
         | try to help. People have been using it on older AMD GPUs back
         | to Polaris arch.
        
         | delijati wrote:
         | Short answer no. Long answer "in theory" yes. I tried this [1]
         | but gave up as building rocm + deps takes up to 6h :/ Official
         | statement [2]
         | 
         | [1] https://github.com/xuhuisheng/rocm-build [2]
         | https://github.com/RadeonOpenCompute/ROCm/issues/1587
        
           | nicolaslem wrote:
           | For anyone on Arch, there is a third-party repository called
           | arch4edu[0] that provides up to date builds of ROCm and its
           | dependencies. On my iGPU, OpenCL sometimes works, sometimes
           | crashes. Even finding a list of supported hardware is close
           | to impossible. The whole situation is just ridiculous and
           | makes AMD look bad.
           | 
           | [0] https://github.com/arch4edu/arch4edu
        
             | my123 wrote:
             | AMD doesn't actually care.
             | 
             | For them, GPGPU is a pro level feature not worth supporting
             | on most customer GPUs. They are doing much more feature
             | segmentation than NVIDIA ever did.
        
       | AstixAndBelix wrote:
       | Does anyone know what's the current state of AMD's tools to
       | migrate from CUDA? There's so much untapped potential with these
       | cards, it's crazy that basically only gamers can make use of
       | their competitive prices
        
         | epmaybe wrote:
         | I don't think there's truly a competitor but opencl is the
         | alternative to shoot for. Otherwise for machine learning
         | purposes amd helps develop ROCm.
        
           | pjmlp wrote:
           | OpenCL is hardly an alternative, plain old C, using
           | compilation from source at runtime, with very basic tooling
           | available.
           | 
           | Versus a polyglot compiler infrastructure, IDE tooling that
           | includes shader debugging, and a rich ecosytem of GPU based
           | libraries.
           | 
           | Even with SYSCL and SPIR-V, that has hardly improved, and
           | while Intel bases oneAPI on top of SYSCL, that naturally also
           | goes beyond the standard.
        
             | amelius wrote:
             | Shouldn't we have an API that can speak to both CUDA and
             | opencl? Or is opencl sufficiently capable?
        
               | pjmlp wrote:
               | No it isn't because it lacks the polyglot infrastructure
               | from CUDA, it has now SPIR-V but hardly anyone targets it
               | as PTX gets used.
        
               | snvzz wrote:
               | I understand AMD HiP is a CUDA clone, where library
               | functions have the same syntax but with hip replacing
               | cuda in the function names.
               | 
               | Behind, it can use AMD and NVIDIA hardware alike. Thus,
               | the idea is that through typically negligible effort
               | porting to HiP, your code becomes vendor-independent.
               | 
               | In practice, I do not know how true this is.
        
               | my123 wrote:
               | > Thus, the idea is that through typically negligible
               | effort porting to HiP, your code becomes vendor-
               | independent.
               | 
               | Here, the big AMD mistake was to rename those function
               | prefixes in the first place. It's a mistake that they
               | could have avoided...
               | 
               | What a lot of SW codebases did to support AMD (see
               | PyTorch code notably): codebase is still CUDA, have the
               | conversion pass to HIP done at build time.
               | 
               | See https://github.com/ROCm-Developer-
               | Tools/HIPIFY/blob/amd-stag... for the Perl script to do
               | it.
               | 
               | Then comes the problem of AMD not supporting ROCm HIP on
               | most of their hardware or user base.
               | 
               | On Windows, the ROCm HIP SDK is private and only
               | available under NDA. This means that while you can use
               | Blender w/ HIP on Windows, the Blender builds that you
               | compile yourself will not be able to use ROCm HIP.
               | 
               | On Linux, the supported GPUs are few and far between,
               | Vega20 onwards are supported today. APUs, RDNA1, and
               | lower end RDNA2 w/o unsupported hacks (6700 XT and below)
               | are excluded.
        
               | tormeh wrote:
               | It's quite baffling. AMD is behaving like an incumbent
               | trying to segment users etc. when they really should
               | behave more like an upstart trying to make things easy.
               | But their drivers for Linux are the best, so I don't
               | think I'll switch to Nvidia...
        
               | paulmd wrote:
               | > What a lot of SW codebases did to support AMD (see
               | PyTorch code notably): codebase is still CUDA, have the
               | conversion pass to HIP done at build time.
               | 
               | This is sort of echoed in AMD's stance on FSR2/upscaling,
               | where they have _explicitly stated_ they will not support
               | any API that allows plugging, regardless of whether the
               | API is open-source or not, or who owns it, because a
               | pluggable API might allow plugging proprietary
               | implementations. Their opinion is their solution is what
               | 's best for everyone, in every situation, so you don't
               | really need DLLs or pluggability because why would you
               | want to plug something worse? FSR2 is the best for
               | everyone and you should really just be compiling it
               | directly into your application.
               | 
               | https://youtu.be/8ve5dDQ6TQE?t=974
               | 
               | (this of course also makes it impossible to update
               | versions of FSR2 if better ones come out subsequently -
               | you can't do the DLSS thing where you swap in newer DLLs
               | (at your own risk of course) and benefit from later
               | improvements to a modular grouping of code. You know,
               | sort of the whole concept of libraries in the first
               | place...)
               | 
               | The HIP stuff is the same thing... AMD really wants you
               | to convert once to HIP and be locked in forever, because
               | Theirs Is The Best, Why Would You Need Anything Else? But
               | of course HIP doesn't have a PTX-like concept so you
               | really need to distribute as source and compile
               | everything at runtime... because who would want library
               | code or dynamic linking?
               | 
               | Anyway, like, I know it's not really a shocker but the
               | "we love open-source!" thing is a bit of an act. They
               | love it when it's an angle for them as the underdog to
               | leverage their way into marketshare... and as the
               | underdog when it's not favorable for them (like FSR)
               | they'll abandon their pro-freeness stance. And they too
               | have their closed, proprietary technologies (like their
               | CXL alternative that only works with their
               | CPU+peripherals and nobody else can use) that they don't
               | open up either. Nor is AMD racing to open up chipsets
               | (like the NForce or Abit days) either, that's all locked
               | down and proprietary too.
               | 
               | I know that's not really a shocker when you put it like
               | that, but, AMD really gets a ton of the benefit-of-the-
               | doubt all the time. They have on _multiple occasions_
               | shipped defective /marginal silicon at launch for
               | example, and it all just gets brushed over and people
               | forget all about it. Both Zen2 (low-quality silicon in
               | the launch batch meant chips were missing advertised
               | boost clocks by 10%+) and RDNA1 had massive incidents of
               | the community downplaying very real problems because AMD
               | Is Good Now, many of the affected users never had their
               | problems resolved and they just kinda sighed and lived
               | with it or sold the hardware and bought something better,
               | and the fans swept it all under the rug and never talked
               | of it again. Same for pandemic profiteering (while Intel
               | cut prices), etc. There's just a ton of shit that people
               | bend over backwards to find justifications for with AMD
               | that just wouldn't fly with more reputable vendors.
        
             | schmorptron wrote:
             | Do you have an opinion on the new openCL implementation
             | that recently got merged into mesa? It doesn't touch on
             | tooling or the other points you mentioned, but performance
             | seems to be pretty good!
             | 
             | https://www.phoronix.com/news/Rusticl-2022-XDC-State
        
               | pjmlp wrote:
               | No, it doesn't seem to matter for what makes CUDA
               | relevant anyway.
        
               | my123 wrote:
               | It's still a heavy work in progress. Not usable for SYCL
               | programs as SVM isn't implemented yet.
        
         | CodeArtisan wrote:
         | The performances on Blender3d are atrocious, the RX 7900 XTX is
         | noticeably slower than a RTX 3060.
        
           | snvzz wrote:
           | Latest Blender release does not have the optimization work in
           | yet.
           | 
           | AIUI, what's in current git master is very different.
        
         | marcyb5st wrote:
         | Last time I seriously checked (6 months ago or so) ROCm was
         | still a far cry from CUDA. Set up was a mess, support was hit
         | and miss, some operations were not particularly performante
         | compared to the CUDA counterparts. Additionally, there are
         | Tensorflow and probably PyTorch forks that should work with it,
         | but they lag behind the official repositories quite a bit.
         | 
         | I hope that now that generative AI is becoming mainstream AMD
         | steps up their game both on their consumer and professional
         | lineups. If I were to buy a video card right now ( mostly for
         | gaming+ML hobbies projects + running stable diffusion) I
         | wouldn't pick AMD because I could do just 1/3 of my use cases
         | properly without headaches (gaming).
        
           | rowanG077 wrote:
           | OpenCL works pretty well. Can't say I notice large gaps of
           | performance between CUDA and openCL for my hpc work.
        
             | lalaland1125 wrote:
             | Have you done any benchmarks with vulkan?
        
               | rowanG077 wrote:
               | No I haven't used vulkan for compute.
        
             | my123 wrote:
             | Thankfully for a good chunk of number crunching that works
             | fine. But the other side of the coin is notably AI
             | workloads. There's no OpenCL or Vulkan standard for
             | exposing matrix units, only vendor specific ones.
             | 
             | For OpenCL: cl_qcom_ml_ops (Qualcomm) notably, for Vulkan:
             | VK_NV_cooperative_matrix (NVIDIA)
        
       | tehsauce wrote:
       | "There has also been a wide variety of accuracy-degrading
       | performance optimizations like Xformers and Flash Attention,
       | which are great tools if you are open to trading accuracy for
       | performance"
       | 
       | This is incorrect. Those optimizations do identical computations,
       | but leverage memory bandwidth on the gpu more effectively. So
       | there is no accuracy tradeoff there.
        
         | magic_at_nodai wrote:
         | Here are a list of potential issues
         | https://github.com/AUTOMATIC1111/stable-diffusion-webui/disc...
         | 
         | That said we (Nod.ai team) will add support for xformers soon
         | so you can opt in for xformers anyway.
        
       | ggerganov wrote:
       | > There has also been a wide variety of accuracy-degrading
       | performance optimizations like Xformers and Flash Attention,
       | which are great tools if you are open to trading accuracy for
       | performance ..
       | 
       | I wasn't aware that Flash Attention trades accuracy for
       | performance. Either I have a wrong understanding of what FA is,
       | or this statement is not fully accurate.
       | 
       | Either way - looks like great work
        
         | marcyb5st wrote:
         | From the flash attention paper:
         | 
         | We also extend FlashAttention to block-sparse attention,
         | yielding an approximate attention algorithm that is faster than
         | any existing approximate attention method.
         | 
         | So I assume they are using the approximate version as they also
         | have an exact version.
        
           | ggerganov wrote:
           | Thanks for that - I have missed the block-sparse extension of
           | the algorithm when I first read about it. And indeed this
           | seems to be what the author means.
        
       | lalaland1125 wrote:
       | I really wish more GPU libraries had focused on vulkan instead of
       | CUDA ...
        
         | andy_ppp wrote:
         | I thought Vulkan was a graphics specific layer and CUDA was
         | specifically for machine learning?
        
           | lalaland1125 wrote:
           | Vulkan is designed for all GPU needs, from rendering to
           | general purpose compute.
           | 
           | CUDA is only for general purpose compute.
        
           | mhh__ wrote:
           | CUDA is for GPGPU (general purpose GPU) which includes
           | machine learning.
           | 
           | Vulkan is a primarily for graphics but does have options for
           | GPGPU too. Vulkan is however not like OpenGL in that it's
           | fairly close to the hardware in terms of abstraction.
        
             | my123 wrote:
             | Vulkan has a very atrocious developer experience by GPGPU
             | standards.
             | 
             | The chance of it winning over CUDA is at zero. And that's
             | _before_ considering its API gaps compared to modern
             | OpenCL.
             | 
             | (Yes, even OpenCL is a much better compute API choice than
             | Vulkan. Vulkan does not even have SVM)
        
               | mhh__ wrote:
               | Vulkan is fairly atrocious, GPGPU or not IMO. Obviously
               | it tries to do something intrinsically complex but I've
               | never enjoyed working with it.
        
               | VHRanger wrote:
               | You should think more of vulkan as an IR endpoint than
               | the actually usable API here.
               | 
               | Vulkan is well supported by most GPUs because it's so low
               | level. Performance tends to be good everywhere.
               | 
               | What would make vulkan succesful is having APIs that
               | "compile" to this IR. Stuff like vulkan Kompute are good
               | ideas in this direction.
        
               | my123 wrote:
               | Vulkan is not a suitable API for even implementing
               | Khronos's very own SYCL on top of. SYCL requires shared
               | virtual memory capabilities that Vulkan just doesn't
               | have.
        
               | lalaland1125 wrote:
               | Does CUDA have SVM either? Seems like a pretty niche
               | feature IMHO
        
               | my123 wrote:
               | Yes CUDA does, since Kepler, under the CUDA Unified
               | Memory naming.
               | 
               | It's not a niche feature at all, but one that is
               | essential to lower the barrier for developer adoption.
        
           | mattnewton wrote:
           | CUDA is general purpose compute, but nvidia also releases
           | cudnn which all the major libraries use because it is fast
           | and good (if a little complex). There's efforts underway to
           | have a comparable library on open source general compute
           | packages but none as mature or effective as cudnn so people
           | just pay nvidia to use that in practice, which lets them
           | invest even more in pulling ahead.
           | 
           | As an aside, I've been kinda surprised that this has existed
           | for as long as it has, but I am probably biased and think Ml
           | acceleration is more important than most large business do
           | today.
        
         | rowanG077 wrote:
         | It's one of the reasons Nvidia is basically untouchable at this
         | time. The AI field willingly enslaved itself to NVidia.
        
           | my123 wrote:
           | It's because NVIDIA actually cared and AMD does not where it
           | matters (customer HW).
           | 
           | Openness is totally secondary to functional. You know, the
           | same kind of reasons as of why Linux on the desktop is not a
           | mass market thing compared to Windows for a very long time.
        
             | rowanG077 wrote:
             | OpenCL is totally functional, on AMD and even iGPU Intel.
             | The reason Nvidia won was because they made it easier. And
             | the AI people ate it up. The tooling Nvidia offers is
             | second to none. But you can build almost anything CUDA does
             | with openCL. It's simply harder to do.
             | 
             | The AI crowd cared more about that then the impact of
             | tieing the entire ecosystem to a single company.
             | 
             | Who knows what openCL might have been if it would be the
             | premier implementation language. I'd wager it would have
             | gotten a LOT more love.
        
               | paulmd wrote:
               | > OpenCL is totally functional, on AMD and even iGPU
               | Intel.
               | 
               | The fact that it's not functional on non-NVIDIA platforms
               | is the whole reason Blender dropped OpenCL support. If
               | you're going to write a bunch of implementation-specific
               | code to handle AMD's bugs/non-compliant runtime, why not
               | just target CUDA directly?
               | 
               | "it's open-source, if you want it fixed then pay someone
               | to do it or just spend a month making a patch instead of
               | doing your work???"
               | 
               | bugs-bunny-no.gif
               | 
               | like, same with ROCm, is AMD just wants to externalize
               | all their costs onto customers yet still wants them to
               | adopt it instead of the turnkey solution that everyone
               | else already uses. Why would anyone do that? It's great
               | for NVIDIA but terrible for users.
               | 
               | (and as for the "AMD drivers have been good for like ten
               | years now!!!" crowd... counterpoint: the entire 5700XT
               | thing, drivers broken for the first 18 months, just like
               | Vega before it. And oh look 7900XTX is turning into a
               | trainwreck too. There's just constant showstopping bugs
               | with AMD drivers. Just like with ROCm too... patchwork
               | support and endless bugs that don't exist in the
               | industry-standard solution. Nobody wants to spend their
               | time doing AMD's job for them.)
               | 
               | To their credit this is one thing Intel got right... they
               | probably spent more dev time on oneAPI in the last year
               | than AMD spent on ROCm and all their previous
               | attempts/projects/resume-driven-development fodder
               | combined.
        
               | [deleted]
        
               | CamperBob2 wrote:
               | When I catch myself writing a sentence like _It 's simply
               | harder to do_ in order to promote or justify an
               | alternative engineering approach, I try to, well, catch
               | myself before making a really weak argument in favor of
               | an inferior solution. Making life easy for developers is
               | important.
        
               | rowanG077 wrote:
               | I'm not claiming it's not important and it's not very
               | nice that you say I did.
               | 
               | When you basically have doomed humanity to rely on a
               | single (malicious) company for a technology that is as
               | important as AI. Then maybe, just maybe, the trade off
               | that it is a harder to implement is worth it.
        
               | my123 wrote:
               | The tradeoff was not that. The OpenCL SW ecosystem was
               | just not there at all. It's not a coincidence that nobody
               | has a good AI training on OpenCL stack even today. The
               | cross-vendor infrastructure for that doesn't exist.
               | 
               | And NV was far from malicious here, they are who made
               | building this ecosystem possible.
               | 
               | Without NV what would have plausibly happened was not
               | having AI training on GPUs at all, but on bespoke
               | accelerators (which _did_ exist back then) at a totally
               | inaccessible cost to customers. It's hard to understate
               | their role in building this ecosystem.
        
               | rowanG077 wrote:
               | What exactly is the issue? I use OpenCl without
               | significant issues everyday.
        
               | my123 wrote:
               | The whole library ecosystem, for example (but far from
               | only) if you want a BLAS. OpenCL only provides much lower
               | level infrastructure bricks.
               | 
               | With CUDA having unmatched performance compared to
               | alternatives too.
        
               | hdjeoslsjdjrhe wrote:
               | We havn't doomed anything... These things happen in
               | cycles. Companies try to force control and compliance and
               | then customers look for alternatives ... The cost had
               | become worth it at point and we have found our point of
               | inflection.
        
               | CamperBob2 wrote:
               | _When you basically have doomed humanity to rely on a
               | single (malicious) company for a technology that is as
               | important as AI_
               | 
               | I don't disagree, but how did this argument fare against
               | Microsoft? Is there a reason you expect it to fare better
               | against Nvidia? That sweaty guy jumping around yelling
               | "Developers! Developers! Developers!" had a point.
        
               | rowanG077 wrote:
               | Well I wouldn't have recommended building anything
               | foundational on .NET either. But .NET is open source and
               | runs almost everywhere now.
               | 
               | I would be fine with CUDA if Nvidia would allow
               | anyone(AMD/Intel) to make implementations for their GPUs
               | as well.
        
               | my123 wrote:
               | See ROCm HIP which is basically just that. AMD chose to
               | rename all the function prefixes but it's what you are
               | asking for here.
               | 
               | AMD fucked up by not having a stable IR between GPU
               | generations and not having a public Windows SDK. But
               | that's their own problem, not NVIDIA's.
        
               | paulmd wrote:
               | > AMD fucked up by not having a stable IR between GPU
               | generations
               | 
               | The lack of a stable IR is probably deliberate. Much like
               | the "we won't support DLLs or pluggable APIs, only
               | statically compiling it into your application" with FSR2,
               | once you port to HIP you're locked in. AMD wants you
               | working in HIP, compiling from HIP, not treating them as
               | an IR - they don't _want_ to be an alternate runtime for
               | NVIDIA 's ecosystem.
               | 
               | And again, much like FSR2, they are in fact willing to
               | compromise end-user experience (no updates) or developer
               | convenience (continual patching) in order to do it. No
               | libraries, only distribute as source, ever.
               | 
               | It's not about library pluggability or runtime
               | compatibility (after all GPU Ocelot already existed),
               | what they want is you building the ROCm Ecosystem and not
               | the CUDA Ecosystem or OneAPI Ecosystem.
               | 
               | That's understandable from a corporate strategy
               | perspective, as a corporation you don't want to be
               | building a product on someone else's platform, because
               | that gives a lot of freedom for the platform owner to
               | fuck with you. But like, the whole "we won't even do
               | libraries/IR" is a little crass from a customer
               | experience/developer experience perspective, and it kinda
               | goes against the whole good-guy-AMD mythos they've built
               | up.
        
               | frognumber wrote:
               | No. It's not.
               | 
               | AMD only officially supports GPGPU headless. That
               | discounts 90% of the market. Old graphics cards lose
               | support randomly. That discounts much of the rest. The
               | whole thing is a horrible, bug-ridden mess.
               | 
               | I'd pick AMD over NVidia if it was e.g. 50% slower at the
               | same price point -- open source is worth waiting for --
               | but I can't take nonworking.
               | 
               | AMD also has no support. I'm now building tooling reliant
               | on NVidia, so if AMD ever gets their stuff working, we're
               | many backports away from a working ecosystem. The longer
               | AMD takes, the deeper the hole.
        
         | AceJohnny2 wrote:
         | CUDA predates Vulkan by over 8 years.
         | 
         | There's a lot of established ecosystem for CUDA, thanks to
         | Nvidia's investment.
        
       ___________________________________________________________________
       (page generated 2022-12-21 23:01 UTC)