[HN Gopher] Stable Diffusion on AMD RDNA3
___________________________________________________________________
Stable Diffusion on AMD RDNA3
Author : tomtomlapomme
Score : 144 points
Date : 2022-12-21 08:53 UTC (14 hours ago)
(HTM) web link (nod.ai)
(TXT) w3m dump (nod.ai)
| chem83 wrote:
| > SHARK is an open source cross platform (Windows, macOS and
| Linux) Machine Learning Distribution packaged with torch-mlir
| (for seamless PyTorch integration), LLVM/MLIR for re-targetable
| compiler technologies along with IREE (for efficient codegen,
| compilation and runtime) and Nod.ai's tuning. IREE is part of the
| OpenXLA Project
|
| Google has been doing a good job advancing the IREE ML compiler
| project, which I think is what will bring other hw platforms like
| AMD and Intel to the ML game. Industry only has to benefit from
| increased hardware portability.
| thunkshift1 wrote:
| Can someone explain what exactly does nod.ai do? Its not clear at
| all from their page
| imhoguy wrote:
| Any chance to get SD running on mobile Ryzen APU e.g. Ryzen Pro
| 4750U (Renoir)?
| sosborn wrote:
| This worked for me with a 5700xt:
|
| https://www.travelneil.com/stable-diffusion-windows-amd.html
| negativegate wrote:
| nod-ai/SHARK from the original submission is by far the
| fastest way I've found to run Stable Diffusion on a 5700 XT.
|
| For 50 iterations:
|
| * ONNX on Windows was 4-5 minutes
|
| * ROCm on Arch Linux was ~2.5 minutes
|
| * SHARK on Windows is ~30 seconds
| magic_at_nodai wrote:
| Can you give SHARK a try and let us know on our discord? We can
| try to help. People have been using it on older AMD GPUs back
| to Polaris arch.
| delijati wrote:
| Short answer no. Long answer "in theory" yes. I tried this [1]
| but gave up as building rocm + deps takes up to 6h :/ Official
| statement [2]
|
| [1] https://github.com/xuhuisheng/rocm-build [2]
| https://github.com/RadeonOpenCompute/ROCm/issues/1587
| nicolaslem wrote:
| For anyone on Arch, there is a third-party repository called
| arch4edu[0] that provides up to date builds of ROCm and its
| dependencies. On my iGPU, OpenCL sometimes works, sometimes
| crashes. Even finding a list of supported hardware is close
| to impossible. The whole situation is just ridiculous and
| makes AMD look bad.
|
| [0] https://github.com/arch4edu/arch4edu
| my123 wrote:
| AMD doesn't actually care.
|
| For them, GPGPU is a pro level feature not worth supporting
| on most customer GPUs. They are doing much more feature
| segmentation than NVIDIA ever did.
| AstixAndBelix wrote:
| Does anyone know what's the current state of AMD's tools to
| migrate from CUDA? There's so much untapped potential with these
| cards, it's crazy that basically only gamers can make use of
| their competitive prices
| epmaybe wrote:
| I don't think there's truly a competitor but opencl is the
| alternative to shoot for. Otherwise for machine learning
| purposes amd helps develop ROCm.
| pjmlp wrote:
| OpenCL is hardly an alternative, plain old C, using
| compilation from source at runtime, with very basic tooling
| available.
|
| Versus a polyglot compiler infrastructure, IDE tooling that
| includes shader debugging, and a rich ecosytem of GPU based
| libraries.
|
| Even with SYSCL and SPIR-V, that has hardly improved, and
| while Intel bases oneAPI on top of SYSCL, that naturally also
| goes beyond the standard.
| amelius wrote:
| Shouldn't we have an API that can speak to both CUDA and
| opencl? Or is opencl sufficiently capable?
| pjmlp wrote:
| No it isn't because it lacks the polyglot infrastructure
| from CUDA, it has now SPIR-V but hardly anyone targets it
| as PTX gets used.
| snvzz wrote:
| I understand AMD HiP is a CUDA clone, where library
| functions have the same syntax but with hip replacing
| cuda in the function names.
|
| Behind, it can use AMD and NVIDIA hardware alike. Thus,
| the idea is that through typically negligible effort
| porting to HiP, your code becomes vendor-independent.
|
| In practice, I do not know how true this is.
| my123 wrote:
| > Thus, the idea is that through typically negligible
| effort porting to HiP, your code becomes vendor-
| independent.
|
| Here, the big AMD mistake was to rename those function
| prefixes in the first place. It's a mistake that they
| could have avoided...
|
| What a lot of SW codebases did to support AMD (see
| PyTorch code notably): codebase is still CUDA, have the
| conversion pass to HIP done at build time.
|
| See https://github.com/ROCm-Developer-
| Tools/HIPIFY/blob/amd-stag... for the Perl script to do
| it.
|
| Then comes the problem of AMD not supporting ROCm HIP on
| most of their hardware or user base.
|
| On Windows, the ROCm HIP SDK is private and only
| available under NDA. This means that while you can use
| Blender w/ HIP on Windows, the Blender builds that you
| compile yourself will not be able to use ROCm HIP.
|
| On Linux, the supported GPUs are few and far between,
| Vega20 onwards are supported today. APUs, RDNA1, and
| lower end RDNA2 w/o unsupported hacks (6700 XT and below)
| are excluded.
| tormeh wrote:
| It's quite baffling. AMD is behaving like an incumbent
| trying to segment users etc. when they really should
| behave more like an upstart trying to make things easy.
| But their drivers for Linux are the best, so I don't
| think I'll switch to Nvidia...
| paulmd wrote:
| > What a lot of SW codebases did to support AMD (see
| PyTorch code notably): codebase is still CUDA, have the
| conversion pass to HIP done at build time.
|
| This is sort of echoed in AMD's stance on FSR2/upscaling,
| where they have _explicitly stated_ they will not support
| any API that allows plugging, regardless of whether the
| API is open-source or not, or who owns it, because a
| pluggable API might allow plugging proprietary
| implementations. Their opinion is their solution is what
| 's best for everyone, in every situation, so you don't
| really need DLLs or pluggability because why would you
| want to plug something worse? FSR2 is the best for
| everyone and you should really just be compiling it
| directly into your application.
|
| https://youtu.be/8ve5dDQ6TQE?t=974
|
| (this of course also makes it impossible to update
| versions of FSR2 if better ones come out subsequently -
| you can't do the DLSS thing where you swap in newer DLLs
| (at your own risk of course) and benefit from later
| improvements to a modular grouping of code. You know,
| sort of the whole concept of libraries in the first
| place...)
|
| The HIP stuff is the same thing... AMD really wants you
| to convert once to HIP and be locked in forever, because
| Theirs Is The Best, Why Would You Need Anything Else? But
| of course HIP doesn't have a PTX-like concept so you
| really need to distribute as source and compile
| everything at runtime... because who would want library
| code or dynamic linking?
|
| Anyway, like, I know it's not really a shocker but the
| "we love open-source!" thing is a bit of an act. They
| love it when it's an angle for them as the underdog to
| leverage their way into marketshare... and as the
| underdog when it's not favorable for them (like FSR)
| they'll abandon their pro-freeness stance. And they too
| have their closed, proprietary technologies (like their
| CXL alternative that only works with their
| CPU+peripherals and nobody else can use) that they don't
| open up either. Nor is AMD racing to open up chipsets
| (like the NForce or Abit days) either, that's all locked
| down and proprietary too.
|
| I know that's not really a shocker when you put it like
| that, but, AMD really gets a ton of the benefit-of-the-
| doubt all the time. They have on _multiple occasions_
| shipped defective /marginal silicon at launch for
| example, and it all just gets brushed over and people
| forget all about it. Both Zen2 (low-quality silicon in
| the launch batch meant chips were missing advertised
| boost clocks by 10%+) and RDNA1 had massive incidents of
| the community downplaying very real problems because AMD
| Is Good Now, many of the affected users never had their
| problems resolved and they just kinda sighed and lived
| with it or sold the hardware and bought something better,
| and the fans swept it all under the rug and never talked
| of it again. Same for pandemic profiteering (while Intel
| cut prices), etc. There's just a ton of shit that people
| bend over backwards to find justifications for with AMD
| that just wouldn't fly with more reputable vendors.
| schmorptron wrote:
| Do you have an opinion on the new openCL implementation
| that recently got merged into mesa? It doesn't touch on
| tooling or the other points you mentioned, but performance
| seems to be pretty good!
|
| https://www.phoronix.com/news/Rusticl-2022-XDC-State
| pjmlp wrote:
| No, it doesn't seem to matter for what makes CUDA
| relevant anyway.
| my123 wrote:
| It's still a heavy work in progress. Not usable for SYCL
| programs as SVM isn't implemented yet.
| CodeArtisan wrote:
| The performances on Blender3d are atrocious, the RX 7900 XTX is
| noticeably slower than a RTX 3060.
| snvzz wrote:
| Latest Blender release does not have the optimization work in
| yet.
|
| AIUI, what's in current git master is very different.
| marcyb5st wrote:
| Last time I seriously checked (6 months ago or so) ROCm was
| still a far cry from CUDA. Set up was a mess, support was hit
| and miss, some operations were not particularly performante
| compared to the CUDA counterparts. Additionally, there are
| Tensorflow and probably PyTorch forks that should work with it,
| but they lag behind the official repositories quite a bit.
|
| I hope that now that generative AI is becoming mainstream AMD
| steps up their game both on their consumer and professional
| lineups. If I were to buy a video card right now ( mostly for
| gaming+ML hobbies projects + running stable diffusion) I
| wouldn't pick AMD because I could do just 1/3 of my use cases
| properly without headaches (gaming).
| rowanG077 wrote:
| OpenCL works pretty well. Can't say I notice large gaps of
| performance between CUDA and openCL for my hpc work.
| lalaland1125 wrote:
| Have you done any benchmarks with vulkan?
| rowanG077 wrote:
| No I haven't used vulkan for compute.
| my123 wrote:
| Thankfully for a good chunk of number crunching that works
| fine. But the other side of the coin is notably AI
| workloads. There's no OpenCL or Vulkan standard for
| exposing matrix units, only vendor specific ones.
|
| For OpenCL: cl_qcom_ml_ops (Qualcomm) notably, for Vulkan:
| VK_NV_cooperative_matrix (NVIDIA)
| tehsauce wrote:
| "There has also been a wide variety of accuracy-degrading
| performance optimizations like Xformers and Flash Attention,
| which are great tools if you are open to trading accuracy for
| performance"
|
| This is incorrect. Those optimizations do identical computations,
| but leverage memory bandwidth on the gpu more effectively. So
| there is no accuracy tradeoff there.
| magic_at_nodai wrote:
| Here are a list of potential issues
| https://github.com/AUTOMATIC1111/stable-diffusion-webui/disc...
|
| That said we (Nod.ai team) will add support for xformers soon
| so you can opt in for xformers anyway.
| ggerganov wrote:
| > There has also been a wide variety of accuracy-degrading
| performance optimizations like Xformers and Flash Attention,
| which are great tools if you are open to trading accuracy for
| performance ..
|
| I wasn't aware that Flash Attention trades accuracy for
| performance. Either I have a wrong understanding of what FA is,
| or this statement is not fully accurate.
|
| Either way - looks like great work
| marcyb5st wrote:
| From the flash attention paper:
|
| We also extend FlashAttention to block-sparse attention,
| yielding an approximate attention algorithm that is faster than
| any existing approximate attention method.
|
| So I assume they are using the approximate version as they also
| have an exact version.
| ggerganov wrote:
| Thanks for that - I have missed the block-sparse extension of
| the algorithm when I first read about it. And indeed this
| seems to be what the author means.
| lalaland1125 wrote:
| I really wish more GPU libraries had focused on vulkan instead of
| CUDA ...
| andy_ppp wrote:
| I thought Vulkan was a graphics specific layer and CUDA was
| specifically for machine learning?
| lalaland1125 wrote:
| Vulkan is designed for all GPU needs, from rendering to
| general purpose compute.
|
| CUDA is only for general purpose compute.
| mhh__ wrote:
| CUDA is for GPGPU (general purpose GPU) which includes
| machine learning.
|
| Vulkan is a primarily for graphics but does have options for
| GPGPU too. Vulkan is however not like OpenGL in that it's
| fairly close to the hardware in terms of abstraction.
| my123 wrote:
| Vulkan has a very atrocious developer experience by GPGPU
| standards.
|
| The chance of it winning over CUDA is at zero. And that's
| _before_ considering its API gaps compared to modern
| OpenCL.
|
| (Yes, even OpenCL is a much better compute API choice than
| Vulkan. Vulkan does not even have SVM)
| mhh__ wrote:
| Vulkan is fairly atrocious, GPGPU or not IMO. Obviously
| it tries to do something intrinsically complex but I've
| never enjoyed working with it.
| VHRanger wrote:
| You should think more of vulkan as an IR endpoint than
| the actually usable API here.
|
| Vulkan is well supported by most GPUs because it's so low
| level. Performance tends to be good everywhere.
|
| What would make vulkan succesful is having APIs that
| "compile" to this IR. Stuff like vulkan Kompute are good
| ideas in this direction.
| my123 wrote:
| Vulkan is not a suitable API for even implementing
| Khronos's very own SYCL on top of. SYCL requires shared
| virtual memory capabilities that Vulkan just doesn't
| have.
| lalaland1125 wrote:
| Does CUDA have SVM either? Seems like a pretty niche
| feature IMHO
| my123 wrote:
| Yes CUDA does, since Kepler, under the CUDA Unified
| Memory naming.
|
| It's not a niche feature at all, but one that is
| essential to lower the barrier for developer adoption.
| mattnewton wrote:
| CUDA is general purpose compute, but nvidia also releases
| cudnn which all the major libraries use because it is fast
| and good (if a little complex). There's efforts underway to
| have a comparable library on open source general compute
| packages but none as mature or effective as cudnn so people
| just pay nvidia to use that in practice, which lets them
| invest even more in pulling ahead.
|
| As an aside, I've been kinda surprised that this has existed
| for as long as it has, but I am probably biased and think Ml
| acceleration is more important than most large business do
| today.
| rowanG077 wrote:
| It's one of the reasons Nvidia is basically untouchable at this
| time. The AI field willingly enslaved itself to NVidia.
| my123 wrote:
| It's because NVIDIA actually cared and AMD does not where it
| matters (customer HW).
|
| Openness is totally secondary to functional. You know, the
| same kind of reasons as of why Linux on the desktop is not a
| mass market thing compared to Windows for a very long time.
| rowanG077 wrote:
| OpenCL is totally functional, on AMD and even iGPU Intel.
| The reason Nvidia won was because they made it easier. And
| the AI people ate it up. The tooling Nvidia offers is
| second to none. But you can build almost anything CUDA does
| with openCL. It's simply harder to do.
|
| The AI crowd cared more about that then the impact of
| tieing the entire ecosystem to a single company.
|
| Who knows what openCL might have been if it would be the
| premier implementation language. I'd wager it would have
| gotten a LOT more love.
| paulmd wrote:
| > OpenCL is totally functional, on AMD and even iGPU
| Intel.
|
| The fact that it's not functional on non-NVIDIA platforms
| is the whole reason Blender dropped OpenCL support. If
| you're going to write a bunch of implementation-specific
| code to handle AMD's bugs/non-compliant runtime, why not
| just target CUDA directly?
|
| "it's open-source, if you want it fixed then pay someone
| to do it or just spend a month making a patch instead of
| doing your work???"
|
| bugs-bunny-no.gif
|
| like, same with ROCm, is AMD just wants to externalize
| all their costs onto customers yet still wants them to
| adopt it instead of the turnkey solution that everyone
| else already uses. Why would anyone do that? It's great
| for NVIDIA but terrible for users.
|
| (and as for the "AMD drivers have been good for like ten
| years now!!!" crowd... counterpoint: the entire 5700XT
| thing, drivers broken for the first 18 months, just like
| Vega before it. And oh look 7900XTX is turning into a
| trainwreck too. There's just constant showstopping bugs
| with AMD drivers. Just like with ROCm too... patchwork
| support and endless bugs that don't exist in the
| industry-standard solution. Nobody wants to spend their
| time doing AMD's job for them.)
|
| To their credit this is one thing Intel got right... they
| probably spent more dev time on oneAPI in the last year
| than AMD spent on ROCm and all their previous
| attempts/projects/resume-driven-development fodder
| combined.
| [deleted]
| CamperBob2 wrote:
| When I catch myself writing a sentence like _It 's simply
| harder to do_ in order to promote or justify an
| alternative engineering approach, I try to, well, catch
| myself before making a really weak argument in favor of
| an inferior solution. Making life easy for developers is
| important.
| rowanG077 wrote:
| I'm not claiming it's not important and it's not very
| nice that you say I did.
|
| When you basically have doomed humanity to rely on a
| single (malicious) company for a technology that is as
| important as AI. Then maybe, just maybe, the trade off
| that it is a harder to implement is worth it.
| my123 wrote:
| The tradeoff was not that. The OpenCL SW ecosystem was
| just not there at all. It's not a coincidence that nobody
| has a good AI training on OpenCL stack even today. The
| cross-vendor infrastructure for that doesn't exist.
|
| And NV was far from malicious here, they are who made
| building this ecosystem possible.
|
| Without NV what would have plausibly happened was not
| having AI training on GPUs at all, but on bespoke
| accelerators (which _did_ exist back then) at a totally
| inaccessible cost to customers. It's hard to understate
| their role in building this ecosystem.
| rowanG077 wrote:
| What exactly is the issue? I use OpenCl without
| significant issues everyday.
| my123 wrote:
| The whole library ecosystem, for example (but far from
| only) if you want a BLAS. OpenCL only provides much lower
| level infrastructure bricks.
|
| With CUDA having unmatched performance compared to
| alternatives too.
| hdjeoslsjdjrhe wrote:
| We havn't doomed anything... These things happen in
| cycles. Companies try to force control and compliance and
| then customers look for alternatives ... The cost had
| become worth it at point and we have found our point of
| inflection.
| CamperBob2 wrote:
| _When you basically have doomed humanity to rely on a
| single (malicious) company for a technology that is as
| important as AI_
|
| I don't disagree, but how did this argument fare against
| Microsoft? Is there a reason you expect it to fare better
| against Nvidia? That sweaty guy jumping around yelling
| "Developers! Developers! Developers!" had a point.
| rowanG077 wrote:
| Well I wouldn't have recommended building anything
| foundational on .NET either. But .NET is open source and
| runs almost everywhere now.
|
| I would be fine with CUDA if Nvidia would allow
| anyone(AMD/Intel) to make implementations for their GPUs
| as well.
| my123 wrote:
| See ROCm HIP which is basically just that. AMD chose to
| rename all the function prefixes but it's what you are
| asking for here.
|
| AMD fucked up by not having a stable IR between GPU
| generations and not having a public Windows SDK. But
| that's their own problem, not NVIDIA's.
| paulmd wrote:
| > AMD fucked up by not having a stable IR between GPU
| generations
|
| The lack of a stable IR is probably deliberate. Much like
| the "we won't support DLLs or pluggable APIs, only
| statically compiling it into your application" with FSR2,
| once you port to HIP you're locked in. AMD wants you
| working in HIP, compiling from HIP, not treating them as
| an IR - they don't _want_ to be an alternate runtime for
| NVIDIA 's ecosystem.
|
| And again, much like FSR2, they are in fact willing to
| compromise end-user experience (no updates) or developer
| convenience (continual patching) in order to do it. No
| libraries, only distribute as source, ever.
|
| It's not about library pluggability or runtime
| compatibility (after all GPU Ocelot already existed),
| what they want is you building the ROCm Ecosystem and not
| the CUDA Ecosystem or OneAPI Ecosystem.
|
| That's understandable from a corporate strategy
| perspective, as a corporation you don't want to be
| building a product on someone else's platform, because
| that gives a lot of freedom for the platform owner to
| fuck with you. But like, the whole "we won't even do
| libraries/IR" is a little crass from a customer
| experience/developer experience perspective, and it kinda
| goes against the whole good-guy-AMD mythos they've built
| up.
| frognumber wrote:
| No. It's not.
|
| AMD only officially supports GPGPU headless. That
| discounts 90% of the market. Old graphics cards lose
| support randomly. That discounts much of the rest. The
| whole thing is a horrible, bug-ridden mess.
|
| I'd pick AMD over NVidia if it was e.g. 50% slower at the
| same price point -- open source is worth waiting for --
| but I can't take nonworking.
|
| AMD also has no support. I'm now building tooling reliant
| on NVidia, so if AMD ever gets their stuff working, we're
| many backports away from a working ecosystem. The longer
| AMD takes, the deeper the hole.
| AceJohnny2 wrote:
| CUDA predates Vulkan by over 8 years.
|
| There's a lot of established ecosystem for CUDA, thanks to
| Nvidia's investment.
___________________________________________________________________
(page generated 2022-12-21 23:01 UTC)