[HN Gopher] Use pytorch2+cu118 with ADA hardware for 50%+ speedup
___________________________________________________________________
Use pytorch2+cu118 with ADA hardware for 50%+ speedup
Author : vans554
Score : 116 points
Date : 2023-07-19 14:55 UTC (8 hours ago)
(HTM) web link (gpux.ai)
(TXT) w3m dump (gpux.ai)
| mrwizrd wrote:
| Can the same speedup be obtained on a 3090?
| vans554 wrote:
| I accidentally stumbled upon this and did not expect such a
| speedup. It seems anything less than cu118 does not properly
| support the RTX4090 (or H100).
|
| Bumping to cuda12.2 with pytorch2.0.1+cu118 made my SDXL go 50%
| faster and ESRGAN 80% faster on the 4090.
| latchkey wrote:
| Good find!
| brucethemoose2 wrote:
| You can also run PyTorch cu121 nightly builds,
|
| These also allow `torch.compile` to function properly with
| dynamic input, which should net another 30%+ boost to SD.
| cheald wrote:
| Is there a trick to getting pytorch+cu121 and xformers to
| play nicely together? All the xformers packages I can find
| are torch==2.01+cu118.
|
| Edit: After a bit more research it looks like scaled dot
| product attention in Pytorch 2 provides much the same benefit
| as xformers without the need for xformers proper. Nice.
| brucethemoose2 wrote:
| xformers has to match the PyTorch build. For PyTorch
| nightly, you need to build from source.
|
| xformers still has a tiny performance benefit (especially
| at higher resolutions IIRC), but yeah, PyTorch's SDP is
| good.
| vans554 wrote:
| Pretty interesting. Using nightly + cu121 im getting 8.18
| it/s, another 5% improvement vs 7.78 that cu118 gave.
| voz_ wrote:
| This comment brings a tear to my eye.
| doctorpangloss wrote:
| The underlying problem is the community's decision to make
| users manage this in the first place.
|
| This is an example of a setup.py that correctly installs
| the accelerated PyTorch for your platform:
|
| https://github.com/comfyanonymous/ComfyUI/blob/9aeaac4af5e1
| 9...
|
| As you can see, never merged. For philosophical reasons I
| believe. The author wanted to merge it earlier and changed
| his mind.
|
| Like why make end users deal with this at all? The ROI from
| a layperson choosing these details is very low.
|
| Python has a packaging problem, this is well known. Fixing
| setuptools would be highest yield. Other package tooling
| can't install PyTorch, for example:
| https://github.com/python-
| poetry/poetry/issues/6409#issuecom....
|
| PyTorch itself is wonkily packaged. But I'm sure they have
| a good reason for this. Anyway, it goes to show that you
| can put a huge amount of effort into fixing this particular
| problem that everyone touching this technology has, and the
| maintainers everywhere will go nowhere with it. And I don't
| think this is a "me" problem, because there is so much
| demand for packaging PyTorch correctly - all the easy UIs,
| etc.
| brucethemoose2 wrote:
| > But I'm sure they have a good reason for this.
|
| CUDA and ROCM make this an intractable problem. Basically
| there is no way to sanely package everything users need,
| and the absolutely enormous, cude/rocm versioned pytorch
| packages with missing libs are already a compromise.
|
| TBH the whole ecosystem is not meant to be for end user
| inference anyway.
| voz_ wrote:
| Sorry, no idea what you are talking about.
|
| I am talking about dynamic shapes in torch.compile.
|
| You seem to be talking about software packaging. You also
| make heavy use of the word "this" without it being clear
| what "this" is.
| brucethemoose2 wrote:
| The two most popular stable diffusion UIs (automatic1111
| and comfy) have longstanding issues with a few known but
| poorly documented bugs, like the ADA performance issue.
|
| For instance, the torch.compile thing we are talking
| about is (last I checked) totally irrelevant for those
| UIs because they are still using the Stability AI
| implementation, not diffusers package that Huggingface
| checks for graph breaks. This may extend to SDXL.
| hospitalJail wrote:
| This was one of the reasons I skipped the 4090.
|
| So few people have the technology that I knew I'd be spending
| significant time figuring out solutions to problems.
|
| The other reason is that I'd wait a few years and get some 6090
| with 4x the VRAM.
| yakorevivan wrote:
| [dead]
| VadimPR wrote:
| I can confirm that it's true on RTX 4080 on Ubuntu 22.04 LTS.
| SekstiNi wrote:
| Surprised people don't know about this, as it has been common
| knowledge in the SD community [1] since october last year.
| Strictly speaking you don't even need cuda 11.8+ to get the
| speedup; it's sufficient to use cuDNN 8.6+, though you should use
| the newest versions for other reasons.
|
| [1]: https://github.com/AUTOMATIC1111/stable-diffusion-
| webui/issu...
| voz_ wrote:
| Always cool to see :)
|
| If you build from source, it should be even faster compared to
| release builds, if only because we keep on landing fixes and
| speedups regularly.
|
| If anyone tries this and runs into bugs or issues, feel free to
| respond here and I can take a look.
| alfalfasprout wrote:
| Oh man, I deal with CUDA version nuances all the time. ML
| dependency management in particular is always extra fun. Between
| all the different CUDA, CuDNN, NCCL versions and versions of TF
| frameworks and numpy dependencies, etc. it can quickly become a
| mess.
|
| We've started really investing into a better solution-- always
| interesting to see just how big a difference getting the right
| CUDA version for a given build of eg; torch is.
| WithinReason wrote:
| PyTorch has been listing this install option for months, just
| click the "CUDA 11.8" button:
|
| https://pytorch.org/get-started/locally/
| [deleted]
| baby_souffle wrote:
| Yes, but 11.7 has been the "stable" release:
| https://github.com/pytorch/pytorch/blob/main/RELEASE.md#rele...
| lostmsu wrote:
| Does it apply to Windows?
| boredumb wrote:
| wow if those benchmarks are true that is amazing to read.
| valine wrote:
| Its true. I've been installing nightly builds of pytorch for
| months specifically to access this fix. Have been getting
| 40it/s outputting a 512x512 image on my 4090. Prior to the fix
| would get around 19it/s.
| photoGrant wrote:
| Why am I with a 3090 @ 3 it/s?
|
| Am I doing something heavily wrong? All through WSL2
| valine wrote:
| is/s depends on resolution and other factors like batch
| size. What are you getting for 512x image?
| photoGrant wrote:
| Fair, 12.3. My numbers are with the dev branch and 1024
| with the xl model
| valine wrote:
| Yeah that'll do it. 3it/s sounds normal then.
| capableweb wrote:
| Also sampler and bunch of other parameters.
| bilsbie wrote:
| Eli5?
| thangngoc89 wrote:
| Using newer CUDA version with supported hardware and software
| boost performance
___________________________________________________________________
(page generated 2023-07-19 23:01 UTC)