[HN Gopher] High-performance image generation using Stable Diffu...
___________________________________________________________________
High-performance image generation using Stable Diffusion in KerasCV
Author : tosh
Score : 317 points
Date : 2022-09-28 08:28 UTC (14 hours ago)
(HTM) web link (keras.io)
(TXT) w3m dump (keras.io)
| ShamelessC wrote:
| Nice! I'll take anything over the huggingface version - the API
| design by huggingface where CLIP is in transformers, everything
| else is in diffusers...not a great developer experience [unless
| youre the type of person that likes their python to look like
| half-baked J2EE).
| capableweb wrote:
| Tried to get this running on my 2080ti (11GB VRAM) but hitting
| OOM issues. So while performance seems better (but can't actually
| test this myself), I'm unable to actually verify it as it doesn't
| run. Some of the Pytorch forks works on as little as 6GB of VRAM
| (or maybe even 4GB?), but always good to have implementations
| that optimize for various factors, this one seems to trade memory
| usage for raw generation speed.
|
| Edit: there seems to be a more "full" version of the same work
| available here, made by one of the authors of the submission
| article: https://github.com/divamgupta/stable-diffusion-
| tensorflow
| WithinReason wrote:
| Just breaking the attention matrix multiply into parts allows a
| significant reduction of memory consumption at minimal cost.
| There are variants out there that do that and more.
|
| Short version: Attention works as a matrix multiply that looks
| like this: s(QK)V where QK is a large matrix but Q,K,V and the
| result are all small. You can break Q and V into horizontal
| strips. Then the result is the vertical concatenation of:
| s(Q1*K)*V1 s(Q2*K)*V2 s(Q3*K)*V3 ...
| s(QN*K)*VN
|
| Since you're reusing the memory for the computation of each
| block you can get away with much less simultaneous RAM use.
| liuliu wrote:
| PyTorch doesn't offer an inplace softmax which contributes
| about 1GiB extra memory for inference (of stable diffusion).
| Although all these are not significant improvements comparing
| to just switch to FlashAttention inside the UNet model.
| GistNoesis wrote:
| Yeah, the problem is indeed in the attention computation.
|
| You can do something like that but it's far from optimal.
|
| From memory consumption perspective, the right way to do it,
| is to never materialize the intermediate matrices.
|
| You can do it, by using a customop, that compute att =
| scaledAttention(Q,K,V) and the gradient dQ,dK,dV =
| scaledAttentionBackward(Q,K,V,att,datt)
|
| The memory needed for these ops is the memory to store
| Q,K,V,attn,dQ,dK,dV,dattn + extra temporary memory.
|
| When you do the work to minimize memory consumption, this
| extra temporary memory is really small : 6
| _attention_horizon^2_ number_of_core_running_in_parallel
| numbers.
|
| But even though there is not much re computation, this kernel
| won't run as fast due to the pattern of memory access, unless
| you spend some time manually optimizing it.
|
| The place to do it is at the level of the autodiff framework
| aka tensorflow or pytorch, with low level c++/cuda code.
|
| Anybody can write some custom kernel, but deploying,
| maintaining them and distributing them is a nightmare. So the
| only people that could and should have done it, are the
| tensorflow or pytorch guys.
|
| In fact they probably have, but it's considered a strategic
| advantage and reserved for internal use only.
|
| The mere mortals like us, have to use some workarounds
| (splitting matrices, Kheops, gradient checkpointing... ) to
| not be too much penalized by the limited ops of the out of
| the box autodiff frameworks like tensorflow or torch.
| Karuma wrote:
| There are forks that even work on 1.8 of VRAM! They work great
| on my GTX 1050 2GB.
|
| This is by far the most popular and active right now:
| https://github.com/AUTOMATIC1111/stable-diffusion-webui
| jtap wrote:
| Just as another point of reference. I followed the windows
| install. I'm running this on my 1060 with 6GB memory. With no
| setting changes takes about 10 seconds to generate an image.
| I often run with sampling steps up to 50 and that takes about
| 40 seconds to generate an image.
| rmurri wrote:
| What settings and repo are you using for GTX 1050 with 2GB?
| Karuma wrote:
| I'm using the one I linked in my original post:
| https://github.com/AUTOMATIC1111/stable-diffusion-webui
|
| The only command line argument I'm using is --lowvram, and
| usually generate pictures at the default settings at
| 512x512 image size.
|
| You can see all the command line arguments and what they do
| here: https://github.com/AUTOMATIC1111/stable-diffusion-
| webui/wiki...
| [deleted]
| extesy wrote:
| > This is by far the most popular and active right now:
| https://github.com/AUTOMATIC1111/stable-diffusion-webui
|
| While technically the most popular, I wouldn't call it "by
| far". This one is a very close second (500 vs 580 forks):
| https://github.com/sd-webui/stable-diffusion-webui/tree/dev
| Karuma wrote:
| That's why I said "right now", since I feel that most
| people have moved from the one you linked to AUTOMATIC's
| fork by now. hlky's fork (the one you linked) was by far
| the most popular one until a couple of weeks ago, but some
| problems with the main developer's attitude and a never-
| ending migration from Gradio to Streamlit filled with
| issues made it lose its popularity.
|
| AUTOMATIC has the attention of most devs nowadays. When you
| see any new ideas come up, they usually appear in
| AUTOMATIC's fork first.
| jaggs wrote:
| This needs Windows 10/11 though?
| Karuma wrote:
| Nope. There are instructions for Windows, Linux and Apple
| Silicon in the readme:
| https://github.com/AUTOMATIC1111/stable-diffusion-webui
|
| There's also this fork of AUTOMATIC1111's fork, which also
| has a Colab notebook ready to run, and it's way, way faster
| than the KerasCV version:
| https://github.com/TheLastBen/fast-stable-diffusion
|
| (It also has many, many more options and some nice, user-
| friendly GUIs. It's the best version for Google Colab!)
| jaggs wrote:
| Brilliant thanks.
| sophrocyne wrote:
| While AUTOMATIC is certainly popular, calling it the most
| active/popular would be ignoring the community working on
| Invoke. Forks don't lie.
|
| https://github.com/invoke-ai/InvokeAI
| counttheforks wrote:
| > Forks don't lie.
|
| They sure do. InvokeAI is a fork of the original repo
| CompVis/stable-diffusion and thus shares its fork counter.
| Those 4.1k forks are coming from CompVis/stable-diffusion,
| not InvokeAI.
|
| Meanwhile AUTOMATIC1111/stable-diffusion-webui is not a
| fork itself, and has 511 forks.
| pwillia7 wrote:
| Subjectively, AUTOMATIC has taken over -- I have not
| heard of invoke yet but will check it out.
| toqy wrote:
| The only reason to use it imo has been if you need mac/m1
| support, but that's probably in other forks by now
| sophrocyne wrote:
| Welp - TIL.
|
| Thanks for the correction.
|
| Any idea on how to count forks of a downstream fork? If
| anyone would know... :)
| rcarmo wrote:
| This is _markedly_ faster than the PyTorch versions I've seen
| (nothing against the library, just categorizing the
| implementations). It would be nice to see this including the
| little quality of life additional models (eye fixes, upscaling,
| etc.), but I suspect the optimizations are transferrable.
|
| Either way, getting 3 images for 25 iterations under 10 seconds
| (quick Colab test, which is where I've taken to comparing these
| things) is just ridiculously faster.
| zone411 wrote:
| Which GPU did you test on Colab? Are you comparing with one of
| the fp16 PyTorch versions? Their test shows little improvement
| on V100.
|
| PyTorch is now quite a bit more popular than Keras in research-
| type code (except when it comes from Google) so I don't know if
| these enhancements will get ported. This port was done by
| people working on Keras which is kind of telling - there isn't
| a lot of outside interest.
| _ntka wrote:
| This is not true, the initial Keras port of the model was
| done by Divam Gupta who is not affiliated with Keras or
| Google. He works at Meta.
|
| The benchmark in the article uses mixed precision (and
| equivalent generation settings) for both implementations,
| it's a fair benchmark.
|
| In the latest StackOverflow global developer survey,
| TensorFlow had 50% more users than PyTorch.
| zone411 wrote:
| Two Keras creators are listed as authors on this post. If
| they were not involved, this should be specified. I
| specifically talked about research and StackOverflow is not
| in any way representative of what's used. Do you disagree
| that the majority of neural net research papers now only
| have PyTorch implementations, not TensorFlow? Also,
| according to Google Trends, PyTorch is more popular: https:
| //trends.google.com/trends/explore?geo=US&q=pytorch,te....
| BTW, I would love it if TF made a strong comeback, it's
| always better to have two big competing frameworks and I
| have some issues with PyTorch, including with its
| performance.
| polygamous_bat wrote:
| > In the latest StackOverflow global developer survey,
| TensorFlow had 50% more users than PyTorch.
|
| It also doesn't help that PyTorch has its own discussion
| forum [1] where most pytorch questions end up.
|
| [1]: https://discuss.pytorch.org/
| kgwgk wrote:
| Should we expect people not working on keras to have the
| interest and ability to get it to work on keras?
| zone411 wrote:
| If these people have existing Keras code they want to
| integrate or they are interested in developing it further
| in Keras, then it shouldn't require any insider knowledge
| to create a Keras version of a small but popular open-
| source project like this. I am very sure we'd get a PyTorch
| version made by outsiders quickly if Stable Diffusion was
| originally released in Keras/TF.
| kgwgk wrote:
| What is your definition of outsider?
|
| We got a Keras version made by Divam Gupta very quickly
| after Stable Diffusion was released.
|
| Is he not an outsider?
| zone411 wrote:
| From what I can tell this Keras version was just released
| (the date on the post is Sep. 25) and the first author
| listed is the creator of Keras. Is this incorrect? I am
| not familiar with Divam Gupta and I would consider
| outsiders to be people not paid by Google.
| kgwgk wrote:
| https://mobile.twitter.com/divamgupta/status/157123450432
| 020...
|
| https://github.com/divamgupta/stable-diffusion-tensorflow
|
| Now they are working together. That may be "telling" to
| you but I'm not sure why that should cast a negative
| light on Keras, really.
| zone411 wrote:
| I didn't say that it casts a negative light on Keras.
| Just on its popularity among outsiders. There are
| thousands of great libraries out there that are much less
| popular than Keras or PyTorch. And BTW, JAX is a useful
| Google-created framework that's growing in popularity
| among researchers and pushed PyTorch to improve
| (functorch), so I have nothing against Google projects.
| kgwgk wrote:
| The reason why we're having this discussion is that what
| you call a Keras outsider ported Stable Diffusion to
| Keras last week.
|
| It's hard to understand how that can say anything
| negative about the popularity of Keras among outsiders.
| zone411 wrote:
| So why are Keras creators listed as authors on this post
| and why is it on Keras' official site? Compare this to
| hundreds of PyTorch SD forks that have been thrown up on
| GitHub.
|
| The OP was wondering whether additional enhancements will
| also be ported and that's what I responding to. It's
| simply much less likely that a new paper will get a Keras
| implementation than a PyTorch implementation.
| nextaccountic wrote:
| Is this faster even after applying the optimizations that
| reduce VRAM usage? (some of which the Keras version seem to
| lack)
| labarilem wrote:
| Very interesting performance. Also a very good write-up. Can't
| wait to try this.
| gpderetta wrote:
| I have a mediocre GPU but a fast CPU (with a lot of RAM). Would I
| see improvements there?
|
| I guess I should give it a try.
| senthilnayagam wrote:
| tried it yesterday, on intel i9 macbook pro it takes about 300
| seconds per image.
| gpderetta wrote:
| You mean the keras version? How does it compare to the
| original one? Currently on my 10850k I get 2.4s/iteration,
| which is borderline usable. I haven't managed (nor tried very
| hard) to get the cuda version working on my 1070; I expect to
| be a little better, but I don't want to fight with ram
| issues.
| ttflee wrote:
| How many steps did you perform?
|
| I tried some and found no major differences after 16 steps or
| so with given random seed.
| ttflee wrote:
| On intel MacBookPro 2020, CPU-only, the original one[1] using
| pytorch utilized one core only. A tensorflow implementation[2]
| with oneDNN support which utilized most of the cores ran at
| ~11sec/iteration. Another OpenVINO based implementation[3] ran
| at ~6.0sec/iteration.
|
| [1] https://github.com/CompVis/stable-diffusion/
|
| [2] https://github.com/divamgupta/stable-diffusion-tensorflow/
|
| [3] https://github.com/bes-dev/stable_diffusion.openvino/
| gpderetta wrote:
| Yes, I use [3] and I get 2.4s/iter on my 10 core machine. I
| was wondering if keras would give additional help here. I'll
| have to try I guess.
| erwinh wrote:
| Not necessarily my expertise but if as stated by the article, 2
| lines of code can already get a 2x performance gain, what more
| can be done to improve performance in the coming years?
| londons_explore wrote:
| It's not two lines of code... It's 2 lines that enable tens of
| thousands of lines of library code by invoking a new
| optimizer...
| MintsJohn wrote:
| I'm curious whether this really is "the fastest model yet"
| there are pytorch optimizations as well.
|
| Something like global optimization has been done in pytorch,
| here's a blog about it: https://www.photoroom.com/tech/stable-
| diffusion-25-percent-f...
|
| Mixed precision seems pretty much default looking at a few
| Stable Diffusion notebooks.
|
| More intriguing, there's also a more local optimization that
| makes pytorch faster: https://www.photoroom.com/tech/stable-
| diffusion-100-percent-...
|
| Unless it's already there, that last one would be interesting
| to add to keras.
|
| All in all this machine learning ecosystem is wild, as a
| software dev, things like cache locality and preferring
| computation over memory access are basic optimizations, yet in
| machine learning it seems wildly disregarded, I've seen models
| happily swapping between gpu and system memory to do numpy
| calculations.
|
| Hopefully stable diffusion changes things, the work towards
| optimizations is there, it just seems often disregarded. As
| stable diffusion is one popular open model that, when
| optimized, can be run locally (and not as saas, where you just
| add extra compute power, which seems cheaper than engineers)
| and has a lot of enthusiasm behind it, it might just be the
| spark that makes optimization sexy again.
| shadowgovt wrote:
| Bonus points for this article being one of the clearest
| explanations for how Stable Diffusion works that I've seen to-
| date.
| unspecldn wrote:
| How do I deploy this? Can someone offer some guidance please?
| monkmartinez wrote:
| Is the H5 file type that much different than whatever the Pytorch
| versions are using?
|
| The model is loaded from Huggingface during the instantiation of
| the stable diffusion class. It is loaded as an H5 file which I
| believe is unique to Keras[0]. I don't have any experience with
| Keras so I can't say if that is good or bad. I wanted to see
| where they were getting the weights as the blog post didn't
| demonstrate an explicit loading function/call like Pytorch.
|
| Gonna run it and see... although I have like 40GB of stable
| diffusion weights on my computer now.
|
| [0] https://github.com/keras-team/keras-
| cv/blob/master/keras_cv/...
| mikereen wrote:
| enhance
| xiphias2 wrote:
| ,, Note that when running on a M1 MacBookPro, you should not
| enable mixed precision, as it is not yet well supported by
| Apple's Metal runtime"
|
| It is a bit sad if this is just a closed software issue that
| cannot be fixed :(
| ribit wrote:
| Mixed precision won't do anything on Apple Silicon anyway since
| there is no performance advantage to using FP16 (aside from
| decreasing register pressure and RAM bandwidth which won't
| happen here as data is FP32 to start with).
| capableweb wrote:
| Is it really that sad? Closed software/hardware won't get
| support (official nor community) for things until the
| maintainer of the software adds it, and people who buy that
| kind of hardware is more than aware of this pitfall (and in
| fact, see it as a benefit sometimes too).
| lynndotpy wrote:
| I'm a new MacOS user and, while I did anticipate some of
| these issues, I do often find myself surprised when running
| into them. This was one such surprise I hit recently
| nextaccountic wrote:
| Does this run on AMD?
|
| A problem I see is that a lot of times everything works fine on
| rocm+hip, but since nvidia dominates the machine learning market
| (and thus most researches run nvidia), most forks don't bother
| checking and just advertise compatibility with nvidia and
| sometimes apple M1.
|
| Problem is, AMD GPUs are much cheaper!
| mrtksn wrote:
| Well, high-end stuff is always on Nvidia and Apple Silicon
| seems to get some love because of its unified memory that makes
| it possible in first place plus its popularity among
| developers.
|
| AMD seems to be popular among gamers on budget and the budget
| cards often don't have the VRAM required by default. So, AMD
| seems to be in this weird place where the people who can make
| it work don't care.
| mrtranscendence wrote:
| For what it's worth, at the consumer level AMD cards -- at
| least recently -- have tended to have more VRAM than Nvidia
| cards. My 3080 Ti, which I bought for $1400 (though it now
| goes for ~$1k), has less RAM (12GB) than a 6800 XT that you
| can get for $600 (16GB).
| cypress66 wrote:
| > Problem is, AMD GPUs are much cheaper!
|
| Are they? I believe Nvidia (consumer) gpus have better
| price/performance than amd for AI.
| nextaccountic wrote:
| I don't know about AI performance (does this happen only
| because of the overhead of providing CUDA through rocm+HIP?),
| but I was just checking and at least in my country (Brazil),
| for any given memory size (12GB, 8GB, 4GB) I can find cheaper
| AMD GPUs than NVidia GPUs
|
| Here I'm considering that the main constraint is VRAM and
| while stable diffusion now runs even on GPUs with 2GB RAM,
| there's always new developments that require more VRAM (for
| example, Dreambooth requires 12GB as of today)
| mrtranscendence wrote:
| Maybe for AI? For other tasks, especially gaming, they punch
| well above their weight relative to Nvidia (though they lack
| features in comparison). It's also possible to get a 16GB
| card for much cheaper from AMD than Nvidia.
| gdubs wrote:
| Has anyone tried running this with an AMD card on Mac? At first
| glance it's able to run on Metal (given the M1 compatibility)...
| mrtksn wrote:
| On a 16GB 8c8g Macbook Air M1, the PyTorch implementation takes
| about 3.6s/step which is about 3 minutes per image with the
| default parameters. I wonder how faster this would be. If there's
| anyone out there with a similar system and wants to compare,
| could you please write your findings?
| thisisjasononhn wrote:
| Not M1 comparible but I'm working on testing various GPU vs M1
| comparisons, with a few accessible cloud providers. My
| impression is times should be the same, but it's nice to hear
| other real-world stats for M1 with SD. Makes me really want to
| rent the Hetzner M1 now.
|
| Which repo or build are you using BTW, is it the one related to
| this readme?
|
| https://github.com/magnusviri/stable-diffusion/blob/main/REA...
| stared wrote:
| I would love to see it, but this file is not accessible.
| thisisjasononhn wrote:
| Sorry about that, web link rot sure is real eh.
|
| This is an example of the original file:
| https://github.com/magnusviri/stable-
| diffusion/blob/79ac0f34...
|
| Which seems to have been renamed, and cleaned up a bit
| here: https://github.com/magnusviri/stable-
| diffusion/blob/main/doc...
|
| However, per the note on the magnusviri repo, the following
| repo should be used for a stable set of this SD Toolkit:
| https://github.com/invoke-ai/InvokeAI
|
| with instructions here https://github.com/invoke-
| ai/InvokeAI/blob/main/docs/install...
| mrtksn wrote:
| >Which repo or build are you using BTW, is it the one related
| to this readme? https://github.com/magnusviri/stable-
| diffusion/blob/main/REA...
|
| Yes, this one. However it was like a month ago I think, so
| speeds might have improved. I'm getting ~2.2s/step with
| another implementation:
| https://news.ycombinator.com/item?id=33006447
| thisisjasononhn wrote:
| Wow, that sounds like a good improvement.
|
| I am also wondering, do you follow the general advice of 1
| iteration and 1 sample, for example:
|
| --n_samples 1 --n_iter 1 (when referencing commands using
| txt2img.py)
|
| I figure you could wait a bit for things to process going
| further, but curious just if you're getting results like
| that with higher sample/iter settings.
| mrtksn wrote:
| I usually go with the default parameters.
| mft_ wrote:
| I've not tried it, but this approach apparently takes 10-20s
| per image?
|
| https://reddit.com/r/StableDiffusion/comments/xbo3y7/oneclic...
| mrtksn wrote:
| I just gave it a spin, it took 1 min 52 sec for a 50 steps
| image and that is ~2.2s/step. It seems faster than my
| original installation(which might also have improved speed as
| it was at very beta stage when I tried it) but definitely not
| 20 seconds for 50 steps image at 512x512 resolution.
|
| Maybe they use lower parameters.
|
| edit:
|
| 50 steps at 256x256 resolution took 55 seconds.
|
| 50 steps at 768x768 resolution took 8 min, exactly.
|
| PS: my Macbook Air is modified with thermal pads, it takes a
| bit longer to start throttling than usual. Either way, it's
| very dependent on the ambient temperature.
| WatchDog wrote:
| I don't quite understand the benefit of mixed precision.
|
| It seems like using high precision is useful for training, but if
| not training, why not just use float16 weights and save the
| memory?
| NavinF wrote:
| Converting weights to float16 after training will reduce
| quality/accuracy whereas mixed precision has a negligible
| effect on quality/accuracy and dramatically improves
| performance.
|
| If you really just want to save memory, there's plenty of other
| low hanging fruit. It's just not a priority for most devs since
| mid tier GPUs start at 10GB whereas a typical model only has
| 0.5GB weights. Activations and intermediate calculations use
| way more memory.
| zone411 wrote:
| You usually can. But it can take some work if you're using any
| libraries that expect FP32 and it might be slower, depending on
| the GPU. The FP16 support isn't quite as good as FP32.
| dennisy wrote:
| This is amazing! I am more used to TF so very happy to see this!
|
| Has anyone got a suggestion on how to fine tune this model?
| itronitron wrote:
| someone should compare results with just doing a keyword search
| on deviantart
| JoeAltmaier wrote:
| The otter examples highlight something you can't control using
| these things: the 'eats shoots and leaves' phenomenon.
|
| The prompt was "A cute otter in a rainbow whirlpool holding
| shells, watercolor"
|
| Seems like the otter should be holding shells, the way a normal
| human parses it.
|
| The tool showed the otter 'in holding-shells', which are shells
| that hold otters apparently. Also some random shells strewn
| about, as the technique is sensitive to spurious detail sprouting
| up from single words.
|
| Until the tool permits some kind of syntactic diagramming or so
| forth, we'll not be able to control for this.
|
| Just the other day here, I saw a picture of a fork and some
| plastic mushrooms. The prompt was 'plastic eating mushrooms'
| which was ambiguous even to humans. The tool chose to illustrate
| the subclass of mushrooms 'eating-mushrooms' (as opposed to
| poison mushrooms or decorative mushrooms I suppose) made of
| plastic.
|
| When we're playing around this can seem whimsical and artistic.
| But a graphic designer might want some semblance of control over
| the process.
|
| Not sure how a solution would work.
| CuriouslyC wrote:
| Graphic designers lean on img2img in their workflows more than
| txt2img, as that gives you the control you speak of.
| UncleEntity wrote:
| My favorite is when you do "<whatever> bla, bla, bla, wearing a
| t-shirt by <artist>" and it gives you an image of <whatever>
| wearing a t-shirt with a print in the style of the artist.
| Which adds extra dimensions to play with so isn't all that bad.
| CrazyStat wrote:
| This is the compositionality problem--the language model
| sometimes doesn't quite know how to put the words together.
| Better language models will help in the future; in the mean
| time you can give it a helping hand by prompt engineering or
| using img2img.
| honksillet wrote:
| Can this be used to train you own model? I have a moderately
| large medical image dataset that would like to try this with for
| data augmentation.
| jawadch93 wrote:
___________________________________________________________________
(page generated 2022-09-28 23:00 UTC)