[HN Gopher] The Illustrated Stable Diffusion
___________________________________________________________________
The Illustrated Stable Diffusion
Author : mariuz
Score : 223 points
Date : 2022-10-04 17:59 UTC (5 hours ago)
(HTM) web link (jalammar.github.io)
(TXT) w3m dump (jalammar.github.io)
| uptown wrote:
| "We then compare the resulting embeddings using cosine
| similarity. When we begin the training process, the similarity
| will be low, even if the text describes the image correctly."
|
| How is this training performed? How is accuracy rated?
| marshray wrote:
| Nope, still don't understand it. :-/
| torbTurret wrote:
| Love the visual explainers for machine learning nowadays.
|
| The author has more here: https://jalammar.github.io/
|
| Amazon has some highly interactive ones here: https://mlu-
| explain.github.io/
|
| Google had: distill.pub
|
| Hope to see education in this space grow more.
| vanjajaja1 wrote:
| This is the perfect level of description, thank you. Looking
| forward to checking out more of your work.
| minism wrote:
| Great overview, I think the part for me which is still very
| unintuitive is the denoising process.
|
| If the diffusion process is removing noise by predicting a final
| image and comparing it to the current one, why can't we just jump
| to the final predicted image? Or is the point that because its an
| iterative process, each noise step results in a different "final
| image" prediction?
| mota7 wrote:
| The problem is that predicting a pixel requires knowing what
| the pixels around it looks like. But if we start with lots of
| noise, then the neighboring pixels are all just noise and have
| no signal.
|
| You could also think of this as: We start with a terrible
| signal to noise ratio. So we need to average over very large
| areas to get any reasonable signal. But as we increase the
| signal, we can average over a smaller area to get the same
| signal-to-ratio.
|
| In the beginning, we're averaging over large areas, so all the
| fine detail is lost. We just get 'might be a dog? maybe??'.
| What the network is doing is saying "if this a dog, there
| should be a head somewhere over here. So let me make it more
| like a head". Which improves the signal to noise ratio a bit.
|
| After a few more steps, the signal is strong enough that we can
| get sufficient signal from smaller areas, so it starts saying
| 'head of a dog' in places. So the network will then start doing
| "Well, if this is a dog's head, there should be some eyes.
| Maybe two, but probably not three. And they'll be kinda
| somewhere around here".
|
| Why do it this way?
|
| Doing it this ways means the network doesn't need to learn
| "Here are all the ways dogs can look". Instead, it can learn a
| factored representation: A dog has a head and a body. The
| network only needs to learn a very fuzzy representation at this
| level. Then a head has some eyes and maybe a nose. Again, it
| only needs to learn a very fuzzy representation and (very)
| rough relative locations.
|
| So it only when it get right down into fine detail that it
| actually needs to learn pixel perfect representation. But this
| is _way_ easier, because in small areas images have
| surprisingly very low entropy.
|
| The 'text-to-image' bit is a just a twist on the basic idea. At
| the start when the network is going "dog? or it might be a
| horse?", we fiddle with the probabilities a bit so that the
| network starts out convinced there's a dog in there somewhere.
| At which point it starts making the most likely places look a
| little more like a dog.
| astrange wrote:
| Research is still ongoing here, but it seems like diffusion
| models despite being named after the noise addition/removal
| process don't actually work because of it.
|
| There's a paper (which I can't remember the name of) that shows
| the process still works with different information removal
| operators, including one with a circle wipe, and one where it
| blends the original picture with a cat photo.
|
| Also, this article describes CLIP being trained on text-image
| pairs, but Google's Imagen uses an off the shelf text model so
| that part doesn't seem to be needed either.
| krackers wrote:
| I think it might be this paper [1] succintly described by the
| author in this twitter thread [2]
|
| [1] https://arxiv.org/abs/2208.09392 [2] https://twitter.com/
| tomgoldsteincs/status/156250381442263040...
| jayalammar wrote:
| Two diffusion processes are involved:
|
| 1- Forward Diffusion (adding noise, and training the Unet to
| predict how much noise is added in each step)
|
| 2- Generating the image by denoising. This doesn't predict the
| final image, each step only predicts a small slice of noise
| (the removal of which leads to images similar to what the model
| encountered in step 1).
|
| So it is indeed an iterative processes in that way, each step
| taking one step towards the final image.
| [deleted]
| hanrelan wrote:
| I was wondering the same and this video [1] helped me better
| understand how the prediction is used. The original paper isn't
| super clear about this either.
|
| The diffusion process predicts the total noise that was added
| to the image. But that prediction isn't great and applying it
| immediately wouldn't result in a good output. So instead, the
| noise is multiplied by a small epsilon and then subtracted from
| the noisy image. That process is iterated to get to the final
| result.
|
| [1]https://www.youtube.com/watch?v=J87hffSMB60
| cgearhart wrote:
| I'm pretty sure it's a stability issue. With small steps the
| noise is correlated between steps; if you tried it in one big
| jump then you would essentially just memorize the input data.
| The maximum noise would act as a "key" and the model would
| memorize the corresponding image as the "value". But if we do
| it as a bunch of little steps then the nearby steps are
| correlated and in the training set you'll find lots of groups
| of noise that are similar which allows the model to generalize
| instead of memorizing.
| nullc wrote:
| You can think of it like solving a differential equation
| numerically. The diffusion model encodes the relationships
| between values in sensible images (technically in the
| compressed representations of sensible images). You can try to
| jump directly to the solution but the result won't be very good
| compared to taking small steps.
| Waterluvian wrote:
| Closer. But I still get lost when words like "tensor" are used.
| "structured lists of numbers" really doesn't seem to explain it
| usefully.
|
| This reminds me that explaining seemingly complex things in
| simple terms is one of the most valuable and rarest skills in
| engineering. Most people just can't. And often because they no-
| longer remember what's not general knowledge. You end up with a
| recursive Feynmannian "now explain what that means" situation.
|
| This is probably why I admire a whole bunch of engineering
| YouTubers and other engineering "PR" people for their brilliance
| at making complex stuff seem very very simple.
| 6gvONxR4sf7o wrote:
| You're talking like using jargon makes something a bad
| explanation, but maybe you just aren't the audience? Why not
| use words like that if it's a super basic concept to your
| intended audience?
| Waterluvian wrote:
| I saw the scientific term, "Text Understander" and wrongly
| thought I was the audience.
| Waterluvian wrote:
| I should have added that the images/figures really help. I
| think I'm about there.
| netruk44 wrote:
| If it helps you to understand at all, assuming you have a CS
| background, any time you see the word "tensor" you can replace
| it with "array" and you'll be 95% of the way to understanding
| it. Or "matrix" if you have a mathematical background.
|
| Whereas CS arrays tend to be 1 dimensional, and sometimes 2
| dimensional, tensors can be as many dimensions as you need. A
| 256x256 photo with RGB channels would be stored as a [256 x 256
| x 3] tensor/array. If you want to store a bunch of them? Add a
| dimension to store each image. Want rows and columns of images?
| Make the dimensions [width x height x channels x rows x
| columns].
| Waterluvian wrote:
| This helps. Thank you. Any advice on where to look to
| understand why the word tensor was used?
| pvarangot wrote:
| A Tensor is a mathematical object for symbolic manipulation
| of relationships between other objects that belong in
| conceptually similar universes or spaces. Literature on
| Deep Learning, like Goodfellow, call for the CS-minded
| reader to just assume it's a fancy word for a matrix of
| more than two dimensions. That makes matters confusing
| because mathematically you could have scalar or vectorial
| tensors. The classic mathematical definition puts more
| restrictions on the "shape" of the "matrix" by requiring
| certain properties so that it can create a symbolic
| language for a tensor calculus. Then you can study how the
| relationships change as variables on the universes or
| spaces change.
|
| Understanding the inertia tensor on classical mechanics or
| the stress tensor may illustrate where tensors come from,
| and I understand that GR also makes use of a lot of tensor
| calculus that came to be as mathematics developed
| manipulating and talking about tensors. I have a kinda firm
| grasp on some very rudimentary tensor calculus from trying
| to leanr GR, and a pretty solid grasp on classical
| mechanics. I've had hour long conversations with deep
| learning popes and thought leaders in varying states of
| mind and after that my understanding is that they use the
| word tensor in an overreaching fashion as like you could
| call a solar panel a nuclear fission reactor power source.
| This thought leaders include people with books and 1M+ view
| Youtube videos on the subject that use the word tensor and
| I'm not saying their names because they off-the-record
| admitted that it's a poor choice of term but it harms the
| ego of many Google engineers to publicly admit that.
| amelius wrote:
| How are sum-types handled in deep learning?
|
| E.g., a type that holds "a 3x2 tensor OR a 4x6x2x5
| tensor".
| sva_ wrote:
| It should be noted that it doesn't have all to much to do
| with the rigorous mathematical definition of a Tensor.
| jamessb wrote:
| Very loosely, a number/vector/matrix/tensor can be
| considered to be objects where specifying the values of
| 0/1/2/3 indexes will give a number.
|
| (A mathematician might object to this on several grounds,
| such as that vectors/matrices/tensors are geometric objects
| which need not be expressed numerically as coordinates in
| any coordinate system)
| amelius wrote:
| A tensor can take any number of dimensions (not just 3).
| jayalammar wrote:
| I updated the post to say "multi-dimensional array".
|
| In a context like this, we use tensor because it allows for
| any number of dimensions (while vector/ array is only one,
| matrix is two). When you get into ML libraries, both
| popular packages PyTorch and TensorFlow use the "tensor"
| terminology.
|
| It's a good point. Hope it's clearer for devs with "array"
| terminology.
| zestyping wrote:
| > we use tensor because it allows for any number of
| dimensions
|
| "Vector" implies one dimension and "matrix" strongly
| implies two. But an array can have any number of
| dimensions, so "array" is the best word.
|
| We don't need the word "tensor"; when the context is
| programming, "tensor" is only confusing and doesn't
| really add any useful meaning.
| [deleted]
| avereveard wrote:
| To reduce it a little, matrix holds numbers, tensor holds
| whatever. Numbers. Vectors. Operations.
| yreg wrote:
| So it is math lingo for array.
| minimaxir wrote:
| A more practical example of the added dimensionality of
| tensors is the addition of a batch dimension, so a 8 image
| batch per training step would be a (8, 256, 256, 3) tensor.
|
| Tools such as PyTorch's DataLoader can efficiently collate
| multiple inputs into a batch.
| TigeriusKirk wrote:
| One of my favorite moments in Geoffrey Hinton's otherwise
| pretty info-dense Coursera neural network class was when he
| said-
|
| "To deal with a 14-dimensional space, visualize a 3-D space and
| say 'fourteen' to yourself very loudly. Everyone does it."
| [deleted]
| renewiltord wrote:
| Isn't it really cool? It's like the AI is asking itself what
| shapes the clouds are making and whether the moon has a face,
| over and over again.
| coldcode wrote:
| I find SD to be amazing technology, but it still (mostly) sucks
| at producing "intelligent" images. It basically fancy math that
| turns noise into images (from the opposite it trained on) but
| still has no idea what it is producing. If you run it long enough
| you eventually get lucky and find a gem. I like to try "George
| Washington riding a Unicorn in Times Square"; I've so far never
| gotten anything a first year art student can draw. I wonder how
| long it will take before something more "AI" than "ML" will have
| an understanding even close to what a simple human brain can
| process.
|
| In the meantime it's fun to play with it, plus I'd like to better
| understand the noise training process.
| jw1224 wrote:
| > "George Washington riding a Unicorn in Times Square"
|
| The "secret" to Stable Diffusion (and other CLIP-based models)
| is being as descriptive as possible. This prompt, whilst easy
| for humans to imagine, actually has a whole lot of ambiguity
| baked in.
|
| How high is the unicorn flying? Is the unicorn even flying, or
| on the ground? How old is George Washington? What visual style
| is the image in? Is the image from the perspective of a
| pedestrian at ground level, or from up at skyscraper level?
|
| The more ambiguous the prompt, the less cohesive the image.
|
| To demonstrate, here's 4 renders from your original prompt:
| https://imgur.com/a/Jo4qfOp
|
| And here's 4 using the prompt "George Washington riding a
| unicorn in Times Square, cinematic composition, concept art,
| digital illustration, detailed": https://imgur.com/a/lB36JqC
|
| Certainly not perfect, but for an additional 15 seconds of
| effort, far better.
| CrazyStat wrote:
| I love how the unicorn horn got stuck on a Washington's head
| in the bottom right instead of on the unicorn.
| minimaxir wrote:
| With SD, you _have_ to use modifier quality /positional/artist
| keywords, as vanilla inputs give the model too much freedom.
| dr_dshiv wrote:
| > I like to try "George Washington riding a Unicorn in Times
| Square"; I've so far never gotten anything a first year art
| student can draw.
|
| Why the hell would a first year art student draw that? Flunk
| their ass. God damn dumb ass prompts I have to deal with.
|
| --Stable Diffusion
| Psychoshy_bc1q wrote:
| you might try "Pony Diffusion" for that :-)
| https://huggingface.co/AstraliteHeart/pony-diffusion
| l33tman wrote:
| The reason you can't get the images you want from it is not
| because of the noise diffusion process (after all, this is
| probably the closest similarity to how a human gets a flash of
| creativity) but the lack of a large language model in SD - it
| was deliberately scaled down so the result could fit in
| consumer GPUs.
|
| DALLE-2 uses a much larger language model and you can explain
| more complicated concepts to it. Googles Imagen likewise (not
| released though).
|
| It's mostly a matter of scaling to get this better.
| astrange wrote:
| It's not just size but also model architecture. DALLE mini
| (craiyon.com) has the opposite priority because of its
| different architecture; you can enter a complex prompt and it
| will follow it, but it's much slower and the image quality is
| a lot worse. SD prefers to make aesthetic pictures over
| listening to everything you tell it.
|
| You can improve this in SD by raising cfg_scale at the cost
| of some weird "oversharpening" artifacts. Or, you can make a
| crappy image in DallE mini and use that as the img2img prompt
| with SD to make it prettier.
|
| The real sign it's lacking intelligence is, if you ask it a
| question it won't draw the answer, it'll just draw the
| question. Of course, they could fix that too, it's got a GPT
| in it, they just don't let it recurse...
| ilaksh wrote:
| It says the final output before pixel space is 64x64x4? How can
| that be enough information?
| Imnimo wrote:
| The way I think of it, we have a 512x512x3 target, so that's
| 48x the information. I don't think it's unreasonable to say
| that far less than 1/48th of the space of 512x512x3 outputs are
| natural images (meaning an image that might actually exist,
| rather than meaningless pixels). So if we think about that
| 64x64x4 tensor as telling us what in the smaller space of
| natural images we should draw, it seems like plenty of
| information. Especially since we have the information stored
| weights of the output network also.
| zestyping wrote:
| The amount of information in a 64x64x4 array would depend on
| the precision of the numbers in it, right? For example, a
| 512x512 image in 24-bit colour could be completely encoded in
| a 64x64x4 array if each of the 64 x 64 x 4 = 16,384 values
| had 384 bits of precision.
|
| So, I wonder -- what's the minimum number of bits of
| precision in the 64x64x4 array that would be sufficient for
| this to work?
| l33tman wrote:
| The autoencoder that maps between that and the 512x512x3 RGB
| space was trained together with the model, so it is specialized
| in upscaling the 64x64x4 info to pixel space for this
| particular purpose. It's "just" a factor of 48
| (de)compression..
| astrange wrote:
| And remember we only care about 8 bits of the output
| (actually less due to JPEG compression) but each latent value
| is a 32-bit float.
| ginger2016 wrote:
| This is awesome ! Thank you for creating it. I have been wanting
| to read about Stable Diffusion. I added this to my reading list
| in Safari.
| minimaxir wrote:
| Hugging Face's diffusers library and explainer Colab notebook
| (https://colab.research.google.com/github/huggingface/noteboo...)
| are good resources on how diffusion works in practice codewise.
| jayalammar wrote:
| Agreed. "Stable Diffusion with Diffusers" and "The Annotated
| Diffusion Model" were excellent and are linked in the article.
| The code in Diffusers was also a good reference.
| yieldcrv wrote:
| What are you guys currently using for Stable Diffusion on OSX
| with M1?
|
| There are so many variants and forks that I don't know which one
| to install any more. Something that takes advantage of Metal and
| the CPU cores.
|
| Any that retains the "upload a sketch and then add a description"
| feature?
| pram wrote:
| I'm using InvokeAI. Follow the instructions and it will work
| flawlessly.
|
| https://github.com/invoke-ai/InvokeAI
| subdane wrote:
| https://github.com/divamgupta/diffusionbee-stable-diffusion-...
| jerpint wrote:
| Everytime I need a refresher on transformers, I read the same
| author's post on transformers. Looking forward to this one!
| jmartrican wrote:
| So its like the How to Draw an Owl meme.
| minimaxir wrote:
| SD has made this meme into a reality, given how easy it is to
| take a sketch and use img2img to get something workable out of
| it.
| thunderbird120 wrote:
| You can do that, yes https://0x0.st/oJVK.webm
| swyx wrote:
| i've been collecting other explanations of how SD works here:
| https://github.com/sw-yx/prompt-eng#sd-model-values
___________________________________________________________________
(page generated 2022-10-04 23:00 UTC)