[HN Gopher] Like diffusion but faster: The Paella model for fast...
___________________________________________________________________
Like diffusion but faster: The Paella model for fast image
generation
Author : webmaven
Score : 72 points
Date : 2023-06-25 13:10 UTC (1 days ago)
(HTM) web link (www.deeplearning.ai)
(TXT) w3m dump (www.deeplearning.ai)
| tehsauce wrote:
| Their main claim of "faster" unfortunately is false.
|
| > Running on an Nvidia A100 GPU, Paella took 0.5 seconds to
| produce a 256x256-pixel image in eight steps, while Stable
| Diffusion took 3.2 seconds
|
| Using the latest methods (torch 2.0 compile, improved schedulers)
| stable diffusion only takes about 1 second to generate a 512x512
| image on an a100 gpu. A 256x256 image 1/4 the size presumably
| takes less than half that time.
|
| So the corrected title is "Like diffusion but slightly slower and
| lower quality."
| jeron wrote:
| to add, there's a finetuned version of Stable Diffusion 1.5
| that can output 5 fps for 256x256 (0.2 seconds per image)[0].
| So over 2x faster than Paella at 256x256
|
| [0]:https://www.reddit.com/r/StableDiffusion/comments/z3m97e/mi
| n...
| x3874 wrote:
| Okay, and what has this 'model' to do with paella (patella)?
| Another stupid project name.
| GaggiX wrote:
| Paella is spanish food, one of the researchers is spanish and
| among the first things the model was good at was food.
| x3874 wrote:
| I know what a paella is, i am not some Hinterwaeldler
| American! I would never choose such a stupid name which
| already has longterm, webll-established meanings elsewhere /
| in daily life! "Looking for paella? Well, do you mean the
| food, or some tech framework / database etc. du jour?"
|
| I'll give you a hint - Very good chosen name example b/c
| unique, no other and especially older meanings overloaded,
| and a direct link to the main product functionality given:
| "Keycloak"
| GaggiX wrote:
| If that's the reason why you're whining than I will explain
| a simple trick: "Paella model"
| moron4hire wrote:
| No worse than Apple computers, Internet cookies, email
| spam...
| eutropia wrote:
| I think it's an analogy.
|
| paella is a Spanish rice dish. Rice cooked with various other
| ingredients like chicken, seafood, peppers, tomatoes, etc...
|
| If an image (a 2-d array of pixels) is a plate of rice,
| standard diffusion models denoise starting from each grain of
| rice (pixel). If you've ever seen a step-by-step output from a
| diffusion model, you know what I'm talking about.
|
| This model makes use of a CNN (convolutional neural net) to
| decode tokens from the image. A CNN takes m-by-m sections of
| elements (i.e. a square of pixels) as input, and translates
| them into 1d vectors as part of the input to the NN. Taking
| "chunks" of the image like this allows the net to learn things
| like edges, shapes, groups of color, etc.
|
| You could consider the convolutional samples as "chunks" in the
| paella: the meat, vegetables, and other goodies that make the
| dish beloved by many.
| dang wrote:
| " _Please don 't complain about tangential annoyances--e.g.
| article or website formats, name collisions, or back-button
| breakage. They're too common to be interesting._"
|
| https://news.ycombinator.com/newsguidelines.html
| marcus0x62 wrote:
| I don't understand your parenthetical reference - paella and
| patella are very different things...
|
| In any case, I took Paella to be a play on infusion (used in
| making Paella vs diffusion.
|
| Or maybe these guys just really like rice.
| ben_w wrote:
| Half a second. Can't even type a good prompt in that fast.
|
| One of my dad's preferred anecdotes about how much computers sped
| up in his career, was the number of digits of pi that the company
| mainframe could compute.
|
| He was born in '39.
|
| And now I can generate images from descriptions faster than I can
| give those descriptions.
|
| At this rate, websites will be replaced with image generators and
| LLMs, and the loading speed won't change.
| interroboink wrote:
| Just to be clear, it's half a second for 256x256, where Stable
| Diffusion takes 3.2 seconds. Still a great speed-up, but not
| producing the big hi-res images people might be thinking of.
| gcanyon wrote:
| Way back when, it was pretty easy to recognize diagrams created
| with MacDraw. There was a particular visual style to the
| primitives it included that flowed through to the final product.
| This was of course easier to notice because there were so few
| alternatives at the time.
|
| Given that Paella uses tokens instead of the source image, I
| wonder if the results will have a (human- or machine-) detectable
| "style" to them.
| sbierwagen wrote:
| The usual answer to "all AI art looks the same" is
| https://i.redd.it/jvwyyqn7776a1.jpg
| omnicognate wrote:
| Perhaps if you ask an 11 year old.
| RobotToaster wrote:
| Github for those looking for the code
| https://github.com/dome272/Paella
| FloatArtifact wrote:
| A question for curiosity. Why can't it train on 256x256 pixels
| yet generate any size image? So if it was trained on multiple
| sizes of images, could you also generate of a larger size without
| upscaling?
| dplavery92 wrote:
| Presumably a transformer model or similar that uses positional
| encodings for the tokens could do that, but the U-Net decoder
| here uses a fixed-shape output and learns relationships between
| tokens (and sizes of image features) based on the positions of
| those tokens in a fixed-size vector. You could still apply this
| process convolutionally and slide the entire network around to
| generate an image that is an arbitrary multiple of the token
| size, but image content in one area of the image will only be
| "aware" of image content at a fixed-size neighborhood (e.g.
| 256x256).
| wincy wrote:
| As someone who doesn't know the "why" but uses Stable diffusion
| a lot and has an intuitive feel for the "what" of what happens,
| it's like trying to use a low res pattern for your wallpaper on
| Windows. It'll either just repeat over and over so you'll end
| up with weird multi headed people with heads on top of their
| heads, or you just upscale which hallucinates details in a
| totally different way that doesn't really add new interesting
| details.
|
| With automatic1111 you can get around this by upscaling then
| inpainting the spots you want more detail and specifying a
| specific prompt for that particular area.
| kmeisthax wrote:
| Well, it's complicated.
|
| The model can't just work on arbitrary image sizes because the
| model was trained with a fixed number of input and output
| neurons. For example, 512x512 is Stable Diffusion's "native
| size." However, there are tricks to work around this.
|
| Diffusion models work by predicting image noise, which is then
| subtracted from the image iteratively until you get a result
| that matches the prompt. Stable Diffusion specifically has the
| following architectural features:
|
| - A Variational Autoencoder (VAE) layer that encodes the
| 512x512 input into a 128x128 latent space[0]
|
| - Three cross-attention blocks that take the encoded text
| prompt and input latent-space image, and output a downscaled
| image to the next layer
|
| - A simpler downscaling block that just has a linear and
| convolutional layer
|
| - _Skip connections_ between the last four downscaling blocks
| and corresponding _upscaling_ blocks that do the opposite, in
| the opposite order (e.g. simple upscale, then three cross-
| attention blocks).
|
| - The aforementioned opposite blocks (upscale + cross-attn
| upscale)
|
| - VAE decoder that goes from latent space back to a 512x512
| output
|
| At the end of this process you get what the combined model
| thinks is noise in the image according to the prompt you gave
| it. You then subtract the noise and repeat for a certain number
| of iterations until done. So obviously, if you wanted a smaller
| image, you could crop the input and output at each iteration so
| that the model can only draw in the 'center'.
|
| Larger images are a bit trickier, you have to feed the image
| through in halves and then merge the noise predictions together
| before subtracting. This of course has limitations: since the
| model is looking at only half the image, there's nothing to
| steer the overall process, so it will draw things that look
| locally coherent but make no sense globally[1].
|
| I _suspect_ - as in, I 'm totally guessing here - that we might
| be able to fix that by also running the diffusion process on a
| downscaled version of the image and then scaling the noise
| prediction back up to average with the other outputs. As far as
| I'm aware no SD frontends do this. But if that worked you could
| build up a resolution pyramid of models at different sizes
| taking fragments of the image and working together to denoise
| the image. If you were training from scratch you could even add
| scale and position information to the condition vector so the
| model can learn what image features should exist at what sizes.
|
| [0] Think of this like if every pixel of the latent-space image
| was, instead of RGB, four different channels worth of
| information about the distribution of pixels in the color-space
| image. This compresses the image so that the U-Net part of the
| model can be architecturally simpler - in fact, lots of machine
| learning research is finding new ways to compress data into a
| smaller amount of input neurons.
|
| [1] Moreso than diffusion models normally do
| nodja wrote:
| Note that Paella is a bit old in image model terms (Nov 2022) and
| modern stable diffusion tools have access to optimized workflows.
|
| My 3060 can generate a 256x256 8 step image in 0.5 seconds, no
| A100 needed. A 3090 is double the performance of a 3060 at
| 512x512, and an A100 is 50% faster than a 3090...
|
| If you have access to an high end consumer GPU (4090) you can
| generate 512x512 images in less than a second, it's reached the
| point that you can increase the batch size and have it show 2-4
| images per prompt without adversely affecting your workflow.
|
| Too bad SD1.5 is too small* and we'll require models with more
| parameters if we want a true general purpose image model. If
| SD1.5 was the end-game, we'd have truly instant high res image
| generation in just a couple more generations of GPUs, think
| generating images in real time as you type the prompt, or have
| sliders that affect the strength of certain tokens and see the
| effects in real time, etc. Tho I heard that SDXL is actually
| faster for higher resolutions (>1024x1024) due to removing
| attention on the first layer, making it scale better with
| resolution even tho SDXL has 4x the parameter size.
|
| * Current SD1.5 models that can generate consistent high quality
| images have been fine-tuned and merged so many times that a lot
| of general knowledge has been lost, e.g. they can be great at
| generating landscapes, but lacking in generating humans, or they
| can be very good at a certain style like comics but can do comic
| style only and lose the ability to generate more dynamic face
| variations, etc.
___________________________________________________________________
(page generated 2023-06-26 23:00 UTC)