[HN Gopher] Like diffusion but faster: The Paella model for fast...
       ___________________________________________________________________
        
       Like diffusion but faster: The Paella model for fast image
       generation
        
       Author : webmaven
       Score  : 72 points
       Date   : 2023-06-25 13:10 UTC (1 days ago)
        
 (HTM) web link (www.deeplearning.ai)
 (TXT) w3m dump (www.deeplearning.ai)
        
       | tehsauce wrote:
       | Their main claim of "faster" unfortunately is false.
       | 
       | > Running on an Nvidia A100 GPU, Paella took 0.5 seconds to
       | produce a 256x256-pixel image in eight steps, while Stable
       | Diffusion took 3.2 seconds
       | 
       | Using the latest methods (torch 2.0 compile, improved schedulers)
       | stable diffusion only takes about 1 second to generate a 512x512
       | image on an a100 gpu. A 256x256 image 1/4 the size presumably
       | takes less than half that time.
       | 
       | So the corrected title is "Like diffusion but slightly slower and
       | lower quality."
        
         | jeron wrote:
         | to add, there's a finetuned version of Stable Diffusion 1.5
         | that can output 5 fps for 256x256 (0.2 seconds per image)[0].
         | So over 2x faster than Paella at 256x256
         | 
         | [0]:https://www.reddit.com/r/StableDiffusion/comments/z3m97e/mi
         | n...
        
       | x3874 wrote:
       | Okay, and what has this 'model' to do with paella (patella)?
       | Another stupid project name.
        
         | GaggiX wrote:
         | Paella is spanish food, one of the researchers is spanish and
         | among the first things the model was good at was food.
        
           | x3874 wrote:
           | I know what a paella is, i am not some Hinterwaeldler
           | American! I would never choose such a stupid name which
           | already has longterm, webll-established meanings elsewhere /
           | in daily life! "Looking for paella? Well, do you mean the
           | food, or some tech framework / database etc. du jour?"
           | 
           | I'll give you a hint - Very good chosen name example b/c
           | unique, no other and especially older meanings overloaded,
           | and a direct link to the main product functionality given:
           | "Keycloak"
        
             | GaggiX wrote:
             | If that's the reason why you're whining than I will explain
             | a simple trick: "Paella model"
        
             | moron4hire wrote:
             | No worse than Apple computers, Internet cookies, email
             | spam...
        
         | eutropia wrote:
         | I think it's an analogy.
         | 
         | paella is a Spanish rice dish. Rice cooked with various other
         | ingredients like chicken, seafood, peppers, tomatoes, etc...
         | 
         | If an image (a 2-d array of pixels) is a plate of rice,
         | standard diffusion models denoise starting from each grain of
         | rice (pixel). If you've ever seen a step-by-step output from a
         | diffusion model, you know what I'm talking about.
         | 
         | This model makes use of a CNN (convolutional neural net) to
         | decode tokens from the image. A CNN takes m-by-m sections of
         | elements (i.e. a square of pixels) as input, and translates
         | them into 1d vectors as part of the input to the NN. Taking
         | "chunks" of the image like this allows the net to learn things
         | like edges, shapes, groups of color, etc.
         | 
         | You could consider the convolutional samples as "chunks" in the
         | paella: the meat, vegetables, and other goodies that make the
         | dish beloved by many.
        
         | dang wrote:
         | " _Please don 't complain about tangential annoyances--e.g.
         | article or website formats, name collisions, or back-button
         | breakage. They're too common to be interesting._"
         | 
         | https://news.ycombinator.com/newsguidelines.html
        
         | marcus0x62 wrote:
         | I don't understand your parenthetical reference - paella and
         | patella are very different things...
         | 
         | In any case, I took Paella to be a play on infusion (used in
         | making Paella vs diffusion.
         | 
         | Or maybe these guys just really like rice.
        
       | ben_w wrote:
       | Half a second. Can't even type a good prompt in that fast.
       | 
       | One of my dad's preferred anecdotes about how much computers sped
       | up in his career, was the number of digits of pi that the company
       | mainframe could compute.
       | 
       | He was born in '39.
       | 
       | And now I can generate images from descriptions faster than I can
       | give those descriptions.
       | 
       | At this rate, websites will be replaced with image generators and
       | LLMs, and the loading speed won't change.
        
         | interroboink wrote:
         | Just to be clear, it's half a second for 256x256, where Stable
         | Diffusion takes 3.2 seconds. Still a great speed-up, but not
         | producing the big hi-res images people might be thinking of.
        
       | gcanyon wrote:
       | Way back when, it was pretty easy to recognize diagrams created
       | with MacDraw. There was a particular visual style to the
       | primitives it included that flowed through to the final product.
       | This was of course easier to notice because there were so few
       | alternatives at the time.
       | 
       | Given that Paella uses tokens instead of the source image, I
       | wonder if the results will have a (human- or machine-) detectable
       | "style" to them.
        
         | sbierwagen wrote:
         | The usual answer to "all AI art looks the same" is
         | https://i.redd.it/jvwyyqn7776a1.jpg
        
           | omnicognate wrote:
           | Perhaps if you ask an 11 year old.
        
       | RobotToaster wrote:
       | Github for those looking for the code
       | https://github.com/dome272/Paella
        
       | FloatArtifact wrote:
       | A question for curiosity. Why can't it train on 256x256 pixels
       | yet generate any size image? So if it was trained on multiple
       | sizes of images, could you also generate of a larger size without
       | upscaling?
        
         | dplavery92 wrote:
         | Presumably a transformer model or similar that uses positional
         | encodings for the tokens could do that, but the U-Net decoder
         | here uses a fixed-shape output and learns relationships between
         | tokens (and sizes of image features) based on the positions of
         | those tokens in a fixed-size vector. You could still apply this
         | process convolutionally and slide the entire network around to
         | generate an image that is an arbitrary multiple of the token
         | size, but image content in one area of the image will only be
         | "aware" of image content at a fixed-size neighborhood (e.g.
         | 256x256).
        
         | wincy wrote:
         | As someone who doesn't know the "why" but uses Stable diffusion
         | a lot and has an intuitive feel for the "what" of what happens,
         | it's like trying to use a low res pattern for your wallpaper on
         | Windows. It'll either just repeat over and over so you'll end
         | up with weird multi headed people with heads on top of their
         | heads, or you just upscale which hallucinates details in a
         | totally different way that doesn't really add new interesting
         | details.
         | 
         | With automatic1111 you can get around this by upscaling then
         | inpainting the spots you want more detail and specifying a
         | specific prompt for that particular area.
        
         | kmeisthax wrote:
         | Well, it's complicated.
         | 
         | The model can't just work on arbitrary image sizes because the
         | model was trained with a fixed number of input and output
         | neurons. For example, 512x512 is Stable Diffusion's "native
         | size." However, there are tricks to work around this.
         | 
         | Diffusion models work by predicting image noise, which is then
         | subtracted from the image iteratively until you get a result
         | that matches the prompt. Stable Diffusion specifically has the
         | following architectural features:
         | 
         | - A Variational Autoencoder (VAE) layer that encodes the
         | 512x512 input into a 128x128 latent space[0]
         | 
         | - Three cross-attention blocks that take the encoded text
         | prompt and input latent-space image, and output a downscaled
         | image to the next layer
         | 
         | - A simpler downscaling block that just has a linear and
         | convolutional layer
         | 
         | - _Skip connections_ between the last four downscaling blocks
         | and corresponding _upscaling_ blocks that do the opposite, in
         | the opposite order (e.g. simple upscale, then three cross-
         | attention blocks).
         | 
         | - The aforementioned opposite blocks (upscale + cross-attn
         | upscale)
         | 
         | - VAE decoder that goes from latent space back to a 512x512
         | output
         | 
         | At the end of this process you get what the combined model
         | thinks is noise in the image according to the prompt you gave
         | it. You then subtract the noise and repeat for a certain number
         | of iterations until done. So obviously, if you wanted a smaller
         | image, you could crop the input and output at each iteration so
         | that the model can only draw in the 'center'.
         | 
         | Larger images are a bit trickier, you have to feed the image
         | through in halves and then merge the noise predictions together
         | before subtracting. This of course has limitations: since the
         | model is looking at only half the image, there's nothing to
         | steer the overall process, so it will draw things that look
         | locally coherent but make no sense globally[1].
         | 
         | I _suspect_ - as in, I 'm totally guessing here - that we might
         | be able to fix that by also running the diffusion process on a
         | downscaled version of the image and then scaling the noise
         | prediction back up to average with the other outputs. As far as
         | I'm aware no SD frontends do this. But if that worked you could
         | build up a resolution pyramid of models at different sizes
         | taking fragments of the image and working together to denoise
         | the image. If you were training from scratch you could even add
         | scale and position information to the condition vector so the
         | model can learn what image features should exist at what sizes.
         | 
         | [0] Think of this like if every pixel of the latent-space image
         | was, instead of RGB, four different channels worth of
         | information about the distribution of pixels in the color-space
         | image. This compresses the image so that the U-Net part of the
         | model can be architecturally simpler - in fact, lots of machine
         | learning research is finding new ways to compress data into a
         | smaller amount of input neurons.
         | 
         | [1] Moreso than diffusion models normally do
        
       | nodja wrote:
       | Note that Paella is a bit old in image model terms (Nov 2022) and
       | modern stable diffusion tools have access to optimized workflows.
       | 
       | My 3060 can generate a 256x256 8 step image in 0.5 seconds, no
       | A100 needed. A 3090 is double the performance of a 3060 at
       | 512x512, and an A100 is 50% faster than a 3090...
       | 
       | If you have access to an high end consumer GPU (4090) you can
       | generate 512x512 images in less than a second, it's reached the
       | point that you can increase the batch size and have it show 2-4
       | images per prompt without adversely affecting your workflow.
       | 
       | Too bad SD1.5 is too small* and we'll require models with more
       | parameters if we want a true general purpose image model. If
       | SD1.5 was the end-game, we'd have truly instant high res image
       | generation in just a couple more generations of GPUs, think
       | generating images in real time as you type the prompt, or have
       | sliders that affect the strength of certain tokens and see the
       | effects in real time, etc. Tho I heard that SDXL is actually
       | faster for higher resolutions (>1024x1024) due to removing
       | attention on the first layer, making it scale better with
       | resolution even tho SDXL has 4x the parameter size.
       | 
       | * Current SD1.5 models that can generate consistent high quality
       | images have been fine-tuned and merged so many times that a lot
       | of general knowledge has been lost, e.g. they can be great at
       | generating landscapes, but lacking in generating humans, or they
       | can be very good at a certain style like comics but can do comic
       | style only and lose the ability to generate more dynamic face
       | variations, etc.
        
       ___________________________________________________________________
       (page generated 2023-06-26 23:00 UTC)