[HN Gopher] Real-time image editing using latent consistency models
       ___________________________________________________________________
        
       Real-time image editing using latent consistency models
        
       Author : dvrp
       Score  : 34 points
       Date   : 2023-11-10 20:06 UTC (2 hours ago)
        
 (HTM) web link (twitter.com)
 (TXT) w3m dump (twitter.com)
        
       | joerambo808 wrote:
       | why is it faster than latent diffusion models?
        
         | vipermu wrote:
         | it uses a new technique called "consistency" that lets latent
         | diffusion models to predict images in much fewer steps.
         | 
         | some links here: - https://arxiv.org/abs/2310.04378 -
         | https://arxiv.org/abs/2311.05556
        
         | billconan wrote:
         | latent diffusion is an iterative process, the image becomes
         | clearer one step at a time.
         | 
         | The process can be viewed as a particle moving in the image
         | space, one step at a time to its final position in the image
         | space, which is the generated image.
         | 
         | consistency model tries to predict the movement trajectory by
         | providing the current position in the image space. Hence, what
         | used to be a step-by-step process becomes a one-step process.
        
           | dvrp wrote:
           | oh wow, never thought of it that way
        
             | swyx wrote:
             | > consistency model tries to predict the movement
             | trajectory by providing the current position in the image
             | space. Hence, what used to be a step-by-step process
             | becomes a one-step process.
             | 
             | no that wasnt a sufficient explanation for me. what is the
             | prediction method here? why was diffusion necessary in the
             | past? what tradeoffs does this approach have?
        
               | quadrature wrote:
               | from https://arxiv.org/abs/2310.04378 it sounds like its
               | a form of distillation of an SDL model. So im guessing it
               | can't be directly trained, but once you have a trained
               | diffusion model you can distil a predictor which cuts out
               | the iterative steps.
               | 
               | While it can do 1-step the output quality looks a ton
               | better with additional steps.
        
               | dvrp wrote:
               | You can train it directly. This is from the paper "An LCM
               | demands merely 32 A100 GPUs Hours training for 2-step
               | inference [...]"
        
               | billconan wrote:
               | to my defense, if you look at the original paper
               | 
               | https://arxiv.org/pdf/2303.01469.pdf
               | 
               | the exact neural network used for the prediction method
               | is omitted. apparently many neural networks can be used
               | for this prediction method as long as they fulfill
               | certain requirements.
               | 
               | > why was diffusion necessary in the past?
               | 
               | in the paper, one way to train a consistency model is
               | distilling an existing diffusion model. But it can be
               | trained from scratch too.
               | 
               | "why was it necessary in the past " doesn't bother me
               | that much. Before people know to use molding to make
               | candles, they do it by dipping threads into wax. Why was
               | thread dipping necessary? it's just the stepping stone of
               | technology development.
        
       | tasgon wrote:
       | Disclaimer: I'm actively working on this tool.
       | 
       | This is in a closed beta for now (while we work on provisioning
       | enough GPU compute) but we're hoping to make this public later.
        
       | gailees wrote:
       | How do you plan on stopping people from using this tool
       | maliciously?
        
         | monkellipse wrote:
         | The same way you stop people from using a hammer maliciously?
        
       | byrneml wrote:
       | Could you apply the same technique for real-time video editing?
        
       | kmavm wrote:
       | (Disclaimer: I'm an investor in Krea AI.)
       | 
       | When Diego first showed me this animation, I wasn't completely
       | sure what I was looking at, because I assumed the left and right
       | sides were like composited together or something. But it's a
       | unified screen recording; the right, generated side is keeping
       | pace with the riffing the artist does in the little paint program
       | on the left.
       | 
       | There is no substitute for low latency in creative tools; if you
       | have to sit there holding your breath every time you try
       | something, you aren't just linearly slowed down. There are points
       | that are just too hard to reach in slow, deliberate, 30+ second
       | steps that a classical diffusion generation requires.
       | 
       | When I first heard about consistency, my assumption was that it
       | was just an accelerator. I expected we'd get faster, cheaper
       | versions of the same kinds of interactions with visual models
       | we're used to seeing. The fine hackers at Krea did not take long
       | to prove me wrong!
        
         | dvrp wrote:
         | Exactly.
         | 
         | There is no substitute for real-time when you're doing creative
         | work.
         | 
         | That's why GitHub Copilot works so well; that's why ChatGPT
         | struck a chord with people--it streamed the characters back to
         | you quite fast.
         | 
         | At first, I was skeptical too. I asked myself "what about
         | Photoshop 1.0? They surely couldn't do it in real-time.". It
         | turns out that even then you needed it. Of course, compute
         | wasn't there to do a simple translation of _all_ rasterized
         | pixel values that form an image within a layer, but there was a
         | trick they did: they showed you the outline that would tell
         | you, the user, where the content _will_ render if you let the
         | mouse go.
         | 
         | You can see the workflow here:
         | 
         | > https://www.youtube.com/watch?v=ftaIzyrMDqE
         | 
         | It applies to general tools too; you can see the same on this
         | MacOS 8 demo (it runs on the browser!):
         | 
         | > https://infinitemac.org/1998/Mac%20OS%208.1
        
       | vmatsiiako wrote:
       | Does this have to do anything at all with LoRAs?
        
         | vipermu wrote:
         | indeed; we're able to make it work with SDXL thanks to a new
         | technique that got released yesterday called LCM-LoRA.
         | 
         | with LCM-LoRA you can turn models like SDXL into LCMs without
         | need for training and you can add other style LoRAs like the
         | ones you find on civit.ai
         | 
         | in case you're interested, here's the technical report about
         | LCM-LoRA: https://arxiv.org/abs/2311.05556
        
       ___________________________________________________________________
       (page generated 2023-11-10 23:00 UTC)