[HN Gopher] Stable Diffusion XL technical report [pdf]
       ___________________________________________________________________
        
       Stable Diffusion XL technical report [pdf]
        
       Author : GaggiX
       Score  : 145 points
       Date   : 2023-07-04 13:00 UTC (10 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | mkaic wrote:
       | Some cool stuff from the paper:
       | 
       | * Earlier SD versions would often generate images where the head
       | or feet of the subject was cropped out of frame. This was because
       | random cropping was applied to its training data as data
       | augmentation, so it learned to make images that looked randomly
       | cropped -- not ideal! To fix this issue, they still used random
       | cropping during training, but also gave the crop coordinates to
       | the model so that it would _know_ it was training on a cropped
       | image. Then, they set those crop coordinates to 0 at test-time,
       | and the model keeps the subject centered! They also did a similar
       | thing with the pixel dimensions of the image, so that the model
       | can learn to operate at different  "DPI" ranges.
       | 
       | * They're using a two-stage model instead of a single monolithic
       | model. They have one model trained to get the image "most of the
       | way there" and second model to take the output of the first and
       | refine it, fixing textures and small details. Sort of mixture-of-
       | experts-y. It makes sense that different "skillsets" would be
       | required for the different stages of the denoising process, so
       | it's reasonable to train separate models for each of the stages.
       | Raises the question of whether unrolling the process further
       | might yield more improvments -- maybe a 3- or 4-stage model next?
       | 
       | * Maybe I missed it, but I don't see in the paper whether the
       | images they're showing come from the raw model or the RLHF-tuned
       | variant. SDXL has been available for DreamStudio users to play
       | with since April, and Emad indicated that the reason for this was
       | to collect tons of human preference data. He also said that when
       | the full SDXL 1.0 release happens later this month, both RLHF'd
       | and non-RLHF'd variants of the weights will be available for
       | download. I look forward to seeing detailed comparisons between
       | the two.
       | 
       | * They removed the lowest-resolution level of the U-Net -- the 8x
       | downsample block -- which makes sense to me. I don't think
       | there's really that much benefit from wasting flops on a tiny 4x4
       | or 8x8 latent tbh. Also thought it was interesting that they got
       | rid of the cross-attention on the highest-resolution level of the
       | U-Net.
        
         | jack_riminton wrote:
         | The two stage process using different skill sets reminds me of
         | the old painting masters. Often the master would come up with
         | the overall composition, which requires a lot of creativity and
         | judgement of forms, apprentice painters would then come in and
         | paint clothes, trees or whatever and then the master would
         | finish off the critical details such as eyes, mouth and hands
         | etc.
         | 
         | It makes sense to have different criteria of what good is for
         | each stage
        
         | loudmax wrote:
         | The two-stage model sounds a lot like the "Hires fix" checkbox
         | in the Automatic1111 interface. If you enable that, it will
         | generate an image, and then generate a scaled up image based on
         | the first image. You can do the same thing yourself by sending
         | an image to the "Image to Image" tab and then upscaling. If you
         | do it that way, you also have the option of swapping the image
         | model or adding LoRAs.
         | 
         | Presumably the two parts of the SDXL model are complimentary: a
         | first pass that's an expert on overall composition, and a
         | second pass that's an expert on details.
        
       | in_the_bay wrote:
       | Does it say anywhere what hardware they used for training?
        
         | rhogar wrote:
         | The report does not detail hardware -- though it states that
         | SDXL has 2.6B parameters in its UNet component, compared to SD
         | 1.4/1.5 with 860M and SD 2.0/2.1 with 865M. So SDXL has roughly
         | 3x more UNet parameters. In January, MosaicML claimed a model
         | comparable to Stable Diffusion V2 could be trained with 79,000
         | A100-hours in 13 days. Some sort of inference can be made from
         | this information, would be interested to hear someone with more
         | insight here provide more perspective.
        
           | darqwolff wrote:
           | wouldnt that mean more vram is required to load the model?
           | they are claiming it will still work on 8 gb cards.
        
             | brucethemoose2 wrote:
             | I am guessing 8 bit quantization will be a thing for SDXL.
             | 
             | It should be easy(TM) with bitsandbytes, or ML compiler
             | frameworks.
        
             | Filligree wrote:
             | Stable Diffusion 1/2 were made to run on cards with as
             | little as 3GB of memory.
             | 
             | Using the same techniques, yes, this will fit in 8.
        
       | p1esk wrote:
       | "Images generated with our code use the invisible-watermark
       | library to embed an invisible watermark into the model output. We
       | also provide a script to easily detect that watermark."
       | 
       | Interesting - if it can be detected then it can be removed,
       | right?
        
         | menzoic wrote:
         | It's open source. It can be removed directly from the code.
        
           | donkeyd wrote:
           | Also, most people use other tools like web-ui to run these
           | models, which don't seem to apply any watermarks.
        
           | p1esk wrote:
           | Sure, I just don't understand why they're doing it.
           | 
           | In theory they could also train the model to always add a
           | watermark.
        
             | tjoff wrote:
             | Many good reasons have been presented here. And on the
             | other side, is there any reason not to do it?
             | 
             | (part from the effort required to put it in place)
        
               | bogdan wrote:
               | I guess it's a form of tracking and we don't like that
        
               | brucethemoose2 wrote:
               | There is nothing personally identifiable in the watermark
               | though, no more than other image metadata.
               | 
               | And yeah, you can see the watermarking code in the
               | huggingface pipeline now. Its pretty simple, and theres
               | really no reason to disable it.
        
               | adventured wrote:
               | What exactly is in the watermark? Other image metadata is
               | commonly used to track people, or used in combination
               | with a few other hints to track people.
               | 
               | If it's strictly a primitive watermark with absolutely
               | nothing unique per mark, then it might be no big deal.
        
               | brucethemoose2 wrote:
               | It is. See for yourself:
               | 
               | https://github.com/huggingface/diffusers/blob/2367c1b9fa3
               | 126...
               | 
               | I see no reason to disable it. Its fast, and its no more
               | personal than basic metadata like image format and
               | resolution... I can't think of any use case for disabling
               | it other than passing AI generated images as authentic.
        
             | saynay wrote:
             | There have been some talk of various countries requiring AI
             | images be watermarked. They may be trying to signal their
             | willingness to comply, and hopefully ensure they are not
             | outright banned. That fact it can be bypassed isn't
             | necessarily their problem to fix.
        
             | aenvoker wrote:
             | Stability, Midjourney, Dall-e, etc don't want their
             | products to be used for misinformation. They really do want
             | AI generated images to be trivially identifiable.
             | 
             | There are papers around describing how to bake a watermark
             | into a model. Don't know if anyone is doing that yet.
        
               | Der_Einzige wrote:
               | And many will start working on techniques to unbake it
               | from the model. A cat and mouse game, like all other
               | attempts to track.
        
             | rcme wrote:
             | One reason is so that they can detect generated images and
             | exclude them from future training sets.
        
         | lumost wrote:
         | We'll almost certainly see models trained to produce watermarks
         | directly in the next few months. A highly diffuse statistical
         | abnormality would be nearly impossible to detect or remove.
        
           | p1esk wrote:
           | And as soon as such models appear there will appear methods
           | to process images destroying such "statistical anomalies"
           | without changing visible appearance.
        
           | enlyth wrote:
           | Would this not be trivially defeatable by something like a
           | single pass of img2img with a very low denoise?
        
             | goldenkey wrote:
             | Microstates vs macrostates. Doesn't it depend on the basis
             | for the encoding and how many bits it is? Watermark, we
             | usually think of something in the pixel plane, but a
             | watermark can be in the frequency spectrum, and other
             | steganographic (clever) resilient basises.
        
               | bob1029 wrote:
               | Simply running a cycle of "ok" quality jpeg compression
               | can completely devastate information encoded at higher
               | frequencies.
               | 
               | Quantization & subsampling do a hell of a job at getting
               | "unwanted" information gone. If a human cannot perceive
               | the watermark, then processes that aggressively approach
               | this threshold of perception will potentially succeed at
               | removing it.
        
               | daniel_reetz wrote:
               | At this point the watermark can be meaningful content
               | like "every time there's a bird there is also a cirrus
               | cloud", or "blades of grass lean slightly further to the
               | left than a natural distribution".
               | 
               | Because our main interest is this meaningful content, it
               | will be harder to scrub from the image.
        
               | washadjeffmad wrote:
               | That would be indistinguishable from a model that was
               | also trained on that output, wouldn't it?
               | 
               | It seems much more likely that it's their solution to
               | detect and filter AI images from being used in their
               | training corpus - kind of a latent "robots.txt".
        
             | TheOtherHobbes wrote:
             | Depends on the watermark. The current watermark uses a DCT
             | to add a bit of undetectable noise around the edges. It
             | survives resizing but tends to fail after rotation or
             | cropping.
             | 
             | A robust rotationally invariant repeating-but-not-obviously
             | micro watermark would be quite a neat trick.
        
             | rcxdude wrote:
             | Watermarks can be made to be pretty robust, though having
             | an open source detection routine will make it much easier
             | to remove (if nothing else, if you can differentiate the
             | detection algorithm then you can basically just optimise it
             | away). The kind of watermarking that is e.g. used for
             | 'traitor tracking' tends to rely on at least a symmetric
             | key being secret, if not the techniques used as a whole
             | (which can include more specific changes to the content
             | that are carefully varied between different receivers of
             | the watermarked information, like slightly different cuts
             | in a film, for example).
        
       | Jack000 wrote:
       | Not seeing anything about the dataset, are they still using
       | LAION? There's no mention of LAION in the paper and the results
       | look quite different from 1.5 so I'm guessing no.
       | 
       | > the model may encounter challenges when synthesizing intricate
       | structures, such as human hands
       | 
       | I think there's two main reasons for poor hands/text
       | 
       | - Humans care about certain areas of the image more than others,
       | giving high saliency to faces, hands, body shape etc and lower
       | saliency to backgrounds and textures. Due to the way the unet is
       | trained it cares about all areas of the image equally. This means
       | model capacity per area is uniform, leading to capacity problems
       | for objects with a large number of configurations that humans
       | care more about.
       | 
       | - The sampling procedure implicitly assumes a uniform amount of
       | variance over the entire image. Text glyphs basically never
       | change, which means we should basically have infinite CFG in the
       | parts of the image that contain text.
       | 
       | I'm not sure if there's any point in working on this though,
       | since both can be fixed by simply making a bigger model.
        
         | sorenjan wrote:
         | Are any of these txt2img models being partially trained on
         | synthetic datasets? Automatically rendering tens of thousands
         | of images with different textures, backgrounds, camera poses,
         | etc should be trivial with a handful of human models, or text
         | using different fonts.
        
         | mkaic wrote:
         | From hanging out around the LAION Discord server a bunch over
         | the past few months, I've gathered that they're still using
         | LAION-5B in some capacity, but they've done a bunch of
         | filtering on it to remove low-quality samples. I believe Emad
         | tweeted something to this effect at some point, too, but I
         | can't find the tweet right now.
        
         | Zetobal wrote:
         | The problem with laion was never the picture quality it's the
         | tag quality and a lot of unusable data points. They traded some
         | compute for a horde of users to tag them for them though.
         | https://dbzer0.com/blog/a-collaboration-begins-between-stabl...
        
       | naillo wrote:
       | I love stability AI
        
       | chaos_emergent wrote:
       | how refreshing it is to see technical details in a technical
       | report /s
        
       | brucethemoose2 wrote:
       | > At inference time, a user can then set the desired apparent
       | resolution of the image via this size-conditioning.
       | 
       | This is really cool.
       | 
       | So is the difference between the first and second stage. The
       | output of the first stage looks good enough for generating
       | "drafts" when can then be picked by the user before being
       | finished by the second stage.
       | 
       | The improved cropping data augmentation is also a big deal. Bad
       | framing is a constant issue with SD 1.5.
        
       | ilaksh wrote:
       | You can access this with the Stability.ai API but it was really
       | slow and unstable when I tried it (the day it came out). It was
       | 20 seconds per image.
       | 
       | Supposedly the Clipdrop API was also going to support it.
       | Clipdrop has a web page for SDXL 0.9 but nothing in the API docs
       | about it.
       | 
       | Anyone know what the story is with the Clipdrop API and SDXL 0.9?
        
       | cypress66 wrote:
       | Are the weights out?
        
         | GaggiX wrote:
         | Research weights but they should release the weights for
         | everyone this month.
        
       ___________________________________________________________________
       (page generated 2023-07-04 23:01 UTC)