[HN Gopher] Stable Diffusion XL technical report [pdf]
___________________________________________________________________
Stable Diffusion XL technical report [pdf]
Author : GaggiX
Score : 145 points
Date : 2023-07-04 13:00 UTC (10 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| mkaic wrote:
| Some cool stuff from the paper:
|
| * Earlier SD versions would often generate images where the head
| or feet of the subject was cropped out of frame. This was because
| random cropping was applied to its training data as data
| augmentation, so it learned to make images that looked randomly
| cropped -- not ideal! To fix this issue, they still used random
| cropping during training, but also gave the crop coordinates to
| the model so that it would _know_ it was training on a cropped
| image. Then, they set those crop coordinates to 0 at test-time,
| and the model keeps the subject centered! They also did a similar
| thing with the pixel dimensions of the image, so that the model
| can learn to operate at different "DPI" ranges.
|
| * They're using a two-stage model instead of a single monolithic
| model. They have one model trained to get the image "most of the
| way there" and second model to take the output of the first and
| refine it, fixing textures and small details. Sort of mixture-of-
| experts-y. It makes sense that different "skillsets" would be
| required for the different stages of the denoising process, so
| it's reasonable to train separate models for each of the stages.
| Raises the question of whether unrolling the process further
| might yield more improvments -- maybe a 3- or 4-stage model next?
|
| * Maybe I missed it, but I don't see in the paper whether the
| images they're showing come from the raw model or the RLHF-tuned
| variant. SDXL has been available for DreamStudio users to play
| with since April, and Emad indicated that the reason for this was
| to collect tons of human preference data. He also said that when
| the full SDXL 1.0 release happens later this month, both RLHF'd
| and non-RLHF'd variants of the weights will be available for
| download. I look forward to seeing detailed comparisons between
| the two.
|
| * They removed the lowest-resolution level of the U-Net -- the 8x
| downsample block -- which makes sense to me. I don't think
| there's really that much benefit from wasting flops on a tiny 4x4
| or 8x8 latent tbh. Also thought it was interesting that they got
| rid of the cross-attention on the highest-resolution level of the
| U-Net.
| jack_riminton wrote:
| The two stage process using different skill sets reminds me of
| the old painting masters. Often the master would come up with
| the overall composition, which requires a lot of creativity and
| judgement of forms, apprentice painters would then come in and
| paint clothes, trees or whatever and then the master would
| finish off the critical details such as eyes, mouth and hands
| etc.
|
| It makes sense to have different criteria of what good is for
| each stage
| loudmax wrote:
| The two-stage model sounds a lot like the "Hires fix" checkbox
| in the Automatic1111 interface. If you enable that, it will
| generate an image, and then generate a scaled up image based on
| the first image. You can do the same thing yourself by sending
| an image to the "Image to Image" tab and then upscaling. If you
| do it that way, you also have the option of swapping the image
| model or adding LoRAs.
|
| Presumably the two parts of the SDXL model are complimentary: a
| first pass that's an expert on overall composition, and a
| second pass that's an expert on details.
| in_the_bay wrote:
| Does it say anywhere what hardware they used for training?
| rhogar wrote:
| The report does not detail hardware -- though it states that
| SDXL has 2.6B parameters in its UNet component, compared to SD
| 1.4/1.5 with 860M and SD 2.0/2.1 with 865M. So SDXL has roughly
| 3x more UNet parameters. In January, MosaicML claimed a model
| comparable to Stable Diffusion V2 could be trained with 79,000
| A100-hours in 13 days. Some sort of inference can be made from
| this information, would be interested to hear someone with more
| insight here provide more perspective.
| darqwolff wrote:
| wouldnt that mean more vram is required to load the model?
| they are claiming it will still work on 8 gb cards.
| brucethemoose2 wrote:
| I am guessing 8 bit quantization will be a thing for SDXL.
|
| It should be easy(TM) with bitsandbytes, or ML compiler
| frameworks.
| Filligree wrote:
| Stable Diffusion 1/2 were made to run on cards with as
| little as 3GB of memory.
|
| Using the same techniques, yes, this will fit in 8.
| p1esk wrote:
| "Images generated with our code use the invisible-watermark
| library to embed an invisible watermark into the model output. We
| also provide a script to easily detect that watermark."
|
| Interesting - if it can be detected then it can be removed,
| right?
| menzoic wrote:
| It's open source. It can be removed directly from the code.
| donkeyd wrote:
| Also, most people use other tools like web-ui to run these
| models, which don't seem to apply any watermarks.
| p1esk wrote:
| Sure, I just don't understand why they're doing it.
|
| In theory they could also train the model to always add a
| watermark.
| tjoff wrote:
| Many good reasons have been presented here. And on the
| other side, is there any reason not to do it?
|
| (part from the effort required to put it in place)
| bogdan wrote:
| I guess it's a form of tracking and we don't like that
| brucethemoose2 wrote:
| There is nothing personally identifiable in the watermark
| though, no more than other image metadata.
|
| And yeah, you can see the watermarking code in the
| huggingface pipeline now. Its pretty simple, and theres
| really no reason to disable it.
| adventured wrote:
| What exactly is in the watermark? Other image metadata is
| commonly used to track people, or used in combination
| with a few other hints to track people.
|
| If it's strictly a primitive watermark with absolutely
| nothing unique per mark, then it might be no big deal.
| brucethemoose2 wrote:
| It is. See for yourself:
|
| https://github.com/huggingface/diffusers/blob/2367c1b9fa3
| 126...
|
| I see no reason to disable it. Its fast, and its no more
| personal than basic metadata like image format and
| resolution... I can't think of any use case for disabling
| it other than passing AI generated images as authentic.
| saynay wrote:
| There have been some talk of various countries requiring AI
| images be watermarked. They may be trying to signal their
| willingness to comply, and hopefully ensure they are not
| outright banned. That fact it can be bypassed isn't
| necessarily their problem to fix.
| aenvoker wrote:
| Stability, Midjourney, Dall-e, etc don't want their
| products to be used for misinformation. They really do want
| AI generated images to be trivially identifiable.
|
| There are papers around describing how to bake a watermark
| into a model. Don't know if anyone is doing that yet.
| Der_Einzige wrote:
| And many will start working on techniques to unbake it
| from the model. A cat and mouse game, like all other
| attempts to track.
| rcme wrote:
| One reason is so that they can detect generated images and
| exclude them from future training sets.
| lumost wrote:
| We'll almost certainly see models trained to produce watermarks
| directly in the next few months. A highly diffuse statistical
| abnormality would be nearly impossible to detect or remove.
| p1esk wrote:
| And as soon as such models appear there will appear methods
| to process images destroying such "statistical anomalies"
| without changing visible appearance.
| enlyth wrote:
| Would this not be trivially defeatable by something like a
| single pass of img2img with a very low denoise?
| goldenkey wrote:
| Microstates vs macrostates. Doesn't it depend on the basis
| for the encoding and how many bits it is? Watermark, we
| usually think of something in the pixel plane, but a
| watermark can be in the frequency spectrum, and other
| steganographic (clever) resilient basises.
| bob1029 wrote:
| Simply running a cycle of "ok" quality jpeg compression
| can completely devastate information encoded at higher
| frequencies.
|
| Quantization & subsampling do a hell of a job at getting
| "unwanted" information gone. If a human cannot perceive
| the watermark, then processes that aggressively approach
| this threshold of perception will potentially succeed at
| removing it.
| daniel_reetz wrote:
| At this point the watermark can be meaningful content
| like "every time there's a bird there is also a cirrus
| cloud", or "blades of grass lean slightly further to the
| left than a natural distribution".
|
| Because our main interest is this meaningful content, it
| will be harder to scrub from the image.
| washadjeffmad wrote:
| That would be indistinguishable from a model that was
| also trained on that output, wouldn't it?
|
| It seems much more likely that it's their solution to
| detect and filter AI images from being used in their
| training corpus - kind of a latent "robots.txt".
| TheOtherHobbes wrote:
| Depends on the watermark. The current watermark uses a DCT
| to add a bit of undetectable noise around the edges. It
| survives resizing but tends to fail after rotation or
| cropping.
|
| A robust rotationally invariant repeating-but-not-obviously
| micro watermark would be quite a neat trick.
| rcxdude wrote:
| Watermarks can be made to be pretty robust, though having
| an open source detection routine will make it much easier
| to remove (if nothing else, if you can differentiate the
| detection algorithm then you can basically just optimise it
| away). The kind of watermarking that is e.g. used for
| 'traitor tracking' tends to rely on at least a symmetric
| key being secret, if not the techniques used as a whole
| (which can include more specific changes to the content
| that are carefully varied between different receivers of
| the watermarked information, like slightly different cuts
| in a film, for example).
| Jack000 wrote:
| Not seeing anything about the dataset, are they still using
| LAION? There's no mention of LAION in the paper and the results
| look quite different from 1.5 so I'm guessing no.
|
| > the model may encounter challenges when synthesizing intricate
| structures, such as human hands
|
| I think there's two main reasons for poor hands/text
|
| - Humans care about certain areas of the image more than others,
| giving high saliency to faces, hands, body shape etc and lower
| saliency to backgrounds and textures. Due to the way the unet is
| trained it cares about all areas of the image equally. This means
| model capacity per area is uniform, leading to capacity problems
| for objects with a large number of configurations that humans
| care more about.
|
| - The sampling procedure implicitly assumes a uniform amount of
| variance over the entire image. Text glyphs basically never
| change, which means we should basically have infinite CFG in the
| parts of the image that contain text.
|
| I'm not sure if there's any point in working on this though,
| since both can be fixed by simply making a bigger model.
| sorenjan wrote:
| Are any of these txt2img models being partially trained on
| synthetic datasets? Automatically rendering tens of thousands
| of images with different textures, backgrounds, camera poses,
| etc should be trivial with a handful of human models, or text
| using different fonts.
| mkaic wrote:
| From hanging out around the LAION Discord server a bunch over
| the past few months, I've gathered that they're still using
| LAION-5B in some capacity, but they've done a bunch of
| filtering on it to remove low-quality samples. I believe Emad
| tweeted something to this effect at some point, too, but I
| can't find the tweet right now.
| Zetobal wrote:
| The problem with laion was never the picture quality it's the
| tag quality and a lot of unusable data points. They traded some
| compute for a horde of users to tag them for them though.
| https://dbzer0.com/blog/a-collaboration-begins-between-stabl...
| naillo wrote:
| I love stability AI
| chaos_emergent wrote:
| how refreshing it is to see technical details in a technical
| report /s
| brucethemoose2 wrote:
| > At inference time, a user can then set the desired apparent
| resolution of the image via this size-conditioning.
|
| This is really cool.
|
| So is the difference between the first and second stage. The
| output of the first stage looks good enough for generating
| "drafts" when can then be picked by the user before being
| finished by the second stage.
|
| The improved cropping data augmentation is also a big deal. Bad
| framing is a constant issue with SD 1.5.
| ilaksh wrote:
| You can access this with the Stability.ai API but it was really
| slow and unstable when I tried it (the day it came out). It was
| 20 seconds per image.
|
| Supposedly the Clipdrop API was also going to support it.
| Clipdrop has a web page for SDXL 0.9 but nothing in the API docs
| about it.
|
| Anyone know what the story is with the Clipdrop API and SDXL 0.9?
| cypress66 wrote:
| Are the weights out?
| GaggiX wrote:
| Research weights but they should release the weights for
| everyone this month.
___________________________________________________________________
(page generated 2023-07-04 23:01 UTC)