[HN Gopher] I Made Stable Diffusion XL Smarter by Finetuning It ...
       ___________________________________________________________________
        
       I Made Stable Diffusion XL Smarter by Finetuning It on Bad AI-
       Generated Images
        
       Author : minimaxir
       Score  : 250 points
       Date   : 2023-08-21 16:09 UTC (6 hours ago)
        
 (HTM) web link (minimaxir.com)
 (TXT) w3m dump (minimaxir.com)
        
       | msp26 wrote:
       | >A minor weakness with LoRAs is that you can only have one active
       | at a time
       | 
       | Uh this isn't true at all, at least with auto1111.
        
         | minimaxir wrote:
         | IIRC it does merging/weighting behind the scenes.
        
           | cheald wrote:
           | I'm pretty sure that it's just serially summing the network
           | weights, which results in an accumulated offset to the self-
           | attention layers of the transformer. It's not doing any kind
           | of analysis of multiple networks prior to application to make
           | them "play nice" together; it's just looping and summing.
           | 
           | https://github.com/AUTOMATIC1111/stable-diffusion-
           | webui/blob...
        
           | Der_Einzige wrote:
           | Source for this?
        
       | rabuse wrote:
       | Creating art with stable diffusion has become such a fun hobby of
       | mine. The difference between SD 1.5/2.0 and SDXL is massive, and
       | it's impressive how quickly the quality is improving with this
       | stuff.
        
         | hospitalJail wrote:
         | >The difference between SD 1.5/2.0 and SDXL is massive,
         | 
         | Can you explain?
         | 
         | I havent used SDXL yet, but I spent a ton of time in 1.5.
         | 
         | So far I gathered:
         | 
         | >Higher res
         | 
         | >higher 'quality'
         | 
         | But given I was using realistic vision 3 for so long, I never
         | had a quality issue. With upscaling, I never needed higher res.
        
           | Sharlin wrote:
           | Yes, currently SDXL doesn't really beat the best SD1.5
           | checkpoints quality-wise. But it (and the currently available
           | checkpoints) shows awesome promise, so give it a six months
           | or so.
        
             | CuriouslyC wrote:
             | The best 1.5 checkpoints are constrained in their output
             | flexibility to achieve the quality they get though, and
             | they don't follow prompts nearly as well as SDXL, so if the
             | model doesn't naturally gravitate towards doing what you
             | want it's very hard to steer it anywhere. SDXL also does a
             | better job with full anatomy, which is the reason shared
             | 1.5 generations tend to be torso up or portrait shots.
        
             | AuryGlenz wrote:
             | Currently SDXL is better than SD1.5 checkpoints at pretty
             | much everything other than portraits (or anime drawings) of
             | pretty women.
             | 
             | Unfortunately it seems that's all people want to generate,
             | as is evident when you search for SD on Twitter.
        
               | ChatGTP wrote:
               | Stable diffusion doesn't grant the user a good
               | imagination or taste unfortunately
        
               | Sharlin wrote:
               | Yes, point acceded, I should've said something about the
               | flexibility and capability of SDXL rather than just image
               | quality in a narrow sense.
        
           | AuryGlenz wrote:
           | Here's an example using my dog - a trained checkpoint on one
           | of the nicer SD 1.5 models and a LoRA for the SDXL ones:
           | https://imgur.com/a/PklEKwC
           | 
           | The first 3 images are some of my attempts at making her into
           | a Pokemon. Some turned out pretty good (after generating 50+
           | per type), but I struggled with water in particular. It was
           | hard to get her to have a fin, especially with no additional
           | tail.
           | 
           | I haven't done many in SDXL, but that's the point. I've
           | probably generated..10 images of her as a Pokemon, just when
           | I first trying out the LoRA. The next 2 images are from that,
           | and that was before I had a good ComfyUI workflow to boot.
           | 
           | The rest are various sample images from SDXL showing how
           | versatile it is. In most of those, I only had to generate a
           | few images per prompt to get something pretty darn great. In
           | the Halo 2 one the prompt was literally "an xbox 360
           | screenshot of cinderdog in Halo 2, multiplayer."
           | 
           | And it made her into a freaking Elite, and it worked
           | wonderfully. I previously tried to generate ones like those
           | candyland images in 1.5 models and the foreground and
           | background just didn't look good. In SDXL it just works.
        
             | jononor wrote:
             | Very cool! How many images did you use to create the LoRa
             | of your dog? Do you have any guide to recommend?
        
               | AuryGlenz wrote:
               | It was about 30 images, though I'm planning on adding
               | more and training again sometime. Either that or
               | splitting it up between when her hair is short and when
               | it's long, as it really changes how she looks.
               | 
               | This isn't what I used for my dog's LoRa but I used it
               | for my wife and it worked better than what I was doing
               | before (Adafactor):
               | https://civitai.notion.site/SDXL-1-0-Training-
               | Overview-4fb03...
               | 
               | I'd recommend increasing the network dimension to at
               | least 64, if your VRAM can take it. I can do 64 with my
               | 12GB card. At least for people, I've had better luck
               | using a token that's a celebrity. I'm not sure how to try
               | that with my dog - perhaps just "terrier dog" or
               | something.
        
               | jononor wrote:
               | Thanks! Looks like I'll need to rent a GPU to use SDXL
               | fine tuning. Poor old RTX2060 not gonna cut it.
        
           | zirgs wrote:
           | From my experiments it seems that SD XL understands prompts
           | much better. While SD 1.5 is great at generating your typical
           | "anime girl with big boobs" stuff - if you try to generate
           | something a little bit more unusual - it usually doesn't
           | generate exactly what you want and seems to straight up
           | ignore large parts of the prompt.
           | 
           | SD XL seems to understand weird and unusual prompts a lot
           | better.
           | 
           | SD XL is capable of generating 1024x1024 images without hacks
           | like "hires fix". That's a very good thing, because hires fix
           | sometimes introduces additional glitches while upscaling.
           | Especially at higher denoising strength. Hires fix fixed the
           | broken face - yay, but the subject now has 3 legs instead of
           | two. Things like that happen far less often with SD XL.
        
             | 3abiton wrote:
             | > While SD 1.5 is great at generating your typical "anime
             | girl with big boobs" stuff - if you try to generate
             | something a little bit more unusual - it usually doesn't
             | generate exactly what you want and seems to straight up
             | ignore large parts of the prompt.
             | 
             | Pretty much experience with SD 1.5, but I'll give XL a try.
        
           | davely wrote:
           | I hope you'll forgive me for a bit of a self promotion here,
           | but I think I have an interesting example of SD 1.5 (what
           | most people are familiar with and what most models are based
           | off of) vs SDXL.
           | 
           | Before Phony Stark shut down the Twitter API, I was running a
           | bot that created landscape images with Stable Diffusion v1.5.
           | Its name is Mr. RossBot [1]. Check out the Twitter page for
           | some examples of the quality.
           | 
           | This weekend, I finally updated the code to get it running on
           | Mastodon. In the process, I updated the model to use SDXL
           | [2]. It's running the exact same code otherwise to randomly
           | generate prompts.
           | 
           | The image caption is a simplified version of the prompt.
           | e.g., "Snowcapped mountain peaks with an oxbow lake at golden
           | hour."
           | 
           | Behind the scenes, a whole bunch of extra descriptive stuff
           | is added, so the prompt that SD v1.5 / SDXL get is:
           | "beautiful painting of snowcapped mountain peaks with an
           | oxbow lake at golden hour, concept art, trending on
           | artstation, 8k, very sharp, extremely detailed, volumetric,
           | beautiful lighting, serene, oil painting, wet-on-wet brush
           | strokes, bob ross style"
           | 
           | Anyway, I feel like the quality of SDXL is sharper and it
           | just nails subjects a lot better. It also tries to add
           | reflections and shadows (not always correctly), whereas that
           | didn't happen as much with SD v1.5.
           | 
           | I'm pretty impressed! Especially because Stability.ai had
           | released an update model of Stable Diffusion before SDXL: SD
           | v2.0 and SD v2.1. The results (IMHO) were absolute garbage
           | using the same prompts.
           | 
           | [1] https://twitter.com/mrrossbot
           | 
           | [2] https://botsin.space/@MrRossBot
        
       | politelemon wrote:
       | Please consider posting the LoRa on civitai.com as well as the
       | stable diffusion Reddit.
       | 
       | These results look pretty good, looking forward to trying it out.
       | I hadn't realized that the generative images buzz was dying out,
       | since I'm using it regularly I guess it is always in buzz for me.
        
         | minimaxir wrote:
         | I posted the original release to /r/StableDiffusion but all the
         | comments are "why not compatable with A1111?" and I can't find
         | a good script to do the conversion:
         | https://www.reddit.com/r/StableDiffusion/comments/15r5k3i/i_...
         | 
         | Civitai has syndicated the LoRA:
         | https://civitai.com/models/128708/sdxl-wrong-lora
        
           | Zetobal wrote:
           | You will get more users if you provide a safetensors file
           | instead of bin and pickletensors a lot of people have gotten
           | really scared by the malware scare that was going through
           | social media a few months ago.
        
             | Sharlin wrote:
             | And for a good reason. A big hunk of floating-point numbers
             | really shouldn't be able to execute arbitrary code. Or any
             | code at all.
        
             | araes wrote:
             | Thank you for note on this. I had not heard there were
             | already trojan horse malware being slipped into tensor
             | files as python scripts. Apparently torch pickle uses eval
             | on the tensor file with no filter.
             | 
             | Heard surprisingly little commentary on this topic. The
             | full explanation of how Safetensors are "Safe" can be found
             | from the developer at:
             | https://github.com/huggingface/safetensors/discussions/111
        
               | homarp wrote:
               | also safetensors security audit:
               | https://huggingface.co/blog/safetensors-security-audit
        
             | 0cf8612b2e1e wrote:
             | I would also ask that sha hashes are posted somewhere. It
             | annoys me to know end how difficult it can be to confirm
             | you are using the real model.
        
         | chankstein38 wrote:
         | Agreed I feel like, and I do this a lot as well, people have a
         | tendency to track their habits and assume everyone follows
         | that. From my perspective, the gen image buzz is still as hot
         | as ever!
         | 
         | If I lacked excitement for SDXL it was because it felt like the
         | there was no massive jump in image quality to me. Sure the size
         | doubling is great, but it also presents a problem, as I don't
         | always want to generate 1024x1024 images. I still use third
         | party trained 1.5 models because they create damned good
         | outputs and I have like 5 different upscaling solutions and at
         | least one will add new detail as things are upscaled.
        
           | Sharlin wrote:
           | SDXL is more resolution-agnostic than SD1x, 768x768 works
           | fine, but admittedly going down to 512x512 does tend to
           | produce cropped images.
        
       | letitgo12345 wrote:
       | Similar to https://arxiv.org/abs/2307.12950
        
       | carbocation wrote:
       | Tangentially related: for reasons I don't yet really understand,
       | the LORAs that I build for Stable Diffusion XL only work well if
       | I give a pretty generic negative prompt.
       | 
       | These are fine-tuned on 6 photos of my face, and if I use them
       | with positive prompts, the generated characters don't look much
       | like me. But if I add generic negative terms like "low quality",
       | suddenly the depiction of my face is almost exactly right.
       | 
       | I've trained several models and this has been true across a range
       | of learning rates and number of training epochs.
       | 
       | To me, this feels like it will somehow ultimately be connected to
       | whatever is driving minimaxir's observations in this post.
        
       | sorenjan wrote:
       | This is really interesting. Like mentioned in the article, this
       | is a kind of RLHF, and that's what takes GPT3 from a difficult to
       | use LLM to a chat bot which is able to confuse some people into
       | thinking is has consciousness. It makes it much more usable.
       | 
       | I don't know how these models are trained, but hopefully future
       | models will include bad results as negative training data, baking
       | it into the base model.
       | 
       | It's only mentioned in passing in the article, but apparently
       | it's possible to merge LoRAs? How would you do that, I'd like to
       | use one LoRA to include my own subjects, this LoRA to make the
       | results better, and maybe a third one for a particular style.
        
         | minimaxir wrote:
         | Merging LoRAs is essentially taking a weighted average of the
         | LoRA adapter weights. It's more common in other UIs.
         | 
         | diffusers is working on a PR for it:
         | https://github.com/huggingface/diffusers/pull/4473
        
       | kwhitefoot wrote:
       | > XL
       | 
       | Extra Large? 40 times?
        
         | sschueller wrote:
         | 1024 x 1024 instead of 512 x 512.
        
           | Taek wrote:
           | XL more likely refers to the parameter count, which is 3
           | billion instead of <1 billion
        
             | not2b wrote:
             | No, I think it is mainly because it's optimized for 1024 x
             | 1024 images, rather than 512 x 512 as the previous version
             | was.
        
               | Our_Benefactors wrote:
               | It's both. More pixel space _and_ more parameters.
        
         | brianjking wrote:
         | What? XL is the current version of Stable Diffusion.
        
           | pbjtime wrote:
           | It's already on version 40?
        
             | bckr wrote:
             | Extra large
        
               | jfoutz wrote:
               | It's a Roman numeral joke.
        
               | ShamelessC wrote:
               | Not a very good one.
        
       | [deleted]
        
       | Jackson__ wrote:
       | >The release went mostly under-the-radar because the generative
       | image AI buzz has cooled down a bit. Everyone in the AI space is
       | too busy with text-generating AI like ChatGPT (including
       | myself!).
       | 
       | I disagree with this statement. The release went mostly under the
       | radar for 2 reasons, according to the people I've talked to.
       | 
       | 1. Higher vram and compute requirements
       | 
       | 2. Perceived lower quality outputs compared to specialized SD1.5
       | models.
       | 
       | If either of these points had been different, it would have
       | gained a lot more popularity I'm sure.
       | 
       | But alas, most people now simply wait and see if specialized SDXL
       | models can actually improve upon specialized 1.5 models.
        
       | Der_Einzige wrote:
       | This concept is not new. Lots of "negative embeddings" that you
       | put into negative prompts to fix hands and bad anatomy on
       | civit.ai
        
         | minimaxir wrote:
         | That was my previous textual inversion experiment that I
         | mentioned in the post: https://minimaxir.com/2022/11/stable-
         | diffusion-negative-prom...
         | 
         | This submission is about a negative LoRA which does not behave
         | the same way at a technical level.
        
       | theptip wrote:
       | In general I'm really interested by the concept of personalized
       | RLHF. As we have more and more interactions with a given
       | generative AI system, it seems we'll start to have enough
       | interaction data to meaningfully steer the output towards our
       | personal preferences. I hope the UIs improve to make this as
       | transparent as possible.
       | 
       | Just thinking about how to productize this flow, it should be
       | quite easy to implement the "thumbs up/down" feedback option on
       | every image generated in the UI, plus an optional text label to
       | override "wrong". Then when you have enough HF (or nightly) you
       | could have a batch job to re-train a new LoRA with your updated
       | preferences.
       | 
       | In principle you could collect HF from the implicit tree-
       | traversal that happens when you generate N candidate images from
       | a prompt and then pick one to refine. Or more explicitly, have a
       | quick UI to rank/score a batch, or a trash bin in the digital
       | workspace to discard images you don't like at each iteration of
       | refinement (batching that negative feedback to update your
       | project/global LoRA later).
       | 
       | Going further I wonder what the fastest possible iteration loop
       | for feedback would be? For images in particular you should be
       | able to wire up a very short feedback loop with keypresses in
       | response to image generation. What happens if you strap yourself
       | to that rig for a few hours and collect ~10k preferences at 1/s?
       | Can you get the model to be substantially more likely to output
       | the sort of images that you're personally going to like? Also
       | sounds pretty intense, I'm getting Clockwork Orange vibes.
       | 
       | I didn't spot in the article, how many `wrong` images were there?
       | From a quick skim of the code it looks like maybe 6 per keyword
       | with 13 keywords, so not many at all. ~100 is surprisingly little
       | feedback to steer the model this well.
        
         | davely wrote:
         | > Just thinking about how to productize this flow, it should be
         | quite easy to implement the "thumbs up/down" feedback option on
         | every image generated in the UI, plus an optional text label to
         | override "wrong". Then when you have enough HF (or nightly) you
         | could have a batch job to re-train a new LoRA with your updated
         | preferences.
         | 
         | The AI Horde [1] (an open source distributed cluster of GPUs
         | contributed by volunteers) has a partnership with Stability.ai
         | to effectively do this [2]. They are contributing some GPU
         | resources to AI Horde to run an A/B test.
         | 
         | If a user of one of the AI Horde UIs (Lucid Creations[3] or
         | ArtBot[4]... made by me) requests an image using an SDXL model,
         | they get 2 images back. One was created using SDXL v1.0. The
         | other was created using an updated model (you don't know which
         | is which).
         | 
         | You're asked to pick which image you like better of the two.
         | That's pretty much it. The result is sent back to Stability.ai
         | for analysis and incorporation into future image models.
         | 
         | EDIT: There is a similar partnership between the AI Horde and
         | LAION to provide user-defined aesthetics ratings for the same
         | thing[5].
         | 
         | [1] https://aihorde.net/
         | 
         | [2] https://dbzer0.com/blog/stable-diffusion-xl-beta-on-the-
         | ai-h...
         | 
         | [3] https://dbzer0.itch.io/lucid-creations
         | 
         | [4] https://tinybots.net/artbot
         | 
         | [5] https://laion.ai/blog/laion-stable-horde/
        
         | leopoldhaller wrote:
         | You may be interested in the open source framework we're
         | developing at https://github.com/agentic-ai/enact
         | 
         | It's still early, but the core insight is that a lot of these
         | generative AI flows (whether text, image, single models, model
         | chains, etc) will need to be fit via some form of feedback
         | signal, so it makes sense to build some fundamental
         | infrastructure to support that. One of the early demos (not
         | currently live, but I plan on bringing it back soon) was
         | precisely the type of flow you're talking about, although we
         | used 'prompt refinement' as a cheap proxy for tuning the actual
         | model weights.
         | 
         | Roughly, we aim to build out core python-level infra that makes
         | it easy to write flows in mostly native python and then allows
         | you track executions of your generative flows, including
         | executions of 'human components' such as raters. We also
         | support time travel / rewind / replay, automatic gradio UIs,
         | fastAPI (the latter two very experimental atm).
         | 
         | Medium term we want to make it easy to take any generative
         | flow, wrap it in a 'human rating' flow, auto-deploy as an API
         | or gradio UI and then fit using a number of techniques, e.g.,
         | RLHF, finetuning, A/B testing of generative subcomponents, etc,
         | so stay tuned.
         | 
         | At the moment, we're focused on getting the 'bones' right, but
         | between the quickstart (https://github.com/agentic-
         | ai/enact/blob/main/examples/quick...) and our readme
         | (https://github.com/agentic-ai/enact/tree/main#why-enact) you
         | get a decent idea of where we're headed.
         | 
         | We're looking for people to kick the tires / contribute, so if
         | this sounds interesting, please check it out.
        
         | MuffinFlavored wrote:
         | > RLHF
         | 
         | Reinforcement Learning from Human Feedback
         | 
         | Aren't these systems already trained to score good things
         | higher and bad things worse dictated by human feedback?
        
         | BoorishBears wrote:
         | Implicit RLHF works better than explicit.
         | 
         | It's just like the Mom test: if you ask people to rate you
         | affect their rating
         | 
         | You can have the upscale flow, but you're not limited like
         | Discord based Midjourney was: you can even show all the full
         | sized images and detect that the person copied/saved/right
         | clicked for example
        
         | minimaxir wrote:
         | > I didn't spot in the article, how many `wrong` images were
         | there? From a quick skim of the code it looks like maybe 6 per
         | keyword with 13 keywords, so not many at all. ~100 is
         | surprisingly little feedback to steer the model this well.
         | 
         | Correct: 6 CFG values * 13 keywords = 78 images. Some of them
         | aren't as useful though; apparently "random text" results in
         | old-school SMS applications sometimes!
         | 
         | LoRAs only need 4-5 images to work well, although that was for
         | older/smaller Stable Diffusion which is why I used more images
         | and trained the LoRA a bit longer for SDXL. The Ugly Sonic LoRA
         | in comparison used about 14 images and I suspect it overfit.
        
           | theptip wrote:
           | It's really weird that this works. I can see how LoRA on a
           | specific fine-grained concept like Ugly Sonic can work with
           | so few samples, but naively I'd think such a diffuse concept
           | as "!wrong" should require more bits to specify! Like, isn't
           | the loss function already penalizing the model for being
           | "wrong" on all generated images?
           | 
           | (I wonder if there is a follow-up experiment to test if this
           | LoRA'd model actually has better loss on the original
           | training dataset? There's a very interesting interpretability
           | question here I think. Maybe it's just doing much better on a
           | small subset of possible images, but is slightly worse on the
           | remainder of the training data distribution.)
        
       | usrusr wrote:
       | Must be the formative years spent in the nineties' contradiction
       | field of "counter culture vs also counter culture, but counter
       | culture that's on MTV": there's something about prompts ending
       | with tag references like "award winning photo for vanity fair"
       | (or whatever the promptist's standard tag suffix turns out to be
       | in these posts) that inspires a very deep desire in me to not be
       | part of this generative image wave.
        
         | minimaxir wrote:
         | "award winning photo for vanity fair" is more a trick for good
         | photo composition (e.g. rule of threes) than anything else.
        
       | yantrams wrote:
       | Very cool. Will give this idea a spin soon. I'm a bit of a
       | scientist myself too :)
       | 
       | Here's something interesting I did few days ago.
       | 
       | - Generated images using mixture of different styles of prompts
       | with SDXL Base Model ( using Diffusers )
       | 
       | - Trained a LoRA with them
       | 
       | - Generated again with this LoRA + Prompts used to generate the
       | training set.
       | 
       | Ended up with results with enhanced effects - glitchier, weirder,
       | high def.
       | 
       | Results => https://imgur.com/gallery/vUobKPK
       | 
       | I'm gonna train another LoRA with these generations and repeat
       | the process obviously!
       | 
       | This is a pretty neat way to bypass the 77 token limit in
       | Diffusers and develop tons of more styles now that I think about
       | it.
       | 
       | You can play around with the LoRA at
       | https://replicate.com/galleri5/nammeh ( GitHub account needed )
       | 
       | Will publish it to CivitAI soon.
        
       | Footnote7341 wrote:
       | It became a trend among some data scientists maybe 5 years ago to
       | start recording every keystroke they made on their PC. I'm kind
       | of jealous now when that data is actually kind of useful.
       | 
       | I have a large 30,000 image collection of anime art that I like,
       | that I even competitively ranked for aesthetic score 5 years ago
       | that would come in useful for something like this.
        
       | nullc wrote:
       | I wonder how much of this effect is just undoing stabilities fine
       | tuning against inappropriate images.
        
       ___________________________________________________________________
       (page generated 2023-08-21 23:01 UTC)