[HN Gopher] I Made Stable Diffusion XL Smarter by Finetuning It ...
___________________________________________________________________
I Made Stable Diffusion XL Smarter by Finetuning It on Bad AI-
Generated Images
Author : minimaxir
Score : 250 points
Date : 2023-08-21 16:09 UTC (6 hours ago)
(HTM) web link (minimaxir.com)
(TXT) w3m dump (minimaxir.com)
| msp26 wrote:
| >A minor weakness with LoRAs is that you can only have one active
| at a time
|
| Uh this isn't true at all, at least with auto1111.
| minimaxir wrote:
| IIRC it does merging/weighting behind the scenes.
| cheald wrote:
| I'm pretty sure that it's just serially summing the network
| weights, which results in an accumulated offset to the self-
| attention layers of the transformer. It's not doing any kind
| of analysis of multiple networks prior to application to make
| them "play nice" together; it's just looping and summing.
|
| https://github.com/AUTOMATIC1111/stable-diffusion-
| webui/blob...
| Der_Einzige wrote:
| Source for this?
| rabuse wrote:
| Creating art with stable diffusion has become such a fun hobby of
| mine. The difference between SD 1.5/2.0 and SDXL is massive, and
| it's impressive how quickly the quality is improving with this
| stuff.
| hospitalJail wrote:
| >The difference between SD 1.5/2.0 and SDXL is massive,
|
| Can you explain?
|
| I havent used SDXL yet, but I spent a ton of time in 1.5.
|
| So far I gathered:
|
| >Higher res
|
| >higher 'quality'
|
| But given I was using realistic vision 3 for so long, I never
| had a quality issue. With upscaling, I never needed higher res.
| Sharlin wrote:
| Yes, currently SDXL doesn't really beat the best SD1.5
| checkpoints quality-wise. But it (and the currently available
| checkpoints) shows awesome promise, so give it a six months
| or so.
| CuriouslyC wrote:
| The best 1.5 checkpoints are constrained in their output
| flexibility to achieve the quality they get though, and
| they don't follow prompts nearly as well as SDXL, so if the
| model doesn't naturally gravitate towards doing what you
| want it's very hard to steer it anywhere. SDXL also does a
| better job with full anatomy, which is the reason shared
| 1.5 generations tend to be torso up or portrait shots.
| AuryGlenz wrote:
| Currently SDXL is better than SD1.5 checkpoints at pretty
| much everything other than portraits (or anime drawings) of
| pretty women.
|
| Unfortunately it seems that's all people want to generate,
| as is evident when you search for SD on Twitter.
| ChatGTP wrote:
| Stable diffusion doesn't grant the user a good
| imagination or taste unfortunately
| Sharlin wrote:
| Yes, point acceded, I should've said something about the
| flexibility and capability of SDXL rather than just image
| quality in a narrow sense.
| AuryGlenz wrote:
| Here's an example using my dog - a trained checkpoint on one
| of the nicer SD 1.5 models and a LoRA for the SDXL ones:
| https://imgur.com/a/PklEKwC
|
| The first 3 images are some of my attempts at making her into
| a Pokemon. Some turned out pretty good (after generating 50+
| per type), but I struggled with water in particular. It was
| hard to get her to have a fin, especially with no additional
| tail.
|
| I haven't done many in SDXL, but that's the point. I've
| probably generated..10 images of her as a Pokemon, just when
| I first trying out the LoRA. The next 2 images are from that,
| and that was before I had a good ComfyUI workflow to boot.
|
| The rest are various sample images from SDXL showing how
| versatile it is. In most of those, I only had to generate a
| few images per prompt to get something pretty darn great. In
| the Halo 2 one the prompt was literally "an xbox 360
| screenshot of cinderdog in Halo 2, multiplayer."
|
| And it made her into a freaking Elite, and it worked
| wonderfully. I previously tried to generate ones like those
| candyland images in 1.5 models and the foreground and
| background just didn't look good. In SDXL it just works.
| jononor wrote:
| Very cool! How many images did you use to create the LoRa
| of your dog? Do you have any guide to recommend?
| AuryGlenz wrote:
| It was about 30 images, though I'm planning on adding
| more and training again sometime. Either that or
| splitting it up between when her hair is short and when
| it's long, as it really changes how she looks.
|
| This isn't what I used for my dog's LoRa but I used it
| for my wife and it worked better than what I was doing
| before (Adafactor):
| https://civitai.notion.site/SDXL-1-0-Training-
| Overview-4fb03...
|
| I'd recommend increasing the network dimension to at
| least 64, if your VRAM can take it. I can do 64 with my
| 12GB card. At least for people, I've had better luck
| using a token that's a celebrity. I'm not sure how to try
| that with my dog - perhaps just "terrier dog" or
| something.
| jononor wrote:
| Thanks! Looks like I'll need to rent a GPU to use SDXL
| fine tuning. Poor old RTX2060 not gonna cut it.
| zirgs wrote:
| From my experiments it seems that SD XL understands prompts
| much better. While SD 1.5 is great at generating your typical
| "anime girl with big boobs" stuff - if you try to generate
| something a little bit more unusual - it usually doesn't
| generate exactly what you want and seems to straight up
| ignore large parts of the prompt.
|
| SD XL seems to understand weird and unusual prompts a lot
| better.
|
| SD XL is capable of generating 1024x1024 images without hacks
| like "hires fix". That's a very good thing, because hires fix
| sometimes introduces additional glitches while upscaling.
| Especially at higher denoising strength. Hires fix fixed the
| broken face - yay, but the subject now has 3 legs instead of
| two. Things like that happen far less often with SD XL.
| 3abiton wrote:
| > While SD 1.5 is great at generating your typical "anime
| girl with big boobs" stuff - if you try to generate
| something a little bit more unusual - it usually doesn't
| generate exactly what you want and seems to straight up
| ignore large parts of the prompt.
|
| Pretty much experience with SD 1.5, but I'll give XL a try.
| davely wrote:
| I hope you'll forgive me for a bit of a self promotion here,
| but I think I have an interesting example of SD 1.5 (what
| most people are familiar with and what most models are based
| off of) vs SDXL.
|
| Before Phony Stark shut down the Twitter API, I was running a
| bot that created landscape images with Stable Diffusion v1.5.
| Its name is Mr. RossBot [1]. Check out the Twitter page for
| some examples of the quality.
|
| This weekend, I finally updated the code to get it running on
| Mastodon. In the process, I updated the model to use SDXL
| [2]. It's running the exact same code otherwise to randomly
| generate prompts.
|
| The image caption is a simplified version of the prompt.
| e.g., "Snowcapped mountain peaks with an oxbow lake at golden
| hour."
|
| Behind the scenes, a whole bunch of extra descriptive stuff
| is added, so the prompt that SD v1.5 / SDXL get is:
| "beautiful painting of snowcapped mountain peaks with an
| oxbow lake at golden hour, concept art, trending on
| artstation, 8k, very sharp, extremely detailed, volumetric,
| beautiful lighting, serene, oil painting, wet-on-wet brush
| strokes, bob ross style"
|
| Anyway, I feel like the quality of SDXL is sharper and it
| just nails subjects a lot better. It also tries to add
| reflections and shadows (not always correctly), whereas that
| didn't happen as much with SD v1.5.
|
| I'm pretty impressed! Especially because Stability.ai had
| released an update model of Stable Diffusion before SDXL: SD
| v2.0 and SD v2.1. The results (IMHO) were absolute garbage
| using the same prompts.
|
| [1] https://twitter.com/mrrossbot
|
| [2] https://botsin.space/@MrRossBot
| politelemon wrote:
| Please consider posting the LoRa on civitai.com as well as the
| stable diffusion Reddit.
|
| These results look pretty good, looking forward to trying it out.
| I hadn't realized that the generative images buzz was dying out,
| since I'm using it regularly I guess it is always in buzz for me.
| minimaxir wrote:
| I posted the original release to /r/StableDiffusion but all the
| comments are "why not compatable with A1111?" and I can't find
| a good script to do the conversion:
| https://www.reddit.com/r/StableDiffusion/comments/15r5k3i/i_...
|
| Civitai has syndicated the LoRA:
| https://civitai.com/models/128708/sdxl-wrong-lora
| Zetobal wrote:
| You will get more users if you provide a safetensors file
| instead of bin and pickletensors a lot of people have gotten
| really scared by the malware scare that was going through
| social media a few months ago.
| Sharlin wrote:
| And for a good reason. A big hunk of floating-point numbers
| really shouldn't be able to execute arbitrary code. Or any
| code at all.
| araes wrote:
| Thank you for note on this. I had not heard there were
| already trojan horse malware being slipped into tensor
| files as python scripts. Apparently torch pickle uses eval
| on the tensor file with no filter.
|
| Heard surprisingly little commentary on this topic. The
| full explanation of how Safetensors are "Safe" can be found
| from the developer at:
| https://github.com/huggingface/safetensors/discussions/111
| homarp wrote:
| also safetensors security audit:
| https://huggingface.co/blog/safetensors-security-audit
| 0cf8612b2e1e wrote:
| I would also ask that sha hashes are posted somewhere. It
| annoys me to know end how difficult it can be to confirm
| you are using the real model.
| chankstein38 wrote:
| Agreed I feel like, and I do this a lot as well, people have a
| tendency to track their habits and assume everyone follows
| that. From my perspective, the gen image buzz is still as hot
| as ever!
|
| If I lacked excitement for SDXL it was because it felt like the
| there was no massive jump in image quality to me. Sure the size
| doubling is great, but it also presents a problem, as I don't
| always want to generate 1024x1024 images. I still use third
| party trained 1.5 models because they create damned good
| outputs and I have like 5 different upscaling solutions and at
| least one will add new detail as things are upscaled.
| Sharlin wrote:
| SDXL is more resolution-agnostic than SD1x, 768x768 works
| fine, but admittedly going down to 512x512 does tend to
| produce cropped images.
| letitgo12345 wrote:
| Similar to https://arxiv.org/abs/2307.12950
| carbocation wrote:
| Tangentially related: for reasons I don't yet really understand,
| the LORAs that I build for Stable Diffusion XL only work well if
| I give a pretty generic negative prompt.
|
| These are fine-tuned on 6 photos of my face, and if I use them
| with positive prompts, the generated characters don't look much
| like me. But if I add generic negative terms like "low quality",
| suddenly the depiction of my face is almost exactly right.
|
| I've trained several models and this has been true across a range
| of learning rates and number of training epochs.
|
| To me, this feels like it will somehow ultimately be connected to
| whatever is driving minimaxir's observations in this post.
| sorenjan wrote:
| This is really interesting. Like mentioned in the article, this
| is a kind of RLHF, and that's what takes GPT3 from a difficult to
| use LLM to a chat bot which is able to confuse some people into
| thinking is has consciousness. It makes it much more usable.
|
| I don't know how these models are trained, but hopefully future
| models will include bad results as negative training data, baking
| it into the base model.
|
| It's only mentioned in passing in the article, but apparently
| it's possible to merge LoRAs? How would you do that, I'd like to
| use one LoRA to include my own subjects, this LoRA to make the
| results better, and maybe a third one for a particular style.
| minimaxir wrote:
| Merging LoRAs is essentially taking a weighted average of the
| LoRA adapter weights. It's more common in other UIs.
|
| diffusers is working on a PR for it:
| https://github.com/huggingface/diffusers/pull/4473
| kwhitefoot wrote:
| > XL
|
| Extra Large? 40 times?
| sschueller wrote:
| 1024 x 1024 instead of 512 x 512.
| Taek wrote:
| XL more likely refers to the parameter count, which is 3
| billion instead of <1 billion
| not2b wrote:
| No, I think it is mainly because it's optimized for 1024 x
| 1024 images, rather than 512 x 512 as the previous version
| was.
| Our_Benefactors wrote:
| It's both. More pixel space _and_ more parameters.
| brianjking wrote:
| What? XL is the current version of Stable Diffusion.
| pbjtime wrote:
| It's already on version 40?
| bckr wrote:
| Extra large
| jfoutz wrote:
| It's a Roman numeral joke.
| ShamelessC wrote:
| Not a very good one.
| [deleted]
| Jackson__ wrote:
| >The release went mostly under-the-radar because the generative
| image AI buzz has cooled down a bit. Everyone in the AI space is
| too busy with text-generating AI like ChatGPT (including
| myself!).
|
| I disagree with this statement. The release went mostly under the
| radar for 2 reasons, according to the people I've talked to.
|
| 1. Higher vram and compute requirements
|
| 2. Perceived lower quality outputs compared to specialized SD1.5
| models.
|
| If either of these points had been different, it would have
| gained a lot more popularity I'm sure.
|
| But alas, most people now simply wait and see if specialized SDXL
| models can actually improve upon specialized 1.5 models.
| Der_Einzige wrote:
| This concept is not new. Lots of "negative embeddings" that you
| put into negative prompts to fix hands and bad anatomy on
| civit.ai
| minimaxir wrote:
| That was my previous textual inversion experiment that I
| mentioned in the post: https://minimaxir.com/2022/11/stable-
| diffusion-negative-prom...
|
| This submission is about a negative LoRA which does not behave
| the same way at a technical level.
| theptip wrote:
| In general I'm really interested by the concept of personalized
| RLHF. As we have more and more interactions with a given
| generative AI system, it seems we'll start to have enough
| interaction data to meaningfully steer the output towards our
| personal preferences. I hope the UIs improve to make this as
| transparent as possible.
|
| Just thinking about how to productize this flow, it should be
| quite easy to implement the "thumbs up/down" feedback option on
| every image generated in the UI, plus an optional text label to
| override "wrong". Then when you have enough HF (or nightly) you
| could have a batch job to re-train a new LoRA with your updated
| preferences.
|
| In principle you could collect HF from the implicit tree-
| traversal that happens when you generate N candidate images from
| a prompt and then pick one to refine. Or more explicitly, have a
| quick UI to rank/score a batch, or a trash bin in the digital
| workspace to discard images you don't like at each iteration of
| refinement (batching that negative feedback to update your
| project/global LoRA later).
|
| Going further I wonder what the fastest possible iteration loop
| for feedback would be? For images in particular you should be
| able to wire up a very short feedback loop with keypresses in
| response to image generation. What happens if you strap yourself
| to that rig for a few hours and collect ~10k preferences at 1/s?
| Can you get the model to be substantially more likely to output
| the sort of images that you're personally going to like? Also
| sounds pretty intense, I'm getting Clockwork Orange vibes.
|
| I didn't spot in the article, how many `wrong` images were there?
| From a quick skim of the code it looks like maybe 6 per keyword
| with 13 keywords, so not many at all. ~100 is surprisingly little
| feedback to steer the model this well.
| davely wrote:
| > Just thinking about how to productize this flow, it should be
| quite easy to implement the "thumbs up/down" feedback option on
| every image generated in the UI, plus an optional text label to
| override "wrong". Then when you have enough HF (or nightly) you
| could have a batch job to re-train a new LoRA with your updated
| preferences.
|
| The AI Horde [1] (an open source distributed cluster of GPUs
| contributed by volunteers) has a partnership with Stability.ai
| to effectively do this [2]. They are contributing some GPU
| resources to AI Horde to run an A/B test.
|
| If a user of one of the AI Horde UIs (Lucid Creations[3] or
| ArtBot[4]... made by me) requests an image using an SDXL model,
| they get 2 images back. One was created using SDXL v1.0. The
| other was created using an updated model (you don't know which
| is which).
|
| You're asked to pick which image you like better of the two.
| That's pretty much it. The result is sent back to Stability.ai
| for analysis and incorporation into future image models.
|
| EDIT: There is a similar partnership between the AI Horde and
| LAION to provide user-defined aesthetics ratings for the same
| thing[5].
|
| [1] https://aihorde.net/
|
| [2] https://dbzer0.com/blog/stable-diffusion-xl-beta-on-the-
| ai-h...
|
| [3] https://dbzer0.itch.io/lucid-creations
|
| [4] https://tinybots.net/artbot
|
| [5] https://laion.ai/blog/laion-stable-horde/
| leopoldhaller wrote:
| You may be interested in the open source framework we're
| developing at https://github.com/agentic-ai/enact
|
| It's still early, but the core insight is that a lot of these
| generative AI flows (whether text, image, single models, model
| chains, etc) will need to be fit via some form of feedback
| signal, so it makes sense to build some fundamental
| infrastructure to support that. One of the early demos (not
| currently live, but I plan on bringing it back soon) was
| precisely the type of flow you're talking about, although we
| used 'prompt refinement' as a cheap proxy for tuning the actual
| model weights.
|
| Roughly, we aim to build out core python-level infra that makes
| it easy to write flows in mostly native python and then allows
| you track executions of your generative flows, including
| executions of 'human components' such as raters. We also
| support time travel / rewind / replay, automatic gradio UIs,
| fastAPI (the latter two very experimental atm).
|
| Medium term we want to make it easy to take any generative
| flow, wrap it in a 'human rating' flow, auto-deploy as an API
| or gradio UI and then fit using a number of techniques, e.g.,
| RLHF, finetuning, A/B testing of generative subcomponents, etc,
| so stay tuned.
|
| At the moment, we're focused on getting the 'bones' right, but
| between the quickstart (https://github.com/agentic-
| ai/enact/blob/main/examples/quick...) and our readme
| (https://github.com/agentic-ai/enact/tree/main#why-enact) you
| get a decent idea of where we're headed.
|
| We're looking for people to kick the tires / contribute, so if
| this sounds interesting, please check it out.
| MuffinFlavored wrote:
| > RLHF
|
| Reinforcement Learning from Human Feedback
|
| Aren't these systems already trained to score good things
| higher and bad things worse dictated by human feedback?
| BoorishBears wrote:
| Implicit RLHF works better than explicit.
|
| It's just like the Mom test: if you ask people to rate you
| affect their rating
|
| You can have the upscale flow, but you're not limited like
| Discord based Midjourney was: you can even show all the full
| sized images and detect that the person copied/saved/right
| clicked for example
| minimaxir wrote:
| > I didn't spot in the article, how many `wrong` images were
| there? From a quick skim of the code it looks like maybe 6 per
| keyword with 13 keywords, so not many at all. ~100 is
| surprisingly little feedback to steer the model this well.
|
| Correct: 6 CFG values * 13 keywords = 78 images. Some of them
| aren't as useful though; apparently "random text" results in
| old-school SMS applications sometimes!
|
| LoRAs only need 4-5 images to work well, although that was for
| older/smaller Stable Diffusion which is why I used more images
| and trained the LoRA a bit longer for SDXL. The Ugly Sonic LoRA
| in comparison used about 14 images and I suspect it overfit.
| theptip wrote:
| It's really weird that this works. I can see how LoRA on a
| specific fine-grained concept like Ugly Sonic can work with
| so few samples, but naively I'd think such a diffuse concept
| as "!wrong" should require more bits to specify! Like, isn't
| the loss function already penalizing the model for being
| "wrong" on all generated images?
|
| (I wonder if there is a follow-up experiment to test if this
| LoRA'd model actually has better loss on the original
| training dataset? There's a very interesting interpretability
| question here I think. Maybe it's just doing much better on a
| small subset of possible images, but is slightly worse on the
| remainder of the training data distribution.)
| usrusr wrote:
| Must be the formative years spent in the nineties' contradiction
| field of "counter culture vs also counter culture, but counter
| culture that's on MTV": there's something about prompts ending
| with tag references like "award winning photo for vanity fair"
| (or whatever the promptist's standard tag suffix turns out to be
| in these posts) that inspires a very deep desire in me to not be
| part of this generative image wave.
| minimaxir wrote:
| "award winning photo for vanity fair" is more a trick for good
| photo composition (e.g. rule of threes) than anything else.
| yantrams wrote:
| Very cool. Will give this idea a spin soon. I'm a bit of a
| scientist myself too :)
|
| Here's something interesting I did few days ago.
|
| - Generated images using mixture of different styles of prompts
| with SDXL Base Model ( using Diffusers )
|
| - Trained a LoRA with them
|
| - Generated again with this LoRA + Prompts used to generate the
| training set.
|
| Ended up with results with enhanced effects - glitchier, weirder,
| high def.
|
| Results => https://imgur.com/gallery/vUobKPK
|
| I'm gonna train another LoRA with these generations and repeat
| the process obviously!
|
| This is a pretty neat way to bypass the 77 token limit in
| Diffusers and develop tons of more styles now that I think about
| it.
|
| You can play around with the LoRA at
| https://replicate.com/galleri5/nammeh ( GitHub account needed )
|
| Will publish it to CivitAI soon.
| Footnote7341 wrote:
| It became a trend among some data scientists maybe 5 years ago to
| start recording every keystroke they made on their PC. I'm kind
| of jealous now when that data is actually kind of useful.
|
| I have a large 30,000 image collection of anime art that I like,
| that I even competitively ranked for aesthetic score 5 years ago
| that would come in useful for something like this.
| nullc wrote:
| I wonder how much of this effect is just undoing stabilities fine
| tuning against inappropriate images.
___________________________________________________________________
(page generated 2023-08-21 23:01 UTC)