[HN Gopher] Stability AI Launches Stable Diffusion XL 0.9
       ___________________________________________________________________
        
       Stability AI Launches Stable Diffusion XL 0.9
        
       Author : seydor
       Score  : 143 points
       Date   : 2023-06-22 17:21 UTC (5 hours ago)
        
 (HTM) web link (stability.ai)
 (TXT) w3m dump (stability.ai)
        
       | EGreg wrote:
       | Is there image + text to image prompting?
       | 
       | Can I do dreambooth here? If so, what commands do I use?
        
         | brucethemoose2 wrote:
         | You can do this in SD 1.5/2.1 already. Encode the image like
         | you do for pix2pix, and the text, then average those latents
         | together.
         | 
         | Dreambooth is gonna require an A100 I think... I doubt it will
         | work on the free (16GB VRAM) Colab instances.
        
       | bbor wrote:
       | For anyone dumb like me: this is NOT a routine announcement,
       | definitely read it through. The improvements they show off are
       | honestly stunning.
       | 
       | Also, they've (re-) established a universal law of AI: fuck it,
       | just ensemble it
        
         | isoprophlex wrote:
         | Apparently that's what GPT4 is too: eight GPT3.5's ensembled
         | together.
         | 
         | Not sure if true, sounds plausible tho.
        
         | EGreg wrote:
         | How does ensembling work? What
        
           | minimaxir wrote:
           | Combining the results of multiple models and then adding
           | another layer onto the combined output tends to increase
           | accuracy / reduce error rates. (not new to AI: it's been done
           | for over a decade)
        
             | EGreg wrote:
             | But how to combine the results?
        
               | minimaxir wrote:
               | Usually just by concatenating the final hidden states
               | (before any classifier/regression/image output head)
        
           | liuliu wrote:
           | Many ways, for one: https://magicfusion.github.io/
        
       | xigency wrote:
       | The hands look better but there's still hints of a sixth finger
       | in each of them.
        
         | brucethemoose2 wrote:
         | Nothing a hands LORA can't fix.
        
           | xigency wrote:
           | True, but you still won't get a coherent picture of someone
           | picking their nose I bet.
        
         | seydor wrote:
         | and 5 phalanges
        
         | krunck wrote:
         | The last image has big toes for thumbs.
        
           | philshem wrote:
           | https://en.wikipedia.org/wiki/Brachydactyly_type_D
        
       | LorenDB wrote:
       | Any speculation why the AMD cards require twice the VRAM that
       | Nvidia cards do? I have an RX 6700 XT and I'm disappointed that
       | my 12 GB won't be enough.
        
         | jacooper wrote:
         | Your 6700XT won't work anyway, since its not supported by ROCm
        
           | brucethemoose2 wrote:
           | The 6000 series "unofficially" works with rocm, and that is
           | hopefully getting more official.
        
         | brucethemoose2 wrote:
         | You will want an optimized implementation from torch-mlir or
         | apache tvm anyway.
         | 
         | We had this for SD 1.5, but it always stayed obscure and
         | unpopular for some reason... I hope its different this time
         | around.
        
         | brucethemoose2 wrote:
         | Probably no 4-bit quantization support? Or they are missing
         | some ops that the tensor core cards have?
         | 
         | My guess is AMD users will eventually get low VRAM
         | compatibility through Vulkan ports (like SHARK/Torch MLIR or
         | Apache TVM).
         | 
         | Then again, the existing Vulkan ports were kinda obscure and
         | unused with SD 1.5/2.1
        
       | jdalgetty wrote:
       | Is the mac studio with the Apple M2 Max with 12-core CPU, 30-core
       | GPU enough for something like this?
        
         | jrflowers wrote:
         | System requirements
         | 
         | Despite its powerful output and advanced model architecture,
         | SDXL 0.9 is able to be run on a modern consumer GPU, needing
         | only a Windows 10 or 11, or Linux operating system, with 16GB
         | RAM, an Nvidia GeForce RTX 20 graphics card (equivalent or
         | higher standard) equipped with a minimum of 8GB of VRAM. Linux
         | users are also able to use a compatible AMD card with 16GB
         | VRAM.
         | 
         | I'm guessing that it will work eventually, though I'm not sure
         | who will make that happen.
        
         | minimaxir wrote:
         | Likely not. Even with earlier models performance may be
         | finnicky:
         | https://huggingface.co/docs/diffusers/optimization/mps
         | 
         | Additionally, for Apple Silicon you likely need 64 GB RAM
         | (since CPU/GPU memory is shared) which is expensive.
        
           | pkage wrote:
           | I've had good results with SD1.4/2 with MPS acceleration on
           | similar hardware (M1 Max, though with 64gb). No stability
           | issues with MPS, either. I'd say don't rule it out just yet.
        
           | SkyPuncher wrote:
           | I have an M2 MBP with 64 GB RAM. Performance with the older
           | models is very good in my opinion. It feels to run faster
           | locally than DreamStudio. I don't have benchmarks, but in any
           | case the performance is not bad.
        
         | itake wrote:
         | My guess is you will need cuda support until someone ports it
         | to mips
        
         | brucethemoose2 wrote:
         | Apple ported Stable Diffusion 1.5/2.1 to MPS themselves.
         | 
         | If they don't so it for SDXL, the port will probably take
         | awhile (if it happens at all).
        
           | nozzlegear wrote:
           | I've used Apple's port of Stable Diffusion on my Mac Studio
           | with M1 Ultra and it worked flawlessly. I could even download
           | models from Hugging Face and convert them to a CoreML model
           | with little effort using Apple's conversion tool documented
           | in their Stable Diffusion repo [1]. Some models on Hugging
           | Face are already converted - I think anything tagged with
           | CoreML.
           | 
           | [1] https://github.com/apple/ml-stable-diffusion
        
       | heliophobicdude wrote:
       | If Emad tweeted this image [1] made with SDXL, then text in image
       | could possibly better!
       | 
       | 1:https://twitter.com/emostaque/status/1671885525639380992
        
         | jrflowers wrote:
         | That's probably DeepFloyd.
        
         | gwern wrote:
         | Text will be better due to simple scale, but the text will
         | still be limited due to the use of a CLIP for text encoding
         | (BPEs+contrastive). So that may be SD XL 0.9 but it should
         | still be worse due to not using T5 like
         | https://github.com/deep-floyd/IF
        
       | brucethemoose2 wrote:
       | What's the dataset? How commercially viable/legally questionable
       | is it?
       | 
       | This is critical, for legality of use, ethics concerns, _and_ the
       | quality of the output (as overly zealous filtering can degrade
       | the model like it did for SD 2.0).
        
         | xigency wrote:
         | If I recall correctly, Stability AI's process to skirt
         | copyright is to have the training data compiled and model
         | weights trained by a third-party university. Educational
         | research institutions have more lax requirements around
         | copyright. That may or may not be a legitimate way to work
         | under existing laws, but doesn't tell us much about what the
         | moral/ethical/legal considerations _should_ be, which seems
         | like an open question.
        
           | pmayrgundter wrote:
           | That sounds more like their story for SD 1.5 last year. I
           | think there was some kerfuffle between Stability.ai and
           | Runway.ai/Heidelberg Uni (see the Forbes article; won't link
           | as I'm unclear on the veracity), who they were working with,
           | and they may have parted ways by their first indie work on SD
           | 2.x around the holidays. Either way, the Uni connection story
           | may be old.
        
             | homarp wrote:
             | https://news.ycombinator.com/item?id=36185891 discusses the
             | Forbes article
        
       | xnx wrote:
       | The progress in AI is great, but very hard to keep up with (or
       | understand). It's real to me once it makes it into Automatic1111.
        
         | brucethemoose2 wrote:
         | Small rant:
         | 
         | I don't like how SD consolidated around the A1111 repo. The
         | features are great, and it was fantastic when SD was brand
         | new... but the performance and compatibility is awful, the
         | setup is tricky, and it sucked all the oxygen out of the room
         | that other SD UIs needed to flourish.
        
           | balthigor wrote:
           | I had the same issues with that repo in addition to some
           | early release controversy around its provenance.
           | 
           | I ended up using
           | https://github.com/easydiffusion/easydiffusion
           | 
           | which has served me well so far.
        
             | brucethemoose2 wrote:
             | VoltaML is making good progress. InvokeAI is also pretty
             | good (but not as optimized/bleeding edge).
        
           | ShamelessC wrote:
           | It's open source. The only way to compete is on your merits
           | in open source.
           | 
           | If you want another UI to flourish, clone both it and A111,
           | copy and paste the bits from A111 you'd like to have in yours
           | (with attribution) and push it up along with any features you
           | personally want.
           | 
           | That does require developer time, and developers may converge
           | on a popular implementation with good tests and lots of
           | features as it's easier to contribute.
           | 
           | The bottleneck isn't really the community though, it's the
           | developers.
        
             | brucethemoose2 wrote:
             | Its not that simple, as A1111 uses the old stability AI
             | implementation while pretty much everything else uses HF
             | diffusers code.
             | 
             | I worked trying to add torch.compile support to A1111 for a
             | bit, fixing some graph breaks locally, but... It was too
             | much. Some other things, like ML compilation backends, are
             | also basically impossible.
        
       | vm90 wrote:
       | Curious to know how this compares to mid journey v5
        
         | orbital-decay wrote:
         | It's not better out of the box for text to image, this was
         | known for quite some time. However, as soon as they release the
         | weights (in a month as they promise), it will benefit from the
         | tooling available for SD, without being limited to text to
         | image.
         | 
         | It's also a foundational model, not a finished product, and MJ
         | will possibly use it, like they did in v4 with SD 1.5.
        
           | Klaus4 wrote:
           | MJv4 is not related to SD at all
        
             | orbital-decay wrote:
             | I might be misremembering, but didn't they announce in
             | their twitter that they were using SD somehow for MJ v4?
             | Later deleting this with a bunch of other tweets.
        
             | andybak wrote:
             | Got a link for that? I'm genuinely asking as I didn't know
             | this.
        
           | starshadowx2 wrote:
           | Midjourney only used a combination of SD and their own stuff
           | with the --beta and --test/testp model which came between V3
           | and V4, other versions have no connection to SD.
        
       | varelse wrote:
       | [dead]
        
       | GaggiX wrote:
       | I'm already seeing the speedrun to implement this model
       | architecture in the automatic1111 webui.
        
         | swyx wrote:
         | implementing model architecture and supporting it in a web ui
         | are two different things, maybe a link to pr?
        
           | GaggiX wrote:
           | Do you mean that supporting the model in automatic webui is
           | much more difficult than just implementing the model in a
           | repo? I guess.
        
         | brucethemoose2 wrote:
         | The A1111 backend is kinda not set up for this, as it is built
         | around the old Stability AI 1.5/2.1 implementation (not HF
         | diffusers which most other backends use).
         | 
         | It would basically be a rewrite, if I were to guess... And at
         | that point they mind as well port everything to diffusers.
        
       | ilaksh wrote:
       | So it's non-commercial but they are adding it to the API on
       | Monday? Does that mean it will be commercial then?
        
         | minimaxir wrote:
         | Clipdrop is owned by Stability AI and you can access the model
         | now, with its own API: https://clipdrop.co/stable-diffusion
         | 
         | The Stability AI API/DreamStudio API is slightly different.
         | Yes, it's confusing.
        
         | candiodari wrote:
         | This is copyright: non-commercial for everyone that copies the
         | model from them. Not for the ones producing their own model.
        
         | binarymax wrote:
         | " The model can be accessed via ClipDrop today with API coming
         | shortly. Research weights are now available with an open
         | release coming mid-July as we move to 1.0."
         | 
         | I read this as: commercial use through our API now, self hosted
         | commercial use in July.
        
           | djsavvy wrote:
           | "Research weights" seems to imply non-commercial use only
           | though, right?
        
             | binarymax wrote:
             | "with an _open release_ coming mid-July "
        
       | joshuahedlund wrote:
       | Am I the only one who doesn't see an obvious difference in the
       | quality between the left and right photos? (Maybe the wolf one)
       | And these are extremely-curated examples!
        
         | brucethemoose2 wrote:
         | Objective comparison is always so tricky with Stable Diffusion.
         | They should show off large batches, at the very least.
         | 
         | I think Stability is ostensibly showing that the images are
         | closer to the prompt (and the left wolf in particular has some
         | distortion around the eyes).
        
         | wodenokoto wrote:
         | I prefer the composition of the beta model over the release.
         | Quality wise I can't say one is better than the other. Maybe
         | the hand in the coffee picture is better for the 0.9 model.
        
         | og_kalu wrote:
         | It really is a much better base model aesthetically than 1.5,
         | 2.1 etc
         | 
         | Comps here - https://imgur.com/a/FfECIMP
        
         | HelloMcFly wrote:
         | The wolf looks better, but also looks less like what you'd see
         | in a "nature documentary" (part of the prompt).
         | 
         | I think the coffee cup looks better in the right phot, it seems
         | a tad bit more real to me.
         | 
         | Like you I much prefer the alien photo on the left, but the
         | photos are so stylistically different I'm not sure that says
         | anything about the releases' respective capabilities.
        
         | thelogicguy wrote:
         | For the aliens, the right image has much more realistic
         | gradation. The one on the right looks like the grays have been
         | crushed out of it. There's also a funky glow coming from the
         | right edge of the alien.
         | 
         | I'd say the blur effects on the left images are much cleaner as
         | well. There are some weird artifacts at the fringes of objects
         | in the earlier version.
        
       | djbusby wrote:
       | So, there are at least a few dozen AI image generating sites,
       | some specialized, other not. Are they all powered by SD? Maybe
       | just with some better pre-prompting? Or are there other engines
       | (eg DALL-E)?.
       | 
       | AFAIK only SD can be run locally?
        
         | pmayrgundter wrote:
         | I've only run across 3 primary models: Midjourney (via their
         | Discord), Dall-E and SD. And yes, there's a bunch of sites, but
         | I've seen very similar quality to SD and no mention yet of a
         | different base.
         | 
         | I do expect there are other bases out there, but haven't seen
         | any of quality yet.
         | 
         | Before this release (XL 0.9) it's been unclear how much of the
         | SD quality was in-house or came from their prior collab with
         | Runway/Heidelberg.
        
           | andybak wrote:
           | Kandinsky, Deep Floyd... Also Midjourney is derived from SD I
           | believe.
        
             | pmayrgundter wrote:
             | I don't think MJ is from SD.. found no mention of SD on
             | their site or on Wiki, besides a comparison. Any citation?
        
             | og_kalu wrote:
             | None of those is derived from SD
        
               | andybak wrote:
               | I wasn't saying they were. Read back one more message.
               | 
               | (However, I thought Midjourney definitely was at some
               | point)
        
       | bogwog wrote:
       | > Nvidia GeForce RTX 20 graphics card (equivalent or higher
       | standard)
       | 
       | RIP my 1080 TI.
       | 
       | Does anyone know what specific feature they need which 20+ cards
       | have and older ones don't?
        
         | dist-epoch wrote:
         | RTX 20 have tensor cores. 1080 do not.
        
         | brucethemoose2 wrote:
         | My guess is low precision support, or some newer ops Pascal
         | does not support in a custom CUDA kernel.
        
         | binarymax wrote:
         | Not enough RAM in your 1080TI
         | 
         | Edit - it's not the RAM. 1080TI has 11GB and this press release
         | says it requires 8. So I'm going to speculate that it's because
         | 1080 lacks tensor cores compared to the 20x's Turing
         | architecture
        
           | nolok wrote:
           | Funny knowing the stupidity nvidia is pulling with the 4xxx
           | series regarding ram amount.
        
             | [deleted]
        
           | cma wrote:
           | Since it is now split into two models to do the generation,
           | you could load one and do the first stage of a bunch of
           | images, then load the second and complete them, with half the
           | vram usage.
        
             | ShamelessC wrote:
             | I believe the HF pipeline can do this already and I assume
             | each stage uses more than 4 GB vram. There are other tricks
             | they open source community will come up with though.
        
               | brucethemoose2 wrote:
               | 8 bit (and 4 bit?) quantization is low hanging fruit,
               | assuming its not already running in 8 bit.
               | 
               | An 8GB requirement kinda sounds like they have already
               | quantized the model though.
        
           | bogwog wrote:
           | The post says it only needs 8GB, and my 1080 has 11GB.
        
       ___________________________________________________________________
       (page generated 2023-06-22 23:02 UTC)