[HN Gopher] Stable Cascade
       ___________________________________________________________________
        
       Stable Cascade
        
       Author : davidbarker
       Score  : 440 points
       Date   : 2024-02-13 17:23 UTC (5 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | obviyus wrote:
       | Been using it for a couple of hours and it seems it's much better
       | at following the prompt. Right away it seems the quality is worse
       | compared to some SDXL models but I'll reserve judgement until a
       | couple more days of testing.
       | 
       | It's fast too! I would reckon about 2-3x faster than non-turbo
       | SDXL.
        
         | kimoz wrote:
         | Can one run it on CPU?
        
           | ghurtado wrote:
           | You can run any ML model on CPU. The question is the
           | performance
        
           | rwmj wrote:
           | Stable Diffusion on a 16 core AMD CPU takes for me about 2-3
           | hours to generate an image, just to give you a rough idea of
           | the performance. (On the same AMD's iGPU it takes 2 minutes
           | or so).
        
             | OJFord wrote:
             | Even older GPUs are worth using then I take it?
             | 
             | For example I pulled a (2GB I think, 4 tops) 6870 out of my
             | desktop because it's a beast (in physical size, and power
             | consumption) and I wasn't using it for gaming or anything,
             | figured I'd be fine just with the Intel integrated
             | graphics. But if I wanted to play around with some models
             | locally, it'd be worth putting it back & figuring out how
             | to use it as a secondary card?
        
               | rwmj wrote:
               | One counterintuitive advantage of the integrated GPU is
               | it has access to system RAM (instead of using a dedicated
               | and fixed amount of VRAM). That means I'm able to give
               | the iGPU 16 GB of RAM. For me SD takes 8-9 GB of RAM when
               | running. The system RAM is slower than VRAM which is the
               | trade-off here.
        
               | OJFord wrote:
               | Yeah I did wonder about that as I typed, which is why I
               | mentioned the low amount (by modern standards anyway) on
               | the card. OK, thanks!
        
               | mat0 wrote:
               | No, I don't think so. I think you would need more VRAM to
               | start with.
        
               | purpleflame1257 wrote:
               | 2GB is really low. I've been able to use A111 stable
               | diffusion on my old gaming laptop's 1060 (6GB VRAM) and
               | it takes a little bit less than a minute to generate an
               | image. You would probably need to try the --lowvram flag
               | on startup.
        
             | smoldesu wrote:
             | SDXL Turbo is much better, albeit kinda fuzzy and
             | distorted. I was able to get decent single-sample response
             | times (~80-100s) from my 4 core ARM Ampere instance, good
             | enough for a Discord bot with friends.
        
               | emadm wrote:
               | Sd turbo runs nicely on a m2 MacBook Air (as does stable
               | lm 2!)
               | 
               | Much faster models will come
        
             | adrian_b wrote:
             | If that is true, then the CPU variant must be a much worse
             | implementation of the algorithm than the GPU variant,
             | because the true ratio of the GPU and CPU performances is
             | many times less than that.
        
           | sebzim4500 wrote:
           | Not if you want to finish the generation before you have
           | stopped caring about the results.
        
         | sorenjan wrote:
         | How much VRAM does it need? They mention that the largest model
         | uses 1.4 billion parameters more than SDXL, which in turn need
         | a lot of VRAM.
        
           | adventured wrote:
           | There was a leak from Japan yesterday, prior to this release,
           | and in that it was suggested 20gb for the largest model.
           | 
           | This text was part of the Stability Japan leak (the 20gb VRAM
           | reference was dropped in the release today):
           | 
           | "Stages C and B will be released in two different models.
           | Stage C uses parameters of 1B and 3.6B, and Stage B uses
           | parameters of 700M and 1.5B. However, if you want to minimize
           | your hardware needs, you can also use the 1B parameter
           | version. In Stage B, both give great results, but 1.5 billion
           | is better at reconstructing finer details. Thanks to Stable
           | Cascade's modular approach, the expected amount of VRAM
           | required for inference can be kept at around 20GB, but can be
           | reduced even further by using smaller variations (as
           | mentioned earlier, this (which may reduce the final output
           | quality)."
        
             | sorenjan wrote:
             | Thanks. I guess this means that fewer people will be able
             | to use it on their own computer, but the improved
             | efficiency makes it cheaper to run on servers with enough
             | VRAM.
             | 
             | Maybe running stage C first, unloading it from VRAM, and
             | then do B and A would make it fit in 12 or even 8 GB, but I
             | wonder if the memory transfers would negate any time
             | saving. Might still be worth it if it produces better
             | images though.
        
               | adventured wrote:
               | If it worked I imagine large batching could make it worth
               | the load/unload time cost.
        
               | Filligree wrote:
               | Sequential model offloading isn't too bad. It adds about
               | a second or less to inference, assuming it still fits in
               | main memory.
        
               | sorenjan wrote:
               | Sometimes I forget how fast modern computers are. PCIe v4
               | x16 has a transfer speed of 31.5 GB/s, so theoretically
               | it should take less than 100 ms to transfer stage B and
               | A. Maybe it's not so bad after all, it will be
               | interesting to see what happens.
        
           | liuliu wrote:
           | Should use no more than 6GiB for FP16 models at each stage.
           | The current implementation is not RAM optimized.
        
             | sorenjan wrote:
             | The large C model uses 3.6 billion parameters which is 6.7
             | GiB if each parameter is 16 bits.
        
               | liuliu wrote:
               | The large C model have fair bit of parameters tied to
               | text-conditioning, not to the main denoising process.
               | Similar to how we split the network for SDXL Base, I am
               | pretty confident we can split non-trivial amount of
               | parameters to text-conditioning hence during denoising
               | process, loading less than 3.6B parameters.
        
         | vergessenmir wrote:
         | I'll take prompt adherence over quality any day. The machinery
         | otherwise isn't worth it i.e the controlnets, openpose,
         | depthmaps just to force a particular look or to achieve depth.
         | Th solution becomes bespoke for each generation.
         | 
         | Had a test of it and my option is it's an improvement when it
         | comes to following prompts and I do find the images more
         | visually appealing.
        
           | stavros wrote:
           | Can we use its output as input to SDXL? Presumably it would
           | just fill in the details, and not create whole new images.
        
       | ttpphd wrote:
       | That is a very tiny latent space. Wow!
        
       | yogorenapan wrote:
       | Very impressive.
       | 
       | From what I understand, Stability AI is currently VC funded. It's
       | bound to burn through tons of money and it's not clear whether
       | the business model (if any) is sustainable. Perhaps worthy of
       | government funding.
        
         | minimaxir wrote:
         | Stability AI has been burning through tons of money for awhile
         | now, which is the reason newer models like Stable Cascade are
         | not commercially-friendly-licensed open source anymore.
         | 
         | > The company is spending significant amounts of money to grow
         | its business. At the time of its deal with Intel, Stability was
         | spending roughly $8 million a month on bills and payroll and
         | earning a fraction of that in revenue, two of the people
         | familiar with the matter said.
         | 
         | > It made $1.2 million in revenue in August and was on track to
         | make $3 million this month from software and services,
         | according to a post Mostaque wrote on Monday on X, the platform
         | formerly known as Twitter. The post has since been deleted.
         | 
         | https://fortune.com/2023/11/29/stability-ai-sale-intel-ceo-r...
        
           | littlestymaar wrote:
           | > which is the reason newer models like Stable Cascade are
           | not commercially-friendly-licensed open source anymore.
           | 
           | The main reason is probably Mid journey and OpenAi using
           | their tech without any kind of contribution back. AI
           | desperately needs a GPL equivalent...
        
             | yogorenapan wrote:
             | > AI desperately needs a GPL equivalent
             | 
             | Why not just the GPL then?
        
               | loudmax wrote:
               | The GPL was intended for computer code that gets compiled
               | to a binary form. You can share the binary, but you also
               | have to share the code that the binary is compiled from.
               | Pre-trained model weights might be thought of as
               | analogous to compiled code, and the training data may be
               | analogous to program code, but they're not the same
               | thing.
               | 
               | The model weights are shared openly, but the training
               | data used to create these models isn't. This is at least
               | partly because all these models, including OpenAI's, are
               | trained on copyrighted data, so the copyright status of
               | the models themselves is somewhat murky.
               | 
               | In the future we may see models that are 100% trained in
               | the open, but foundational models are currently very
               | expensive to train from scratch. Either prices would need
               | to come down, or enthusiasts will need some way to share
               | radically distributed GPU resources.
        
               | emadm wrote:
               | Tbh I think these models will largely be trained on
               | synthetic datasets in the future. They are mostly trained
               | on garbage now. We have been doing opt outs on these, has
               | been interesting to see quality differential (or lack
               | thereof), eg removing books3 from stableLM 3b zephyr
               | https://stability.wandb.io/stability-llm/stable-
               | lm/reports/S...
        
               | keenmaster wrote:
               | Why aren't the big models trained on synthetic datasets
               | now? What's the bottleneck? And how do you avoid
               | amplifying the weaknesses of LLMs when you train on LLM
               | output vs. novel material from the comparatively very
               | intelligent members of the human species. Would be
               | interesting to see your take on this.
        
               | sillysaurusx wrote:
               | I've wondered whether books3 makes a difference, and how
               | much. If you ever train a model with a proper books3
               | ablation I'd be curious to know how it does. Books are an
               | important data source, but if users find the model useful
               | without them then that's a good datapoint.
        
               | protomikron wrote:
               | What about CC licenses for model weights? It's common for
               | files ("images", "video", "audio", ...) So maybe
               | appropriate.
        
             | ipsum2 wrote:
             | It's highly doubtful that Midjourney and OpenAI use Stable
             | Diffusion or other Stability models.
        
               | jonplackett wrote:
               | How do you know though?
        
               | minimaxir wrote:
               | You can't use off-the-shelf models to get the results
               | Midjourney and DALL-E generate, even with strong
               | finetuning.
        
               | cthalupa wrote:
               | I pay for both MJ and DALL-E (though OpenAI mostly gets
               | my money for GPT) and don't find them to produce
               | significantly better images than popular checkpoints on
               | CivitAI. What I do find is that they are significantly
               | easier to work with. (Actually, my experience with
               | hundreds of DALL-E generations is that it's actually
               | quite poor in quality. I'm in several IRC channels where
               | it's the image generator of choice for some IRC bots, and
               | I'm never particularly impressed with the visual
               | quality.)
               | 
               | For MJ in particular, knowing that they at least used to
               | use Stable Diffusion under the hood, it would not
               | surprise me if the majority of the secret sauce is
               | actually a middle layer that processes the prompt and
               | converts it to one that is better for working with SD.
               | Prompting SD to get output at the MJ quality level takes
               | significantly more tokens, lots of refinement, heavy
               | tweaking of negative prompting, etc. Also a stack of
               | embeddings and LoRAs, though I would place those more in
               | the category of finetuning like you had mentioned.
        
               | emadm wrote:
               | If you try diffusionGPT with regional prompting added and
               | a GAN corrector you can get a good idea of what is
               | possible https://diffusiongpt.github.io
        
               | euazOn wrote:
               | That looks very impressive unless the demo is
               | cherrypicked, would be great if this could be implemented
               | into a frontend like Fooocus
               | https://github.com/lllyasviel/Fooocus
        
               | millgrove wrote:
               | What do you use it for? I haven't found a great use for
               | it myself (outside of generating assets for landing pages
               | / apps, where it's really really good). But I have seen
               | endless subreddits / instagram pages dedicated to various
               | forms of AI content, so it seems lots of people are using
               | it for fun?
        
               | cthalupa wrote:
               | Nothing professional. I run a variety of tabletop RPGs
               | for friends, so I mostly use it for making visual aids
               | there. I've also got a large format printer that I was no
               | longer using for it's original purpose, so I bought a few
               | front-loading art frames that I generate art for and
               | rotate through periodically.
               | 
               | I've also used it to generate art for deskmats I got
               | printed at https://specterlabs.co/
               | 
               | For commercial stuff I still pay human artists.
        
               | soultrees wrote:
               | What IRC Channels do you frequent?
        
               | cthalupa wrote:
               | Largely some old channels from the 90s/00s that really
               | only exist as vestiges of their former selves - not
               | really related to their original purpose, just rooms for
               | hanging out with friends made there back when they had a
               | point besides being a group chat.
        
               | yreg wrote:
               | That's not really true, MJ and DALL-E are just more
               | beginner friendly.
        
               | cthalupa wrote:
               | Midjourney 100% at least used to use Stable Diffusion:
               | https://twitter.com/EMostaque/status/1561917541743841280
               | 
               | I am not sure if that is still the case.
        
               | refulgentis wrote:
               | It trialled it as an explicitly optional model for a
               | moment a couple years ago. (or only a year? time moves so
               | fast. somewhere in v2/v3 timeframe and around when SD
               | came out). I am sure it is no longer the case.
        
               | liuliu wrote:
               | DALL-E shares the same autoencoders as SD v1.x. It is
               | probably similar to how Meta's Emu-class models work
               | though. They tweaked the architecture quite a bit,
               | trained on their own dataset, reused some components (or
               | in Emu case, trained all the components from scratch but
               | reused the same arch).
        
             | minimaxir wrote:
             | More specifically, it's so Stability AI can theoretically
             | make a business on selling commercial access to those
             | models through a membership:
             | https://stability.ai/news/introducing-stability-ai-
             | membershi...
        
             | programjames wrote:
             | I think it'd be interesting to have a non-profit "model
             | sharing" platform, where people can buy/sell compute. When
             | you run someone's model, they get royalties on the compute
             | you buy.
        
             | thatguysaguy wrote:
             | The net flow of knowledge about text-to-image generation
             | from OpenAI has definitely been outward. The early open
             | source methods used CLIP, which OpenAI came up with. Dall-e
             | (1) was also the first demonstration that we could do text
             | to image at all. (There were some earlier papers which
             | could give you a red splotch if you said stop sign or
             | something years earlier).
        
           | loudmax wrote:
           | I get the impression that a lot of open source adjacent AI
           | companies, including Stability AI, are in the "???" phase of
           | execution, hoping the "Profit" phase comes next.
           | 
           | Given how much VC money is chasing the AI space, this isn't
           | necessarily a bad plan. Give stuff away for free while
           | developing deep expertise, then either figure out something
           | to sell, or pivot to proprietary, or get aquihired by a tech
           | giant.
        
             | minimaxir wrote:
             | That is indeed the case, hence the more recent pushes
             | toward building moats by every AI company.
        
         | seydor wrote:
         | exactly my thought. stability should be receiving research
         | grants
        
           | emadm wrote:
           | We should, we haven't yet...
           | 
           | Instead we've given 10m+ supercomputer hours in grants to all
           | sorts of projects, now we have our grant team in place &
           | there is a huge increase in available funding for folk that
           | can actually build stuff we can tap into.
        
         | downrightmike wrote:
         | Finally a good use to burn VC money!
        
         | sveme wrote:
         | None of the researchers are associated with stability.ai, but
         | with universities in Germany and Canada. How does this work? Is
         | this exclusive work for stability.ai?
        
           | emadm wrote:
           | Dom and Pablo both work for Stability AI (Dom finishing his
           | degree).
           | 
           | All the original Stable Diffusion researchers (Robin Rombach,
           | Patrick Esser, Dominik Lorenz, Andreas Blattman) also work
           | for Stability AI.
        
         | diggan wrote:
         | I've seen Emad (Stability AI founder) commenting here on HN
         | somewhere about this before, what exactly their business model
         | is/will be, and similar thoughts.
         | 
         | HN search doesn't seem to agree with me today though and I
         | cannot find the specific comment/s I have in mind, maybe
         | someone else has any luck? This is their user
         | https://news.ycombinator.com/user?id=emadm
        
           | emadm wrote:
           | https://x.com/EMostaque/status/1649152422634221593?s=20
           | 
           | We now have top models of every type, sites like
           | www.stableaudio.com, memberships, custom model deals etc so
           | lots of demand
           | 
           | We're the only AI company that can make a model of any type
           | for anyone from scratch & are the most liked / one of the
           | most downloaded on HuggingFace
           | (https://x.com/Jarvis_Data/status/1730394474285572148?s=20,
           | https://x.com/EMostaque/status/1727055672057962634?s=20)
           | 
           | Its going ok, team working hard and shipping good models, the
           | team are accelerating their work on building ComfyUI to bring
           | it all together.
           | 
           | My favourite recent model was CheXagent, I think medical
           | models should be open & will really save lives:
           | https://x.com/Kseniase_/status/1754575702824038717?s=20
        
       | jedberg wrote:
       | I'd say I'm most impressed by the compression. Being able to
       | compress an image 42x is huge for portable devices or bad
       | internet connectivity (or both!).
        
         | incrudible wrote:
         | That is 42x _spatial_ compression, but it needs 16 channels
         | instead of 3 for RGB.
        
           | ansk wrote:
           | Furthermore, each of those 16 channels would typically be
           | mutibyte floats as opposed to single byte RGB channels.
           | (speaking generally, haven't read the paper)
        
           | zamadatix wrote:
           | Even assuming 32 bit floats (the extra 4 on the end):
           | 
           | 4*16*24*24*4 = 147,456
           | 
           | vs (removing the alpha channel as it's unused here)
           | 
           | 3*3*1024*1024 = 9,437,184
           | 
           | Or 1/64 raw size, assuming I haven't fucked up the
           | math/understanding somewhere (very possible at the moment).
        
         | flgstnd wrote:
         | a 42x compression is also impressive as it matches the answer
         | to the ultimate question of life, the universe, and everything,
         | maybe there is some deep universal truth within this model.
        
         | seanalltogether wrote:
         | I have to imagine at this point someone is working toward a
         | fast AI based video codec that comes with a small pretrained
         | model and can operate in a limited memory environment like a tv
         | to offer 8k resolution with low bandwidth.
        
           | jedberg wrote:
           | I would be shocked if Netflix was _not_ working on that.
        
       | yogorenapan wrote:
       | I see in the commits that the license was changed from MIT to
       | their own custom one: https://github.com/Stability-
       | AI/StableCascade/commit/209a526...
       | 
       | Is it legal to use an older snapshot before the license was
       | changed in accordance with the previous MIT license?
        
         | OJFord wrote:
         | Yes, you can continue to do what you want with that commit^ in
         | accordance with the MIT licence it was released under. Kind of
         | like if you buy an ebook, and then they publish a second
         | edition but only as a hardback - the first edition ebook is
         | still yours to read.
        
         | treesciencebot wrote:
         | I think the model architecture (training code etc.) itself is
         | still under MIT while the weights (which was the result of
         | training in a huge GPU cluster as well as the dataset they have
         | used [not sure if they publicly talked about it] is under this
         | new license.
        
           | emadm wrote:
           | Code is MIT, weights are under the NC license for now.
        
         | ed wrote:
         | It seems pretty clear the intent was to use a non-commercial
         | license, so it's probably something that would go to court, if
         | you really wanted to press the issue.
         | 
         | Generally courts are more holistic and look at intent, and
         | understand that clerical errors happen. One exception to this
         | is if a business claims it relied on the previous license and
         | invested a bunch of resources as a result.
         | 
         | I believe the timing of commits is pretty important-- it would
         | be hard to claim your business made a substantial investment on
         | a pre-announcement repo that was only MIT'ed for a few hours.
        
       | gorkemyurt wrote:
       | we have an optimized playground here:
       | https://www.fal.ai/models/stable-cascade
        
         | adventured wrote:
         | "sign in to run"
         | 
         | That's a marketing opportunity being missed, especially given
         | how crowded the space is now. The HN crowd is more likely to
         | run it themselves when presented with signing up just to test
         | out a single generation.
        
           | treesciencebot wrote:
           | Uh, thanks for noticing it! We generally turn it off for
           | popular models so people can see the underlying inference
           | speed and the results but we forgot about it for this one, it
           | should now be auth-less with a stricter rate limit just like
           | other popular models in the gallery.
        
           | MattRix wrote:
           | It uses github auth, it's not some complex process. I can see
           | why they would need to require accounts so it's harder to
           | abuse it.
        
       | holoduke wrote:
       | Wow like the compression part. 42 fixed times compression. That
       | is really nice. Slow to unpack on the fly. But the future is
       | waiting.
        
       | GaggiX wrote:
       | I remember doing some random experiments with these two
       | researchers to find the best way to condition the stage B on the
       | latent, my very fancy cross-attn with relative 2D positional
       | embeddings didn't work as well as just concatenating the channels
       | of the input with the nearest upsample of the latent, so I just
       | gave up ahah.
       | 
       | This model used to be known as Wurstchen v3.
        
       | joshelgar wrote:
       | Why are they benchmarking it with 20+10 steps vs. 50 steps for
       | the other models?
        
         | liuliu wrote:
         | prior generations usually take fewer steps than vanilla SDXL to
         | reach the same quality.
         | 
         | But yeah, the inference speed improvement is mediocre (until I
         | take a look at exactly what computation performed to have more
         | informed opinion on whether it is implementation issue or model
         | issue).
         | 
         | The prompt alignment should be better though. It looks like the
         | model have more parameters to work with text conditioning.
        
           | treesciencebot wrote:
           | in my observation, it yields amazing perf at higher batch
           | sizes (4 or better 8). i assume it is due to memory bandwith
           | and the constrained latent space helping.
        
         | GaggiX wrote:
         | I think that this model used consistency loss during training
         | so that it can yield better results with less steps.
        
       | gajnadsgjoas wrote:
       | Where can I run it if I don't have a GPU? Colab didn't work
        
         | detolly wrote:
         | runpod, kaggle, lambda labs, or pretty much any other server
         | provider that gives you one or more gpus.
        
       | k2enemy wrote:
       | I haven't been following the image generation space since the
       | initial excitement around stable diffusion. Is there an easy to
       | use interface for the new models coming out?
       | 
       | I remember setting up the python env for stable diffusion, but
       | then shortly after there were a host of nice GUIs. Are there some
       | popular GUIs that can be used to try out newer models? Similarly,
       | what's the best GUI for some of the older models? Preferably for
       | macos.
        
         | thot_experiment wrote:
         | Auto1111 and Comfy both get updated pretty quickly to support
         | most of the new models coming out. I expect they'll both
         | support this soon.
        
           | stereobit wrote:
           | Check out invoke.com
        
       | cybereporter wrote:
       | Will this get integrated into Stable Diffusion Web UI?
        
       | hncomb wrote:
       | Is there any way this can be used to generate multiple images of
       | the same model? e.g. a car model rotated around (but all images
       | are of the same generated car)
        
         | matroid wrote:
         | Someone with resources will have to train Zero123 [1] with this
         | backbone.
         | 
         | [1] https://zero123.cs.columbia.edu/
        
       ___________________________________________________________________
       (page generated 2024-02-13 23:00 UTC)