[HN Gopher] Stable Cascade
___________________________________________________________________
Stable Cascade
Author : davidbarker
Score : 440 points
Date : 2024-02-13 17:23 UTC (5 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| obviyus wrote:
| Been using it for a couple of hours and it seems it's much better
| at following the prompt. Right away it seems the quality is worse
| compared to some SDXL models but I'll reserve judgement until a
| couple more days of testing.
|
| It's fast too! I would reckon about 2-3x faster than non-turbo
| SDXL.
| kimoz wrote:
| Can one run it on CPU?
| ghurtado wrote:
| You can run any ML model on CPU. The question is the
| performance
| rwmj wrote:
| Stable Diffusion on a 16 core AMD CPU takes for me about 2-3
| hours to generate an image, just to give you a rough idea of
| the performance. (On the same AMD's iGPU it takes 2 minutes
| or so).
| OJFord wrote:
| Even older GPUs are worth using then I take it?
|
| For example I pulled a (2GB I think, 4 tops) 6870 out of my
| desktop because it's a beast (in physical size, and power
| consumption) and I wasn't using it for gaming or anything,
| figured I'd be fine just with the Intel integrated
| graphics. But if I wanted to play around with some models
| locally, it'd be worth putting it back & figuring out how
| to use it as a secondary card?
| rwmj wrote:
| One counterintuitive advantage of the integrated GPU is
| it has access to system RAM (instead of using a dedicated
| and fixed amount of VRAM). That means I'm able to give
| the iGPU 16 GB of RAM. For me SD takes 8-9 GB of RAM when
| running. The system RAM is slower than VRAM which is the
| trade-off here.
| OJFord wrote:
| Yeah I did wonder about that as I typed, which is why I
| mentioned the low amount (by modern standards anyway) on
| the card. OK, thanks!
| mat0 wrote:
| No, I don't think so. I think you would need more VRAM to
| start with.
| purpleflame1257 wrote:
| 2GB is really low. I've been able to use A111 stable
| diffusion on my old gaming laptop's 1060 (6GB VRAM) and
| it takes a little bit less than a minute to generate an
| image. You would probably need to try the --lowvram flag
| on startup.
| smoldesu wrote:
| SDXL Turbo is much better, albeit kinda fuzzy and
| distorted. I was able to get decent single-sample response
| times (~80-100s) from my 4 core ARM Ampere instance, good
| enough for a Discord bot with friends.
| emadm wrote:
| Sd turbo runs nicely on a m2 MacBook Air (as does stable
| lm 2!)
|
| Much faster models will come
| adrian_b wrote:
| If that is true, then the CPU variant must be a much worse
| implementation of the algorithm than the GPU variant,
| because the true ratio of the GPU and CPU performances is
| many times less than that.
| sebzim4500 wrote:
| Not if you want to finish the generation before you have
| stopped caring about the results.
| sorenjan wrote:
| How much VRAM does it need? They mention that the largest model
| uses 1.4 billion parameters more than SDXL, which in turn need
| a lot of VRAM.
| adventured wrote:
| There was a leak from Japan yesterday, prior to this release,
| and in that it was suggested 20gb for the largest model.
|
| This text was part of the Stability Japan leak (the 20gb VRAM
| reference was dropped in the release today):
|
| "Stages C and B will be released in two different models.
| Stage C uses parameters of 1B and 3.6B, and Stage B uses
| parameters of 700M and 1.5B. However, if you want to minimize
| your hardware needs, you can also use the 1B parameter
| version. In Stage B, both give great results, but 1.5 billion
| is better at reconstructing finer details. Thanks to Stable
| Cascade's modular approach, the expected amount of VRAM
| required for inference can be kept at around 20GB, but can be
| reduced even further by using smaller variations (as
| mentioned earlier, this (which may reduce the final output
| quality)."
| sorenjan wrote:
| Thanks. I guess this means that fewer people will be able
| to use it on their own computer, but the improved
| efficiency makes it cheaper to run on servers with enough
| VRAM.
|
| Maybe running stage C first, unloading it from VRAM, and
| then do B and A would make it fit in 12 or even 8 GB, but I
| wonder if the memory transfers would negate any time
| saving. Might still be worth it if it produces better
| images though.
| adventured wrote:
| If it worked I imagine large batching could make it worth
| the load/unload time cost.
| Filligree wrote:
| Sequential model offloading isn't too bad. It adds about
| a second or less to inference, assuming it still fits in
| main memory.
| sorenjan wrote:
| Sometimes I forget how fast modern computers are. PCIe v4
| x16 has a transfer speed of 31.5 GB/s, so theoretically
| it should take less than 100 ms to transfer stage B and
| A. Maybe it's not so bad after all, it will be
| interesting to see what happens.
| liuliu wrote:
| Should use no more than 6GiB for FP16 models at each stage.
| The current implementation is not RAM optimized.
| sorenjan wrote:
| The large C model uses 3.6 billion parameters which is 6.7
| GiB if each parameter is 16 bits.
| liuliu wrote:
| The large C model have fair bit of parameters tied to
| text-conditioning, not to the main denoising process.
| Similar to how we split the network for SDXL Base, I am
| pretty confident we can split non-trivial amount of
| parameters to text-conditioning hence during denoising
| process, loading less than 3.6B parameters.
| vergessenmir wrote:
| I'll take prompt adherence over quality any day. The machinery
| otherwise isn't worth it i.e the controlnets, openpose,
| depthmaps just to force a particular look or to achieve depth.
| Th solution becomes bespoke for each generation.
|
| Had a test of it and my option is it's an improvement when it
| comes to following prompts and I do find the images more
| visually appealing.
| stavros wrote:
| Can we use its output as input to SDXL? Presumably it would
| just fill in the details, and not create whole new images.
| ttpphd wrote:
| That is a very tiny latent space. Wow!
| yogorenapan wrote:
| Very impressive.
|
| From what I understand, Stability AI is currently VC funded. It's
| bound to burn through tons of money and it's not clear whether
| the business model (if any) is sustainable. Perhaps worthy of
| government funding.
| minimaxir wrote:
| Stability AI has been burning through tons of money for awhile
| now, which is the reason newer models like Stable Cascade are
| not commercially-friendly-licensed open source anymore.
|
| > The company is spending significant amounts of money to grow
| its business. At the time of its deal with Intel, Stability was
| spending roughly $8 million a month on bills and payroll and
| earning a fraction of that in revenue, two of the people
| familiar with the matter said.
|
| > It made $1.2 million in revenue in August and was on track to
| make $3 million this month from software and services,
| according to a post Mostaque wrote on Monday on X, the platform
| formerly known as Twitter. The post has since been deleted.
|
| https://fortune.com/2023/11/29/stability-ai-sale-intel-ceo-r...
| littlestymaar wrote:
| > which is the reason newer models like Stable Cascade are
| not commercially-friendly-licensed open source anymore.
|
| The main reason is probably Mid journey and OpenAi using
| their tech without any kind of contribution back. AI
| desperately needs a GPL equivalent...
| yogorenapan wrote:
| > AI desperately needs a GPL equivalent
|
| Why not just the GPL then?
| loudmax wrote:
| The GPL was intended for computer code that gets compiled
| to a binary form. You can share the binary, but you also
| have to share the code that the binary is compiled from.
| Pre-trained model weights might be thought of as
| analogous to compiled code, and the training data may be
| analogous to program code, but they're not the same
| thing.
|
| The model weights are shared openly, but the training
| data used to create these models isn't. This is at least
| partly because all these models, including OpenAI's, are
| trained on copyrighted data, so the copyright status of
| the models themselves is somewhat murky.
|
| In the future we may see models that are 100% trained in
| the open, but foundational models are currently very
| expensive to train from scratch. Either prices would need
| to come down, or enthusiasts will need some way to share
| radically distributed GPU resources.
| emadm wrote:
| Tbh I think these models will largely be trained on
| synthetic datasets in the future. They are mostly trained
| on garbage now. We have been doing opt outs on these, has
| been interesting to see quality differential (or lack
| thereof), eg removing books3 from stableLM 3b zephyr
| https://stability.wandb.io/stability-llm/stable-
| lm/reports/S...
| keenmaster wrote:
| Why aren't the big models trained on synthetic datasets
| now? What's the bottleneck? And how do you avoid
| amplifying the weaknesses of LLMs when you train on LLM
| output vs. novel material from the comparatively very
| intelligent members of the human species. Would be
| interesting to see your take on this.
| sillysaurusx wrote:
| I've wondered whether books3 makes a difference, and how
| much. If you ever train a model with a proper books3
| ablation I'd be curious to know how it does. Books are an
| important data source, but if users find the model useful
| without them then that's a good datapoint.
| protomikron wrote:
| What about CC licenses for model weights? It's common for
| files ("images", "video", "audio", ...) So maybe
| appropriate.
| ipsum2 wrote:
| It's highly doubtful that Midjourney and OpenAI use Stable
| Diffusion or other Stability models.
| jonplackett wrote:
| How do you know though?
| minimaxir wrote:
| You can't use off-the-shelf models to get the results
| Midjourney and DALL-E generate, even with strong
| finetuning.
| cthalupa wrote:
| I pay for both MJ and DALL-E (though OpenAI mostly gets
| my money for GPT) and don't find them to produce
| significantly better images than popular checkpoints on
| CivitAI. What I do find is that they are significantly
| easier to work with. (Actually, my experience with
| hundreds of DALL-E generations is that it's actually
| quite poor in quality. I'm in several IRC channels where
| it's the image generator of choice for some IRC bots, and
| I'm never particularly impressed with the visual
| quality.)
|
| For MJ in particular, knowing that they at least used to
| use Stable Diffusion under the hood, it would not
| surprise me if the majority of the secret sauce is
| actually a middle layer that processes the prompt and
| converts it to one that is better for working with SD.
| Prompting SD to get output at the MJ quality level takes
| significantly more tokens, lots of refinement, heavy
| tweaking of negative prompting, etc. Also a stack of
| embeddings and LoRAs, though I would place those more in
| the category of finetuning like you had mentioned.
| emadm wrote:
| If you try diffusionGPT with regional prompting added and
| a GAN corrector you can get a good idea of what is
| possible https://diffusiongpt.github.io
| euazOn wrote:
| That looks very impressive unless the demo is
| cherrypicked, would be great if this could be implemented
| into a frontend like Fooocus
| https://github.com/lllyasviel/Fooocus
| millgrove wrote:
| What do you use it for? I haven't found a great use for
| it myself (outside of generating assets for landing pages
| / apps, where it's really really good). But I have seen
| endless subreddits / instagram pages dedicated to various
| forms of AI content, so it seems lots of people are using
| it for fun?
| cthalupa wrote:
| Nothing professional. I run a variety of tabletop RPGs
| for friends, so I mostly use it for making visual aids
| there. I've also got a large format printer that I was no
| longer using for it's original purpose, so I bought a few
| front-loading art frames that I generate art for and
| rotate through periodically.
|
| I've also used it to generate art for deskmats I got
| printed at https://specterlabs.co/
|
| For commercial stuff I still pay human artists.
| soultrees wrote:
| What IRC Channels do you frequent?
| cthalupa wrote:
| Largely some old channels from the 90s/00s that really
| only exist as vestiges of their former selves - not
| really related to their original purpose, just rooms for
| hanging out with friends made there back when they had a
| point besides being a group chat.
| yreg wrote:
| That's not really true, MJ and DALL-E are just more
| beginner friendly.
| cthalupa wrote:
| Midjourney 100% at least used to use Stable Diffusion:
| https://twitter.com/EMostaque/status/1561917541743841280
|
| I am not sure if that is still the case.
| refulgentis wrote:
| It trialled it as an explicitly optional model for a
| moment a couple years ago. (or only a year? time moves so
| fast. somewhere in v2/v3 timeframe and around when SD
| came out). I am sure it is no longer the case.
| liuliu wrote:
| DALL-E shares the same autoencoders as SD v1.x. It is
| probably similar to how Meta's Emu-class models work
| though. They tweaked the architecture quite a bit,
| trained on their own dataset, reused some components (or
| in Emu case, trained all the components from scratch but
| reused the same arch).
| minimaxir wrote:
| More specifically, it's so Stability AI can theoretically
| make a business on selling commercial access to those
| models through a membership:
| https://stability.ai/news/introducing-stability-ai-
| membershi...
| programjames wrote:
| I think it'd be interesting to have a non-profit "model
| sharing" platform, where people can buy/sell compute. When
| you run someone's model, they get royalties on the compute
| you buy.
| thatguysaguy wrote:
| The net flow of knowledge about text-to-image generation
| from OpenAI has definitely been outward. The early open
| source methods used CLIP, which OpenAI came up with. Dall-e
| (1) was also the first demonstration that we could do text
| to image at all. (There were some earlier papers which
| could give you a red splotch if you said stop sign or
| something years earlier).
| loudmax wrote:
| I get the impression that a lot of open source adjacent AI
| companies, including Stability AI, are in the "???" phase of
| execution, hoping the "Profit" phase comes next.
|
| Given how much VC money is chasing the AI space, this isn't
| necessarily a bad plan. Give stuff away for free while
| developing deep expertise, then either figure out something
| to sell, or pivot to proprietary, or get aquihired by a tech
| giant.
| minimaxir wrote:
| That is indeed the case, hence the more recent pushes
| toward building moats by every AI company.
| seydor wrote:
| exactly my thought. stability should be receiving research
| grants
| emadm wrote:
| We should, we haven't yet...
|
| Instead we've given 10m+ supercomputer hours in grants to all
| sorts of projects, now we have our grant team in place &
| there is a huge increase in available funding for folk that
| can actually build stuff we can tap into.
| downrightmike wrote:
| Finally a good use to burn VC money!
| sveme wrote:
| None of the researchers are associated with stability.ai, but
| with universities in Germany and Canada. How does this work? Is
| this exclusive work for stability.ai?
| emadm wrote:
| Dom and Pablo both work for Stability AI (Dom finishing his
| degree).
|
| All the original Stable Diffusion researchers (Robin Rombach,
| Patrick Esser, Dominik Lorenz, Andreas Blattman) also work
| for Stability AI.
| diggan wrote:
| I've seen Emad (Stability AI founder) commenting here on HN
| somewhere about this before, what exactly their business model
| is/will be, and similar thoughts.
|
| HN search doesn't seem to agree with me today though and I
| cannot find the specific comment/s I have in mind, maybe
| someone else has any luck? This is their user
| https://news.ycombinator.com/user?id=emadm
| emadm wrote:
| https://x.com/EMostaque/status/1649152422634221593?s=20
|
| We now have top models of every type, sites like
| www.stableaudio.com, memberships, custom model deals etc so
| lots of demand
|
| We're the only AI company that can make a model of any type
| for anyone from scratch & are the most liked / one of the
| most downloaded on HuggingFace
| (https://x.com/Jarvis_Data/status/1730394474285572148?s=20,
| https://x.com/EMostaque/status/1727055672057962634?s=20)
|
| Its going ok, team working hard and shipping good models, the
| team are accelerating their work on building ComfyUI to bring
| it all together.
|
| My favourite recent model was CheXagent, I think medical
| models should be open & will really save lives:
| https://x.com/Kseniase_/status/1754575702824038717?s=20
| jedberg wrote:
| I'd say I'm most impressed by the compression. Being able to
| compress an image 42x is huge for portable devices or bad
| internet connectivity (or both!).
| incrudible wrote:
| That is 42x _spatial_ compression, but it needs 16 channels
| instead of 3 for RGB.
| ansk wrote:
| Furthermore, each of those 16 channels would typically be
| mutibyte floats as opposed to single byte RGB channels.
| (speaking generally, haven't read the paper)
| zamadatix wrote:
| Even assuming 32 bit floats (the extra 4 on the end):
|
| 4*16*24*24*4 = 147,456
|
| vs (removing the alpha channel as it's unused here)
|
| 3*3*1024*1024 = 9,437,184
|
| Or 1/64 raw size, assuming I haven't fucked up the
| math/understanding somewhere (very possible at the moment).
| flgstnd wrote:
| a 42x compression is also impressive as it matches the answer
| to the ultimate question of life, the universe, and everything,
| maybe there is some deep universal truth within this model.
| seanalltogether wrote:
| I have to imagine at this point someone is working toward a
| fast AI based video codec that comes with a small pretrained
| model and can operate in a limited memory environment like a tv
| to offer 8k resolution with low bandwidth.
| jedberg wrote:
| I would be shocked if Netflix was _not_ working on that.
| yogorenapan wrote:
| I see in the commits that the license was changed from MIT to
| their own custom one: https://github.com/Stability-
| AI/StableCascade/commit/209a526...
|
| Is it legal to use an older snapshot before the license was
| changed in accordance with the previous MIT license?
| OJFord wrote:
| Yes, you can continue to do what you want with that commit^ in
| accordance with the MIT licence it was released under. Kind of
| like if you buy an ebook, and then they publish a second
| edition but only as a hardback - the first edition ebook is
| still yours to read.
| treesciencebot wrote:
| I think the model architecture (training code etc.) itself is
| still under MIT while the weights (which was the result of
| training in a huge GPU cluster as well as the dataset they have
| used [not sure if they publicly talked about it] is under this
| new license.
| emadm wrote:
| Code is MIT, weights are under the NC license for now.
| ed wrote:
| It seems pretty clear the intent was to use a non-commercial
| license, so it's probably something that would go to court, if
| you really wanted to press the issue.
|
| Generally courts are more holistic and look at intent, and
| understand that clerical errors happen. One exception to this
| is if a business claims it relied on the previous license and
| invested a bunch of resources as a result.
|
| I believe the timing of commits is pretty important-- it would
| be hard to claim your business made a substantial investment on
| a pre-announcement repo that was only MIT'ed for a few hours.
| gorkemyurt wrote:
| we have an optimized playground here:
| https://www.fal.ai/models/stable-cascade
| adventured wrote:
| "sign in to run"
|
| That's a marketing opportunity being missed, especially given
| how crowded the space is now. The HN crowd is more likely to
| run it themselves when presented with signing up just to test
| out a single generation.
| treesciencebot wrote:
| Uh, thanks for noticing it! We generally turn it off for
| popular models so people can see the underlying inference
| speed and the results but we forgot about it for this one, it
| should now be auth-less with a stricter rate limit just like
| other popular models in the gallery.
| MattRix wrote:
| It uses github auth, it's not some complex process. I can see
| why they would need to require accounts so it's harder to
| abuse it.
| holoduke wrote:
| Wow like the compression part. 42 fixed times compression. That
| is really nice. Slow to unpack on the fly. But the future is
| waiting.
| GaggiX wrote:
| I remember doing some random experiments with these two
| researchers to find the best way to condition the stage B on the
| latent, my very fancy cross-attn with relative 2D positional
| embeddings didn't work as well as just concatenating the channels
| of the input with the nearest upsample of the latent, so I just
| gave up ahah.
|
| This model used to be known as Wurstchen v3.
| joshelgar wrote:
| Why are they benchmarking it with 20+10 steps vs. 50 steps for
| the other models?
| liuliu wrote:
| prior generations usually take fewer steps than vanilla SDXL to
| reach the same quality.
|
| But yeah, the inference speed improvement is mediocre (until I
| take a look at exactly what computation performed to have more
| informed opinion on whether it is implementation issue or model
| issue).
|
| The prompt alignment should be better though. It looks like the
| model have more parameters to work with text conditioning.
| treesciencebot wrote:
| in my observation, it yields amazing perf at higher batch
| sizes (4 or better 8). i assume it is due to memory bandwith
| and the constrained latent space helping.
| GaggiX wrote:
| I think that this model used consistency loss during training
| so that it can yield better results with less steps.
| gajnadsgjoas wrote:
| Where can I run it if I don't have a GPU? Colab didn't work
| detolly wrote:
| runpod, kaggle, lambda labs, or pretty much any other server
| provider that gives you one or more gpus.
| k2enemy wrote:
| I haven't been following the image generation space since the
| initial excitement around stable diffusion. Is there an easy to
| use interface for the new models coming out?
|
| I remember setting up the python env for stable diffusion, but
| then shortly after there were a host of nice GUIs. Are there some
| popular GUIs that can be used to try out newer models? Similarly,
| what's the best GUI for some of the older models? Preferably for
| macos.
| thot_experiment wrote:
| Auto1111 and Comfy both get updated pretty quickly to support
| most of the new models coming out. I expect they'll both
| support this soon.
| stereobit wrote:
| Check out invoke.com
| cybereporter wrote:
| Will this get integrated into Stable Diffusion Web UI?
| hncomb wrote:
| Is there any way this can be used to generate multiple images of
| the same model? e.g. a car model rotated around (but all images
| are of the same generated car)
| matroid wrote:
| Someone with resources will have to train Zero123 [1] with this
| backbone.
|
| [1] https://zero123.cs.columbia.edu/
___________________________________________________________________
(page generated 2024-02-13 23:00 UTC)