[HN Gopher] Stable Cascade
       ___________________________________________________________________
        
       Stable Cascade
        
       Author : davidbarker
       Score  : 676 points
       Date   : 2024-02-13 17:23 UTC (1 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | obviyus wrote:
       | Been using it for a couple of hours and it seems it's much better
       | at following the prompt. Right away it seems the quality is worse
       | compared to some SDXL models but I'll reserve judgement until a
       | couple more days of testing.
       | 
       | It's fast too! I would reckon about 2-3x faster than non-turbo
       | SDXL.
        
         | kimoz wrote:
         | Can one run it on CPU?
        
           | ghurtado wrote:
           | You can run any ML model on CPU. The question is the
           | performance
        
           | rwmj wrote:
           | Stable Diffusion on a 16 core AMD CPU takes for me about 2-3
           | hours to generate an image, just to give you a rough idea of
           | the performance. (On the same AMD's iGPU it takes 2 minutes
           | or so).
        
             | OJFord wrote:
             | Even older GPUs are worth using then I take it?
             | 
             | For example I pulled a (2GB I think, 4 tops) 6870 out of my
             | desktop because it's a beast (in physical size, and power
             | consumption) and I wasn't using it for gaming or anything,
             | figured I'd be fine just with the Intel integrated
             | graphics. But if I wanted to play around with some models
             | locally, it'd be worth putting it back & figuring out how
             | to use it as a secondary card?
        
               | rwmj wrote:
               | One counterintuitive advantage of the integrated GPU is
               | it has access to system RAM (instead of using a dedicated
               | and fixed amount of VRAM). That means I'm able to give
               | the iGPU 16 GB of RAM. For me SD takes 8-9 GB of RAM when
               | running. The system RAM is slower than VRAM which is the
               | trade-off here.
        
               | OJFord wrote:
               | Yeah I did wonder about that as I typed, which is why I
               | mentioned the low amount (by modern standards anyway) on
               | the card. OK, thanks!
        
               | mat0 wrote:
               | No, I don't think so. I think you would need more VRAM to
               | start with.
        
               | purpleflame1257 wrote:
               | 2GB is really low. I've been able to use A111 stable
               | diffusion on my old gaming laptop's 1060 (6GB VRAM) and
               | it takes a little bit less than a minute to generate an
               | image. You would probably need to try the --lowvram flag
               | on startup.
        
             | smoldesu wrote:
             | SDXL Turbo is much better, albeit kinda fuzzy and
             | distorted. I was able to get decent single-sample response
             | times (~80-100s) from my 4 core ARM Ampere instance, good
             | enough for a Discord bot with friends.
        
               | emadm wrote:
               | Sd turbo runs nicely on a m2 MacBook Air (as does stable
               | lm 2!)
               | 
               | Much faster models will come
        
             | adrian_b wrote:
             | If that is true, then the CPU variant must be a much worse
             | implementation of the algorithm than the GPU variant,
             | because the true ratio of the GPU and CPU performances is
             | many times less than that.
        
             | antman wrote:
             | Which AMD CPU/iGPU are these timings for?
        
               | rwmj wrote:
               | AMD Ryzen 9 7950X 16-Core Processor
               | 
               | The iGPU is gfx1036 (RDNA 2).
        
             | weebull wrote:
             | WTF!
             | 
             | On my 5900X, so 12 cores, I was able to get SDXL to around
             | 10-15 minutes. I did do a few things to get to that.
             | 
             | 1. I used an AMD Zen optimised BLAS library. In particular
             | the AMDBLIS one, although it wasn't that different to the
             | Intel MKL one.
             | 
             | 2. I preload the jemalloc library to get better aligned
             | memory allocations.
             | 
             | 3. I manually set the number of threads to 12.
             | 
             | This is the start of my ComfyUI CPU invocation script.
             | export OMP_NUM_THREADS=12         export
             | LD_PRELOAD=/opt/aocl/4.1.0/aocc/lib_LP64/libblis-
             | mt.so:$LD_PRELOAD         export
             | LD_PRELOAD=/usr/lib/libjemalloc.so:$LD_PRELOAD
             | export MALLOC_CONF="oversize_threshold:1,background_thread:
             | true,metadata_thp:auto,dirty_decay_ms:
             | 60000,muzzy_decay_ms:60000"
             | 
             | Honestly, 12 threads wasn't much better than 8, and more
             | than 12 was detrimental. I was memory bandwidth limited I
             | think, not compute.
        
           | sebzim4500 wrote:
           | Not if you want to finish the generation before you have
           | stopped caring about the results.
        
         | sorenjan wrote:
         | How much VRAM does it need? They mention that the largest model
         | uses 1.4 billion parameters more than SDXL, which in turn need
         | a lot of VRAM.
        
           | adventured wrote:
           | There was a leak from Japan yesterday, prior to this release,
           | and in that it was suggested 20gb for the largest model.
           | 
           | This text was part of the Stability Japan leak (the 20gb VRAM
           | reference was dropped in the release today):
           | 
           | "Stages C and B will be released in two different models.
           | Stage C uses parameters of 1B and 3.6B, and Stage B uses
           | parameters of 700M and 1.5B. However, if you want to minimize
           | your hardware needs, you can also use the 1B parameter
           | version. In Stage B, both give great results, but 1.5 billion
           | is better at reconstructing finer details. Thanks to Stable
           | Cascade's modular approach, the expected amount of VRAM
           | required for inference can be kept at around 20GB, but can be
           | reduced even further by using smaller variations (as
           | mentioned earlier, this (which may reduce the final output
           | quality)."
        
             | sorenjan wrote:
             | Thanks. I guess this means that fewer people will be able
             | to use it on their own computer, but the improved
             | efficiency makes it cheaper to run on servers with enough
             | VRAM.
             | 
             | Maybe running stage C first, unloading it from VRAM, and
             | then do B and A would make it fit in 12 or even 8 GB, but I
             | wonder if the memory transfers would negate any time
             | saving. Might still be worth it if it produces better
             | images though.
        
               | adventured wrote:
               | If it worked I imagine large batching could make it worth
               | the load/unload time cost.
        
               | weebull wrote:
               | Shouldn't be a reason you couldn't do a ton of Layer C
               | work on different images, and then swap in Layer B.
        
               | Filligree wrote:
               | Sequential model offloading isn't too bad. It adds about
               | a second or less to inference, assuming it still fits in
               | main memory.
        
               | sorenjan wrote:
               | Sometimes I forget how fast modern computers are. PCIe v4
               | x16 has a transfer speed of 31.5 GB/s, so theoretically
               | it should take less than 100 ms to transfer stage B and
               | A. Maybe it's not so bad after all, it will be
               | interesting to see what happens.
        
               | whywhywhywhy wrote:
               | If you're serious about doing image gen locally you
               | should be running a 24GB card anyway because honestly
               | Nvidia's current generation 24GB is the sweet spot price
               | to performance. 3080 ram is laughably the same as the 6
               | year old 1080Ti and 4080 ram is only slightly more at 16
               | and costs about 1.5 times the 3090 second hand.
               | 
               | Any speed benefits of the 4080 are gonna be worthless the
               | second it has to cycle a model in and out of ram anyway
               | vs the 3090 in image gen.
        
               | weebull wrote:
               | > because honestly Nvidia's current generation 24GB is
               | the sweet spot price to performance
               | 
               | How is the halo product of a range the "sweet spot"?
               | 
               | I think nVidia are extremely exposed on this front. The
               | RX 7900XTX is also 24GB and under half the price (In UK
               | at least - PS800 vs PS1,700 for the 4090). It's difficult
               | to get a performance comparison on compute tasks, but I
               | think it's around 70-80% of the 4090 given what I can
               | find. Even a 3090, if you can find one, is PS1,500.
               | 
               | The software isn't as stable on AMD hardware, but it does
               | work. I'm running a RX7600 - 8GB myself, and happily
               | doing SDXL. The main problem is that exhausting VRAM
               | causes instability. Exceed it by a lot, and everything is
               | handled fine, but if it's marginal... problems ensue.
               | 
               | The AMD engineers are actively making the experience
               | better, and it may not be long before it's a practical
               | alternative. If/When that happens nVidia will need to
               | slash their prices to sell anything in this sphere, which
               | I can't really see themselves doing.
        
               | zargon wrote:
               | > If/When that happens nVidia will need to slash their
               | prices to sell anything in this sphere
               | 
               | It's just as likely that AMD will raise prices to
               | compensate.
        
               | weebull wrote:
               | You think they're going to say "Hey, compute became
               | competitive but nothing else changed performance
               | therefore... PRICE HIKE!"? They don't have the reputation
               | to burn in this domain for that IMHO.
               | 
               | Granted you could see a supply/demand related increase
               | from retailers if demand spiked, but that's the retailers
               | capitalising.
        
               | whywhywhywhy wrote:
               | >How is the halo product of a range the "sweet spot"?
               | 
               | Because it's actually a bargain second hand (got another
               | for PS650 last week buy it now eBay) and cheap for the
               | benefit it offers for any professional who needs it.
               | 
               | 3090 is the iPhone of AI, people should be ecstatic it
               | even exists not complaining about it.
        
               | weebull wrote:
               | > because honestly Nvidia's _current generation_ 24GB is
               | the sweet spot price to performance
               | 
               | You're aware the 3090 is not the current generation? You
               | can see why I would think you were talking about the
               | 4090?
        
           | liuliu wrote:
           | Should use no more than 6GiB for FP16 models at each stage.
           | The current implementation is not RAM optimized.
        
             | sorenjan wrote:
             | The large C model uses 3.6 billion parameters which is 6.7
             | GiB if each parameter is 16 bits.
        
               | liuliu wrote:
               | The large C model have fair bit of parameters tied to
               | text-conditioning, not to the main denoising process.
               | Similar to how we split the network for SDXL Base, I am
               | pretty confident we can split non-trivial amount of
               | parameters to text-conditioning hence during denoising
               | process, loading less than 3.6B parameters.
        
             | brucethemoose2 wrote:
             | What's more, they can presumably be swapped in and out like
             | the SDXL base + refiner, right?
        
         | vergessenmir wrote:
         | I'll take prompt adherence over quality any day. The machinery
         | otherwise isn't worth it i.e the controlnets, openpose,
         | depthmaps just to force a particular look or to achieve depth.
         | Th solution becomes bespoke for each generation.
         | 
         | Had a test of it and my option is it's an improvement when it
         | comes to following prompts and I do find the images more
         | visually appealing.
        
           | stavros wrote:
           | Can we use its output as input to SDXL? Presumably it would
           | just fill in the details, and not create whole new images.
        
             | RIMR wrote:
             | I was thinking that exactly. You could use the same trick
             | as the hires-fix for an adherence-fix.
        
               | emadm wrote:
               | Yeah chain it in comfy to a turbo model for detail
        
               | Filligree wrote:
               | A turbo model isn't the first thing I'd think of when it
               | comes to finalizing a picture. Have you found one that
               | produces high-quality output?
        
               | dragonwriter wrote:
               | For detail, it'd probably be better to use a full model
               | with a small number of steps (something like KSampler
               | Advanced node with 40 total steps, but starting at step
               | 32-ish.) Might even try using the SDXL refiner model for
               | that.
               | 
               | Turbo models are decent at low-iteration-decent-results,
               | but not so much at adding fine details to an mostly-done
               | image.
        
       | ttpphd wrote:
       | That is a very tiny latent space. Wow!
        
       | yogorenapan wrote:
       | Very impressive.
       | 
       | From what I understand, Stability AI is currently VC funded. It's
       | bound to burn through tons of money and it's not clear whether
       | the business model (if any) is sustainable. Perhaps worthy of
       | government funding.
        
         | minimaxir wrote:
         | Stability AI has been burning through tons of money for awhile
         | now, which is the reason newer models like Stable Cascade are
         | not commercially-friendly-licensed open source anymore.
         | 
         | > The company is spending significant amounts of money to grow
         | its business. At the time of its deal with Intel, Stability was
         | spending roughly $8 million a month on bills and payroll and
         | earning a fraction of that in revenue, two of the people
         | familiar with the matter said.
         | 
         | > It made $1.2 million in revenue in August and was on track to
         | make $3 million this month from software and services,
         | according to a post Mostaque wrote on Monday on X, the platform
         | formerly known as Twitter. The post has since been deleted.
         | 
         | https://fortune.com/2023/11/29/stability-ai-sale-intel-ceo-r...
        
           | littlestymaar wrote:
           | > which is the reason newer models like Stable Cascade are
           | not commercially-friendly-licensed open source anymore.
           | 
           | The main reason is probably Mid journey and OpenAi using
           | their tech without any kind of contribution back. AI
           | desperately needs a GPL equivalent...
        
             | yogorenapan wrote:
             | > AI desperately needs a GPL equivalent
             | 
             | Why not just the GPL then?
        
               | loudmax wrote:
               | The GPL was intended for computer code that gets compiled
               | to a binary form. You can share the binary, but you also
               | have to share the code that the binary is compiled from.
               | Pre-trained model weights might be thought of as
               | analogous to compiled code, and the training data may be
               | analogous to program code, but they're not the same
               | thing.
               | 
               | The model weights are shared openly, but the training
               | data used to create these models isn't. This is at least
               | partly because all these models, including OpenAI's, are
               | trained on copyrighted data, so the copyright status of
               | the models themselves is somewhat murky.
               | 
               | In the future we may see models that are 100% trained in
               | the open, but foundational models are currently very
               | expensive to train from scratch. Either prices would need
               | to come down, or enthusiasts will need some way to share
               | radically distributed GPU resources.
        
               | emadm wrote:
               | Tbh I think these models will largely be trained on
               | synthetic datasets in the future. They are mostly trained
               | on garbage now. We have been doing opt outs on these, has
               | been interesting to see quality differential (or lack
               | thereof), eg removing books3 from stableLM 3b zephyr
               | https://stability.wandb.io/stability-llm/stable-
               | lm/reports/S...
        
               | keenmaster wrote:
               | Why aren't the big models trained on synthetic datasets
               | now? What's the bottleneck? And how do you avoid
               | amplifying the weaknesses of LLMs when you train on LLM
               | output vs. novel material from the comparatively very
               | intelligent members of the human species. Would be
               | interesting to see your take on this.
        
               | emadm wrote:
               | We are starting to see that, see phi2 for example
               | 
               | There are approaches to get the right type of augmented
               | and generated data to feed these models right, check out
               | our QDAIF paper we worked on for example
               | 
               | https://arxiv.org/pdf/2310.13032.pdf
        
               | sillysaurusx wrote:
               | I've wondered whether books3 makes a difference, and how
               | much. If you ever train a model with a proper books3
               | ablation I'd be curious to know how it does. Books are an
               | important data source, but if users find the model useful
               | without them then that's a good datapoint.
        
               | emadm wrote:
               | We did try stableLM 3b4 with books3 and it got worse in
               | general and benchmarks
               | 
               | Just did some pes2o ablations too which were eh
        
               | sillysaurusx wrote:
               | What I mean is, it's important to train a model with
               | _and_ without books3. That's the only way to know whether
               | it was books3 itself causing the issue, or some artifact
               | of the training process.
               | 
               | One thing that's hard to measure is the knowledge
               | contained in books3. If someone asks about certain books,
               | it won't be able to give an answer unless the knowledge
               | is there in some form. I've often wondered whether
               | scraping the internet is enough rather than training on
               | books directly.
               | 
               | But be careful about relying too much on evals.
               | Ultimately the only benchmark that matters is whether
               | users find the model useful. The clearest test of this
               | would be to train two models side by side, with and
               | without books3, and then ask some people which they
               | prefer.
               | 
               | It's really tricky to get all of this right. But if
               | there's more details on the pes2o ablations I'd be
               | curious to see.
        
               | protomikron wrote:
               | What about CC licenses for model weights? It's common for
               | files ("images", "video", "audio", ...) So maybe
               | appropriate.
        
             | ipsum2 wrote:
             | It's highly doubtful that Midjourney and OpenAI use Stable
             | Diffusion or other Stability models.
        
               | jonplackett wrote:
               | How do you know though?
        
               | minimaxir wrote:
               | You can't use off-the-shelf models to get the results
               | Midjourney and DALL-E generate, even with strong
               | finetuning.
        
               | cthalupa wrote:
               | I pay for both MJ and DALL-E (though OpenAI mostly gets
               | my money for GPT) and don't find them to produce
               | significantly better images than popular checkpoints on
               | CivitAI. What I do find is that they are significantly
               | easier to work with. (Actually, my experience with
               | hundreds of DALL-E generations is that it's actually
               | quite poor in quality. I'm in several IRC channels where
               | it's the image generator of choice for some IRC bots, and
               | I'm never particularly impressed with the visual
               | quality.)
               | 
               | For MJ in particular, knowing that they at least used to
               | use Stable Diffusion under the hood, it would not
               | surprise me if the majority of the secret sauce is
               | actually a middle layer that processes the prompt and
               | converts it to one that is better for working with SD.
               | Prompting SD to get output at the MJ quality level takes
               | significantly more tokens, lots of refinement, heavy
               | tweaking of negative prompting, etc. Also a stack of
               | embeddings and LoRAs, though I would place those more in
               | the category of finetuning like you had mentioned.
        
               | emadm wrote:
               | If you try diffusionGPT with regional prompting added and
               | a GAN corrector you can get a good idea of what is
               | possible https://diffusiongpt.github.io
        
               | euazOn wrote:
               | That looks very impressive unless the demo is
               | cherrypicked, would be great if this could be implemented
               | into a frontend like Fooocus
               | https://github.com/lllyasviel/Fooocus
        
               | millgrove wrote:
               | What do you use it for? I haven't found a great use for
               | it myself (outside of generating assets for landing pages
               | / apps, where it's really really good). But I have seen
               | endless subreddits / instagram pages dedicated to various
               | forms of AI content, so it seems lots of people are using
               | it for fun?
        
               | cthalupa wrote:
               | Nothing professional. I run a variety of tabletop RPGs
               | for friends, so I mostly use it for making visual aids
               | there. I've also got a large format printer that I was no
               | longer using for it's original purpose, so I bought a few
               | front-loading art frames that I generate art for and
               | rotate through periodically.
               | 
               | I've also used it to generate art for deskmats I got
               | printed at https://specterlabs.co/
               | 
               | For commercial stuff I still pay human artists.
        
               | throwanem wrote:
               | Whose frames do you use? Do you like them? I print my
               | photos to frame and hang, and wouldn't at all mind being
               | able to rotate them more conveniently and inexpensively
               | than dedicating a frame to each allows.
        
               | cthalupa wrote:
               | https://www.spotlightdisplays.com/
               | 
               | I like them quite a bit, and you can get basically any
               | size cut to fit your needs even if they don't directly
               | offer it on the site.
        
               | throwanem wrote:
               | Perfectly suited to go alongside the style of frame I
               | already have lots of, and very reasonably priced off the
               | shelf for the 13x19 my printer tops out at. Thanks so
               | much! It'll be easier to fill that one blank wall now.
        
               | soultrees wrote:
               | What IRC Channels do you frequent?
        
               | cthalupa wrote:
               | Largely some old channels from the 90s/00s that really
               | only exist as vestiges of their former selves - not
               | really related to their original purpose, just rooms for
               | hanging out with friends made there back when they had a
               | point besides being a group chat.
        
               | yreg wrote:
               | That's not really true, MJ and DALL-E are just more
               | beginner friendly.
        
               | orbital-decay wrote:
               | Midjourney has absolutely nothing to offer compared to
               | proper finetunes. DALL-E has: it generalizes well (can
               | make objects interact properly for example) and has great
               | prompt adherence. But it can also be unpredictable as
               | hell because it rewrites the prompts. DALL-E's quality is
               | meh - it has terrible artifacts on all pixel-sized
               | details, hallucinations on small details, and limited
               | resolution. Controlnets, finetuning/zero-shot reference
               | transfer, and open tooling would have made a beast of a
               | model of it, but they aren't available.
        
               | cthalupa wrote:
               | Midjourney 100% at least used to use Stable Diffusion:
               | https://twitter.com/EMostaque/status/1561917541743841280
               | 
               | I am not sure if that is still the case.
        
               | refulgentis wrote:
               | It trialled it as an explicitly optional model for a
               | moment a couple years ago. (or only a year? time moves so
               | fast. somewhere in v2/v3 timeframe and around when SD
               | came out). I am sure it is no longer the case.
        
               | liuliu wrote:
               | DALL-E shares the same autoencoders as SD v1.x. It is
               | probably similar to how Meta's Emu-class models work
               | though. They tweaked the architecture quite a bit,
               | trained on their own dataset, reused some components (or
               | in Emu case, trained all the components from scratch but
               | reused the same arch).
        
             | minimaxir wrote:
             | More specifically, it's so Stability AI can theoretically
             | make a business on selling commercial access to those
             | models through a membership:
             | https://stability.ai/news/introducing-stability-ai-
             | membershi...
        
             | programjames wrote:
             | I think it'd be interesting to have a non-profit "model
             | sharing" platform, where people can buy/sell compute. When
             | you run someone's model, they get royalties on the compute
             | you buy.
        
             | thatguysaguy wrote:
             | The net flow of knowledge about text-to-image generation
             | from OpenAI has definitely been outward. The early open
             | source methods used CLIP, which OpenAI came up with. Dall-e
             | (1) was also the first demonstration that we could do text
             | to image at all. (There were some earlier papers which
             | could give you a red splotch if you said stop sign or
             | something years earlier).
        
           | loudmax wrote:
           | I get the impression that a lot of open source adjacent AI
           | companies, including Stability AI, are in the "???" phase of
           | execution, hoping the "Profit" phase comes next.
           | 
           | Given how much VC money is chasing the AI space, this isn't
           | necessarily a bad plan. Give stuff away for free while
           | developing deep expertise, then either figure out something
           | to sell, or pivot to proprietary, or get aquihired by a tech
           | giant.
        
             | minimaxir wrote:
             | That is indeed the case, hence the more recent pushes
             | toward building moats by every AI company.
        
         | seydor wrote:
         | exactly my thought. stability should be receiving research
         | grants
        
           | emadm wrote:
           | We should, we haven't yet...
           | 
           | Instead we've given 10m+ supercomputer hours in grants to all
           | sorts of projects, now we have our grant team in place &
           | there is a huge increase in available funding for folk that
           | can actually build stuff we can tap into.
        
         | downrightmike wrote:
         | Finally a good use to burn VC money!
        
         | sveme wrote:
         | None of the researchers are associated with stability.ai, but
         | with universities in Germany and Canada. How does this work? Is
         | this exclusive work for stability.ai?
        
           | emadm wrote:
           | Dom and Pablo both work for Stability AI (Dom finishing his
           | degree).
           | 
           | All the original Stable Diffusion researchers (Robin Rombach,
           | Patrick Esser, Dominik Lorenz, Andreas Blattman) also work
           | for Stability AI.
        
         | diggan wrote:
         | I've seen Emad (Stability AI founder) commenting here on HN
         | somewhere about this before, what exactly their business model
         | is/will be, and similar thoughts.
         | 
         | HN search doesn't seem to agree with me today though and I
         | cannot find the specific comment/s I have in mind, maybe
         | someone else has any luck? This is their user
         | https://news.ycombinator.com/user?id=emadm
        
           | emadm wrote:
           | https://x.com/EMostaque/status/1649152422634221593?s=20
           | 
           | We now have top models of every type, sites like
           | www.stableaudio.com, memberships, custom model deals etc so
           | lots of demand
           | 
           | We're the only AI company that can make a model of any type
           | for anyone from scratch & are the most liked / one of the
           | most downloaded on HuggingFace
           | (https://x.com/Jarvis_Data/status/1730394474285572148?s=20,
           | https://x.com/EMostaque/status/1727055672057962634?s=20)
           | 
           | Its going ok, team working hard and shipping good models, the
           | team are accelerating their work on building ComfyUI to bring
           | it all together.
           | 
           | My favourite recent model was CheXagent, I think medical
           | models should be open & will really save lives:
           | https://x.com/Kseniase_/status/1754575702824038717?s=20
        
       | jedberg wrote:
       | I'd say I'm most impressed by the compression. Being able to
       | compress an image 42x is huge for portable devices or bad
       | internet connectivity (or both!).
        
         | incrudible wrote:
         | That is 42x _spatial_ compression, but it needs 16 channels
         | instead of 3 for RGB.
        
           | ansk wrote:
           | Furthermore, each of those 16 channels would typically be
           | mutibyte floats as opposed to single byte RGB channels.
           | (speaking generally, haven't read the paper)
        
           | zamadatix wrote:
           | Even assuming 32 bit floats (the extra 4 on the end):
           | 
           | 4*16*24*24*4 = 147,456
           | 
           | vs (removing the alpha channel as it's unused here)
           | 
           | 3*3*1024*1024 = 9,437,184
           | 
           | Or 1/64 raw size, assuming I haven't fucked up the
           | math/understanding somewhere (very possible at the moment).
        
             | incrudible wrote:
             | It is actually just 2/4 bytes x 16 latent channels x 24 x
             | 24, but the comparison to raw data needs to be taken with a
             | grain of salt, as there is quite a bit of hallucination
             | involved in reconstruction.
        
         | flgstnd wrote:
         | a 42x compression is also impressive as it matches the answer
         | to the ultimate question of life, the universe, and everything,
         | maybe there is some deep universal truth within this model.
        
         | seanalltogether wrote:
         | I have to imagine at this point someone is working toward a
         | fast AI based video codec that comes with a small pretrained
         | model and can operate in a limited memory environment like a tv
         | to offer 8k resolution with low bandwidth.
        
           | jedberg wrote:
           | I would be shocked if Netflix was _not_ working on that.
        
           | Lord-Jobo wrote:
           | I am 65% sure this is already extremely similar to LGs
           | upscaling approach in their most recent flagship
        
       | yogorenapan wrote:
       | I see in the commits that the license was changed from MIT to
       | their own custom one: https://github.com/Stability-
       | AI/StableCascade/commit/209a526...
       | 
       | Is it legal to use an older snapshot before the license was
       | changed in accordance with the previous MIT license?
        
         | OJFord wrote:
         | Yes, you can continue to do what you want with that commit^ in
         | accordance with the MIT licence it was released under. Kind of
         | like if you buy an ebook, and then they publish a second
         | edition but only as a hardback - the first edition ebook is
         | still yours to read.
        
         | treesciencebot wrote:
         | I think the model architecture (training code etc.) itself is
         | still under MIT while the weights (which was the result of
         | training in a huge GPU cluster as well as the dataset they have
         | used [not sure if they publicly talked about it] is under this
         | new license.
        
           | emadm wrote:
           | Code is MIT, weights are under the NC license for now.
        
         | ed wrote:
         | It seems pretty clear the intent was to use a non-commercial
         | license, so it's probably something that would go to court, if
         | you really wanted to press the issue.
         | 
         | Generally courts are more holistic and look at intent, and
         | understand that clerical errors happen. One exception to this
         | is if a business claims it relied on the previous license and
         | invested a bunch of resources as a result.
         | 
         | I believe the timing of commits is pretty important-- it would
         | be hard to claim your business made a substantial investment on
         | a pre-announcement repo that was only MIT'ed for a few hours.
        
           | RIMR wrote:
           | If I clone/fork that repo before the license change, and
           | start putting any amount of time into developing my own fork
           | in good faith, they shouldn't be allowed to claim a clerical
           | error when they lied to me upon delivery about what I was
           | allowed to do with the code.
           | 
           | Licenses are important. If you are going to expose your code
           | to the world, make sure it has the right license. If you
           | publish your code with the wrong license, you shouldn't be
           | allowed to take it back. Not for an organization of this size
           | that is going to see a new repo cloned thousands of times
           | upon release.
        
             | ed wrote:
             | There's no case law here, so if you're volunteering to find
             | out what a judge thinks we'd surely appreciate it!
        
             | wokwokwok wrote:
             | No, sadly this won't fly in court.
             | 
             | For the same reason you cannot publish a private corporate
             | repo with an MIT license and then have other people claim
             | in "good faith" to be using it.
             | 
             | All they need is to assert that the license was published
             | in error, or that the person publishing it did not have the
             | authority to publish it.
             | 
             | You can't "magically" make a license stick by putting it in
             | a repo, any more than putting a "name here" sticker on
             | someone's car and then claiming to own it.
             | 
             | The license file in the repo is simply the _notice_ of the
             | license.
             | 
             | It does not indicate a binding legal agreement.
             | 
             | You of course, can challenge it in court, and ianal, but I
             | assure you, there is president in incorrectly labelled
             | repos removing and changing their licenses.
        
               | arcbyte wrote:
               | It could very well fly. Agency law, promissory estoppel,
               | ...
        
         | RIMR wrote:
         | MIT license is not parasitic like GPL. You can close an MIT
         | licensed codebase, but you cannot retroactively change the
         | license of the old code.
         | 
         | Stability's initial commit had an MIT license, so you can fork
         | that commit and do whatever you want with it. It's MIT
         | licensed.
         | 
         | Now, the tricky part here is that they committed a change to
         | the license that changes it from MIT to proprietary, but they
         | didn't change any code with it. That is definitely invalid,
         | because they cannot license the exact same codebase with two
         | different contradictory licenses. They can only license the
         | changes made to the codebase after the license change. I
         | wouldn't call it "illegal", but it wouldn't stand up in court
         | if they tried to claim that the software is proprietary,
         | because they already distributed it verbatim with an open
         | license.
        
           | kruuuder wrote:
           | > they didn't change any code with it. That is definitely
           | invalid, because they cannot license the exact same codebase
           | with two different contradictory licenses.
           | 
           | Why couldn't they? Of course they can. If you are the
           | copyright owner, you can publish/sell your stuff under as
           | many licenses as you like.
        
         | weebull wrote:
         | The code is MIT. The model has a non-commercial license. They
         | are separate pieces of work under different licenses. Stability
         | AI have said that the non-commercial license is because this is
         | a technology preview (like SDXL 0.9 was).
        
       | gorkemyurt wrote:
       | we have an optimized playground here:
       | https://www.fal.ai/models/stable-cascade
        
         | adventured wrote:
         | "sign in to run"
         | 
         | That's a marketing opportunity being missed, especially given
         | how crowded the space is now. The HN crowd is more likely to
         | run it themselves when presented with signing up just to test
         | out a single generation.
        
           | treesciencebot wrote:
           | Uh, thanks for noticing it! We generally turn it off for
           | popular models so people can see the underlying inference
           | speed and the results but we forgot about it for this one, it
           | should now be auth-less with a stricter rate limit just like
           | other popular models in the gallery.
        
             | RIMR wrote:
             | I just got rate-limited on my first generation. The message
             | is "You have exceeded the request limit per minute". This
             | was after showing me cli output suggesting that my image
             | was being generated.
             | 
             | I guess my zero attempts per minute was too much. You
             | really shouldn't post your product on HN if you aren't
             | prepared for it to work. Reputations are hard to earn, and
             | you're losing people's interest by directing them to a
             | broken product.
        
               | getcrunk wrote:
               | Are you using a vpn or at a large campus or office?
        
             | archerx wrote:
             | I wanted to use your service for a project but you can only
             | sign in through github, I emailed your support about this
             | and never got an answer, in the end I ended up installing
             | SD Turbo locally. I think that a github only auth is losing
             | you potential customers like myself.
        
           | MattRix wrote:
           | It uses github auth, it's not some complex process. I can see
           | why they would need to require accounts so it's harder to
           | abuse it.
        
             | arcanemachiner wrote:
             | After all the bellyaching from the HN crowd when PyPI
             | started requiring 2FA, nothing surprises me anymore.
        
       | holoduke wrote:
       | Wow like the compression part. 42 fixed times compression. That
       | is really nice. Slow to unpack on the fly. But the future is
       | waiting.
        
       | GaggiX wrote:
       | I remember doing some random experiments with these two
       | researchers to find the best way to condition the stage B on the
       | latent, my very fancy cross-attn with relative 2D positional
       | embeddings didn't work as well as just concatenating the channels
       | of the input with the nearest upsample of the latent, so I just
       | gave up ahah.
       | 
       | This model used to be known as Wurstchen v3.
        
       | joshelgar wrote:
       | Why are they benchmarking it with 20+10 steps vs. 50 steps for
       | the other models?
        
         | liuliu wrote:
         | prior generations usually take fewer steps than vanilla SDXL to
         | reach the same quality.
         | 
         | But yeah, the inference speed improvement is mediocre (until I
         | take a look at exactly what computation performed to have more
         | informed opinion on whether it is implementation issue or model
         | issue).
         | 
         | The prompt alignment should be better though. It looks like the
         | model have more parameters to work with text conditioning.
        
           | treesciencebot wrote:
           | in my observation, it yields amazing perf at higher batch
           | sizes (4 or better 8). i assume it is due to memory bandwith
           | and the constrained latent space helping.
        
             | Filligree wrote:
             | However, the outputs are so similar that I barely feel a
             | need for more than 1. 2 is plenty.
        
         | GaggiX wrote:
         | I think that this model used consistency loss during training
         | so that it can yield better results with less steps.
        
         | weebull wrote:
         | ...because they feel that at 20+10 it achieves a superior
         | output than at 50 steps for SDXL. They also benchmark it
         | against 1 step for SDXL-Turbo.
        
       | gajnadsgjoas wrote:
       | Where can I run it if I don't have a GPU? Colab didn't work
        
         | detolly wrote:
         | runpod, kaggle, lambda labs, or pretty much any other server
         | provider that gives you one or more gpus.
        
       | k2enemy wrote:
       | I haven't been following the image generation space since the
       | initial excitement around stable diffusion. Is there an easy to
       | use interface for the new models coming out?
       | 
       | I remember setting up the python env for stable diffusion, but
       | then shortly after there were a host of nice GUIs. Are there some
       | popular GUIs that can be used to try out newer models? Similarly,
       | what's the best GUI for some of the older models? Preferably for
       | macos.
        
         | thot_experiment wrote:
         | Auto1111 and Comfy both get updated pretty quickly to support
         | most of the new models coming out. I expect they'll both
         | support this soon.
        
           | stereobit wrote:
           | Check out invoke.com
        
             | sophrocyne wrote:
             | Thanks for calling us out - I'm one of the maintainers.
             | 
             | Not entirely sure we'll be in the Stable Cascade race quite
             | yet. Since Auto/Comfy aren't really built for businesses,
             | they'll get it incorporated sooner vs later.
             | 
             | Invoke's main focus is building open-source tools for the
             | pros using this for work that are getting disrupted, and
             | non-commercial licenses don't really help the ones that are
             | trying to follow the letter of the license.
             | 
             | Theoretically, since we're just a deployment solution, it
             | might come up with our larger customers who want us to run
             | something they license from Stability, but we've had zero
             | interest on any of the closed-license stuff so far.
        
         | yokto wrote:
         | fal.ai is nice and fast:
         | https://news.ycombinator.com/item?id=39360800 Both in
         | performance and for how quickly they integrate new models
         | apparently: they already support Stable Cascade.
        
         | brucethemoose2 wrote:
         | Fooocus is the fastest way to try SDXL/SDXL turbo with good
         | quality.
         | 
         | ComfyUI is cool but very DIY. You don't get good results unless
         | you wrap your head around all the augmentations and defaults.
         | 
         | No idea if it will support cascade.
        
           | SpliffnCola wrote:
           | ComfyUI is similar to Houdini in complexity, but immensely
           | powerful. It's a joy to use.
           | 
           | There are also a large amount of resources available for it
           | on YouTube, GitHub
           | (https://github.com/comfyanonymous/ComfyUI_examples), reddit
           | (https://old.reddit.com/r/comfyui), CivitAI, Comfy Workflows
           | (https://comfyworkflows.com/), and OpenArt Flow
           | (https://openart.ai/workflows/).
           | 
           | I still use AUTO1111
           | (https://github.com/AUTOMATIC1111/stable-diffusion-webui) and
           | the recently released and heavily modified fork of AUTO1111
           | called Forge (https://github.com/lllyasviel/stable-diffusion-
           | webui-forge).
        
             | emadm wrote:
             | Our team at Stability AI build ComfyUI so yeah is supported
        
       | cybereporter wrote:
       | Will this get integrated into Stable Diffusion Web UI?
        
         | ttul wrote:
         | Surely within days. ComfyUI's maintainer said he is readying
         | the node for release perhaps by this weekend. The Stable
         | Cascade model is otherwise known as Wurschten v3 and has been
         | floating around the open source generative image space since
         | fall.
        
           | dragonwriter wrote:
           | Third-party (using diffusers) node for ComfyUI is already
           | available for those who can't wait for native integration.
           | 
           | https://github.com/kijai/ComfyUI-DiffusersStableCascade
        
       | hncomb wrote:
       | Is there any way this can be used to generate multiple images of
       | the same model? e.g. a car model rotated around (but all images
       | are of the same generated car)
        
         | matroid wrote:
         | Someone with resources will have to train Zero123 [1] with this
         | backbone.
         | 
         | [1] https://zero123.cs.columbia.edu/
        
           | emadm wrote:
           | Heh https://stability.ai/news/stable-zero123-3d-generation
           | 
           | Better coming
        
         | refulgentis wrote:
         | Yes, input image => embedding => N images, and if you're
         | thinking 3D perspectives for rendering, you'd ControlNet the N.
         | 
         | ref.: "The model can also understand image embeddings, which
         | makes it possible to generate variations of a given image
         | (left). There was no prompt given here."
        
           | taejavu wrote:
           | The model looks different in each of those variations though.
           | Which seems to be intentional, but the post you're responding
           | to is asking whether it's possible to keep the model exactly
           | the same in each render, varying only by perspective.
        
       | ionwake wrote:
       | Does anyone have a link to a demo online?
        
         | martin82 wrote:
         | https://huggingface.co/spaces/multimodalart/stable-cascade
        
           | ionwake wrote:
           | Thank you, is there a demo if the "image to image" ability?
           | It doesnt seem to be in any of the demos I see.
        
       | pxoe wrote:
       | the way it's written about in Image Reconstruction section like
       | it is just an image compression thing...is kind of interesting.
       | for that stuff and its presented use there to be very much about
       | storing images and reconstructing them. when "it doesn't actually
       | store original images" and "it can't actually give out original
       | images" are points that get used so often in arguments as a
       | defense for image generators. so it is just a multi-image
       | compression file format, just a very efficient one. sure, it's
       | "redrawing"/"rendering" its output and makes things look kinda
       | fuzzy, but any other compressed image format does that as well.
       | what was all that 'well it doesn't do those things' nonsense
       | about then? clearly it can do that.
        
         | wongarsu wrote:
         | In a way it's just an algorithm than can compress either text
         | or an image. The neat trick is that if you compress the text
         | "brown bear hitting Vladimir Putin" and then decompress it as
         | an image, you get an image of a bear hitting Vladimir Putin.
         | 
         | This principle is the idea behind all Stable Diffusion models,
         | this one "just" achieved a much better compression ratio
        
           | pxoe wrote:
           | well yeah. but it's not so much about what it actually does,
           | but how they talk about it. maybe (probably) i missed them
           | putting out something that's described like that before, but
           | it's just the open admission in demonstration of it. i guess
           | they're getting more brazen, given than they're not really
           | getting punished for what they're doing, be it piracy or
           | infringement or whatever.
        
             | Filligree wrote:
             | The model works on compressed data. That's all it is. Sure,
             | it could output a picture from its training set on
             | decompression, but only if you feed that same picture into
             | the compressor.
             | 
             | In which case what are you doing, exactly? Normally you
             | feed it a text prompt instead, which won't compress to the
             | same thing.
        
         | gmerc wrote:
         | Ultimately this is abstraction not compression.
        
         | GaggiX wrote:
         | >well it doesn't do those things' nonsense about then? clearly
         | it can do that.
         | 
         | There is a model that is trained to compress (very lossy) and
         | decompress the latent, but it's not the main generative model,
         | of course the model doesn't store images in it, you just give
         | the encoder an image and it will encode it and then you can
         | decode it with the decoder and get a very similar image, this
         | encoder and decoder is used during training so that the stage C
         | can work on a compressed latent instead of directly at the
         | pixel level because it's expensive, but the main generative
         | model (stage C) should be able to generate any of the images
         | that were present in the dataset or it fails to do its job.
         | Stages C, B, and A do not store any images.
         | 
         | The B and A stages work like an advanced image decoder, so
         | unless you have something wrong with image decoders in general,
         | I don't see how this could be a problem (a JPEG decoder doesn't
         | store images either, of course).
        
       | mise_en_place wrote:
       | Was anyone able to get this running on Colab? I got as far as
       | loading extras in text-to-inference, but it was complaining about
       | a dependency.
        
       | SECourses wrote:
       | It is pretty good I shared a comparison on medium
       | 
       | https://medium.com/@furkangozukara/stable-cascade-prompt-fol...
       | 
       | My Gradio APP even works amazing on 8 GB gpu with CPU offloading
        
       | lqcfcjx wrote:
       | I'm very impressed by the recent AI progress on making models
       | smaller and more efficient. I just have the feeling that every
       | week there's something big on this space (like what we saw
       | previously from ollama, llava, mixtral...). Apparently the space
       | for on-device models are not fully discovered yet. Very excited
       | to see future products on that direction.
        
         | dragonwriter wrote:
         | > I'm very impressed by the recent AI progress on making models
         | smaller and more efficient.
         | 
         | That's an odd comment to place in a thread about an image
         | generation model that is bigger than SDXL. Yes, it works in a
         | smaller latent space, yes its faster in the hardware
         | configuration they've used, but its not _smaller_.
        
       | skybrian wrote:
       | Like every other image generator I've tried, it can't do a piano
       | keyboard [1]. I expect that some different approach is needed to
       | be able to count the black keys groups.
       | 
       | [1] https://fal.ai/models/stable-
       | cascade?share=13d35b76-d32f-45c...
        
         | Agraillo wrote:
         | I think it's more than this. In my case in most of images I
         | made about basketball there were more than one ball. I'm not an
         | expert, but some fundamental constrains of the human (cultural)
         | life (like all piano keys are the same, there's only one ball
         | in a game) are not grasped by the training or grasped partially
        
         | GaggiX wrote:
         | As with human hands, coherency is fixed by scaling the model
         | and the training.
        
       | sanroot99 wrote:
       | What is the system requirements needed to run this, particularly
       | how much vram it would take?
        
       | instagraham wrote:
       | Will this work on AMD? Found no mention of support. Kinda an
       | important feature for such a project, as AMD users running Stable
       | Diffusion will be suffering diminished performance.
        
         | drclegg wrote:
         | Apparently yes
         | https://news.ycombinator.com/item?id=39360106#39360497
        
       | xkgt wrote:
       | This model is built upon the Wurstchen architecture. Here is a
       | very good explanation of how this model works by one of its
       | authors.
       | 
       | https://www.youtube.com/watch?v=ogJsCPqgFMk
        
         | lordswork wrote:
         | Great video! And here's a summary of the video :)
         | Gemini Advanced> Summarize this video:
         | https://www.youtube.com/watch?v=ogJsCPqgFMk
         | 
         | This video is about a new method for training text-to-image
         | diffusion models called Wurstchen. The method is significantly
         | more efficient than previous methods, such as Stable Diffusion
         | 1.4, and can achieve similar results with 16 times less
         | training time and compute.
         | 
         | The key to Wurstchen's efficiency is its use of a two-stage
         | compression process. The first stage uses a VQ-VAE to compress
         | images into a latent space that is 4 times smaller than the
         | latent space used by Stable Diffusion. The second stage uses a
         | diffusion model to further compress the latent space by another
         | factor of 10. This results in a total compression ratio of 40,
         | which is significantly higher than the compression ratio of 8
         | used by Stable Diffusion.
         | 
         | The compressed latent space allows the text-to-image diffusion
         | model in Wurstchen to be much smaller and faster to train than
         | the model in Stable Diffusion. This makes it possible to train
         | Wurstchen on a single GPU in just 24,000 GPU hours, while
         | Stable Diffusion 1.4 requires 150,000 GPU hours.
         | 
         | Despite its efficiency, Wurstchen is able to generate images
         | that are of comparable quality to those generated by Stable
         | Diffusion. In some cases, Wurstchen can even generate images
         | that are of higher quality, such as images with higher
         | resolutions or images that contain more detail.
         | 
         | Overall, Wurstchen is a significant advance in the field of
         | text-to-image generation. It makes it possible to train text-
         | to-image models that are more efficient and affordable than
         | ever before. This could lead to a wider range of applications
         | for text-to-image generation, such as creating images for
         | marketing materials, generating illustrations for books, or
         | even creating personalized avatars.
        
       | nialv7 wrote:
       | Can Stable Cascade be used for image compression? 1024x1024 to
       | 24x24 is crazy.
        
         | anonuser1234 wrote:
         | That's definitely not lossless compression
        
       ___________________________________________________________________
       (page generated 2024-02-14 23:01 UTC)