[HN Gopher] Stable Cascade
___________________________________________________________________
Stable Cascade
Author : davidbarker
Score : 676 points
Date : 2024-02-13 17:23 UTC (1 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| obviyus wrote:
| Been using it for a couple of hours and it seems it's much better
| at following the prompt. Right away it seems the quality is worse
| compared to some SDXL models but I'll reserve judgement until a
| couple more days of testing.
|
| It's fast too! I would reckon about 2-3x faster than non-turbo
| SDXL.
| kimoz wrote:
| Can one run it on CPU?
| ghurtado wrote:
| You can run any ML model on CPU. The question is the
| performance
| rwmj wrote:
| Stable Diffusion on a 16 core AMD CPU takes for me about 2-3
| hours to generate an image, just to give you a rough idea of
| the performance. (On the same AMD's iGPU it takes 2 minutes
| or so).
| OJFord wrote:
| Even older GPUs are worth using then I take it?
|
| For example I pulled a (2GB I think, 4 tops) 6870 out of my
| desktop because it's a beast (in physical size, and power
| consumption) and I wasn't using it for gaming or anything,
| figured I'd be fine just with the Intel integrated
| graphics. But if I wanted to play around with some models
| locally, it'd be worth putting it back & figuring out how
| to use it as a secondary card?
| rwmj wrote:
| One counterintuitive advantage of the integrated GPU is
| it has access to system RAM (instead of using a dedicated
| and fixed amount of VRAM). That means I'm able to give
| the iGPU 16 GB of RAM. For me SD takes 8-9 GB of RAM when
| running. The system RAM is slower than VRAM which is the
| trade-off here.
| OJFord wrote:
| Yeah I did wonder about that as I typed, which is why I
| mentioned the low amount (by modern standards anyway) on
| the card. OK, thanks!
| mat0 wrote:
| No, I don't think so. I think you would need more VRAM to
| start with.
| purpleflame1257 wrote:
| 2GB is really low. I've been able to use A111 stable
| diffusion on my old gaming laptop's 1060 (6GB VRAM) and
| it takes a little bit less than a minute to generate an
| image. You would probably need to try the --lowvram flag
| on startup.
| smoldesu wrote:
| SDXL Turbo is much better, albeit kinda fuzzy and
| distorted. I was able to get decent single-sample response
| times (~80-100s) from my 4 core ARM Ampere instance, good
| enough for a Discord bot with friends.
| emadm wrote:
| Sd turbo runs nicely on a m2 MacBook Air (as does stable
| lm 2!)
|
| Much faster models will come
| adrian_b wrote:
| If that is true, then the CPU variant must be a much worse
| implementation of the algorithm than the GPU variant,
| because the true ratio of the GPU and CPU performances is
| many times less than that.
| antman wrote:
| Which AMD CPU/iGPU are these timings for?
| rwmj wrote:
| AMD Ryzen 9 7950X 16-Core Processor
|
| The iGPU is gfx1036 (RDNA 2).
| weebull wrote:
| WTF!
|
| On my 5900X, so 12 cores, I was able to get SDXL to around
| 10-15 minutes. I did do a few things to get to that.
|
| 1. I used an AMD Zen optimised BLAS library. In particular
| the AMDBLIS one, although it wasn't that different to the
| Intel MKL one.
|
| 2. I preload the jemalloc library to get better aligned
| memory allocations.
|
| 3. I manually set the number of threads to 12.
|
| This is the start of my ComfyUI CPU invocation script.
| export OMP_NUM_THREADS=12 export
| LD_PRELOAD=/opt/aocl/4.1.0/aocc/lib_LP64/libblis-
| mt.so:$LD_PRELOAD export
| LD_PRELOAD=/usr/lib/libjemalloc.so:$LD_PRELOAD
| export MALLOC_CONF="oversize_threshold:1,background_thread:
| true,metadata_thp:auto,dirty_decay_ms:
| 60000,muzzy_decay_ms:60000"
|
| Honestly, 12 threads wasn't much better than 8, and more
| than 12 was detrimental. I was memory bandwidth limited I
| think, not compute.
| sebzim4500 wrote:
| Not if you want to finish the generation before you have
| stopped caring about the results.
| sorenjan wrote:
| How much VRAM does it need? They mention that the largest model
| uses 1.4 billion parameters more than SDXL, which in turn need
| a lot of VRAM.
| adventured wrote:
| There was a leak from Japan yesterday, prior to this release,
| and in that it was suggested 20gb for the largest model.
|
| This text was part of the Stability Japan leak (the 20gb VRAM
| reference was dropped in the release today):
|
| "Stages C and B will be released in two different models.
| Stage C uses parameters of 1B and 3.6B, and Stage B uses
| parameters of 700M and 1.5B. However, if you want to minimize
| your hardware needs, you can also use the 1B parameter
| version. In Stage B, both give great results, but 1.5 billion
| is better at reconstructing finer details. Thanks to Stable
| Cascade's modular approach, the expected amount of VRAM
| required for inference can be kept at around 20GB, but can be
| reduced even further by using smaller variations (as
| mentioned earlier, this (which may reduce the final output
| quality)."
| sorenjan wrote:
| Thanks. I guess this means that fewer people will be able
| to use it on their own computer, but the improved
| efficiency makes it cheaper to run on servers with enough
| VRAM.
|
| Maybe running stage C first, unloading it from VRAM, and
| then do B and A would make it fit in 12 or even 8 GB, but I
| wonder if the memory transfers would negate any time
| saving. Might still be worth it if it produces better
| images though.
| adventured wrote:
| If it worked I imagine large batching could make it worth
| the load/unload time cost.
| weebull wrote:
| Shouldn't be a reason you couldn't do a ton of Layer C
| work on different images, and then swap in Layer B.
| Filligree wrote:
| Sequential model offloading isn't too bad. It adds about
| a second or less to inference, assuming it still fits in
| main memory.
| sorenjan wrote:
| Sometimes I forget how fast modern computers are. PCIe v4
| x16 has a transfer speed of 31.5 GB/s, so theoretically
| it should take less than 100 ms to transfer stage B and
| A. Maybe it's not so bad after all, it will be
| interesting to see what happens.
| whywhywhywhy wrote:
| If you're serious about doing image gen locally you
| should be running a 24GB card anyway because honestly
| Nvidia's current generation 24GB is the sweet spot price
| to performance. 3080 ram is laughably the same as the 6
| year old 1080Ti and 4080 ram is only slightly more at 16
| and costs about 1.5 times the 3090 second hand.
|
| Any speed benefits of the 4080 are gonna be worthless the
| second it has to cycle a model in and out of ram anyway
| vs the 3090 in image gen.
| weebull wrote:
| > because honestly Nvidia's current generation 24GB is
| the sweet spot price to performance
|
| How is the halo product of a range the "sweet spot"?
|
| I think nVidia are extremely exposed on this front. The
| RX 7900XTX is also 24GB and under half the price (In UK
| at least - PS800 vs PS1,700 for the 4090). It's difficult
| to get a performance comparison on compute tasks, but I
| think it's around 70-80% of the 4090 given what I can
| find. Even a 3090, if you can find one, is PS1,500.
|
| The software isn't as stable on AMD hardware, but it does
| work. I'm running a RX7600 - 8GB myself, and happily
| doing SDXL. The main problem is that exhausting VRAM
| causes instability. Exceed it by a lot, and everything is
| handled fine, but if it's marginal... problems ensue.
|
| The AMD engineers are actively making the experience
| better, and it may not be long before it's a practical
| alternative. If/When that happens nVidia will need to
| slash their prices to sell anything in this sphere, which
| I can't really see themselves doing.
| zargon wrote:
| > If/When that happens nVidia will need to slash their
| prices to sell anything in this sphere
|
| It's just as likely that AMD will raise prices to
| compensate.
| weebull wrote:
| You think they're going to say "Hey, compute became
| competitive but nothing else changed performance
| therefore... PRICE HIKE!"? They don't have the reputation
| to burn in this domain for that IMHO.
|
| Granted you could see a supply/demand related increase
| from retailers if demand spiked, but that's the retailers
| capitalising.
| whywhywhywhy wrote:
| >How is the halo product of a range the "sweet spot"?
|
| Because it's actually a bargain second hand (got another
| for PS650 last week buy it now eBay) and cheap for the
| benefit it offers for any professional who needs it.
|
| 3090 is the iPhone of AI, people should be ecstatic it
| even exists not complaining about it.
| weebull wrote:
| > because honestly Nvidia's _current generation_ 24GB is
| the sweet spot price to performance
|
| You're aware the 3090 is not the current generation? You
| can see why I would think you were talking about the
| 4090?
| liuliu wrote:
| Should use no more than 6GiB for FP16 models at each stage.
| The current implementation is not RAM optimized.
| sorenjan wrote:
| The large C model uses 3.6 billion parameters which is 6.7
| GiB if each parameter is 16 bits.
| liuliu wrote:
| The large C model have fair bit of parameters tied to
| text-conditioning, not to the main denoising process.
| Similar to how we split the network for SDXL Base, I am
| pretty confident we can split non-trivial amount of
| parameters to text-conditioning hence during denoising
| process, loading less than 3.6B parameters.
| brucethemoose2 wrote:
| What's more, they can presumably be swapped in and out like
| the SDXL base + refiner, right?
| vergessenmir wrote:
| I'll take prompt adherence over quality any day. The machinery
| otherwise isn't worth it i.e the controlnets, openpose,
| depthmaps just to force a particular look or to achieve depth.
| Th solution becomes bespoke for each generation.
|
| Had a test of it and my option is it's an improvement when it
| comes to following prompts and I do find the images more
| visually appealing.
| stavros wrote:
| Can we use its output as input to SDXL? Presumably it would
| just fill in the details, and not create whole new images.
| RIMR wrote:
| I was thinking that exactly. You could use the same trick
| as the hires-fix for an adherence-fix.
| emadm wrote:
| Yeah chain it in comfy to a turbo model for detail
| Filligree wrote:
| A turbo model isn't the first thing I'd think of when it
| comes to finalizing a picture. Have you found one that
| produces high-quality output?
| dragonwriter wrote:
| For detail, it'd probably be better to use a full model
| with a small number of steps (something like KSampler
| Advanced node with 40 total steps, but starting at step
| 32-ish.) Might even try using the SDXL refiner model for
| that.
|
| Turbo models are decent at low-iteration-decent-results,
| but not so much at adding fine details to an mostly-done
| image.
| ttpphd wrote:
| That is a very tiny latent space. Wow!
| yogorenapan wrote:
| Very impressive.
|
| From what I understand, Stability AI is currently VC funded. It's
| bound to burn through tons of money and it's not clear whether
| the business model (if any) is sustainable. Perhaps worthy of
| government funding.
| minimaxir wrote:
| Stability AI has been burning through tons of money for awhile
| now, which is the reason newer models like Stable Cascade are
| not commercially-friendly-licensed open source anymore.
|
| > The company is spending significant amounts of money to grow
| its business. At the time of its deal with Intel, Stability was
| spending roughly $8 million a month on bills and payroll and
| earning a fraction of that in revenue, two of the people
| familiar with the matter said.
|
| > It made $1.2 million in revenue in August and was on track to
| make $3 million this month from software and services,
| according to a post Mostaque wrote on Monday on X, the platform
| formerly known as Twitter. The post has since been deleted.
|
| https://fortune.com/2023/11/29/stability-ai-sale-intel-ceo-r...
| littlestymaar wrote:
| > which is the reason newer models like Stable Cascade are
| not commercially-friendly-licensed open source anymore.
|
| The main reason is probably Mid journey and OpenAi using
| their tech without any kind of contribution back. AI
| desperately needs a GPL equivalent...
| yogorenapan wrote:
| > AI desperately needs a GPL equivalent
|
| Why not just the GPL then?
| loudmax wrote:
| The GPL was intended for computer code that gets compiled
| to a binary form. You can share the binary, but you also
| have to share the code that the binary is compiled from.
| Pre-trained model weights might be thought of as
| analogous to compiled code, and the training data may be
| analogous to program code, but they're not the same
| thing.
|
| The model weights are shared openly, but the training
| data used to create these models isn't. This is at least
| partly because all these models, including OpenAI's, are
| trained on copyrighted data, so the copyright status of
| the models themselves is somewhat murky.
|
| In the future we may see models that are 100% trained in
| the open, but foundational models are currently very
| expensive to train from scratch. Either prices would need
| to come down, or enthusiasts will need some way to share
| radically distributed GPU resources.
| emadm wrote:
| Tbh I think these models will largely be trained on
| synthetic datasets in the future. They are mostly trained
| on garbage now. We have been doing opt outs on these, has
| been interesting to see quality differential (or lack
| thereof), eg removing books3 from stableLM 3b zephyr
| https://stability.wandb.io/stability-llm/stable-
| lm/reports/S...
| keenmaster wrote:
| Why aren't the big models trained on synthetic datasets
| now? What's the bottleneck? And how do you avoid
| amplifying the weaknesses of LLMs when you train on LLM
| output vs. novel material from the comparatively very
| intelligent members of the human species. Would be
| interesting to see your take on this.
| emadm wrote:
| We are starting to see that, see phi2 for example
|
| There are approaches to get the right type of augmented
| and generated data to feed these models right, check out
| our QDAIF paper we worked on for example
|
| https://arxiv.org/pdf/2310.13032.pdf
| sillysaurusx wrote:
| I've wondered whether books3 makes a difference, and how
| much. If you ever train a model with a proper books3
| ablation I'd be curious to know how it does. Books are an
| important data source, but if users find the model useful
| without them then that's a good datapoint.
| emadm wrote:
| We did try stableLM 3b4 with books3 and it got worse in
| general and benchmarks
|
| Just did some pes2o ablations too which were eh
| sillysaurusx wrote:
| What I mean is, it's important to train a model with
| _and_ without books3. That's the only way to know whether
| it was books3 itself causing the issue, or some artifact
| of the training process.
|
| One thing that's hard to measure is the knowledge
| contained in books3. If someone asks about certain books,
| it won't be able to give an answer unless the knowledge
| is there in some form. I've often wondered whether
| scraping the internet is enough rather than training on
| books directly.
|
| But be careful about relying too much on evals.
| Ultimately the only benchmark that matters is whether
| users find the model useful. The clearest test of this
| would be to train two models side by side, with and
| without books3, and then ask some people which they
| prefer.
|
| It's really tricky to get all of this right. But if
| there's more details on the pes2o ablations I'd be
| curious to see.
| protomikron wrote:
| What about CC licenses for model weights? It's common for
| files ("images", "video", "audio", ...) So maybe
| appropriate.
| ipsum2 wrote:
| It's highly doubtful that Midjourney and OpenAI use Stable
| Diffusion or other Stability models.
| jonplackett wrote:
| How do you know though?
| minimaxir wrote:
| You can't use off-the-shelf models to get the results
| Midjourney and DALL-E generate, even with strong
| finetuning.
| cthalupa wrote:
| I pay for both MJ and DALL-E (though OpenAI mostly gets
| my money for GPT) and don't find them to produce
| significantly better images than popular checkpoints on
| CivitAI. What I do find is that they are significantly
| easier to work with. (Actually, my experience with
| hundreds of DALL-E generations is that it's actually
| quite poor in quality. I'm in several IRC channels where
| it's the image generator of choice for some IRC bots, and
| I'm never particularly impressed with the visual
| quality.)
|
| For MJ in particular, knowing that they at least used to
| use Stable Diffusion under the hood, it would not
| surprise me if the majority of the secret sauce is
| actually a middle layer that processes the prompt and
| converts it to one that is better for working with SD.
| Prompting SD to get output at the MJ quality level takes
| significantly more tokens, lots of refinement, heavy
| tweaking of negative prompting, etc. Also a stack of
| embeddings and LoRAs, though I would place those more in
| the category of finetuning like you had mentioned.
| emadm wrote:
| If you try diffusionGPT with regional prompting added and
| a GAN corrector you can get a good idea of what is
| possible https://diffusiongpt.github.io
| euazOn wrote:
| That looks very impressive unless the demo is
| cherrypicked, would be great if this could be implemented
| into a frontend like Fooocus
| https://github.com/lllyasviel/Fooocus
| millgrove wrote:
| What do you use it for? I haven't found a great use for
| it myself (outside of generating assets for landing pages
| / apps, where it's really really good). But I have seen
| endless subreddits / instagram pages dedicated to various
| forms of AI content, so it seems lots of people are using
| it for fun?
| cthalupa wrote:
| Nothing professional. I run a variety of tabletop RPGs
| for friends, so I mostly use it for making visual aids
| there. I've also got a large format printer that I was no
| longer using for it's original purpose, so I bought a few
| front-loading art frames that I generate art for and
| rotate through periodically.
|
| I've also used it to generate art for deskmats I got
| printed at https://specterlabs.co/
|
| For commercial stuff I still pay human artists.
| throwanem wrote:
| Whose frames do you use? Do you like them? I print my
| photos to frame and hang, and wouldn't at all mind being
| able to rotate them more conveniently and inexpensively
| than dedicating a frame to each allows.
| cthalupa wrote:
| https://www.spotlightdisplays.com/
|
| I like them quite a bit, and you can get basically any
| size cut to fit your needs even if they don't directly
| offer it on the site.
| throwanem wrote:
| Perfectly suited to go alongside the style of frame I
| already have lots of, and very reasonably priced off the
| shelf for the 13x19 my printer tops out at. Thanks so
| much! It'll be easier to fill that one blank wall now.
| soultrees wrote:
| What IRC Channels do you frequent?
| cthalupa wrote:
| Largely some old channels from the 90s/00s that really
| only exist as vestiges of their former selves - not
| really related to their original purpose, just rooms for
| hanging out with friends made there back when they had a
| point besides being a group chat.
| yreg wrote:
| That's not really true, MJ and DALL-E are just more
| beginner friendly.
| orbital-decay wrote:
| Midjourney has absolutely nothing to offer compared to
| proper finetunes. DALL-E has: it generalizes well (can
| make objects interact properly for example) and has great
| prompt adherence. But it can also be unpredictable as
| hell because it rewrites the prompts. DALL-E's quality is
| meh - it has terrible artifacts on all pixel-sized
| details, hallucinations on small details, and limited
| resolution. Controlnets, finetuning/zero-shot reference
| transfer, and open tooling would have made a beast of a
| model of it, but they aren't available.
| cthalupa wrote:
| Midjourney 100% at least used to use Stable Diffusion:
| https://twitter.com/EMostaque/status/1561917541743841280
|
| I am not sure if that is still the case.
| refulgentis wrote:
| It trialled it as an explicitly optional model for a
| moment a couple years ago. (or only a year? time moves so
| fast. somewhere in v2/v3 timeframe and around when SD
| came out). I am sure it is no longer the case.
| liuliu wrote:
| DALL-E shares the same autoencoders as SD v1.x. It is
| probably similar to how Meta's Emu-class models work
| though. They tweaked the architecture quite a bit,
| trained on their own dataset, reused some components (or
| in Emu case, trained all the components from scratch but
| reused the same arch).
| minimaxir wrote:
| More specifically, it's so Stability AI can theoretically
| make a business on selling commercial access to those
| models through a membership:
| https://stability.ai/news/introducing-stability-ai-
| membershi...
| programjames wrote:
| I think it'd be interesting to have a non-profit "model
| sharing" platform, where people can buy/sell compute. When
| you run someone's model, they get royalties on the compute
| you buy.
| thatguysaguy wrote:
| The net flow of knowledge about text-to-image generation
| from OpenAI has definitely been outward. The early open
| source methods used CLIP, which OpenAI came up with. Dall-e
| (1) was also the first demonstration that we could do text
| to image at all. (There were some earlier papers which
| could give you a red splotch if you said stop sign or
| something years earlier).
| loudmax wrote:
| I get the impression that a lot of open source adjacent AI
| companies, including Stability AI, are in the "???" phase of
| execution, hoping the "Profit" phase comes next.
|
| Given how much VC money is chasing the AI space, this isn't
| necessarily a bad plan. Give stuff away for free while
| developing deep expertise, then either figure out something
| to sell, or pivot to proprietary, or get aquihired by a tech
| giant.
| minimaxir wrote:
| That is indeed the case, hence the more recent pushes
| toward building moats by every AI company.
| seydor wrote:
| exactly my thought. stability should be receiving research
| grants
| emadm wrote:
| We should, we haven't yet...
|
| Instead we've given 10m+ supercomputer hours in grants to all
| sorts of projects, now we have our grant team in place &
| there is a huge increase in available funding for folk that
| can actually build stuff we can tap into.
| downrightmike wrote:
| Finally a good use to burn VC money!
| sveme wrote:
| None of the researchers are associated with stability.ai, but
| with universities in Germany and Canada. How does this work? Is
| this exclusive work for stability.ai?
| emadm wrote:
| Dom and Pablo both work for Stability AI (Dom finishing his
| degree).
|
| All the original Stable Diffusion researchers (Robin Rombach,
| Patrick Esser, Dominik Lorenz, Andreas Blattman) also work
| for Stability AI.
| diggan wrote:
| I've seen Emad (Stability AI founder) commenting here on HN
| somewhere about this before, what exactly their business model
| is/will be, and similar thoughts.
|
| HN search doesn't seem to agree with me today though and I
| cannot find the specific comment/s I have in mind, maybe
| someone else has any luck? This is their user
| https://news.ycombinator.com/user?id=emadm
| emadm wrote:
| https://x.com/EMostaque/status/1649152422634221593?s=20
|
| We now have top models of every type, sites like
| www.stableaudio.com, memberships, custom model deals etc so
| lots of demand
|
| We're the only AI company that can make a model of any type
| for anyone from scratch & are the most liked / one of the
| most downloaded on HuggingFace
| (https://x.com/Jarvis_Data/status/1730394474285572148?s=20,
| https://x.com/EMostaque/status/1727055672057962634?s=20)
|
| Its going ok, team working hard and shipping good models, the
| team are accelerating their work on building ComfyUI to bring
| it all together.
|
| My favourite recent model was CheXagent, I think medical
| models should be open & will really save lives:
| https://x.com/Kseniase_/status/1754575702824038717?s=20
| jedberg wrote:
| I'd say I'm most impressed by the compression. Being able to
| compress an image 42x is huge for portable devices or bad
| internet connectivity (or both!).
| incrudible wrote:
| That is 42x _spatial_ compression, but it needs 16 channels
| instead of 3 for RGB.
| ansk wrote:
| Furthermore, each of those 16 channels would typically be
| mutibyte floats as opposed to single byte RGB channels.
| (speaking generally, haven't read the paper)
| zamadatix wrote:
| Even assuming 32 bit floats (the extra 4 on the end):
|
| 4*16*24*24*4 = 147,456
|
| vs (removing the alpha channel as it's unused here)
|
| 3*3*1024*1024 = 9,437,184
|
| Or 1/64 raw size, assuming I haven't fucked up the
| math/understanding somewhere (very possible at the moment).
| incrudible wrote:
| It is actually just 2/4 bytes x 16 latent channels x 24 x
| 24, but the comparison to raw data needs to be taken with a
| grain of salt, as there is quite a bit of hallucination
| involved in reconstruction.
| flgstnd wrote:
| a 42x compression is also impressive as it matches the answer
| to the ultimate question of life, the universe, and everything,
| maybe there is some deep universal truth within this model.
| seanalltogether wrote:
| I have to imagine at this point someone is working toward a
| fast AI based video codec that comes with a small pretrained
| model and can operate in a limited memory environment like a tv
| to offer 8k resolution with low bandwidth.
| jedberg wrote:
| I would be shocked if Netflix was _not_ working on that.
| Lord-Jobo wrote:
| I am 65% sure this is already extremely similar to LGs
| upscaling approach in their most recent flagship
| yogorenapan wrote:
| I see in the commits that the license was changed from MIT to
| their own custom one: https://github.com/Stability-
| AI/StableCascade/commit/209a526...
|
| Is it legal to use an older snapshot before the license was
| changed in accordance with the previous MIT license?
| OJFord wrote:
| Yes, you can continue to do what you want with that commit^ in
| accordance with the MIT licence it was released under. Kind of
| like if you buy an ebook, and then they publish a second
| edition but only as a hardback - the first edition ebook is
| still yours to read.
| treesciencebot wrote:
| I think the model architecture (training code etc.) itself is
| still under MIT while the weights (which was the result of
| training in a huge GPU cluster as well as the dataset they have
| used [not sure if they publicly talked about it] is under this
| new license.
| emadm wrote:
| Code is MIT, weights are under the NC license for now.
| ed wrote:
| It seems pretty clear the intent was to use a non-commercial
| license, so it's probably something that would go to court, if
| you really wanted to press the issue.
|
| Generally courts are more holistic and look at intent, and
| understand that clerical errors happen. One exception to this
| is if a business claims it relied on the previous license and
| invested a bunch of resources as a result.
|
| I believe the timing of commits is pretty important-- it would
| be hard to claim your business made a substantial investment on
| a pre-announcement repo that was only MIT'ed for a few hours.
| RIMR wrote:
| If I clone/fork that repo before the license change, and
| start putting any amount of time into developing my own fork
| in good faith, they shouldn't be allowed to claim a clerical
| error when they lied to me upon delivery about what I was
| allowed to do with the code.
|
| Licenses are important. If you are going to expose your code
| to the world, make sure it has the right license. If you
| publish your code with the wrong license, you shouldn't be
| allowed to take it back. Not for an organization of this size
| that is going to see a new repo cloned thousands of times
| upon release.
| ed wrote:
| There's no case law here, so if you're volunteering to find
| out what a judge thinks we'd surely appreciate it!
| wokwokwok wrote:
| No, sadly this won't fly in court.
|
| For the same reason you cannot publish a private corporate
| repo with an MIT license and then have other people claim
| in "good faith" to be using it.
|
| All they need is to assert that the license was published
| in error, or that the person publishing it did not have the
| authority to publish it.
|
| You can't "magically" make a license stick by putting it in
| a repo, any more than putting a "name here" sticker on
| someone's car and then claiming to own it.
|
| The license file in the repo is simply the _notice_ of the
| license.
|
| It does not indicate a binding legal agreement.
|
| You of course, can challenge it in court, and ianal, but I
| assure you, there is president in incorrectly labelled
| repos removing and changing their licenses.
| arcbyte wrote:
| It could very well fly. Agency law, promissory estoppel,
| ...
| RIMR wrote:
| MIT license is not parasitic like GPL. You can close an MIT
| licensed codebase, but you cannot retroactively change the
| license of the old code.
|
| Stability's initial commit had an MIT license, so you can fork
| that commit and do whatever you want with it. It's MIT
| licensed.
|
| Now, the tricky part here is that they committed a change to
| the license that changes it from MIT to proprietary, but they
| didn't change any code with it. That is definitely invalid,
| because they cannot license the exact same codebase with two
| different contradictory licenses. They can only license the
| changes made to the codebase after the license change. I
| wouldn't call it "illegal", but it wouldn't stand up in court
| if they tried to claim that the software is proprietary,
| because they already distributed it verbatim with an open
| license.
| kruuuder wrote:
| > they didn't change any code with it. That is definitely
| invalid, because they cannot license the exact same codebase
| with two different contradictory licenses.
|
| Why couldn't they? Of course they can. If you are the
| copyright owner, you can publish/sell your stuff under as
| many licenses as you like.
| weebull wrote:
| The code is MIT. The model has a non-commercial license. They
| are separate pieces of work under different licenses. Stability
| AI have said that the non-commercial license is because this is
| a technology preview (like SDXL 0.9 was).
| gorkemyurt wrote:
| we have an optimized playground here:
| https://www.fal.ai/models/stable-cascade
| adventured wrote:
| "sign in to run"
|
| That's a marketing opportunity being missed, especially given
| how crowded the space is now. The HN crowd is more likely to
| run it themselves when presented with signing up just to test
| out a single generation.
| treesciencebot wrote:
| Uh, thanks for noticing it! We generally turn it off for
| popular models so people can see the underlying inference
| speed and the results but we forgot about it for this one, it
| should now be auth-less with a stricter rate limit just like
| other popular models in the gallery.
| RIMR wrote:
| I just got rate-limited on my first generation. The message
| is "You have exceeded the request limit per minute". This
| was after showing me cli output suggesting that my image
| was being generated.
|
| I guess my zero attempts per minute was too much. You
| really shouldn't post your product on HN if you aren't
| prepared for it to work. Reputations are hard to earn, and
| you're losing people's interest by directing them to a
| broken product.
| getcrunk wrote:
| Are you using a vpn or at a large campus or office?
| archerx wrote:
| I wanted to use your service for a project but you can only
| sign in through github, I emailed your support about this
| and never got an answer, in the end I ended up installing
| SD Turbo locally. I think that a github only auth is losing
| you potential customers like myself.
| MattRix wrote:
| It uses github auth, it's not some complex process. I can see
| why they would need to require accounts so it's harder to
| abuse it.
| arcanemachiner wrote:
| After all the bellyaching from the HN crowd when PyPI
| started requiring 2FA, nothing surprises me anymore.
| holoduke wrote:
| Wow like the compression part. 42 fixed times compression. That
| is really nice. Slow to unpack on the fly. But the future is
| waiting.
| GaggiX wrote:
| I remember doing some random experiments with these two
| researchers to find the best way to condition the stage B on the
| latent, my very fancy cross-attn with relative 2D positional
| embeddings didn't work as well as just concatenating the channels
| of the input with the nearest upsample of the latent, so I just
| gave up ahah.
|
| This model used to be known as Wurstchen v3.
| joshelgar wrote:
| Why are they benchmarking it with 20+10 steps vs. 50 steps for
| the other models?
| liuliu wrote:
| prior generations usually take fewer steps than vanilla SDXL to
| reach the same quality.
|
| But yeah, the inference speed improvement is mediocre (until I
| take a look at exactly what computation performed to have more
| informed opinion on whether it is implementation issue or model
| issue).
|
| The prompt alignment should be better though. It looks like the
| model have more parameters to work with text conditioning.
| treesciencebot wrote:
| in my observation, it yields amazing perf at higher batch
| sizes (4 or better 8). i assume it is due to memory bandwith
| and the constrained latent space helping.
| Filligree wrote:
| However, the outputs are so similar that I barely feel a
| need for more than 1. 2 is plenty.
| GaggiX wrote:
| I think that this model used consistency loss during training
| so that it can yield better results with less steps.
| weebull wrote:
| ...because they feel that at 20+10 it achieves a superior
| output than at 50 steps for SDXL. They also benchmark it
| against 1 step for SDXL-Turbo.
| gajnadsgjoas wrote:
| Where can I run it if I don't have a GPU? Colab didn't work
| detolly wrote:
| runpod, kaggle, lambda labs, or pretty much any other server
| provider that gives you one or more gpus.
| k2enemy wrote:
| I haven't been following the image generation space since the
| initial excitement around stable diffusion. Is there an easy to
| use interface for the new models coming out?
|
| I remember setting up the python env for stable diffusion, but
| then shortly after there were a host of nice GUIs. Are there some
| popular GUIs that can be used to try out newer models? Similarly,
| what's the best GUI for some of the older models? Preferably for
| macos.
| thot_experiment wrote:
| Auto1111 and Comfy both get updated pretty quickly to support
| most of the new models coming out. I expect they'll both
| support this soon.
| stereobit wrote:
| Check out invoke.com
| sophrocyne wrote:
| Thanks for calling us out - I'm one of the maintainers.
|
| Not entirely sure we'll be in the Stable Cascade race quite
| yet. Since Auto/Comfy aren't really built for businesses,
| they'll get it incorporated sooner vs later.
|
| Invoke's main focus is building open-source tools for the
| pros using this for work that are getting disrupted, and
| non-commercial licenses don't really help the ones that are
| trying to follow the letter of the license.
|
| Theoretically, since we're just a deployment solution, it
| might come up with our larger customers who want us to run
| something they license from Stability, but we've had zero
| interest on any of the closed-license stuff so far.
| yokto wrote:
| fal.ai is nice and fast:
| https://news.ycombinator.com/item?id=39360800 Both in
| performance and for how quickly they integrate new models
| apparently: they already support Stable Cascade.
| brucethemoose2 wrote:
| Fooocus is the fastest way to try SDXL/SDXL turbo with good
| quality.
|
| ComfyUI is cool but very DIY. You don't get good results unless
| you wrap your head around all the augmentations and defaults.
|
| No idea if it will support cascade.
| SpliffnCola wrote:
| ComfyUI is similar to Houdini in complexity, but immensely
| powerful. It's a joy to use.
|
| There are also a large amount of resources available for it
| on YouTube, GitHub
| (https://github.com/comfyanonymous/ComfyUI_examples), reddit
| (https://old.reddit.com/r/comfyui), CivitAI, Comfy Workflows
| (https://comfyworkflows.com/), and OpenArt Flow
| (https://openart.ai/workflows/).
|
| I still use AUTO1111
| (https://github.com/AUTOMATIC1111/stable-diffusion-webui) and
| the recently released and heavily modified fork of AUTO1111
| called Forge (https://github.com/lllyasviel/stable-diffusion-
| webui-forge).
| emadm wrote:
| Our team at Stability AI build ComfyUI so yeah is supported
| cybereporter wrote:
| Will this get integrated into Stable Diffusion Web UI?
| ttul wrote:
| Surely within days. ComfyUI's maintainer said he is readying
| the node for release perhaps by this weekend. The Stable
| Cascade model is otherwise known as Wurschten v3 and has been
| floating around the open source generative image space since
| fall.
| dragonwriter wrote:
| Third-party (using diffusers) node for ComfyUI is already
| available for those who can't wait for native integration.
|
| https://github.com/kijai/ComfyUI-DiffusersStableCascade
| hncomb wrote:
| Is there any way this can be used to generate multiple images of
| the same model? e.g. a car model rotated around (but all images
| are of the same generated car)
| matroid wrote:
| Someone with resources will have to train Zero123 [1] with this
| backbone.
|
| [1] https://zero123.cs.columbia.edu/
| emadm wrote:
| Heh https://stability.ai/news/stable-zero123-3d-generation
|
| Better coming
| refulgentis wrote:
| Yes, input image => embedding => N images, and if you're
| thinking 3D perspectives for rendering, you'd ControlNet the N.
|
| ref.: "The model can also understand image embeddings, which
| makes it possible to generate variations of a given image
| (left). There was no prompt given here."
| taejavu wrote:
| The model looks different in each of those variations though.
| Which seems to be intentional, but the post you're responding
| to is asking whether it's possible to keep the model exactly
| the same in each render, varying only by perspective.
| ionwake wrote:
| Does anyone have a link to a demo online?
| martin82 wrote:
| https://huggingface.co/spaces/multimodalart/stable-cascade
| ionwake wrote:
| Thank you, is there a demo if the "image to image" ability?
| It doesnt seem to be in any of the demos I see.
| pxoe wrote:
| the way it's written about in Image Reconstruction section like
| it is just an image compression thing...is kind of interesting.
| for that stuff and its presented use there to be very much about
| storing images and reconstructing them. when "it doesn't actually
| store original images" and "it can't actually give out original
| images" are points that get used so often in arguments as a
| defense for image generators. so it is just a multi-image
| compression file format, just a very efficient one. sure, it's
| "redrawing"/"rendering" its output and makes things look kinda
| fuzzy, but any other compressed image format does that as well.
| what was all that 'well it doesn't do those things' nonsense
| about then? clearly it can do that.
| wongarsu wrote:
| In a way it's just an algorithm than can compress either text
| or an image. The neat trick is that if you compress the text
| "brown bear hitting Vladimir Putin" and then decompress it as
| an image, you get an image of a bear hitting Vladimir Putin.
|
| This principle is the idea behind all Stable Diffusion models,
| this one "just" achieved a much better compression ratio
| pxoe wrote:
| well yeah. but it's not so much about what it actually does,
| but how they talk about it. maybe (probably) i missed them
| putting out something that's described like that before, but
| it's just the open admission in demonstration of it. i guess
| they're getting more brazen, given than they're not really
| getting punished for what they're doing, be it piracy or
| infringement or whatever.
| Filligree wrote:
| The model works on compressed data. That's all it is. Sure,
| it could output a picture from its training set on
| decompression, but only if you feed that same picture into
| the compressor.
|
| In which case what are you doing, exactly? Normally you
| feed it a text prompt instead, which won't compress to the
| same thing.
| gmerc wrote:
| Ultimately this is abstraction not compression.
| GaggiX wrote:
| >well it doesn't do those things' nonsense about then? clearly
| it can do that.
|
| There is a model that is trained to compress (very lossy) and
| decompress the latent, but it's not the main generative model,
| of course the model doesn't store images in it, you just give
| the encoder an image and it will encode it and then you can
| decode it with the decoder and get a very similar image, this
| encoder and decoder is used during training so that the stage C
| can work on a compressed latent instead of directly at the
| pixel level because it's expensive, but the main generative
| model (stage C) should be able to generate any of the images
| that were present in the dataset or it fails to do its job.
| Stages C, B, and A do not store any images.
|
| The B and A stages work like an advanced image decoder, so
| unless you have something wrong with image decoders in general,
| I don't see how this could be a problem (a JPEG decoder doesn't
| store images either, of course).
| mise_en_place wrote:
| Was anyone able to get this running on Colab? I got as far as
| loading extras in text-to-inference, but it was complaining about
| a dependency.
| SECourses wrote:
| It is pretty good I shared a comparison on medium
|
| https://medium.com/@furkangozukara/stable-cascade-prompt-fol...
|
| My Gradio APP even works amazing on 8 GB gpu with CPU offloading
| lqcfcjx wrote:
| I'm very impressed by the recent AI progress on making models
| smaller and more efficient. I just have the feeling that every
| week there's something big on this space (like what we saw
| previously from ollama, llava, mixtral...). Apparently the space
| for on-device models are not fully discovered yet. Very excited
| to see future products on that direction.
| dragonwriter wrote:
| > I'm very impressed by the recent AI progress on making models
| smaller and more efficient.
|
| That's an odd comment to place in a thread about an image
| generation model that is bigger than SDXL. Yes, it works in a
| smaller latent space, yes its faster in the hardware
| configuration they've used, but its not _smaller_.
| skybrian wrote:
| Like every other image generator I've tried, it can't do a piano
| keyboard [1]. I expect that some different approach is needed to
| be able to count the black keys groups.
|
| [1] https://fal.ai/models/stable-
| cascade?share=13d35b76-d32f-45c...
| Agraillo wrote:
| I think it's more than this. In my case in most of images I
| made about basketball there were more than one ball. I'm not an
| expert, but some fundamental constrains of the human (cultural)
| life (like all piano keys are the same, there's only one ball
| in a game) are not grasped by the training or grasped partially
| GaggiX wrote:
| As with human hands, coherency is fixed by scaling the model
| and the training.
| sanroot99 wrote:
| What is the system requirements needed to run this, particularly
| how much vram it would take?
| instagraham wrote:
| Will this work on AMD? Found no mention of support. Kinda an
| important feature for such a project, as AMD users running Stable
| Diffusion will be suffering diminished performance.
| drclegg wrote:
| Apparently yes
| https://news.ycombinator.com/item?id=39360106#39360497
| xkgt wrote:
| This model is built upon the Wurstchen architecture. Here is a
| very good explanation of how this model works by one of its
| authors.
|
| https://www.youtube.com/watch?v=ogJsCPqgFMk
| lordswork wrote:
| Great video! And here's a summary of the video :)
| Gemini Advanced> Summarize this video:
| https://www.youtube.com/watch?v=ogJsCPqgFMk
|
| This video is about a new method for training text-to-image
| diffusion models called Wurstchen. The method is significantly
| more efficient than previous methods, such as Stable Diffusion
| 1.4, and can achieve similar results with 16 times less
| training time and compute.
|
| The key to Wurstchen's efficiency is its use of a two-stage
| compression process. The first stage uses a VQ-VAE to compress
| images into a latent space that is 4 times smaller than the
| latent space used by Stable Diffusion. The second stage uses a
| diffusion model to further compress the latent space by another
| factor of 10. This results in a total compression ratio of 40,
| which is significantly higher than the compression ratio of 8
| used by Stable Diffusion.
|
| The compressed latent space allows the text-to-image diffusion
| model in Wurstchen to be much smaller and faster to train than
| the model in Stable Diffusion. This makes it possible to train
| Wurstchen on a single GPU in just 24,000 GPU hours, while
| Stable Diffusion 1.4 requires 150,000 GPU hours.
|
| Despite its efficiency, Wurstchen is able to generate images
| that are of comparable quality to those generated by Stable
| Diffusion. In some cases, Wurstchen can even generate images
| that are of higher quality, such as images with higher
| resolutions or images that contain more detail.
|
| Overall, Wurstchen is a significant advance in the field of
| text-to-image generation. It makes it possible to train text-
| to-image models that are more efficient and affordable than
| ever before. This could lead to a wider range of applications
| for text-to-image generation, such as creating images for
| marketing materials, generating illustrations for books, or
| even creating personalized avatars.
| nialv7 wrote:
| Can Stable Cascade be used for image compression? 1024x1024 to
| 24x24 is crazy.
| anonuser1234 wrote:
| That's definitely not lossless compression
___________________________________________________________________
(page generated 2024-02-14 23:01 UTC)