[HN Gopher] DeepFloyd IF: open-source text-to-image model
___________________________________________________________________
DeepFloyd IF: open-source text-to-image model
Author : ea016
Score : 120 points
Date : 2023-04-26 18:15 UTC (4 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| 55555 wrote:
| So this one can create perfect text in images? If true, that's
| insane
| GaggiX wrote:
| LDM-400M was already able to generate text (predecessor of
| Stable Diffusion), thanks to the fact that every token in the
| text encoder (trained from scratch) was available in the
| attention layer.
| flangola7 wrote:
| >thanks to the fact that every token in the text encoder
| (trained from scratch) was available in the attention layer.
|
| >ChatGPT explain this like I'm 5
| coolspot wrote:
| "Every word in the text can be used to help create the
| image."
| GaggiX wrote:
| interesting there are different models: https://github.com/deep-
| floyd/IF#-model-zoo-
|
| I'm also very happy for the release of the two upscaler, I can
| use them to upscale to result of my small 64x64 DDIM models
| (maybe with some finetuning).
| jacob019 wrote:
| Any web based front ends yet? I put together a system that runs a
| variety of web based open source AI image generation and editing
| tools on Vultr GPU instances. It spins up instances on demand,
| mounts an NFS filesystem with local caching and a COW layer,
| spawns the services, proxies the requests, and then spins down
| idle instances when I'm done. Would love to add this, suppose I
| could whip something up if none exists.
| ronsor wrote:
| It'll probably be in the Auto1111 WebUI within a week.
| Taek wrote:
| For anyone who doesn't know, DeepFloyd is a StableDiffusion style
| image model that more or less replaced CLIP with a full LLM (11b
| params). The result is that it is much better at responding to
| more complex prompts.
|
| In theory, it is also smarter at learning from its training data.
| GaggiX wrote:
| >StableDiffusion style
|
| Not really, it's a cascaded diffusion model conditioned on the
| T5 encoder, there is nothing really in common, unless you mean
| that using a diffusion model is "SD style".
| tmabraham wrote:
| It isn't like Stable Diffusion, it's more like Google's Imagen
| model.
| epivosism wrote:
| Example of how much better it can do compared to midjourney, on a
| complex prompt:
| https://twitter.com/eb_french/status/1623823175170805760
|
| It is able to put people on the left/right and put the correct
| t-shirts and facial expressions on each one. This is compared to
| mj which just mixes together a soup of every word you use and
| plops it out into the image. Huge MJ fan of course, it's amazing,
| but having compositional power is another step up.
| epivosism wrote:
| Here are some play markets on manifold markets tracking its
| release:
| https://manifold.markets/markets?s=relevance&f=all&q=deepflo...
|
| 35% to full release by end of month, although it may not have
| adjusted.
| TheBlapse wrote:
| "Imagen free"
| TheBlapse wrote:
| Currently down on hugging face
| zimpenfish wrote:
| 16GB VRAM minimum is a bit steep. Sadly excludes my 3080 which is
| annoying because I'd like something better than Stable Diffusion
| locally.
| specproc wrote:
| There's a note which suggests you might be able to get by on
| lower. My 3060 struggles with SD on the defaults, but works
| fine with float16.
|
| _There are multiple ways to speed up the inference time and
| lower the memory consumption even more with diffusers. To do
| so, please have a look at the Diffusers docs:
| Optimizing for inference time [1] Optimizing for
| low memory during inference [2]
|
| [1]
| https://huggingface.co/docs/diffusers/api/pipelines/if#optim...
|
| [2] https://huggingface.co/docs/diffusers/api/pipelines/if#opti
| m..._
| thewataccount wrote:
| Once these are quantized (I assume they can be), they should be
| ~1/4th the size.
|
| Can anyone explain why it needs so much ram in the first place
| though? 4.3B is only ~9GB at 16bit (I'm not as familiar with
| image models).
|
| I'm really happy to see that fits under 24GB - that's what I
| consider the limit for being able to run on "consumer
| hardware".
| GaggiX wrote:
| >Can anyone explain why it needs so much ram in the first
| place though?
|
| The T5-XXL text encoder is really large, also we do not
| quantize the UNets, the UNet outputs 8-bit pixels, so
| quantizing the UNet to that precision will create pretty bad
| outputs.
| SekstiNi wrote:
| They took down the blogpost, but from what I remember the
| model is composite and consists of a text encoder as well as
| 3 "stages":
|
| 1. (11B) T5-XXL text encoder [1]
|
| 2. (4.3B) Stage 1 UNet
|
| 3. (1.3B) Stage 2 upscaler (64x64 -> 256x256)
|
| 4. (?B) Stage 3 upscaler (256x256 -> 1024x1024)
|
| Resolution numbers could be off though. Also the third stage
| can apparently use the existing stable diffusion x4, or a new
| upscaler that they aren't releasing yet (ever?).
|
| > Once these are quantized (I assume they can be)
|
| Based on the success of LLaMA 4bit quantization, I believe
| the text encoder could be. As for the other modules, I'm not
| sure.
|
| edit: the text encoder is 11B, not 4.5B as I initially wrote.
|
| [1]: https://huggingface.co/google/t5-v1_1-xxl
| gwern wrote:
| You'll be able to optimize it a lot to make it fit on small
| systems if you are willing to modify your workflow a bit:
| instead of 1 prompt -> 1 image _n_ times, do 1 prompt ->
| _n_ images 1 time -> _m_ times... For a given prompt, run
| it through the T5 model and store; you can do that in CPU
| RAM if you have to because you only need the embedding once
| so you don't need a GPU which can run T5-XXL naively. Then
| you can get a large batch of samples from #2; 64px is
| enough to preview; only once you pick some do you run
| through #3, and then from those through #4. Your peak VRAM
| should be 1 image in #2 or #4 and that can be quantized or
| pruned down to something that will fit on many GPUs.
| TaylorAlexander wrote:
| If you don't mind the power consumption I noticed that older
| nvidia P6000's (24GB) are pretty cheap on ebay! My 16GB P5000
| is pretty handy for this stuff.
| coolspot wrote:
| Looks like P6000 24Gb goes for $800-$1200 while you can get
| superior 3090 24Gb for $800-$1000 .
| CamperBob2 wrote:
| 4090s are only $1600 or so now, for that matter.
| TaylorAlexander wrote:
| oh! My mistake thanks for letting me know.
| NBJack wrote:
| An M40 24GB is less than $200, if you don't mind the trouble
| to get it's drivers installed, cooled, etc. It's also
| important to note your motherboard must support larger VRAM
| addressing; many older chipsets won't be able to boot with it
| (i.e. some, perhaps almost all, Zen 1 supporters).
| connerruhl wrote:
| The full release will be soon!
|
| https://twitter.com/EMostaque/status/1651328161148174337
| atleastoptimal wrote:
| > Text
|
| > Hands
|
| good god it solves the two biggest meme issues with image models
| in one go. Will this be the new state of the art every other
| model is compared to?
| Taek wrote:
| There are good reasons to believe that this will be the new
| state of the art by a comfortable margin. Hard to know until we
| can actually play with it.
| gwern wrote:
| We already knew those were going to be solved by scale like
| using T5 instead of the really small bad text encoder SD used,
| because they _were_ solved by Imagen etc.
| orra wrote:
| Neither the source code nor the weights are open source... This
| is actually worse than Stability AI's previous offering, in that
| regard.
| connerruhl wrote:
| That'll change when the full non-research release occurs...
| https://twitter.com/EMostaque/status/1651328161148174337
| orra wrote:
| That tweet is vague. Besides, it says like 'like SD', so I
| will be pleasantly shocked if the models are open source.
| ilaksh wrote:
| They are technically open source. It's just that the model
| license prohibits commercial use and the code license prohibits
| bypassing the filters. So it's kind of worse than closed source
| in a way because it's like a tease. With no API apparently.
|
| Theoretically large companies or rich people might be able to
| make a licensing agreement.
| rgbrgb wrote:
| > model license prohibits commercial use
|
| I thought that at first, but I think it only prohibits
| commercial use that breaks regional copyright or privacy
| laws.
| orra wrote:
| It prohibits both commercial use, whether or not you break
| regional laws; and it prohibits breaking certain laws. As
| another user said, encoding the law into a licence is
| pointless but makes it non-free.
|
| There are also problematic restrictions on your ability to
| modify the software under clause 2(c). And nor do you have
| the right to sublicence, it's not clear to me what rights
| somebody has if you give them a copy.
| yellowapple wrote:
| That's already prohibited by, you know, those very same
| copyright and privacy laws. Adding those same prohibitions
| to the license not only makes the software nonfree, but
| _pointlessly_ does so.
| dragonwriter wrote:
| Its not pointless, it means the model _licensor_ has a
| claim against you, as well as whoever would for violating
| the referenced laws; it also means, and this is probably
| more important, that in some juridictions, the model
| licensor has a better defense against liability for
| contributory infringement if the licensee infringes.
|
| EDIT: That said, it's unambiguously not open source.
| jrm4 wrote:
| I am a lawyer, and as flimsy and wishy-washy as the term
| "open-source" already is, I can't even fathom what is meant
| by "open source" here?
|
| Are people suggesting that "look at the code but don't touch"
| actually fits what some people think of as open source?
| yellowapple wrote:
| > They are technically open source. It's just that the model
| license prohibits commercial use and the code license
| prohibits bypassing the filters.
|
| Your second sentence contradicts the first. Prohibiting
| commercial use and prohibiting modification are each in and
| of themselves mutually exclusive being being "technically
| open source" (let alone both at the same time).
| kingcharles wrote:
| The examples on the README are extremely compelling; the state of
| the art has been raised yet again.
| lalaithion wrote:
| Has anyone tried the Scott Alexander AI bet prompts?
|
| 1. A stained glass picture of a woman in a library with a raven
| on her shoulder with a key in its mouth
|
| 2. An oil painting of a man in a factory looking at a cat wearing
| a top hat
|
| 3. A digital art picture of a child riding a llama with a bell on
| its tail through a desert
|
| 4. A 3D render of an astronaut in space holding a fox wearing
| lipstick
|
| 5. Pixel art of a farmer in a cathedral holding a red basketball
| swyx wrote:
| where are these prompts from?
| epivosism wrote:
| Yes, I tried them here on an earlier version of IF:
| https://twitter.com/eb_french/status/1618354180577714176
| epivosism wrote:
| I thought it was pretty definitive at the time, but when you
| look really closely (as Scott's opponent is likely to do), it
| didn't seem like a clear win yet. But that was 3 months ago,
| and hopefully DF is even better now.
| hunkins wrote:
| New restriction in their License suggests the software can't be
| modified.
|
| "2. All persons obtaining a copy or substantial portion of the
| Software, a modified version of the Software (or substantial
| portion thereof), or a derivative work based upon this Software
| (or substantial portion thereof) must not delete, remove,
| disable, diminish, or circumvent any inference filters or
| inference filter mechanisms in the Software, or any portion of
| the Software that implements any such filters or filter
| mechanisms."
| thewataccount wrote:
| > New restriction in their License suggests the software can't
| be modified.
|
| It can be modified. That just says it can't be modified to
| bypass their filters.
| [deleted]
| GaggiX wrote:
| >New restriction in their License suggests the software can't
| be modified.
|
| To remove filters.
|
| "Permission is hereby granted, free of charge, to any person
| obtaining a copy of this software and associated documentation
| files (the "Software"), to deal in the Software without
| restriction, including without limitation the rights to use,
| copy, modify, merge, publish, distribute, sublicense, and/or
| sell copies of the Software, and to permit persons to whom the
| Software is furnished to do so, subject to the following
| conditions:"
| oh_sigh wrote:
| You can't remove the filters per the license, but the weights
| will be available soon and so anyone can just reimplement this
| code using the weights
| Jackson__ wrote:
| There is a similar license clause for the weights[0] as well,
| so I'm not sure this would apply unless you write the code
| and train your model from scratch.
|
| [0] https://github.com/deep-floyd/IF/blob/main/LICENSE-
| MODEL#L54
| dragonwriter wrote:
| Or unless, as seems to be fairly widely expected but
| untested, model weights are not actually copyrightable, so
| model licenses are superfluous.
| ronsor wrote:
| That's already been done:
| https://github.com/lucidrains/imagen-pytorch
| RobotToaster wrote:
| Then by definition it isn't open source, violating points 3, 4,
| and 6 of the open source definition.
| https://opensource.org/osd/
| yellowapple wrote:
| Yep. It's getting really exhausting seeing projects falsely
| advertising themselves as "open source". Either be FOSS or
| don't be; don't pretend to be while using some nonsense like
| the BSL or whatever adhocery is in play here.
| Mizza wrote:
| In the README they even call it "Modified MIT", the
| modification being where they turned it from a very
| permissive license into a fully proprietary one. Very cool
| model though.
| kmeisthax wrote:
| As someone who's largely "OK" with morality clauses in
| otherwise liberal AI licenses, I think we should start calling
| these "weights-available" models to distinguish from capital-F
| Free Software[1] ones.
|
| I'm starting to get irritated by all these 'non-commercial'
| licensed models, though, because there is no such thing as a
| non-commercial license. In copyright law, merely having the
| work in question is considered a commercial benefit. So you
| need to specify every single act you think is 'non-commercial',
| and users of the license have to read and understand that. Even
| Creative Commons' NC clause only specifies one; they say that
| filesharing is not commercial. So it's just a fancy covenant
| not to sue BitTorrent users.
|
| And then there's LLaMA, whose model weights were only ever
| shared privately with other researchers. Everyone using LLaMA
| publicly is likely pirating it. _Actual_ weights-available or
| Free models already exist, such as BLOOM, Dolly, StableLM[0],
| Pythia, GPT-J, GPT-NeoX, and CerebrasGPT.
|
| [0] Untuned only; the instruction-tuned models are
| frustratingly CC-BY-NC-SA because apparently nobody made an
| open dataset for instruction tuning.
|
| [1] Insamuch as an AI model trained on copyrighted data can
| even be considered Free.
| simonw wrote:
| It looks like the model on Hugging Face either hasn't been
| published yet or was withdrawn. I got this error in their Colab
| notebook:
|
| OSError: DeepFloyd/IF-I-IF-v1.0 is not a local folder and is not
| a valid model identifier listed on
| 'https://huggingface.co/models' If this is a private repository,
| make sure to pass a token having permission to this repo with
| `use_auth_token` or log in with `huggingface-cli login` and pass
| `use_auth_token=True`.
| Zetobal wrote:
| You need to accept the license on the HuggingFace model card.
| lerchmo wrote:
| it doesn't seem like they have anything published
| https://huggingface.co/DeepFloyd
| thewataccount wrote:
| I swear I saw it a few minutes ago but I might be crazy.
| Zetobal wrote:
| Same got the weights on gdrive.
| og_kalu wrote:
| could you link them ?
| simonw wrote:
| https://huggingface.co/DeepFloyd/IF-I-IF-v1.0 is a 404
| currently.
___________________________________________________________________
(page generated 2023-04-26 23:01 UTC)